Jianwei Xie

Distributed System

Last Modified: 2022-09

[Paper] Kafka: a Distributed Messaging System for Log Processing

Key Points in this Paper
pull-based (not push-based) support rewind store all logs for few days stateless, lightweight, brokers
Paper Details	My Notes
Page 2: Simple Storage For each single partition, logs are segment files, and operated in append only mode.	At the moment, it is not clear whether the logs will be lost during node crashing.
Page 2: Every X messages, all new logs are flushed. A message is exposed to the consumers after it is flushed.	Exposing message after flushing is a reasonable trade off toward safety, at the expense of latency (see latency data later).
Page 3: A consumer consumes messages from a partition sequentially. The message offset is the message location (byte offset). The client calculates the offset of next message by adding bytes for current message with the current offset.	This is a nice trick (getting increasing id by the nature of offset, but saves one lookup table from message id to byte offset).
Page 3: Each broken keeps in memory a sorted list of offset, including the offset of the first message in every segment file.	Very classical trick for appending only record system. We should estimate the total memory usages for all locations and total storage size. Did not see this in paper.
Page 3: Efficient Transfer Each pull request retrieves hundreds of kb data, not just one message. There is no broker side caching (relying on kernel file system page cache).
Page 3: Use `sendfile` to efficiently deliver bytes from a broker to a consumer.	Sendfile is a nice optimization API provided by *nix. But how does Java provide it (e.g., nio)?
Page 3: Stateless Broker Brokers do not maintain the consumption states; the client side will do that. With that, the broker stores all log files for several days. One direct benefit is this supports rewind.	The idea is very clever. Rewind is a huge usability support. It is not clear how much storages needed for all log files during that window. And how to rebalance the logs if the disk is close to capacity.
Page 3: Distributed Coordination Each producer can publish a message to either a randomly selected partition or a partition determined by partition key and function.
Page 4: Kafka supports consumer group. The idea is 1) offset for each partition is stored as persistent znode and 2) all membership for producers and consumers are ephemeral znodes. For any change in membership, the system will rebalance. In case, there are race conditions to recognize memberships (say multiple changes or slowly reacting, etc), then the system can wait a bit and retires. And in practise, the process often stabilized after few retries.	Kafka demonstrates how to use Zookeeper to manage membership, which is quite interesting.
Page 4: Delivery Guarantees Kafka only guarantees at least once delivery. From single partition, the messages are delivered to a consumer in order. Noguarantee on the ordering of messages coming from different partitions.
Page 5: Kafka stores CRC for messages. Broken messages are removed.
Page 5: Without tuning, the end-to-end latency is about 10 seconds on average.
Possible Improvements and Open Questions
Load Balancing	Instead of randomly selecting a partition, try best-of-two to improve load balancing without losing scalability.
Node-Crashing Tolerance	Use cloud storage or even tape system (if cost is a problem), to back up the files. This is mentioned at the end of section 3.
Topic Count?	From the paper, it is not clear how many topics can be inserted into the system. Does that have any effect on performance: both scaling performance and storage (offset znodes)?