This post will be the first in a series of blog posts where I will be talking about the limitations of Apache Kafka as a task queue and how we overcame these limitations at a previous employer where I managed tens of fairly high-throughput Apache Kafka clusters. This post will lay the groundwork for explaining how Apache Kafka works, such that the rest of the posts are easy to follow.
Hopefully, these articles will avoid future battle scars for people who dabble with Apache Kafka. ❤️ 🤕
High-level concepts Link to heading
A high-level architecture of Apache Kafka. The arrows show how records flow through the system.
Apache Kafka (from now on, referred to as “Kafka”), is a system that asynchronously transports messages from a set of producers to a set of consumers. “Messages” are, in Kafka lingo, called records. A Kafka cluster consists of a set of brokers. Each record passes through a broker1. Each broker stores the records to disk such that consumers can consume them at a later stage. I will go more into detail about how a broker works later.
Like most message-passing systems, Kafka also has the concept of a topic. Topics are used to organise the records stored in a Kafka cluster. Every record sent to a Kafka cluster has a topic destination. To consume that record, a consumer must use the same topic name. For example, if you are running a website analytics company, you might have a topic called user_clicks that contains all the tracked events of website visitors.
A topic is split up into partitions. Each partition is associated with a broker2. A broken can have multiple partitions assigned to it. As depicted in the figure above, this means that there can be brokers that do not store any records for a specific topic.
Partitions are the core concept that allows Kafka to scale horizontally. Incoming records are assigned to a partition and, through that, a broker. Which partition a record is routed to is up to the producer of the record to decide. Each message has an optional key. If the key is null, the message is sent to a random partition (ie, Round-Robin). If the key is set to a string, the producer picks a partition based on a hash of the key (ie, something like partition := hash(record.key) % numberOfPartitions).
Each consumer belongs to a consumer group, and each consumer group is associated with a topic. Every record will be sent to one consumer in each consumer group. This means that each consumer group for topix X will, as a whole, receive every record for topix X. This allows for fan-out such that you can have multiple downstream consumer systems that each process every message. In the case of the website analytics company, you might have one downstream consumer group that triggers alerts and another consumer group that generates hourly website statistics that can be graphed. Each of these two systems receives all the user events.
Okay, so far, I have described a horizontally scalable message-passing system supporting fan-out. I have left out one particular detail, which is how partitions and consumers relate. To be able to explain that, I need to talk about how brokers store their data.
The partition log Link to heading
An append-only log of records. Each record has an index called ‘offset’.
Each partition on a broker is stored as a log on disk. A log is an append-only file where each record gets added at the end. Each record has an implicit offset counting from the start of the log.
To avoid needing to store all records for infinity, the on-disk log file is chunked up in something like ~100MB files3. Files older than a configurable TTL are deleted. It’s worth pointing out that the record’s offset remains the same. Another important thing to notice here is that no messages are deleted once they have been consumed by all subscribed consumers.
Consumers simply stream the records from this log. Since writing to disk and reading is done by append-only and streaming, Kafka has a high-throughput.
Consumer partitions Link to heading
An append-only log of records, showing the last processed record for each consumer group.
The way Kafka keeps track of which records have been consumed is by, for each consumer group, keeping a reference to each partition’s last offset that it has consumed. I tend to think of Kafka’s internal representation as something like this:
consumer_groups:
  X:
    partitions:
      0: 88
      1: 65
      2: 23
      4: 32
      5: 103
  Y:
    partitions:
      0: 83
      1: 61
      2: 37
      4: 42
      5: 112
As soon as consumer group X has processed record 66 in partition 1, it updates consumer_groups.X.partitions.1 to 66.
To avoid contention in incrementing these consumer offsets, every partition is assigned to one consumer in each consumer group. It is up to each consumer to update these offsets whenever they want (every minute, every message, after 10 messages, etc.). This means that there is only one consumer per consumer group that consumes each partition. This has immense implications, which my next blog post will be about.
Further reading Link to heading
- Key Concepts from Kafka’s documentation.
- The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Keps. A great long-form article about the insights leading up to Apache Kafka.
- Strictly speaking, each record usually passes through multiple brokers for redundancy reasons. That said, to keep this article simple, we can ignore that for now. ↩︎ 
- Actually, every partition is associated with multiple brokers for redundancy reasons. This means that every record is written to all brokers for a specific partition. That said, for simplicity, let’s just assume that there is just one broker per partition for now. ↩︎ 
- The chunk size is configurable. ↩︎