Last week, we explored how Kafka consumers work in consumer groups and how rebalancing ensures partition distribution. That sparked an interesting question about how consumers actually communicate with brokers and detect rebalancing needs. Thanks, Michał. The question is again on point, as that’s what I wanted to expand on!
Again, communication sounds simple. Consumers just get messages from brokers, right? As I’m writing about it, you may already guess that behind the scenes, it’s not that simple.
Today, we'll explore the communication protocol between Kafka consumers and brokers.
We'll see how what appears to be a simple "push" system is actually an elegant pull-based protocol and how rebalancing notifications occur through clever use of error codes rather than explicit push notifications.
Consumer Groups: Why We Need Them and How They Work
Before we jump right into the protocol details, let's discuss a fundamental problem in message processing: how do you scale message consumption across multiple applications? Imagine you're running an e-commerce platform processing thousands of orders per second. A single application instance couldn't handle that load—you need to distribute the work somehow.
This is where consumer groups come in. The concept is beautifully simple: instead of having a single consumer handle all messages, you can have multiple consumers work together, each handling a portion of the messages.
One more powerful aspect of consumer groups is having multiple groups independently processing the same topic. For example:
One group might process orders for fulfillment
Another group might analyze orders for business intelligence
A third group might archive orders for compliance
Each group maintains its own view of which messages have been processed, allowing different applications to work at their own pace.
Think of it like a team of workers in a warehouse - instead of one person trying to process all incoming packages, you have multiple workers, each handling their own set of packages.
The Basic Idea: Dividing Work
At its core, a consumer group is just what it sounds like - a group of consumers working together. When you create a consumer in Kafka, you assign it to a group:
const consumer = kafka.consumer({
groupId: 'order-processing'
});
await consumer.subscribe({ topic: 'orders' });
But how does Kafka ensure these consumers work together efficiently? The magic lies in how Kafka divides work using partitions. Remember that each Kafka topic is split into partitions - these become our units of work distribution. Kafka ensures that each partition is assigned to exactly one consumer in the group.
This simple rule has powerful implications:
Work is naturally distributed across consumers
Message ordering is preserved within each partition
Consumers can work independently without coordination
How Work Gets Divided: Partition Assignment
We discussed that part in detail in the last edition, but let’s recap it and see how this works in practice.
Imagine you have a topic with siz partitions and three consumers in a group. Kafka will distribute the partitions like this:
This distribution isn't random - it's managed by a special broker called the Group Coordinator. Think of the Group Coordinator as a supervisor, ensuring work is fairly distributed. It keeps track of:
Which consumers are currently active
Which partitions are assigned to each consumer
When assignments need to change
Consumer groups also handle the changes ensuring that work is distributed when:
A new consumer joins the group.
A consumer crashes or becomes unresponsive.
Network issues cause temporary disconnects.
New partitions are created.
This is where the concept of rebalancing comes in. When any change occurs in the group, Kafka triggers a rebalancing process to redistribute partitions among the remaining consumers. It's like a warehouse supervisor reassigning work when workers come or go.
The rebalancing process follows a careful sequence:
The Group Coordinator notices a change (new consumer, failed consumer, etc.)
It pauses all message processing
It gathers information about available consumers
It redistributes partitions among them
Processing resumes with new assignments
Making It Reliable: Handling Failures
Now, we're getting to the more complex aspects of consumer groups. In a distributed system, lots of things can go wrong:
Consumers can crash
Network connections can fail
Processes can become slow or unresponsive
Consumers regularly send heartbeat messages to the Group Coordinator, saying, "I'm still here and working!". If a consumer stops sending heartbeats, the coordinator knows something's wrong and can trigger a rebalance.
Each time the group membership changes, Kafka assigns a new "generation ID" to the group. This is crucial for preventing "split-brain" scenarios, where different consumers have different ideas about who should be processing what.
Think of the generation ID as a group membership version number. When it changes:
All consumers must get new partition assignments
Old assignments become invalid
Consumers can't process messages until they join the new generation
We could model Consumer Group metadata to keep track of it as:
interface ConsumerGroupMetadata {
groupId: string;
generationId: number; // ⬅️ Increments with each membership change
consumers: Map<string, ComsumerMetadata>;
leader: string;
}
When the consumer joins the group, the session is created. The session is, in other words, a specific connection to a group. Staying in our warehouse work example represents the employee shift. Kafka needs to track active consumers in the group to distribute load and ensure reliable processing correctly.
It allows the provided settings to detect how long the consumer group should wait for an answer from the consumer and also how long the consumer should wait to end rebalancing. In other words, employees can go home if there’s no work for them as stocktaking is happening.
const consumer = kafka.consumer({
groupId: 'order-processing',
// How long can a consumer be silent before it's considered dead?
sessionTimeout: 30000,
// How long to wait during rebalancing?
rebalanceTimeout: 60000
});
These timeouts help Kafka distinguish between temporary hiccups and actual failures.
Now that we have recapped consumer groups and why we need them, let's examine how the protocol actually implements all this functionality.
The Protocol Foundation: The Dance of Requests and Responses
When you write seemingly simple code to consume messages from Kafka, a complicated dance is happening behind the scenes. Every interaction between consumers and brokers follows a strict protocol designed to handle the complexities of distributed systems.
Let's start with something that might seem mundane but is actually fascinating: the request header. Every single message between a consumer and broker starts with this structured header:
interface RequestHeader {
apiKey: number; // Identifies request type
apiVersion: number; // Protocol version used
correlationId: number; // Request tracking
clientId: string; // Consumer identifier
}
This isn't just bureaucratic overhead - each field is crucial to reliable distributed communication. The apiKey, for instance, isn't just a number - it's part of a carefully designed protocol that supports operations like:
Keep reading with a 7-day free trial
Subscribe to Architecture Weekly to keep reading this post and get 7 days of free access to the full post archives.