Imagine you’re sending a message to Kafka by calling something simple like:
producer.send(new ProducerRecord<>("someTopic", "Hello, Kafka!"));
We often treat this like a “black box”. We put messages on one side and get them on the other. The message leaves the producer, goes through the broker, and eventually appears in a consumer. This sounds straightforward, but behind the scenes, the technical implementation is a bit more complex.
Kafka uses an append-only log for each partition, storing messages in files on disk. We discussed that in detail in The Write-Ahead Log: The underrated Reliability Foundation for Databases and Distributed systems. Thanks to that, if the process crashes mid-write, Kafka detects partial data (via checksums) and discards it upon restart.
As I got positive feedback on mixing the pseudocode (no-offence TypeScript!) with the concept explanation, let’s try to show that flow today!
Of course, we won’t replicate all real Kafka complexities (replication, huge batch format, time-based files rolling, etc.), but we try to be close enough logically to explain it and get closer to the backbone.
By the end, we’ll have:
A minimal producer that batches messages in memory.
A segmented log class that writes these batches in an append-only manner.
A small fsync toggle to ensure data is physically on disk (or rely on OS caching).
A recoverOnStartup method that checks for partial writes if a crash occurred and truncates them.
We’ll also discuss why each piece exists and how that gives you a closer look at tooling internals.
If you’re not into Kafka, that’s fine. This article can help you understand how other messaging tools are using disk, WAL, to keep their guarantees!
Before we jump into the topic, a short sidetrack. Or, actually, two.
First, I invite you to join my online workshop, Practical Introduction to Event Sourcing. I think you got a dedicated email about it, so let me just link here to the page with details and a special 10% discount for you. It’s available through this link: https://ti.to/on3/dddacademy/discount/Oskar. Be quick, as the workshop will happen in precisely 2 weeks!
Secondly, we just released the stable version of the MongoDB event store in Emmett. I wrote a detailed article explaining how we did it and how you can do it. Since you’re here, you’ll surely like such nerd sniping.
See: https://event-driven.io/en/mongodb_event_store/
Making it consistent and performant was challenging, so I think that's an interesting read. If you're considering using key-value databases like DynamoDB and CosmosDB, this article can outline the challenges and solutions.
My first choice is still on PostgreSQL, but I'm happy with the MongoDB implementation we came up with.
If MongoDB is already part of your tech stack and the constraints outlined in the article are not deal-breakers, this approach can deliver a pragmatic, production-friendly solution that balances performance, simplicity, and developer familiarity.
Ok, going back to our Kafka thing!
Producer Batching: The First Step
When your code calls producer.send, real Kafka doesn’t instantly push that single message to the broker. Instead, it accumulates messages into batches to reduce overhead.
For example, if batch.size is set to 16 KB, Kafka’s producer library tries to fill up to 16 KB of messages for a particular partition or wait until the time defined in linger.ms it’s not full, so before sending them as one record batch, this drastically improves throughput, though it can add slight latency.
Below is a pseudocode that demonstrates why we do batching at all—not storing anything on disk or network, but collecting messages until we decide to flush:
class SimpleInMemoryProducer {
private buffer: Buffer[] = [];
constructor(private broker: Broker, private maxBatchSize: number) {}
send(msg: Buffer): void {
if (this.buffer.length + message.length > this.maxBatchSize) {
this.flush();
}
this.buffer.push(msg);
}
flush(): void {
if (this.buffer.length === 0) return;
broker.send(this.buffer);
this.buffer = [];
}
}
In real Kafka, we’d have compression, partitioner logic, etc. But the concept stands: accumulate messages → send them in bigger chunks.
Brokers are responsible for coordinating the data transfer between producer and consumers and ensuring that data is stored durable on disk.
This is important for “under the hood” log writes because the broker typically writes entire batches, possibly compressed, to disk in a single append. That’s one of the essential things to know about why Kafka is performant. After the message is sent to the broker, it’s just stored in the log and transferred to consumers. No additional logic happens.
As explained in the article about WAL. Kafka follows the classical WAL pattern:
Log First: Kafka producers append messages to a specific partition in a topic. Each partition is a WAL where new messages are appended at the end. Partitions are immutable—once a message is written, it cannot be modified. Each message in a partition is assigned a monotonic offset, which acts as a unique position marker.
Flush to Disk: Kafka ensures durability by persisting messages to disk before acknowledging the write to the producer. Each Kafka broker flushes log entries to disk periodically (configurable via flush.messages or flush.ms). For stronger guarantees, Kafka producers can request acknowledgements from multiple brokers (e.g., acks=all), ensuring data is replicated before the write is acknowledged.
Apply Changes Later: Consumers read messages from the topic’s partition sequentially, starting from a specific offset. The position of a consumer within a partition is called the offset. Consumers commit their offsets to Kafka or external stores to keep track of their progress. This model allows Kafka to achieve high throughput: sequential reads and writes are fast and efficient on modern storage hardware. Consumers are responsible for state management (e.g., deduplication or applying the messages), which keeps Kafka lightweight.
Single File Append: The Simplest Broker-Side Implementation
If we were to implement the broker side in a naive manner, we could keep a single file for all messages. Whenever a batch arrives, we append it to the end of that file, storing it in the following format:
[ offset: 4 bytes | dataLength: 4 bytes | data: N bytes ]
Where:
offset - a logical position in the long,
data length - how big is our record, in other words, how many bytes does the record data contain,
data - the payload
Using Node.js fs (File System) built-in library, we could code the basic append to log logic as:
Keep reading with a 7-day free trial
Subscribe to Architecture Weekly to keep reading this post and get 7 days of free access to the full post archives.