We went quite far from our Queue Broker series in recent editions, but today, we’re back to it! Did you miss it?
In an initial article, we introduced the QueueBroker—a central component that manages queuing, backpressure, and concurrency. In distributed systems, queues do more than buffer requests or smooth traffic spikes. They can ensure order, enforce grouping, and provide guarantees for exclusive processing. These properties are critical when designing systems that must be both scalable and consistent.
On Friday, I wrote an extensive guide on how to handle idempotency in command handling on my blog. I explained the challenges in ensuring that our business operations are handled correctly without duplicates. I went through the various ways to handle it, both explicitly and generically, from building predictable state machines to handling it in business logic to leveraging Optimistic Concurrency, retries, and distributed locks.
I recommend handling idempotency using business logic, as it’s highly dependent on the business rules. And those rules tend to change. If we use generic handling, we’re always adding additional overhead, even if most of our applications rarely have idempotency issues. That’s something to consider, as it can also increase costs in cloud environments.
Such implementations are also tricky. We need to handle thread safety and atomic updates correctly. I showed the implementation using a distributed lock like Redis, as in-memory implementation isn’t that simple.
And “in-memory implementation isn’t that simple” is a good challenge. And I’ll take it today!
It’s a good opportunity to merge ideas from those two articles. We’ll build a thread-safe, in-memory IdempotencyKey Store by combining single-writer, grouped queuing with idempotency guarantees. We’ll also explore the broader architectural principles behind grouping and ordering in queuing systems like Amazon SQS FIFO, Kafka, and Azure Service Bus.
But first, let’s start with the problem grouping solves.
Why Grouping Matters
In distributed systems, operations in a shared state—whether it’s a bank account, an order, or a device—are often processed asynchronously. But asynchronous processing introduces risk:
Transactions arrive out of order: Imagine a bank processes a withdrawal before a deposit. A user with a $0 balance may see a failed transaction, even though funds were deposited seconds earlier.
Duplicate messages cause inconsistent state: A hotel system receives two commands to check in a guest. Without safeguards, the system may allocate two rooms to the same guest.
Concurrency can break assumptions: In an IoT system, a device sends a “door open” event while another service processes a “door locked” command. Without guarantees, these commands could execute simultaneously, leaving the system in an undefined state.
Grouping allows related operations to be processed sequentially while operations on different groups run concurrently. We’re correlating a set of tasks, telling that they need to be processed in the certain order: For example:
Banking systems group by account: Ensuring withdrawals and deposits for the same account are serialized.
Order management systems group by order: Ensuring updates to the same order (e.g., payment, shipping) are processed in order.
IoT systems group by device: Ensuring events from the same device don’t conflict.
Grouping isn’t a new idea. Systems like Amazon SQS FIFO, Azure Service Bus, and Kafka have implemented variations of this pattern to solve similar problems.
How Popular Systems Handle Grouping and Ordering
Amazon SQS FIFO
Regular Amazon SQS doesn’t provide any ordering guarantee. Still, some years ago, AWS added a variation: Amazon SQS FIFO (First-In-First-Out).
Amazon SQS FIFO ensures messages with the same message group ID are processed in the right order. Groups are independent, so tasks for one group don’t block tasks in another. For example, in an e-commerce system, group messages by orderId to ensure updates for the same order (e.g. “Order Placed”, “Payment Confirmed”) are processed sequentially. Messages for different orders can be processed concurrently.
Of course, ordering guarantees come at the cost of throughput: FIFO queues process fewer messages per second than standard queues. We may be also facing bottlenecks in group processing. A slow task in one group delays all subsequent tasks in that group.
Azure Service Bus
Azure Service Bus uses sessions to group related messages. Messages in a session are processed sequentially, similar to SQS FIFO, but with added features like session locks to ensure only one consumer handles a session at a time.
Session locks make Azure Service Bus robust against failures, but managing them adds complexity. Throughput can be impacted if too many messages are grouped into the same session.
Kafka
Kafka achieves grouping through partitioning. Each partition processes events sequentially, and partitions can be assigned to consumers for parallel processing. Partitioning requires careful design.
Partitions represent a physical split. Poor partitioning can lead to uneven workload distribution. You can also have a single consumer inside the consumer group to handle those messages, but you cannot distribute them. When consumers fail, or new consumers join, partitions must be reassigned, temporarily disrupting processing.
RabbitMQ
RabbitMQ ensures message ordering within a queue. Messages are delivered in the same order they are enqueued. However, once multiple consumers subscribe to the queue, RabbitMQ balances messages across consumers in a round-robin fashion, potentially disrupting the ordering.
RabbitMQ doesn't natively support grouping like SQS FIFO or Azure Service Bus sessions. Implementing message grouping typically requires creating additional queues or custom routing logic with exchanges. To achieve that you could:
Use one queue per group. For instance, in an e-commerce system, you can create a queue for each orderId to process updates like "Order Placed" and "Order Shipped" sequentially.
Alternatively, use consistent hashing to route messages with the same key (e.g., orderId) to the same queue, ensuring grouped ordering.
While creating separate queues for each group ensures strict ordering, it increases operational complexity. It can lead to scaling issues. Managing thousands of queues for a high-cardinality grouping key like userId can overwhelm RabbitMQ. It also causes increased resource usage. Each queue consumes memory and CPU, leading to overhead in large-scale systems.
Guaranteeing idempotency with queue
The generic implementation of the idempotency handling looks as follows. We:
Attempt to acquire the lock for the specified key.
If the attempt wasn’t successful, return the default result.
If we acquire the lock, then run the business logic.
If it fails, release the lock. We don’t want to lock the key, let someone fix the state condition, and retry later.
If it was successful, accept the lock and return the result.
In code it can look as follows:
async function idempotent <T>(
handler: () => Promise<T>,
options: IdempotentOptions<T>,
): Promise<T> {
const { key, timeoutInMs, defaultResult, store } = options;
//Attempt to acquire the lock for the specified key
const appended = await store.tryLock(key, { timeoutInMs });
// If it was used, return the default result
if (!appended) return await defaultResult();
try {
// run handler
const result = await handler();
// Mark operation as successful
await store.accept(key);
return result;
} catch (error) {
// Release key if it was used already
await store.release(key);
throw error;
}
};
type IdempotentOptions<T> = {
store: IdempotencyKeyStore;
key: string;
defaultResult: () => T | Promise<T>;
timeoutInMs?: number;
};
The store can be defined as:
type IdempotencyKeyStore = {
tryLock: (
key: string,
options?: { timeoutInMs?: number },
) => boolean | Promise<boolean>;
accept: (key: string) => void | Promise<void>;
release: (key: string) => void | Promise<void>;
};
The store has the following methods:
tryLock - that tries to lock the specified key for a certain period of time. Returns true if the lock was acquired and false otherwise.
accept - accepts the key lock after successful handling.
release - releases the lock in case of handling failure.
What’s most important is that those operations should be thread-safe and atomic. Without that, we may face race conditions. The safest option is to use a distributed lock, e.g., Redis or a relational database.
But we’re not here today to choose the safest options but to learn!
What if we had an implementation that ensures that we schedule the processing of a certain task one at a time? Actually, we have it from our older edition.
Let’s join those forces together!
We’ll use a queue broker single-writer capabilities to ensure that only a single check for the existence of a specific idempotency key will be made in parallel.
Keep reading with a 7-day free trial
Subscribe to Architecture Weekly to keep reading this post and get 7 days of free access to the full post archives.