Deduplication in Distributed Systems: Myths…

Nov 25, 2024

This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises of exactly-once delivery made by messaging vendors. TLDR: they're broken.

Read →

6 Comments

Szymon Bernad

Nov 28

In the Azure ServiceBus example I don't know what this block is supposed to be doing:

> if (this.sessionCache.hasMessage(sessionId, messageId)) {

> context.ack(deduplicationCache.get<T>(sessionId, messageId));

> return;

> }

Is deduplicationCache different than the sessionCache? Or was it a copy-paste error?

Expand full comment

Reply (1)

Oskar Dudycz

Dec 2

Essentially it’s the same logic, the difference is the lifespan and scope for both options. In Azure Service Bus It’s explicitly tied to session. The idempotence guarantee is not guaranteed between sessions (or partitions, as ASB supports either session scope or partition scope).

Expand full comment

Reply (1)

Szymon Bernad

Dec 2

But what about sessionCache and deduplicationCache being used at the same time? I don't get this part.

Expand full comment

Reply (1)

Oskar Dudycz

Dec 2Edited

Ah ok, you're right, my bad! It's a copy-paste issue. Sorry for that; I just fixed it!

Expand full comment

Sergey Pichkurov

Dec 5

Great read, thank you (bit pitty GCP is not included).

BTW, KafkaStreams support full EOS semantics, which means dups should not be a concern if your processing logic is Kafka/KS (KafkaStreams) only. It fits into the most of EDA processing patterns, for the cost of committing into stateful and far-from-lightweight (especially OOTB) nature of KS - albeit it's tunable especially in releases after 2.8.

Anyway, still ALO semantics prevail, so (as pointed out in this and other articles), the key is to make business logic idempotent. In this spite, I'm not quite sure what does the concept of (standalone, stateful) Broker can add on the top of cloud-native architectures built on the top of modern services buses? I.e. broker can de-duplicate right, but then might be facing the same issue when delivering from Broker to downstream services, which again requires idemptonecy on the consume end?

Expand full comment

Reply (1)

Oskar Dudycz

Dec 7

> BTW, KafkaStreams support full EOS semantics, which means dups should not be a concern if your processing logic is Kafka/KS (KafkaStreams) only.

Yup, that's the benefit if you fully control the processing. The other example is event store implementation on top of a relational database. If you add a table with monotonic checkpoints, you can check if the checkpoint position hasn't already been processed (so the checkpoint in the database isn't bigger than the event position). If it is, then you can skip committing the transaction and not make duplicated changes.

Still, that works only if storage supports such usage and changes are wrapped in the same storage. That's why Kafka internally uses tiered storage: RocksDB to be able to handle the additional needed capabilities.

> Anyway, still ALO semantics prevail, so (as pointed out in this and other articles), the key is to make business logic idempotent. In this spite, I'm not quite sure what does the concept of (standalone, stateful) Broker can add on the top of cloud-native architectures built on the top of modern services buses?

Yup, that's why I'm claiming that Exactly-Once Delivery is a myth. It's always concerned in terms of internal bus logic. But not in terms of application processing. As you said, you always need to take care of it on your own.

Expand full comment

Architecture Weekly

Deduplication in Distributed Systems: Myths…