Architecture Weekly #163 - 22nd January 2024

Jan 22, 2024

Welcome to the new week!

Last week, I wrote that we should not optimise our code for reusability. We should reflect our business process as it is to run it efficiently. The size of the code is orthogonal to that. We do not always cut all redundancies; sometimes, we add them.

Today, I continued those considerations. I looked at the stream IDs, event type prefixes and other event data. I discussed if you might not want to slice them off.

Oskar Dudycz - Stream ids, event types prefixes and other event data you might not want to slice off

Just like we're designing our code with the reader in mind, we should also design our events. We need to remember that domain logic is just one of the usages. We also record those events to build read models and integrate them with other parts of the system. Read more in my article to find the nuanced explanation of those considerations.

Events (schema) versioning is a boogeyman for people learning Event Sourcing. They’re a spooky tale told at the campfire. There’s a truth in it, as migrations are always challenging. As time flows, the events’ definition may change. Our business is changing, and we need to add more information. Sometimes, we have to fix a bug or modify the definition for a better developer experience.

I invite you to the next webinar for Architecture Weekly paid subscribers: “Simple patterns for events schema versioning”. This time no special guest, but me, but I hope that’s also fine!

I’ll show you them in practice, getting hands dirty in code. I’m sure that after this webinar, the common scenario won’t be too scary for you. It’ll also be a good chance to discuss and ask questions on unclear scenarios. So webinar, is not quite clear, I intend to have interaction!

The webinar will happen on Thursday, January 25th, at 6 PM CET (UTC+1) and last 1-1.5h depending on the number of questions.

Read more all the details:

Invitation to webinar: Simple patterns for events schema versioning

Become a paid subscriber and join us live!

The talk that I probably enjoyed the most in the last Domain-Driven Design Europe was the one by Andreas Pinhammer. Why?

Many people ask me how to apply DDD in a "real project" or introduce it to a "large existing project". This talk gives a decent case study and a fresh look.

Andreas explained their journey of introducing DDD to the complex Insurance domain. He showed how they tried to apply later-the-greatest and recommended tooling. He explained how they stumbled on that and what they eventually came up with.

I liked the conclusion around the platform team and that, in their case, it wasn't "kubernetes-and-all-that-technical-stuff-core-team", but more a business back-office focused on enabling other business modules for other business teams. I think that's how it should be done.

Watch the talk; it's crisp, just 38m. It's not revolutionary, but it nicely explains how you can start applying DDD and what to watch for.

Andreas Pinhammer - DDD in large product portfolios

In the same spirit, check also an excellent article from Ryan Shriver showing how Domain-Driven Design tools can help you modernise architecture. I liked that it presented the context in which tools like the C4 Model, Event Storming, and Message Flows and Bounded Context diagrams can play together and when to use them:

Ryan Shriver - Start Your Architecture Modernization with Domain-Driven Discovery

In architecture modernisation, it’s essential to understand what to leave as it is, what to polish, and what to rewrite. I don’t like term tech debt; I think that it’s looking at the real problem from the wrong perspective. That’s something that I’ll cover at some point here or on the blog. Nevertheless, the issue is real, no matter how we phrase it. I liked the take from Pete Hodgson on this topic:

Pete Hodgson - Tech Debt Walls

He explained his approach called “Tech Debt Wall”

A tech debt wall is a 2-dimensional map used to track a codebase’s tech debt as individual issues. The Y-axis of this map represents value - how valuable would it be to fix the issue. - The X-axis represents cost - roughly how expensive would it be to fix the issue. Whenever an engineer notices a piece of tech debt, they write a brief description on a sticky (or the virtual equivalent) and place it in the appropriate place on the wall, based on their approximation of the value of fixing the issue and how much it would cost.

Read the article to understand how to build it and look on the graphical examples.

Speaking about modernisation. Martin Fowler updated his guide on Continuous Integration. It’s a revamped one adjusted to modern practices.

Martin Fowler - Continuous Integration

One of the updated pieces is a guide around code reviews and feature branches:

The pre-integration code review can be problematic for Continuous Integration because it usually adds significant friction to the integration process. Instead of an automated process that can be done within minutes, we have to find someone to do the code review, schedule their time, and wait for feedback before the review is accepted. Although some organizations may be able to get to flow within minutes, this can easily end up being hours or days - breaking the timing that makes Continuous Integration work.

I don’t entirely agree with this part. I think the biggest issue comes when such a review happens on a long-living branch with many changes. I think that throwing away code reviews before merging is not a solution on its own, as people still may not find time and not even feel motivated to do it. The process is not as simple as choosing if we’re doing code reviews before or after the merge.

Incidents are often triggers for changing the process and architecture that happened to Slack. They had a major network disruption that caused cascading failure, even though they used multiple availability zones (AZ) for deployment. The isolation appeared not as good as it should have been, and a single AZ failure led to significant user-visible disruptions. To address this, Slack adopted a "siloing" strategy, where each AZ operates independently, reducing the impact of localized failures. They based that on AWS Cell-Based architecture.

Slack - Slack’s Migration to a Cellular Architecture

The core innovation was the introduction of an "AZ drain button," enabling quick traffic rerouting (“draining”) away problematic AZs. As they wrote:

AZs are cells, and cells may be drained. Like a lot of satisfying infrastructure work, an AZ drain button is conceptually simple yet complicated in practice. The design goals we chose are:

1. Remove as much traffic as possible from an AZ within 5 minutes. (…)
2. Drains must not result in user-visible errors. (…)
3. Drains and undrains must be incremental. (…)
4. The draining mechanism must not rely on resources in the AZ being drained. (…)

This was achieved through the Envoy xDS ecosystem, allowing for efficient traffic management. This cell-based approach has enhanced Slack's system resilience, ensuring that issues in one AZ don't affect the entire network.

Staying with the materials provided by the tools creator, but jumping to another topic: modelling.

I see that many people struggle to model the denormalised model correctly. For many years, we were taught how to normalise our storage for relational databases. That got so much into our skin that we try to keep those practices even though most cloud-native databases do not match them. There are not many great resources around key-value storage modelling. One of those that I could recommend is materials from MongoDB:

Of course, they’re biased by the tool perspective, but if you’re looking for a decent starting point, they’re a decent place to do it.

I’ll end this release with the call for responsibility.

Have you heard about the UK Post Office scandal? You really need to check the coverage:

More than 900 sub-postmasters and postmistresses were prosecuted after faulty software wrongly made it look like money was missing from their branches. (…)
Horizon was introduced by the Post Office in 1999. The system was developed by the Japanese company Fujitsu, for tasks like accounting and stocktaking.
Sub-postmasters complained about bugs in the system after it falsely reported shortfalls - often for many thousands of pounds.
Some attempted to plug the gap with their own money, as their contracts stated that they were responsible for any shortfalls. Many faced bankruptcy or lost their livelihoods as a result.

After 20 years, campaigners won a legal battle to have their cases reconsidered. To date only 93 convictions have been overturned. Under government plans, victims will be able to sign a form to say they are innocent, in order to have their convictions overturned and claim compensation.

The intriguing part is that the software is still used. As our industry progresses, our responsibility should also progress. Also, the quality of delivery. Now I’m curious about those systems that will come by putting prompts into “Generative AI”. Unfortunately, I foresee even worse complications.

See also the whitepaper with a well-speaking title from Adam Tornhill and Markus Borg.

A. Tornhill, M. Borg, E. Mones - Refactoring vs Refuctoring: Advancing the state of AI-automated code improvements

It concludes with:

This benchmarking study shows that AI is nowhere near replacing humans in a coding context; today’s AI is simply too error-prone, and far from a point where it is able to securely modify existing code. However, by introducing a novel fact-checking model for the AI output, we can elevate generative AI to a point where it is genuinely useful as several complex code smells can be mitigated safely. This allows us to optimize for understanding – the dominant and most human-intensive aspect – not just the narrow task of writing new code.

Check also other links!

Cheers!
Oskar

p.s. I invite you to join the paid version of Architecture Weekly. It already contains the exclusive Discord channel for subscribers (and my GitHub sponsors), monthly webinars, etc. It is a vibrant space for knowledge sharing. Don’t wait to be a part of it!

p.s.2. Ukraine is still under brutal Russian invasion. A lot of Ukrainian people are hurt, without shelter and need help. You can help in various ways, for instance, directly helping refugees, spreading awareness, and putting pressure on your local government or companies. You can also support Ukraine by donating, e.g. to the Ukraine humanitarian organisation, Ambulances for Ukraine or Red Cross.

Architecture

DevOps

Frontend

.NET

TypeScript

Wojciech Baczyński - Branding & Flavoring

Coding Life

Ketan Bhatt - Don’t be the Alpha Geek: Your team deserves better

Management

Ted Neward - We Need to Talk. If we can't use this, what do we use instead?

Architecture Weekly

Architecture Weekly #163 - 22nd January 2024

Architecture

DevOps

Frontend

Hardware

AI

AWS

Azure

Java

.NET

TypeScript

Coding Life

Management

Industry

Discussion about this post