Architecture Weekly #192 - Talk is cheap, show me the numbers! Benchmarking and beyond!
On benchmarking foundations for performance issues analysis
Boy, it's the fourth episode of the refreshed format, time flies!
We started by learning the importance of connection pooling, discussing how to implement it, and then plot twisting to explain why what we learned so far is not yet production-ready and what does it actually mean to be production-ready!
Lots of talking, especially considering that those topics were more to trigger discussions on different deployment strategies (e.g., serverless or not), how and when to scale, and why queuing is essential. Why so much talking? Because it's cheap! It's cheap to discuss and analyse tradeoffs before we make final decisions. Yet, we can't fall into analysis paralysis constantly going back and forth; there's a time when we'll hear:
Talk is cheap; show me the numbers!
If we don't, we'll see them. And I intend to show you the real numbers today.
But before I do, I need to tell you a secret. I'm writing those articles not only for you but also for me, maybe even primarily for me, as I'm writing them not to forget. Forget about what? About what I've learned. But I hope you'll forgive me for this selfishness, as there's an additional benefit for you: they're real. They're showing my journey and are the outcome of my real architecture considerations.
Currently, I'm building two Open Source tools: Emmett and Pongo. Both of them are Node.js storage tools. Emmett intends to bring Event Sourcing to the JS/TS community and their applications back to the future. Pongo is like a Mongo, but it is on PostgreSQL. So joining Mongo experience with PostgreSQL consistency and capabilities. So far, the feedback is positive. But it's an intriguing journey touching multiple aspects. Not only technical but also product-building, as I want to make it sustainable for me and useful for users.
The hidden trap of building something useful is that people will use it, which can cause some trouble. I'm a big fan of the slogan "Make It Work, Make It Right, Make It Fast". Why? From the C2 Wiki:
If it doesn't work right, in what sense does it work at all?
Here's my interpretation. First crank out code that handles one common case (MakeItWork). Then fix all of the special cases, error handling, etc. so all tests pass (MakeItRight).
Another interpretation, "Make it right" means make the code more clear, i.e., refactor. "Make it work" is the part about getting the code to operate correctly. A rephrase might be, "Make it work correctly, make the source code clear, make it run quickly."
So, I think what's there in Emmett in Pongo works. Based on the feedback, it sounds kind of right. Is it fast? Yes and no.
The issue
That's how we got to the troubles users can cause. They might want to use your tool, deploy it, and verify whether it works and how fast it works. And they may find out that it doesn't work as they expected.
Jokes aside, having early adopters and getting a feedback loop is crucial. I'm lucky to have people like Fernando who have contacted me and said they observed something weird.
He said that locally, hosting PostgreSQL on a Docker container, the application was running instantly and super fast. Still, he never got a request processing time below one second when deployed to the end environment. That could not have been an issue a few years back, but it definitely sounds unacceptable nowadays. I told you that I'm lucky, and I do.
Fernando is building a serverless-first product, which is a common case for startups or those in the early phases of development. With this model, you pay only for what you use. That means each request will (or can) spin up a new stateless environment. We cannot have a shared state between the calls, e.g., we cannot share a connection pool.
That's a challenge but also a valid use case that I want to support. One of the reasons why I selected the Node.js environment to build my tools is that it helps me deliver faster and more sustainably.
We discussed already in Mastering Database Connection Pooling that there are "Connection Pools as a Service" like:
They can keep the shared connection, but they can cause issues if you're using an application-level proxy like pg pool.
I used connection pooling as a default design choice for both Emmett and Pongo. They're sharing a common dependency called Dumbo, which is responsible for connection management and running database queries.
Nevertheless, I added the option to turn off pooling to support the "Connection Pool as a Service" scenario. Knowing his use case, I also recommended that option earlier to Fernando, and that's also one of the reasons for the performance issue. We're getting there.
The Initial Analysis
Emmett and Pongo are logically simple tools; they're:
taking the request (reading events, appending new, filtering documents, updating them, etc.),
translating that into the SQL,
getting database connection (and transaction if it's a write request),
executing SQL query,
committing transaction, if it was open, and closing connection.
I'm outsourcing the database stuff to the PostgreSQL driver, so I won't do sneaky things that could complicate this simple process.
That's why the first step was to cut off external and the most obvious issues. What could be they?
Deployment specifics? Serverless, in general, is vulnerable to the cold start issue. Cold start happens when the function is run after not being called for some time. Then, the whole environment needs to be started. The next request can reuse this environment without additional delay. After some time of inactivity, the environment will be again "put to sleep" so you don't use resources and don't pay for them.
Still, that wasn't an issue here, as we verified that it's happening constantly, not only on the first request.
Some deployments are different. For example, Cloudflare Workers use custom, slimmed Node.js; even AWS introduced recently their own Node.js runtime.
But it wasn't that; it also happened when running locally against the actual database.
The distance between the database and application hosting? If they were hosting in different regions (e.g. database in the USA and application in Europe), that could add significant enough latency.
Wasn't that? Both deployments were in a similar region.
So maybe that's the issue with the database? It could have been underprovisioned (so had a shortage of needed resources). If you're using innovative hostings like NeonDB and Supabase, which are adding their sprinkles to make PostgreSQL serverless and cheap, then maybe those sprinkles are the reason.
Nah, running against the regular PostgreSQL hosted on Azure RDS. Gave the same subpar results.
Then maybe something in the application is causing delays. Maybe an unusually big payload is causing delays. That could increase the latency between the application and the database and add to the serialisation time.
Nope, the payloads were even smaller than they typically are.
So we're back to the initial point, and the thesis is that it can be a bug in Emmett or Pongo.
Show me the numbers!
If that sounds like guessing, then you're right; calling that analysis is a bit of an exaggeration. But it's a necessary step to avoid dumb mistakes and understand the deployment specifics, data usage, and scenarios. This is needed to reproduce that and set benchmarks. Yes, benchmarks, because we finally got to the numbers.
I reached the moment when I knew the primary issue. I wanted to know more about the reasons, so I had to choose tooling. I was correct, I was right, but I wasn't fast, and to fix that, benchmarks were needed.
I chose Benchmark.js as it's the most popular microbenchmark tool in Node.js. I didn't need load tests yet, as I wanted to pinpoint the specific issue to make the initial improvement. Each development has its own benchmarking tools, I'll show you the syntax, but it's not as critical as understanding what we're doing.
The first step was reproducing the intended usage and confirming that we were getting similar numbers.
I've set up a NeonDB free-tier database in a region similar to my local computer. This database is underprovisioned for bigger usage, and that's fine, as such a setup should make the observed issue even bigger.
I've set up the test with the event store connected to that database (of course, making that configurable). And doing basic operations like appending and inserting. It looked like this:
Having that, I could set the benchmark code:
The next part of the article is for paid users. If you’re not such yet, till the end of August, you can use a free month's trial: https://www.architecture-weekly.com/b3b7d64d. You can check it out and decide if you like it and want to stay. I hope that you will!
Keep reading with a 7-day free trial
Subscribe to Architecture Weekly to keep reading this post and get 7 days of free access to the full post archives.