I’ve been a backend engineer for a while. Like, a while. Here is an actual picture of me designing my most recent backend architecture:
It's been my experience that almost all architectures come down to a small set of general patterns. In this article I’m going to lay out the various patterns, what problems they solve, and when they are appropriate.
What is a backend system? A backend system is a centralized place where data is processed, stored and made available for access. Does that sound super general? It is! But there are some important details in that description. For one, it's a “centralized place”. It doesn’t happen in your browser, or on your phone or inside a random frog. It happens on a server somewhere at a location that is well known (ie. has an address like google.com). This well-known location processes and stores data - for example, when you send a direct message to someone it goes to a backend system where it gets stored. Then later, the recipient can see that message by logging into the system and looking at their messages.
The most critical thing that backend systems do is care for data. Imagine you are sending someone a saucy DM, and you typed “I never want you to leave,” but the system mangled your message and what they received was “I want you to leave.” Or, maybe worse, your message gets sent 1000 times. Or it says sent, but never was. Or it was sent to the wrong person. Good backend architecture guarantees that these things will never happen.
Backend techniques we’re going to cover in this article:
- Synchronous processing, Synchronous error handling
- Asynchronous processing, Synchronous error handling
- Asynchronous processing, Asynchronous error handling
- Broadcasting without guarantees
- Broadcasting with guarantees
- Batching - when and how to do it
Overview
All of the above backend systems are composed of these building blocks:
System input -> processing (stateless service) -> sink (stateful service)
where each sink can become the system input for another process.
Why do we split stateless processing from stateful sinks? Because it makes everything much easier. Let’s talk about what we need to do:
- Safeguard the data by making sure something is always responsible for its safety
- Handle all the requests that come to us without overloading and crashing too much
- Validate data before accepting it
These things can be shockingly difficult to accomplish, especially at scale. It's easy to set yourself up for failure - like maybe you try to batch requests in your stateless service before putting them into a sink for performance reasons. What happens when the service unexpectedly goes down? You told the user that you took responsibility for the data, but then dropped their baby on the floor before handing it off. Poor form. Party foul.
Or maybe you think to yourself, this is a small little thing, we won’t need to worry about scaling. We’ll just have one server and store state in memory. If it gets wiped no big deal, the user just gets logged out or whatever. Cool cool. Now your service is crashing constantly because people are actually using it and you keep running out of memory. You’re at the biggest server your cloud service provides, it's costing an arm and a leg and now you need to redesign your entire system in the middle of a running production outage. Not fun. Not fun for you and not fun for your users.
Or maybe you think, it's too hard to get the user to provide us with valid data, we’ll just accept whatever they give us and sort it out later. You have just opened yourself up to an absolute world of pain. Now every time the user posts some new nonsense data you have to comb through it, determine what fields mean what and try to massage it into something you can make sense of. This is a special circle of integration hell, where nothing works and no one knows why. A squishy interface is never OK.
We split things out into stateless and stateful services because doing so solves these problems. You have guarantees about data integrity. You can (more) easily scale your system when you inevitably need to. You have a specific place that is responsible for gatekeeping and rejecting bad data before it gets into storage. I cannot overstate this enough - the more guarantees you put at the edge of your system, the simpler all of your downstream systems will be, the fewer bugs you will have and the more people will like you.
Stateless-ness
A stateless service provides only one guarantee - a request is either processed successfully, or it is not. If the service dies in the middle of a request, the client is guaranteed to know that the request failed. The client will not be given a cheery 200 response, then find out later that their request was dropped on the floor.
Stateless services have no memory of what has happened, or what requests they did or did not process. They don’t guarantee that the data will be there later. They don’t guarantee that you will find love. They have one purpose in life, and that is handling requests, telling you if they succeeded or failed, and then immediately forgetting about them. They never (never!) take responsibility for the data. If something takes responsibility for data, it is by definition a stateful service.
Imagine a stateless service as a bank teller. They sit in front of the bank, and they provide access to your money (data), but they don’t actually keep your money (data) in their pockets. They take it from you and put it directly into the bank, or they take it from the bank and give it to you, but they never say, ‘Oh, I’ll just put this in my pocket and put it in the bank later’.
Examples of stateless services are: APIs that store nothing, Lambda functions, Cron jobs and Batch jobs.
Stateful-ness
Stateful services are quite different - they are the bank. Stateful services exist to store and protect your data. They swaddle it up and care for it, making sure it never gets lost or corrupted. Your users should never access it directly for the same reason you should never access the bank directly - that’s what tellers are for.
Stateful services are of course much more difficult to scale. It’s easy to get new tellers, but it's hard to set up a whole new bank. Cloud providers have gotten better about this over time, and if you need to do it, you should use whatever system your cloud provider has set up. If you find that you are scaling a stateful system yourself, re-evaluate your architecture immediately because you’ve done something wrong. (Or you're in the extreme minority of people who legitimately need to run an unmanaged database. In this case, may God have mercy on your soul.)
Examples of stateful services are: Databases, Queues, File systems, APIs that store requests or logs, and External APIs (APIs that you transfer ownership of data to)
Sink/Database Terminology
For the rest of this article, the word “database” is synonymous with “sink”. Databases are most commonly used so that’s what I’ll use in the examples, but really anything that takes responsibility for data could be swapped in.
Architectures
Synchronous processing, Synchronous error handling
This is the most common backend architecture. It takes some input, validates that the input is in the expected shape (number between 1-100, or string less than 10 characters), stores data in the sink and then returns a success.
In the case of failure, for example if the provided input was 300 when it was supposed to be less than 100, an error is returned to the caller.
You might ask, where is logging in all this? As the provider of the service, I want to know when an error happens! Thanks for the easy question that I made up for myself to answer! You are not limited to a single sink, you can write your logs to another database, like Elasticsearch or Datadog or some-such. Something like this:
But Tim, you say. What if Elasticsearch is down, but I still want my logs. Is ES now a dependency? Shouldn’t I write to disk just in case?
The answer is no. If you write your logs to disk, then this is no longer a stateless service. The logs are State with a capital S, and your service is now responsible for caretaking them. You will lose all of the niceties of a stateless service, and are now faced with the complexity of a stateful service. In other words - don’t do it. You will be sad. Maybe not today, and maybe not tomorrow, but soon.
If indeed you do not wish to be dependent on your sink being always available, there are other options open to you. Since queues are more reliable than Elasticsearch (way more, ES goes out more often than a narcoleptic sloth) you can always push your logs into a queue, something like:
This is what Kafka was designed for, and it works fine. BUT you might say. ‘Hmm. What if my queue goes down? I’ve just replaced one dependency with another.’ First, I must ask you, what are your requirements here? Is it good enough to just say, ‘well when your queue is down the system is down’? In almost all circumstances you can harden your database and queue such that the likelihood of either of them going down is extremely minimal. You can totally hit your SLAs and everything is copacetic. This architecture is good and fine, and you should use it 99.99999% of the time, because it's simple and fine. It’s fine. For real.
BUT. Let’s say for the moment that either you have a very sadistic boss, or you really do work at a bank or something and any downtime at all is a super big deal. You can invert the pattern like this:
The API in this case ONLY does validation, and then puts the request on a queue for actual processing and storage. (You’ll notice this is actually the “Asynchronous processing, Synchronous error handling” pattern from later in the article.) The benefit of this architecture is that you’ve decoupled “receiving” a request from “processing” a request. Now you’ve limited your possible failures to only the gateway into the system. The gateway can remain simple and pure with as few dependencies as possible, keeping your (apparently very intense) SLAs nice and safe. The tradeoffs are
- You also won't be able to return the result of your processing synchronously, which can sometimes be a deal-breaker
- Splitting and maintaining a separate validation API and processing worker is a lot more work, which will make your project manager sad.
Of course, if you would like your system to have 100% uptime, I might suggest a different profession. Maybe you can author science fiction.
Getting back to simpler patterns, here is another variation - the polling system:
Notice how this is the same pattern as the first, it's just that the connection is initiated from your cron task instead of from a client. The biggest difference is that you need to provide the method for re-processing on failure since the cron task takes on the role of “caller” as well as processor. There are a couple of ways to address reprocessing. You can maintain a “last successful” watermark, and always get data from after that watermark (works great when the data is timestamped), or you can provide a way to run the cron manually with a range. Always make sure your ingest is idempotent!
Here’s another one, where the data source is a queue that we own:
On failure we put the item back in the queue. After enough failures the item goes into a DLQ (dead letter queue). Failed items like this should be handled manually, since they are likely malformed.
This “Synchronous processing, Synchronous error handling” architecture is most appropriate when your API can validate and process each request quickly. What does quickly mean? It means “your users will not complain that it’s too slow, and your network requests will not timeout.” What that means will depend on your users and your network architecture, but it's usually something between 5 and 10 seconds. We’ll get to what to do when your processing takes too long next.
Asynchronous processing, Synchronous error handling
Alright, so you can’t do your processing synchronously because you need to crunch a lot of numbers, and it's just going to take far too long (like more than 10 seconds). Your users will get antsy looking at a spinner and your network gateways will start severing your long held connections because they think you fell asleep at the wheel. What do?
This do. Split your validation from your processing. That way you can guarantee the client that their request will be fulfilled, even if it isn’t expressly fulfilled right now. Now your queue worker can take its own sweet time fulfilling the request, and the client can relax knowing that it's in the works. If the queue starts to back up with requests, no big deal! The queue workers are stateless so they are easy to scale out during the spike and then scale in when the spike is over.
This architecture is most appropriate when you can validate the request quickly, but it will take too much time to process it for the client to wait. As mentioned in the previous section, this architecture can also be useful for uptime guarantees.
Asynchronous processing, Asynchronous error handling
Sometimes the request is so large that you can’t validate the request at the same time that you receive it. This is usually the case with large files, where you need to go through the whole file and determine if the contents are valid. In this case our stateless service gateway modifies its “success” return value to mean “successfully uploaded.” You will need another mechanism to tell the client if their file was successfully processed or not.
The best way to handle this case is to stash the raw file somewhere (S3 or equivalent), put a ticket into the work queue, then inform the user that the upload is being processed. Once the queue worker is finished you can store the results of the processing in the database. Then you can notify the user that processing has completed via a queue, or just wait for them to poll for status.
Broadcasting without guarantees
Sometimes you need to inform other downstream components that something has happened, or some data has been inserted/updated. This often happens in event driven systems and microservices and is a pretty standard pattern for syncing state across service boundaries. It's a perfectly reasonable architecture - assuming you are OK with losing events sometimes. What happens when the write to the db is successful and the write to the queue fails? In order to return an error to the user you’d have to have some way of rolling back the change that was made to the DB. Not a fun prospect.
Broadcasting with guarantees
There is a way to guarantee that events are processed - put the events in the database at the same time that you write your actual data to the database as part of a single transaction.
This is known as the “Outbox Pattern.” You write both to the standard tables and also an “outbox” table in a single transaction. It’s a fair bit of work on the developers' part, so only do this if you need to. CDC (change data capture) is very similar but automates the second write, reducing the load on the developers.
Batching
Batching is the practice of bundling up multiple requests to be processed at once. It's generally done for performance reasons - writing 10,000 items in a single batch to a database is much faster than writing a single item 10,000 times. We’ve said before that you shouldn’t batch your items in your API, so how should we do it? Like this:
In this architecture, the Queue Worker pulls as many items from the queue as you’d like to batch at once. In this way your API and Queue Worker can remain stateless. Your Queue maintains responsibility for your data until the Queue Worker finishes writing the batch, at which point it releases the data from the Queue. Simple and effective.
If you find that your queue can’t handle a large number of messages, or becomes prohibitively expensive with large numbers of messages (I’m frowning at you, SQS) then you have chosen the wrong queue for your system. If you outgrow SQS, switch to Kafka (or Kinesis if you hate yourself) - don’t try to half-ass it by batching in your API.
Of course as always, if you can, use an off-the-shelf solution like AWS Firehose. Less code is more better.
Conclusion
This has been a whirlwind, high level tour of basic architectures. There is of course much more to say, and many more opinions to sling. If you’re interested I can provide more detail, or go over advanced topics like caching strategies in later articles. Let me know what you’d like to see in the comments!
Top comments (2)
This was really informative thank you. I'd definitely be interested in posts about implementation and technical details of some of the more common backend patterns.
Ok great! I should be able to find time soon