When your product or service is in downtime you lose more than just money. You are losing the trust of your users and partners.
Therefore being proactive and doing what you can to prepare for production incidents is essential.
Here are 7 pieces of advice to help you and your team prepare and deal with production incidents when you encounter them.
Before the incident:
#1 Don’t let error logs fall through the cracks
You probably use some kind of a logger in your service to log errors and other informative data, If not, it’s time to start!
For example, when an exception is raised, a lot of developers write something like this:
try {
// Block of code to try
}
catch(Exception e) {
logger.error("Got an exception, context: %s", e, context);
}
If your service log is swelled with unknown exceptions you should be worried, because bad things are happening and users may be impacted.
Therefore, you should monitor any error logs in your service and fire alerts.
Some errors may not be so bad and some may be brutal, you should decide for yourself which ones you want to get alerted for.
The ELK Stack is currently the most popular log management and logs analysis platform in the market used for monitoring.
You can run a periodic query on top of Elastic Search to find undesired errors, and fire alerts when they happen.
#2 Catch bad behaviors early by adding statistics to your service.
Understanding the “status” of your service is essential for ensuring its reliability and stability. Detailed information and high visibility of the processes in your service not only helps your team react to issues but also gives them the confidence to make changes.
One of the best ways to gain this “status” insight is with a robust monitoring system that gathers metrics, visualizes data, and fires an alert when things appear to be broken.
A robust monitoring and alerting system will help you solve issues sooner and will minimize the damage done by the incident.
Similarly to logs, you can configure alerts on your service based on statistics such as latency, queue lengths, CPU, memory usage, and so on.
Prometheus and Grafana are among the most popular monitoring and alerting tools.
You can also set up PagerDuty to make sure you always have an owner that handles incoming alerts.
When configuring metrics and alerts, focus on the service purpose. What is its job? For example, If it’s serving HTTP requests, then an important metric to set up is the HTTP return status code count.
If a service job is to write something to a database, then an important metric to have can be the rate of writes to the database.
#3 Prepare runbooks and invest in services documentation.
It’s possible that when things break down the owner/expert of the broken service will not be available, or the person dealing with the incident might be clueless regarding how to deal with the incident. Therefore, it’s a good idea to have a prepared response plan for dealing with emergencies. (Here are some best practices to write one)
The benefit of these prepared response plans (or runbooks) is you don’t need an expert to be available every time there is a problem. This in turn reduces the burnout feeling when the same person has to deal with the same problems over and over again.
If you can automate these plans it’s even better. For example, restart the service automatically if the health check is failing.
The downside with response plans is they are written for a specific case. And you can never cover all the cases.
So it might be wise to train your incident responders on generic action taking, like finding a log or rolling back a deployment.
Another important aspect of this is "service documentation”. It doesn’t have to be super detailed, but just enough to give the incident responder entry points to the service and critical things to know about it.
During an incident:
#4 “Read the damn error message”
The first thing you should do when you want to solve the issue at hand is to understand what exactly went wrong.
For example, If you get an error message from the log, read it carefully and look for details like: “What just happened”, “Where it happened”, and only then try to think about “Why it happened” and what you should do about it.
So, the error messages in your log are valuable. This is why they should be as detailed as possible, and add important data, so anybody who reads them will know the “what” and the “where” immediately. (Stack traces are great for the “where” part)
#5 Use data to understand the issue
When you encounter a production issue, your first instinct shouldn't be to guess and immediately jump to a conclusion and try to fix the problem, which can lead to wasted time and resources. Instead, you should use the data you have at hand to devise a hypothesis and then validate that hypothesis with hard data. Think about it like playing detective.
There are two ways to go about this: induction or deduction.
With induction, you locate the relevant data, organize it, and then devise a hypothesis. Once you have a hypothesis, you can use data to prove it and then fix the problem.
Deduction works similarly but involves enumerating the possible causes or hypotheses first and then eliminating the ones that don't fit with the data. This allows you to refine your remaining hypothesis until you arrive at a conclusion that can be proved with data.
Either way, data should be at the heart of your efforts to solve the issue.
#6 Aviate, Navigate, Communicate
One of the first things pilots learn at ground school is what to do in case something goes wrong with the aircraft - “Aviate, Navigate, Communicate”.
Aviate means keeping the plane in the air. Navigate - to a safe location. And lastly, Communicate - let ground control know you are in trouble so they can help.
We can apply the same principles used by pilots to the software domain too.
In cases such as a complete downtime of our system (but not only):
- Aviate - Do everything you can to bring the system up first. While saving data so you can analyze the root cause later, like taking a thread dump for example.
- Navigate - Keep making decisions even when you lack complete information (avoid Analysis paralysis).
- Communicate - Inform everyone that there is an issue, maybe they can help.
Credit to Barak Luzon And Ariel Pizatsky who mentioned these principles in an excellent presentation they gave - When The Firefighters Come Knocking
#7 Avoid psychological biases and pitfalls
Our brain is wired to make decision-making simpler. In doing so, it exposes itself to biases, heuristics, and other quirks that may seem like “bad decisions” in hindsight.
One such example is the ‘simulation heuristic’. The simulation heuristic is a psychological heuristic, or simplified mental strategy, according to which people determine the likelihood of an event based on how easy it is to picture it mentally.
I.E You may think you know the incident's root cause simply because it’s easier for you to imagine it.
Another example is the “confirmation bias”. The confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one's prior beliefs or values.
For instance, you are more likely to find evidence that supports your existing hypothesis, and ignore evidence that disproves it.
So what can you do to avoid these human biases? It’s hard to say, and I don’t think there is a silver bullet here. But being aware of them is the first step to mitigating them.
I encourage you to watch a great talk by Boris Cherkasky that dives into how psychological biases affect incident response.
Wrapping up
Investing in a resilient monitoring and alerting system will result in more confidence when performing actions and deploying features on the system.
Training your team to react to disasters, will remove the fear of breaking things.
As a result, you may increase the development velocity of the now “fearless” developers.
Top comments (2)
Me and my team are building this super debugger to 100x developer productivity while debugging and bring down MTTR drastically. We call it CtrlB.
It brings the capabilities of a traditional debugger to your production environment in a safe way without any overheads or re-deploys needed. You can add tracepoints at arbitrary lines in your code - while it is running - to capture the system state at that line.
If you have any other tip, thought or insight about production incidents, please share them here