Understanding resilience engineering is essential for organizations that want to survive as we accelerate into the online future. Let me tell you what resilience engineering is and how it can help you roll with the punches that come with success.
The only way to keep the Dutch at home is by grounding their offspring, as our government has discovered twice over. So, after spending two weeks dreading the end of the holiday season, last week marked the start of the second homeschooling season in a year.
Thanks mostly to my very structured wife and non-male children (no coincidence there), our household was prepared. We set up two desks in the living room, made schedules, and brought laptops, tablets, headphones, pencils, paper. But when it was time for the digital school gates to open, they… didn’t.
Watskeburt?
In our case, the school’s digital learning and communication platforms choked on the peak load all the homeschoolers generated. Luckily, Netflix/YouTube/the bookcase/outside was still online, so we just postponed the inevitable for a bit. By lunchtime, everything was already forgotten, replaced by the regular remote schooling joys and frustrations.
After dinner and coffee and ice cream, and some delicious white chocolate, I, ehm… Where was I? Oh yeah, I started to get curious about that morning’s outage. And I found an interesting ongoing status report by the Parro team. Apparently, they already anticipated the surge and had taken what they believed would be enough measures to survive it:
“To be fair, we expected that Parro, with all the measures taken, would be able to handle the new extreme peak traffic.”
After what I imagine must have been a stressful few hours, the Parro team fixed all issues and had a smoothly running system by the end of the day again. Kudos to them. That’s what mattered most to their customers, the teachers and parents of our small kingdom.
But their work is not done, as the updates on the status page show. Where is the next bottleneck? When will it become a problem? How will they fix it? Will they ever be done?
They won’t. This is the new normal. But it doesn’t have to be a negative thing. There is a way to use these weird times to engineer ourselves into the future.
Let‘s examine how.
Peaks are the new normal
Now that we are accelerating into the online future, unpredictable massive peak usage will become just another truth. Scary? Sure. Exciting? I think so. But a reality, nonetheless.
This will lead to two types of problems to deal with:
- Known unknowns: a problem that hasn’t happened yet, but is not surprising once it happens. Server failures, network troubles, that kind of stuff.
- Unknown unknowns: something that comes as a complete surprise, shocking systems into failure. Or, as Donald Rumsfeld called them: “the ones we don’t know we don’t know.”
The first type is tough enough to handle as it is. But a well designed, robust system can withstand these known unknowns. Redundancy, retries, fallbacks, failovers: when designing highly available systems, these terms are familiar. We use them to deflect what we know can go wrong, like a server going down or a request to a service timing out.
But what about a global pandemic leading to sudden, massive demand on a system designed for a different reality? Surely, you can’t handle all the ways a system might break down under these unpredictable conditions?
But you can. It’s called resilience engineering.
Handling the unknown unknowns
Resilience engineering is a field still very much inside the scientific domain. It is closely related to the areas of human error, cognitive systems engineering, and safety. You know, plane crashes, oil spills, and nuclear disasters, that type of stuff.
Websites and mobile apps going down rarely lead to environmental disaster or death. That doesn’t mean we can’t learn anything from the tons of research that went into making our world a little safer. Resilience engineering takes all the good stuff that prevents 747s and powerplants from going down and applies it to our young (software) engineering field.
Old thinking, new thinking
Like my colleague Hans Bossenbroek so eloquently put: as the need for agility and speed increases, we need to move towards event thinking. The solution: event-driven, modular, highly scalable systems. But also: super complex architectures resembling chaotic spider webs. And a gazillion events flowing through at breakneck speed.
Of course, that’s not something you design and deploy at the end of the first sprint. And it might be a long way off from your massive monolith. But since we live in web-scale-or-die-times, it’s the logical solution to real-world problems. As the technology evolves, so should the way you manage it. You have to fight the urge to control the new system the way you did your old one.
Instead of focusing on the things that might go wrong (known unknowns!), try instead to look at the things that go right and do those things more often. This will help you and your teams:
- Understand what leads to the right thing and do that more often.
- Increase the chance of success instead of nervously awaiting the next breakdown.
- Be proactive instead of reactive.
- Drive towards continuous improvement.
You’ve read that right; it’s about humans as much as it is about hardware and software!
Systems thinking
You see, systems encompass both people and technology. Think about it: when was the last time you heard about a developer accidentally deleting a production database? Or the other way around: some Sherlock Holmes type saving the day by tweaking a setting even the most experienced colleague was unaware of?
And that is where resilience comes from: humans doing the right things. We do that because we just know what is right. We have a knack for it, some kind of intuition. It’s what keeps complex systems from falling apart.
And when it does fall apart, like last week, something must have tipped the scales. Despite all the hard work by the humans and the computers, systems go down from time to time. And when they do, they have to be brought back up as soon as possible.
Now, stop and think: if production goes down, what do you do? Maybe something like:
- Find the root cause.
- Fix it.
- Make sure that never ever happens again.
If you want to practice resilience engineering: wrong answer.
There is no root cause!
Stop again. Think about the complex system. We already saw that keeping a system running is a balancing act. Finding a root cause would imply that there is a single thing that caused the error. If good behavior is explained by complexity, how can a single cause explain faulty behavior?
It can’t. It‘s a logical fallacy. There is no root cause.
This also means that a proper postmortem should not be about explaining the hunt for a single point of failure. An excellent postmortem is about showing that you understand that complexity sometimes results in unwanted behavior. And that you realize, in hindsight, that many things were working together until they didn’t. But that you have all learned from it. That you now know how to do even more things right.
In other words: good postmortems are blameless, balancing team safety and accountability.
Going resiliently into the future
So, to err is… complex. Faults are a given. Resilience engineering is not about preventing them. It‘s about doing ever more things right so that you can increase complexity while maintaining control. So that when the next unknown unknown strikes, you will be prepared.
So get to it! Learn:
- What is resilience engineering? (You already know!)
- How do I start practicing it?
- Whom should I involve?
- How do we get better over time?
Need help? Keep an eye on our blog series on resilience engineering to find out the answers to these questions!
Observations on the resilience of bone
In the meantime, here is an excellent talk by Dr. Richard Cook about resilience engineering and bones. Bones? Yes, you know, you have them inside your body most of the time! And they are a great vehicle for explaining two types of resilience engineering. Watch and learn:
Top comments (0)