I recently visited a meetup in Düsseldorf where Russ Miles was talking about chaos engineering, its techniques, its mindset and the technology that support this awesome way of exploring weaknesses in systems. After the talk ended, it was clear to see that the attending developers were fascinated about these concepts, as was I. There was so much motivation to start testing and experimenting with software systems, it was awesome!
The next day, when I was thinking about the things I’d learned, I came up with some questions that I could not answer to myself. Fortunately, Russ is a person interested in sharing knowledge and experience. That’s why he offers that his audience should contact him as soon as questions come to their mind. So that is what I did.
Sven: Hey Russ! I was thinking about your talk last evening. I’m curious how you would explain the worth of chaos engineering or things like Game-Days, where the developers are locked out of business for a whole day, to the management? All these things improve the software quality, but this is nothing you can make money with in a direct way.
Russ: Hey! So firstly, a “Game Day” can be a lot shorter than a day. In my experience it’s a few hours at most and fits well into the usual ceremonies around your agile process and preferably around retrospectives. So it can be achieved with a minimally increased time investment in comparison to the process you may already be running. Second, when I explain chaos engineering to managers, I use the term “continuous limited scope disaster recovery”. I also point out that software resilience is directly related to money, in most cases, and that this is a proactive, lightweight approach to avoid costly outages.
Sven: Makes sense! But in my experience, even those arguments don’t convince everybody. Some may argue that they never had an outage ever so why should they invest?
Russ: Usually, there has been an outage in the near-past and so people are familiar with the potential costs of not investing into resilient software. If that’s the case, I ask about that outage and ask what they do to avoid that and other outages in the future. I also ask how they’re now building up confidence in their systems again. Another important point I mention is that for a small investment they can get big returns. They just have to add a little more discipline through resilience engineering practices.
If they haven’t experienced an outage themselves in their own systems, I bet they have been impacted from an outage in someone else’s system that they depend on. So I use that as a “you wouldn’t want to be them” perspective. In addition, I also have a growing catalogue of production failure stories that I could tell if people are still unconvinced.
Sven: You mentioned that things like Game-Days fit well into agile processes. So would you say that an agile process is a prerequisite to efficient chaos engineering?
Russ: Oh no, not exactly. I’d say that any business-critical system of software development and delivery will gain from improved resilience. If you’re agile, then you might have the sorts of learning loops in place that are critical already, but chaos engineering does not require an agile process, microservices or cloud native. It’s much more broadly applicable than those things. They sometimes help, but they’re not a prerequisite.
And just another remark on arguing for chaos engineering: As a customer, we’ve all experienced an outage at some point. So even if people have never been in touch with a system outage in their business, they know how an outage feels.
Sven: Maybe that’s a good point to start from, if you have to argue for these things with someone. Something like “What would you think if you would be on the couch to watch a movie but the service is down? And now imagine the mood of your customers with your product if it’s unavailable.”
Russ, that made it a lot clearer for me! Thank you for your time and your thoughts!
Russ: No problem at all!
For me, the most important outcome of this conversation is:
Don’t try to sell chaos to the management!
Better call it “continuous limited scope disaster recovery” and point out that their revenue stream is directly linked to their software. If the software is unstable, your revenue stream is unstable. And if this is not enough to convince the skeptics, show them how they would feel when they are hit by a system outage.
There are also a few other dos and don’ts Russ is pointing out in his article “Chaos Engineering for the Business - 6 Tips for how to explain Chaos Engineering to non-technical stakeholders”.
Authors: Sven Hettwer, Russ Miles
This article is originally published at medium.com/@SvenHettwer
Top comments (0)