AWS outage
Yes! A whole AWS region was unreachable.
At 12:47 AM PDT on April 21st, 2011, AWS US East Region went down!
The 24 hours downtime of amazon web services affected critical services for their customers’ businesses, including big tech companies such as Netflix.
Amazon engineers released a memo depicting what was the root cause and how they solved it.
Tech companies started to mitigate; Netflix, for instance, aimed at enabling their systems to automatically recover and deal with such failures. The huge effort of Netflix engineers ended up with the creation of "Chaos Monkey" or "Netflix Simian Army": a bunch of tools that will help simulate increasingly catastrophic levels of failures.
Chaos Monkeys
Coping from Netflix: this name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables.
Chaos Monkey includes 7 different concepts that everyone should be aware of:
Chaos Gorilla: simulates the failure of an entire AWS availability zone.
Chaos Kong: simulates failure of entire AWS regions, such as North America or Europe.
Latency Monkey: induces artificial delays or downtime in their RESTful client-server communication layer to simulate service degradation and ensure that dependent services respond appropriately.
Latency Monkey: induces artificial delays or downtime in their RESTful client-server communication layer to simulate service degradation and ensure that dependent services respond appropriately.
Doctor Monkey: taps into health checks that run on each instance and finds unhealthy instances and proactively shuts them down if owners don’t fix the root cause in time.
Janitor Monkey: ensures that their cloud environment is running free of clutter and waste; searches for unused resources and disposes of them.
Security Monkey: an extension of Conformity Monkey; finds and terminates instances with security violations or vulnerabilities, such as improperly configured AWS security groups.
Lessons Learned
Those 7 monkeys are used by Netflix achieving their resilient objectives, while constantly injecting failures into pre-production and/or production environments.
The result of adopting those tools was shocking, Netflix services failed in ways they never could have predicted or
imagined, creating a learning environment for the whole organization, and during working hours :)
Netflix teams, learned how to make their services more resilient, and evolve them far beyond competition.
Chaos engineering is not something to be afraid of, but instead something that will help us learn about potential failures in our infrastructure, environments and deployments.
This is a great example of how we can add and integrate learning into our daily work, and how we can transfer mistakes and failures into an opportunity for learning and not something to be punished for..
Top comments (0)