Skip to Content

Getting the Monkeys off Your Back – The Importance of SRE and Chaos Engineering

Sep 23, 2024
Steve De Smet

While some famous like King Kong, Bubbles, Donkey Kong, and – more recently – Sun Wukong πŸ’ have made headlines, this blog is about a lesser-known cousin: the Chaos Monkey.

In my previous blog, I wrote about Shift Right, and the importance of DRP testing – how to make sure that any business disruption is quickly and efficiently mitigated. I mentioned walkthrough tests, parallel tests, and even full breakdown tests. These are all techniques on how to cope with live failure events, which will inevitably occur. However, organizations are not powerless when it comes to avoiding these failures – on the contrary, with SRE or Site Reliability Engineering, organizations can drastically lower the likelihood of these failures. SRE, a term coined by Google, aimsto create scalable and highly reliable software systems.

One particularly nifty tool in the pocket of every Site Reliability Engineer is Chaos Engineering. Chaos engineering is built on the idea of proactively identifying and addressing potential failures before they could impact end users. The practice was first adopted by Netflix, by the introduction of their so-called Chaos Monkey – software designed to randomly terminate instances in their production system, to verify the robustness of their solutions.

Chaos Monkey laid the foundation for modern chaos engineering by purposefully injecting faults into systems to test and verify their robustness and resilience. While the name might hint otherwise, chaos engineering is not about chaos – it is about controlled experiments to identify potential points of failure in a system before they cause problems.Over time, Netflix added to their chaos family and created what is lovingly dubbed the β€˜Simian Army’: a variety of different monkeys, each created with a different purpose.
Different monkeys have included:

  • Chaos Monkey – randomly disables production instances to test if the system can automatically recover from unexpected failures
  • Latency Monkey – introduces artificial delays to simulate service network or service slowdowns, testing the system’s performance under degraded conditions.
  • Doctor Monkey – performs automated health checks to identify and remove faulty instances, ensuring the system stays healthy
  • Janitor Monkey – identifies and cleans up unused or outdated resources, optimizing the system and reducing costs
  • Security Monkey – scans for security vulnerabilities, ensuring the system adheres to best practices and is protected from threatsChaos Gorilla – a large-scale outage by taking down an entire AWS availability zone (a cloud data center), testing the system’s ability to handle massive failures.
  • Chaos Kong – causes simultaneous failures across different AWS regions and availability zones, testing the system’s global resilience and disaster recovery capabilities.

While having its roots in cloud and infrastructure availability, I believe that every QA engineer should have the chaos engineering principles at heart: perform controlled experiments to build reliable products. By putting our systems and applications to the challenge, we make sure they perform appropriately when they need to.

Next time you’re creating a test approach or fine-tuning a Master Test Plan – be sure to get the monkeys off your back! 🐡

About the author

SogetiLabs Country Lead | Belgium
Steve is a strong advocate of Quality Engineering throughout all phases of the SDLC. With almost a decade of background in Digital Assurance & Quality Engineering, he has gathered experience through various roles within the craft: Test analyst, Test Manager, Program Quality Manager, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

Slide to submit