GETTING THE MONKEYS OFF YOUR BACK – THE IMPORTANCE OF SRE AND CHAOS ENGINEERING

September 23, 2024

Steve De Smet

While some famous like King Kong, Bubbles, Donkey Kong, and – more recently – Sun Wukong 🐒 have made headlines, this blog is about a lesser-known cousin: the Chaos Monkey.

In my previous blog, I wrote about Shift Right, and the importance of DRP testing – how to make sure that any business disruption is quickly and efficiently mitigated. I mentioned walkthrough tests, parallel tests, and even full breakdown tests. These are all techniques on how to cope with live failure events, which will inevitably occur. However, organizations are not powerless when it comes to avoiding these failures – on the contrary, with SRE or Site Reliability Engineering, organizations can drastically lower the likelihood of these failures. SRE, a term coined by Google, aimsto create scalable and highly reliable software systems.

One particularly nifty tool in the pocket of every Site Reliability Engineer is Chaos Engineering. Chaos engineering is built on the idea of proactively identifying and addressing potential failures before they could impact end users. The practice was first adopted by Netflix, by the introduction of their so-called Chaos Monkey – software designed to randomly terminate instances in their production system, to verify the robustness of their solutions.

Chaos Monkey laid the foundation for modern chaos engineering by purposefully injecting faults into systems to test and verify their robustness and resilience. While the name might hint otherwise, chaos engineering is not about chaos – it is about controlled experiments to identify potential points of failure in a system before they cause problems.Over time, Netflix added to their chaos family and created what is lovingly dubbed the ‘Simian Army’: a variety of different monkeys, each created with a different purpose.
Different monkeys have included:

Chaos Monkey – randomly disables production instances to test if the system can automatically recover from unexpected failures
Latency Monkey – introduces artificial delays to simulate service network or service slowdowns, testing the system’s performance under degraded conditions.
Doctor Monkey – performs automated health checks to identify and remove faulty instances, ensuring the system stays healthy
Janitor Monkey – identifies and cleans up unused or outdated resources, optimizing the system and reducing costs
Security Monkey – scans for security vulnerabilities, ensuring the system adheres to best practices and is protected from threatsChaos Gorilla – a large-scale outage by taking down an entire AWS availability zone (a cloud data center), testing the system’s ability to handle massive failures.
Chaos Kong – causes simultaneous failures across different AWS regions and availability zones, testing the system’s global resilience and disaster recovery capabilities.

While having its roots in cloud and infrastructure availability, I believe that every QA engineer should have the chaos engineering principles at heart: perform controlled experiments to build reliable products. By putting our systems and applications to the challenge, we make sure they perform appropriately when they need to.

Next time you’re creating a test approach or fine-tuning a Master Test Plan – be sure to get the monkeys off your back! 🐵

About the author

Steve is a strong advocate of Quality Engineering throughout all phases of the SDLC. With almost a decade of background in Digital Assurance & Quality Engineering, he has gathered experience through various roles within the craft: Test analyst, Test Manager, Program Quality Manager, etc.

Generative AI

Cloud

Testing

Artificial intelligence

Security

GETTING THE MONKEYS OFF YOUR BACK – THE IMPORTANCE OF SRE AND CHAOS ENGINEERING

September 23, 2024

About the author

Related Posts

Automating with Intelligence – Automation Engineer 2.0

Long Live Non-Functional Testing

Mindfulness for Tech Professionals: A Strategic Reset for Sustainable Performance

Understanding sustainable software testing

The Hidden Value of Script Failures in Test Automation

Beyond the Endpoint: Why API Testing is critical in Microservices

How Can We Make People Use Test Design Techniques More Often?

Celebrating 30 years of TMAP with our first audiobook

Test automation and performance testing – When performance testing complements test automation

The role of a Product Owner in Quality Engineering and Testing

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

About the author

Steve De Smet

SogetiLabs Country Lead | Belgium

Related Posts

Automating with Intelligence – Automation Engineer 2.0

Long Live Non-Functional Testing

Mindfulness for Tech Professionals: A Strategic Reset for Sustainable Performance

Understanding sustainable software testing

The Hidden Value of Script Failures in Test Automation

Beyond the Endpoint: Why API Testing is critical in Microservices

How Can We Make People Use Test Design Techniques More Often?

Celebrating 30 years of TMAP with our first audiobook

Test automation and performance testing – When performance testing complements test automation

The role of a Product Owner in Quality Engineering and Testing

Leave a Reply Cancel reply