Introduction
Recently one of my colleagues shared his experiences regarding a Chaos Engineering assignment he conducted for a customer. I thought this so interesting that I offered to publish his experiences on our SogetiLabs platform. He accepted and this blog is the result.
Chaos Engineering
Chaos Engineering is a testing method where organizations deliberately introduce chaos and failures within a system[1]. It is conducted to test the systems’ resilience and reliability. This concept is a software development approach. Organizations usually execute this by conducting controlled experiments in a production environment. In relation to DevSecOps, chaos engineering involves testing security mechanisms and identifying vulnerabilities under real-world conditions. This enables teams to proactively respond to threats and improve the overall security of the system.
With Chaos Engineering, you experience how a system responds to disruptions, and it is possible to identify weak spots. The goal is to create controlled real-life situations, allowing teams to learn and strengthen the system against unforeseen situations.
Chaos Engineering vs. Recovery Tests
To date, organizations mainly perform Recovery Tests. Chaos Engineering and Recovery Tests both aime at improving the resilience of systems, but their approach is different.
Recovery Tests often include a limited scope. They specifically test a system’s recovery mechanism after a failure has occurred. The aim is to evaluate how quickly and effectively the system can recover to ensure prompt problem resolution and service restoration.
An example is restoring data from a database. However, this approach is very limited and does not answer questions such as: ‘What if someone gains access to the system?’, or, ‘What if components stop working due to configuration errors?’ In practice, we experience that organizations do not or hardly test these types of situations. As a result, teams are also little or not at all prepared when they arise.
Chaos Engineering is a method that allows teams to prepare for the situations associated with the questions above. With Chaos Engineering, you learn to keep teams alert by structurally and randomly taking systems within an application landscape offline. As a rule, organizations are not very positive about this approach, and that is of course understandable because you introduce actual disruptions into the production environment.
So, while chaos engineering helps identify weaknesses in a system by introducing controlled failures, recovery tests specifically examine the system’s ability to recover after an outage has occurred. Both are valuable approaches to improving system resilience.
Chaos Engineering – A Hands-on Experience
Recently, my colleague Bram van de Kerkhof took the initial step towards Chaos Engineering at a client. The rest of this blog reflect his experiences and perspectives regarding the outcomes and conclusion.
Scope
The SecOps principles are of paramount importance to this customer. The customers challenge was that in theory, they applied these principles, but it was unclear whether the operation/practice was really at the right level in terms of maturity and prioritization. That’s why we chose to focus on resilience and make it an activity, with all development teams involved. To minimize impact, we ensured that the critical and production systems remained unaffected during the activities unless explicitly informed parties were involved. In practice, this meant that we ran the recovery test in the acceptance environment. This wasn’t perfect, but it was an initial step towards Chaos Engineering.
Approach
Within the agreed scope, the customer gave us carte blanche. They allowed us to do whatever we wanted! Our approach was to create chaos at different levels. Where possible, we added a certain amount of persistence to the disruption so that recovery alone was not enough: The teams had to devise and execute multiple actions to remedy the disruption.
For example, we adjusted logic within infrastructure, emptying important databases every 10 minutes. For other teams, we simulated a ransomware attack, sometimes by encrypting a database or manipulating dependencies. To make the recovery even more difficult, we combined this with force removing the source code in git. By doing this in practice and actually removing the resources or making them inaccessible, instead of a theoretical or limited recovery output (e.g., restoring data to a database), we saw that the results were clearly more applicable. The last trump card we applied was to set up a scenario that disabled senior developers/engineers from participating in solving the disruption from the beginning. As a result, less experienced team members suddenly had to perform a recovery themselves. Something that made them very nervous, and more than once pointed out that there was a single point of failure.
Outcomes
After carrying out the Chaos Engineering activities, we evaluated the results with the teams. We applied the applicable process baselines they would use for normal disruptions and incidents. The feedback from these evaluations was that a lot went well – sometimes even better than expected – but that they also saw a lot of opportunities for improvement. For example, it became clear that within some teams, there was a lack of sharing knowledge between team members. This became visible because, as mentioned before, we had sidelined the senior developer/engineer within the team. Also, the persistence meant that teams had to make multiple recovery attempts. They had to thoroughly investigate the root cause instead of merely controlling symptoms.
They also determined that they had spent a lot of time writing recovery scripts without understanding the root cause of the problem. This allowed attackers to have access and remain within the environment longer than necessary. That was also the reason the client took immediate action; they planned training courses during the ‘postmortem’[2] evaluation to enhance the teams’ skills on this subject. It was great to see the teams getting actively involved right away, enjoying themselves, and defining follow-up steps, preparing themselves for future ‘chaotic’ situations .
The purists among us will say that what we’ve done is still a long way from true Chaos Engineering. Based on the principle that small steps lead to more, I still felt it valuable to share this experience. Walking before running, as it were.
Conclusion
With the rise of DevOps within teams, the focus on short loops, and the expansion of automation, I am convinced that Chaos Engineering will play an increasingly important role in security. We need to move away from the approach in which organizations sporadically check whether systems are resilient. Only when we actually build resilience into the processes (PDCA cycle), organizations can conclude, with a greater degree of certainty, that they are resilient. Or not, but then it is also clear for organizations what to do to improve resilience. The primary goal is to ensure business continuity. Because this is still a long way off for many companies, my advice to you: Start small, like we have done, and see how much can come out of it.
Recognizion of contribution
This blog is written in co-authorship with Bram van de Kerkhof.
Bram van de Kerkhof is a Subject Matter Expert (SME) for DevSecOps within Sogeti. DevSecOps, a secure-by-design methodology rooted in the core values of DevOps, is his area of expertise. With a focus on integrating security into the development process and production environment, Bram frequently fields inquiries from clients seeking to enhance their organization’s resilience. Recently, he embarked on an enlightening exploration of Chaos Engineering for one such client, an experience he is eager to share.
[1] By system, we mean a system consisting of people, processes, and techniques.
[2] In the context of IT, “postmortem” refers to a retrospective evaluation or analysis that IT conducts after the completion of a disruption, incident, or period of operational activity. This is intended to draw lessons, highlight successes, and identify areas for improvement for future reference or actions.