CROWDSTRIKE – AND WHAT WE CAN LEARN FROM THIS

September 2, 2024

Mikko Ruuhamo

During the summer holidays, many of you may have read about CrowdStrike’s less-than-successful release of a new version. For those who missed it, the Cyber Security Center has a brief summary and announcement about the issue https://www.crn.com/news/security/2024/microsoft-crowdstrike-update-caused-outage-for-8-5-million-windows-devices

In short: CrowdStrike, a cybersecurity software company, released an update that caused a worldwide boot loop in Windows machines where the software was installed. Not everything went smoothly, and the timing of the release, a Friday afternoon before the weekend, likely didn’t help the response since some employees had likely already left for the weekend.

So, What Happened?

CrowdStrike published a root cause analysis of the incident, which can be found at:
https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

The analysis is 12 pages long and, in my opinion, quite well written. I won’t go through it in detail here, as the analysis they published deserves to be read. However, the error and problem occurred due to the following reasons:

CrowdStrike released a new software version with a new feature. This new feature scans the communication between Windows processes and blocks attacks to them.
The new version of this feature added an additional input parameter, bringing the total to 21 input parameters that the program uses to check these attacks. However, the integration code that sends these parameters only sent the previous amount of 20 input parameters.
The code passed through various test cycles, because the test cases for this new input parameter only included a simple test to ensure that some or any value comes through with this input parameter (only check to match a wildcard value).
Therefore, the program expected 21 input parameters but received only 20. This led to a memory error when the system tried to read the 21st input parameter from a memory location that didn’t exist (out-of-bounds memory read).
As a result of the memory leak, the system crashed.

And What Can We Learn From This?

Of course, that testing should be increased 😊 But beyond the obvious answer from a software tester, this is a good moment to think about how to prevent a similar situation. CrowdStrike listed some observations and corrective actions in their report. I’ll share a few here, along with my thoughts:

The importance of regression testing and test maintenance. If regression testing is done correctly, and the regression test suite is up-to-date and comprehensive, issues like these would have been caught.
Different levels of test environments. A mistake like this shouldn’t get past the development or test environment, let alone make it to production.
Input parameters and test content. It’s not enough to just test wildcards, like “well, some value comes through, so the test can be accepted.” You really need to think about what to check and verify as values.
Boundary testing. Don’t just input the values the system expects, but also the values the system doesn’t expect.
Negative testing. CrowdStrike’s report states that test automation for this error was performed, but for this new parameter, there was only a check that some value comes through (regex wildcard). So it only expected happy scenarios.
Going live on Friday. Deployments often happen on Fridays to avoid disrupting environments during the week. However, if something goes wrong, fixing it is slower and more expensive.

Testers and testing rarely get praised. We are usually the ones who slow down deployments and ask tough questions. Our job anyway is to try to break and find errors in a working system, so praise for a job well done is rare. On the other hand, our work is precisely to prevent situations like this, and these kinds of situations show that testing matters.

Despite everything, CrowdStrike should be praised for how they handled the situation after the proverbial stuff hit the fan.

What CrowdStrike Did Well, In My Opinion:

Identify the error quickly, find a solution to the problem, and distribute the fix.
Acknowledge the issue and communicate it quickly so that people are not left in the dark.
Apologize and provide a communication channel for those affected by the issue.
Don’t shift the blame onto others or throw those responsible under the bus.
Be transparent about the cause of the error and admit to the mistakes made.

Everyone makes mistakes, and sometimes a mistake causes significant problems. In this case, the mistake was a big one. But beyond preventing errors, what’s important is how you respond and act when they occur. It’s said that trust is built gradually over time but can be lost very quickly. Many probably thought that in the cybersecurity field, where trust in the product is almost everything, this mistake would drive CrowdStrike into bankruptcy.

But in my opinion, CrowdStrike handled the aftermath so strongly and transparently that my trust in them is even stronger after this event—or at least, as long as they can prevent this kind of situation from happening and no similar issues arise anytime soon. This is also reflected in their stock price, which has stabilized and even started to rise slightly after the initial drop.

What I could learn from this is; that while it’s crucial to prevent things from hitting the fan, even more important is how to react and act, when things do hit the fan. Because no matter how hard you try to prevent mistakes, a 100% error-free software does not exist.

This reminds me of the famous Apollo 13 incident (which also has a great movie made about it). Perhaps one of the most famous events in history, where it initially seemed like almost everything failed, but the team ultimately came together, solved the problems one by one, and everyone got home safely. This kind of “Successful Failure” can strengthen a team even more if the error resolution, aftermath and rectifying failure reasons are handled correctly.

It usually is the case that we learn the most from mistakes. Let’s learn from this mistake as well!

About the author

Mikko is a Test Automation consultant with long history of using test automation, mainly Robot Framework. Studied Embedded systems, graduated IT engineer.

Generative AI

Cloud

Testing

Artificial intelligence

Security

CROWDSTRIKE – AND WHAT WE CAN LEARN FROM THIS

September 2, 2024

So, What Happened?

And What Can We Learn From This?

About the author

Leave a Reply Cancel reply