Imagine it is Friday night and after a long week of hard work you are set to enjoy some much deserved time off. You have planned a night with your significant other to finally watch that movie you have heard so much about. Your kids are at their grandparents and you order your favourite takeout.
But then disaster strikes. When ordering food there is an outage in the ordering system, the payment provider is down and when you try to start the movie on your chosen streaming provider you are faced with timeouts and errors.
Now all these issues can be worked around or are probably just temporary issues, but they do severely impact your experience as a customer. It might lead you to order food from other vendors next time or choose a different streaming platform for your movies and series.
It is therefore imperative that organizations spend time and resources on preventing outages as much as possible and when they happen make sure that either the customer does not notice at all or the problem is quickly solved.
A Digital Immune System
There are many reasons an outage can occur so we need several strategies to mitigate the risk of an outage and recover quickly when they happen.
So how can we do it? Like humans, we can give our systems an immune system that can fight off threats by applying healthy practices. For humans that would be healthy food, regular exercise, enough sleep and rest and when needed see a doctor or specialist.
We can also apply this to digital systems. The six pillars of Digital Immune Systems do just that.
Software Supply Chain Security
First and foremost getting software from commit to production takes several steps where both accidents and malicious intent have a profound effect on the reliability of the released software. Cases such as the Solarwind attack show the importance of good supply chain security on the reliability and customer experience of software systems.
Chaos Engineering
Chaos Engineering refers to the practice of deliberately introducing failures in systems and seeing how the systems respond to those failures. These chaos experiments can be applied at different levels and in different environments depending on the maturity of the organization. There is much to say about Chaos Engineering, however in a nutshell the practice starts with a hypothesis and designing an experiment to introduce failures such as introduced latency and intermittent failures, This allows engineers to prove the reliability of their systems in production.
AI-Augmented testing
By using AI to augment traditional testing we can increase the confidence in our systems and for example come up with scenarios that humans have a hard time to come up with. With more and more automation applying AI to do autonomous testing we save time, increase confidence and reduce risk.
Observability
By ensuring we have end-to-end observability over our systems not only in production but also during the entire SDLC. The insight we gain from the metrics, logs and traces allow engineers to see issues before they hit production. If the system is already in production it allows engineers to detect issues and take measures before users are impacted thus increasing the customer experience.
Auto Remediation
With the increased complexity of software systems new problems are popping up constantly. Depending on the scale of the system this can demand many engineers to be caring for the system. With Auto Remediation we can cater for many different cases both inside and between components. This reduces the demand for human resources and allows for systems to constantly repair themselves. These can be concepts such as retrying failed requests, autoscaling based on metrics and complex workflows in case of specific events.
And since Auto Remediation runs 24×7 it will also help ensure those Friday nights are problem free.
Site reliability engineering (SRE)
Automation and AI are important to increase efficiency and reduce the workload of experienced engineers. However the human side cannot be understated. By applying good engineering practices and balancing delivery velocity with reliability the customer experience has a constant focus. This reduces the burden on teams for remediation and dealing with technical debt.
1 + 1 = 3
None of the concepts from the six pillars are really new. However by applying them together we get more than the sum of the parts. And in the end supplying your customers with the best experience is the purpose of IT solutions and where Digital Immune Systems really shine.
Because all of us deserve their Friday night movie with their favourite food without frustration.