As solutions move from local to global and from physical to cloud, we are creating robust IT ecosystems and applications, building in the necessary failover and safety features. Stability, uptime, and performance are built-in from an architectural point of view and are no longer an afterthought. We rely on global providers, proven solutions, and guaranteed SLAs. We sleep soundly at night, knowing we have taken every measure and precaution we could… And then the smallest pebble somehow ends up in the machine and grinds everything to a halt.
As more and more IT systems become complex and intertwined, it is crucial business continuity is guaranteed to a maximum. Even the best-laid plans – and best developed products – will inevitably, and usually unexpectedly, fail. While Shift Left has been huge the past decade (and rightly so), Shift Right has been having increasing merit.
DRP or Disaster Recovery Process testing makes sure that the contingency plans we put in place quickly and effectively kick in, in case of interruption or disaster. Depending on the source you consult, there are different levels of DRP testing, but they all have one thing in common – they are either static or dynamic DRP tests.
The static categories can include:
- Checklist testing – simply reviewing the available documentation and checking completeness and compliance
- Tabletop testing – the DRP team reviews their role and runs through the steps of the DRP scenario
- Walkthrough testing – similar to Tabletop testing, but usually in more detail, with the intent to uncover any gap or weakness
The dynamic categories involve actual system interruption, and can include:
- Simulation testing – a potential disaster scenario is simulated in a controlled test environment. This allows the DRP team to realistically dry run the DRP plan
- Parallel testing – In case of IT infrastructure DRP, the backup systems are run in parallel with the primary systems to make sure they function properly and can take over the production load quickly and efficiently if needed
- Full-interruption test – One of the more impactful DRP tests, where the actual production system is artificially put to a halt to make sure the necessary backup systems and processes
- all work as intended
While DRP testing is often overlooked, it can have a dramatic impact on an organization’s operational readiness in case of unforeseen circumstances. Dynamic DRP tests are a great way to safely uncover any gaps in the procedures in the worst case, or build trust and confidence in the DRP plan in the best case. DRP tests make sure that all contingency procedures are well documented, the teams are exercised and prepared, and guarantee minimal disruption and maximum business continuity. As the IT landscape of any organization changes over time, it is important that DRP testing is implemented on a recurrent basis to adapt to the changed system conditions.
In today’s market, any business downtime can be killer, and will always have a direct impact on reputation, people, and the bottom line. It is therefore crucial that organisations add to their Shift Left approach and incorporate production monitoring and DRP testing to minimize any impact or business disruption.
“Hope for the best, plan for the worst.” – Lee Child
PS: As a complementary branch of QA to DRP testing, there is the field of Chaos Engineering – more on that in the next blog… 🐒