I have a fondness for watching documentaries about aviation disasters.
Now, before you judge me as someone with a psychological disorder–we all slow down when we see an accident on the highway, but planes crashing into each other or the ground?–let me explain why I watch these depressing films and what it has to do with IT work.
I should start by noting that, as a private pilot, I have a direct interest in why aviation accidents happen. Learning from others’ mistakes is an important part of staying safe up there.
“Ever see a car accident happen and find yourself compelled to Google what happened? Dr. Mayer says this is also our survival instincts at work. “This acts as a preventive mechanism to give us information on the dangers to avoid and to flee from,” he says.
But, then, there’s another reason I watch the documentaries that’s only recently become clear to me: seeing how mistakes are made in a domain where mistakes can kill can can be generalized to understand how some mistakes can be avoided in other domain where, while the results might be less catastrophic to human life, are still of high concern.
In my case, and likely in anyone’s case who is reading this, that’s the domain of IT work.
The most important fact I take away from the aviation disaster stories is that disasters are rarely the result of a single mistake but result from a chain of mistakes, any one of which if caught would have prevented the negative outcome.
Let me give an example one such case and see how we, as IT professionals, might learn from it.
“On the night of 1 July 2002, Bashkirian Airlines Flight 2937, a Tupolev Tu-154 passenger jet, and DHL Flight 611, a Boeing 757 cargo jet, collided in mid-air over Überlingen, a southern German town on Lake Constance, near the Swiss border. All 69 passengers and crew aboard the Tupolev and both crew members of the Boeing were killed.[4]“
On the night of July 1, 2002, two aircraft collided over Überlingen, Germany, resulting in the death of 71 people onboard the two aircraft.
The accident investigation that followed determined that the following chain of events led to the disaster:
- The Air Traffic Controller in charge of the safety of both planes was overloaded as the result of the temporary departure of another controller in the center.
- An optical collision warning system was out of service for maintenance but the controller had not been informed of this.
- A phone system used by controllers to coordinate with other ATC centers had been taken down for service during his shift.
- A change to the TCAS (Traffic Collision Avoidance Systems) on both aircraft that would have helped–and which was derived from a similar accidents months earlier–had not yet been implemented.
- The training manuals for both airplanes provided confusing information about whether TCAS or the ATC’s instructions should take priority if they conflicted.
- Another change to TCAS, which would have informed the controller of the conflict between their instructions and TCAS instructions was not yet deployed.
Many issues led to the disaster (which thankfully, have been resolved as of today)–but the important thing to note is that if any one of these issues had not arisen, the accident would likely not have happened.
That being true, what can we learn from this?
I would argue that, in each case, the “system” of air traffic control, airplane systems design, and crew training taken as individual items, each could have recognized that each issue could lead to a disaster and should have been dealt with in a timely manner. This is true even though each issue by itself could have been (and probably was) dismissed as being of little important by itself.
In other words, having a mindset that any single issue should be addressed as soon as possible without detailed analysis of how it could contribute to a negative outcome might have made all the difference here.
And here is where I think we can apply some lessons from this accident, and many others, to our work on IT projects.
We should always assume that if, absent evidence to the contrary, a single issue during a project could result in negative implications that are not immediately obvious, it should be addressed and remediated as soon as practicable.
The difficult part of implementing this advice clearly results from questioning whether a single issue could affect the entire project, and the cost of immediate remediation vs. its cost. There is not an easy answer to this–I tend to believe that unless there is a strong argument showing why a single event cannot become part of a failure chain, then it becomes something that should be fixed now. Alternatively if the cost of immediate remediation is seen as less than the cost of failure, then the issue can be safely put aside–but not ignored–for the time being.
To put this into perspective in our line of work:
Let’s imagine a system to be delivered that provides web-based consumer access to a catalog of items.
Let’s further imagine that the following are true:
- The catalog data is loaded into the system database using a CSV export of data from another system of ancient vintage.
- Some of the data imported goes into text fields.
- Those text fields are directly used by the services layer.
- Some of those text fields determine specific execution paths through the service layer code.
- That service code assumes the execution paths can be completely specified at design time.
- The UI layer is designed assuming that delivery of catalog data for display will be “browser safe”–i.e., no characters that will not display as intended.
This is a simple example, and over-constrained, but I think you can see where this is going.
If the source system has data, to be placed in the target system text fields. has characters that are not properly handled by the services layer and/or the UI layer, bad actions are likely to result.
For instance, some older systems permit the use of text documents produced in MSWord that promote raw single- and double-quote characters to “curly versions” and take the resulting Unicode data in raw form. Downstream this might result in failure within the service layer or improper display in the UI layer.
Most of us, as experienced IT professionals, would likely never let this happen. We would sanitize the data at some point in the process, and/or provide protections in the service/UI layers to prevent such data from producing unacceptable outcomes.
But, for a moment, I want you to think of this as less than an argument for “defense in depth” programming. I want you to think of it as taking each step of the process outlined above as a separate item without knowing how each builds to the ultimate, undesirable outcome, and deciding to mitigate it on the basis of the simple possibility that it might cause a problem.
For example, if the engineer responsible for coding the CSV import process says “the likelihood of having problems with bad data can be ignored or taken care of in the services layer”, my suggested answer would be “you cannot be sure of that, and if we cannot be sure it won’t happen, you need to code against it”.
And, I would give the same answer to the services layer engineer who says “the CSV process will deal with any such issues”. You need to code against it.
It may sound like I’m simply suggesting that “defensive coding” is a good idea–and it is. But–and perhaps the example given is too easy–I would argue that the general idea I am suggesting is that you need to have a mindset that removes each and every item in a possible failure chain without knowing, for certain, that it could be a problem.
This suggestion is not without its drawbacks, and I would encourage you to provide your thoughts, pro or con, in the comments section of this blog.
In the meantime, I’ll be over here watching another disaster documentary….