Like many people in IT, I don’t like risks. More exactly, I often ensure people address risk and are prepared for what comes from taking, and avoiding, risks. But covering risks, no matter which way you chose, is prone to some issues:
|The geeky way where solutions are used to cover every single technical risk||The hero way where risk assessment is based on avoiding past (painful) experiences|
Three orientations are often considered::
- taking into account the complete risk analysis
- weighing the risk with the correct business values
- covering the risk with a constructive approach
Let’s examine some real life situations and find some way to learn from those issues.
|The Geek||The Hero|
|…with BGP and our dual site load balancing solution, consumers will always be able to use our eCommerce platform…||…last year we had so many connections that the eCommerce server stopped handling them. We are now able to handle 10 times more connections from our customers…|
|What about the (new) risk of a wrong failover from the new load balancing solution or a misconfiguration between them.||Did we test that the backend servers are able to handle so many service calls? Wasn’t the network already at 40% use last year?|
When analyzing a risk (or for that matter, events having raised a problem), the complete service chain has to be evaluated, including, in an iterative manner, the risk reduction solution. The risk analysis of other components can be affected by a new solution as well (as in all change management situations).
Remember how many times we, geeks and heroes united, underestimated the inner risk off a mitigation solution, especially the risk of something different happening and the risk on change management.
When possible, solutions should be as independent as possible from the system it covers risks from. Solutions adding complexity are often the cause of an exponential increase in the overall risk on the system.
|The Geek||The Hero|
|We can’t risk any failure this winter season; we must be able to handle every connection from customers. Last year we topped at 10,000, but this year with the new top of the line servers we’ll be able to handle 10 times more.||Last year we had to completely shut down our partner’s web services because the servers weren’t responding anymore. The Sales departments were very angry at us. We must set up a load-balancing solution to handle the traffic this time.|
|How many customers does the marketing department plan for this winter? 11,000? How much does customer acquisition cost? $100 per user. Out of the 1000, how many customers actually buy something? 100 for a $200 average revenue per customer? So we have a $120k risk? Aren’t the new servers and their setup $500k?
Wouldn’t it be possible to mitigate the risk with a static “waiting page presenting our winter offering,” keeping more than 50% of the customers on our site?
|Wasn’t last year’s issue mainly caused by a partner’s bug sending hundreds of calls per minute? Wouldn’t it be possible to enforce a call rate limit (globally or per source)?
Wouldn’t it be better to charge back partners based on that system? Isn’t it already in the partner’s contracts?
Evaluate the business value of the service (and anticipate the business value lost when risk occur) with the business departments concerned with the service results. Don’t keep the solution to IT, except pure technical ones. These are business risks that must be shared as they will occur and consequences must be anticipated in the business figures (which include other ones themselves).
|The Geek||The Hero|
|To improve our cybersecurity suite of solutions, we have acquired an appliance that will limit the request rate at the edge of our network. We are deploying it for all the public websites and web services.||Last year we had an outage due to many errors in a business critical integration flow, which caused many corrections and flow resubmissions on our side. We should re-evaluate our target load and add new capacity to handle it, even if it means stopping others to have more capacity.|
|Didn’t we have an issue last year with your solution when it blocked a user named “Tom Select?” We spent days trying to figure it out as the requests were blocked at the network level and the application wasn’t even aware of the error.
Did you contact each team to let them define with the business departments the rates that should be set up and the chain of responsibility to transfer the event to the right team (for example, if a key partner uses a web service and has special rates allowed for a particular sales operation or if the business partnership team could use this to make some upsell deals with low revenue partners)?
|Wouldn’t it be less risky to accept in production only integration flows that have been completely tested, including non-nominal use cases?
If those errors are due to the partner’s data, wouldn’t it be more constructive to reject the data and let the partner (or the partner’s software) handle it? We don’t have the information to correct these errors in the flow itself, the partner does, and the partner certainly needs to know the flow will not be completed (or not in time).
Moreover, business partnership teams should be informed and could use those cases to weigh in on business negotiation later on.
Risk mitigation involving hiding its occurrence is often a non-constructive approach. A risk affects a complete chain (technical and business). Except when its causes and consequences are fully technical, it is necessary to anticipate its occurrence:
- as a business-technical event, caused by a technical event but with business consequences that have to be traceable and can be the reason of a communication (to business partners especially);
- or a business event that should be anticipated and mitigated with/by business requirements and leave options for a business traceability.
It’s all about risks; they need to be addressed and mitigated, but they surely need to be shared. We often try, with the most sincere intention, to hide them, showing how geeks and heroes can make the IT magical.
But business is more than IT, a world where taking risks is a daily routine and they need to be in the loop to cover the risks taken as precisely as possible. More precise and transparent risk analysis, based and tied with correct business values, covered with a constructive approach is key to be able to lower the coverage, which leads to more investments freed, some of them in IT…