Incident Response - A Digital Solution
When people ask me what I do for work and I say that I work at TaskCall and that it is an incident response system, I usually get a blank look. That is a fair description but it also uses two words in conjunction – incident response – that most people are unfamiliar with. Unless you have experienced the need for it yourself, there is a good chance you are not aware of it despite the fact that it is a growing segment.
Possible Incident Types
Incident response is the process of addressing a technical issue that occurs in a company. It could be:
- Business application and database Issues
- Untested deployment releases
- Maintenance issues
- Cyber-security attack
- Network outage
- Datacenter outage
- Hardware failure
Examples of Notable Major Incidents
These issues or incidents (the latter is the technical term) are not as infrequent as people may be led to believe. They happen quite often. The bigger the firm, the more they happen. Only when they last for an extended period of time, it comes to the public lime light eating into the reputation of the company and its revenue. Some of the honorary mentions of outages of recent times are:
- Facebook and WhatsApp (5th October, 2021) – The outage lasted for 6 hours and affected 3.6 billion users. Estimates suggest that losses amounted to $100,000,000.
- Tesla (10th November, 2021) – Tesla owners were unable to start their car from the app for several hours. Tesla only realized there was an issue after owners started complaining. At least 500 people reported the issue.
- Amazon (8th December, 2021) – AWS, Prime Video and Alexa were affected. It was the third major outage for Amazon in the year.
Incident Response Policy
We live in a very fast paced digital world and a competitive landscape. Statistics suggest that 53% of mobile users will leave a site if it takes more than 3 seconds to load. If that is the benchmark, then having an outage for several minutes or worse, several hours, can be irreparably damaging to the business. Hence, it is of utmost importance that each company has a policy on handling incidents when they do occur. It should outline:
- How incidents should be detected
- Who they should be assigned to
- How responders will collaborate
- How the impact should be minimized
- How customers will be notified
- How internal teams and stakeholders will be kept informed on the progress
- How the cause and impact should be evaluated to avoid future occurrences
Except for responding, resolving and setting future precautions, the rest of the steps can actually be fully automated and handled by monitoring tools and incident response systems. They expedite the whole process and, in essence, reduce the incurred losses.
Automation in Incident Response
- Incident Detection – Entire systems and infrastructures can be automatically safeguarded by monitoring tools like Amazon CloudWatch, DataDog, PRTG, Pingdom, CrowdStrike, etc. These tools are incredibly efficient at ascertaining the performance of your system and identifying anomalies when they occur.
- Incident Notification – Once the incident is detected, it is passed on to an incident response system like TaskCall through integrations. It then determines who the correct on-call responders are for the type of issue and persistently notifies them through emails, push notifications, SMS, voice calls and chat-ops tools like Slack. If the primary on-call does not respond then the incident is automatically escalated to the secondary.
- Response Mobilization – Every second of downtime is costly and the objective should be to reduce it. Incident response systems do just that. They provide all the tools necessary for immediate collaboration and engaging other responders. In some cases it is also possible to engage the tools to trigger pre-configured resolutions of the incident with a single click.
- Impact Identification – A technical infrastructure is a mesh of interwoven components. An impact in one area can affect others. A sophisticated incident response application can automatically help identify and visualize these secondary and tertiary impacts on the system, saving responders even more time.
- Status Dashboards – Although customers are generally notified about on-going issues at the surface level through social media like Twitter or Facebook or the company’s own website, the status updates to internal teams and stakeholders are provided through status dashboards offered by incident response systems. Technical details can be posted here and transmitted to the interested parties in a secured manner. It avoids responders from having to answer the same inquiries about progress from multiple colleagues.
- Postmortems – After the incident has been resolved, its evaluation is critical. Incident response systems provide all the data needed here – what caused the issue, how long it took to acknowledge it, how long it took to resolve it, how many people were involved, which systems were impacted, which business segments were impacted and what the monetary loss was. All critical information is automatically put together for an effective postmortem.
Responding to an incident is not fun. It is exacerbated by the time crunch each responder has to work under, knowing that every minute is costing the firm thousands of dollars. The process is strenuous as it is, but it can be made easier by magnitudes through the usage of incident response systems. They work in conjunction with your monitoring tools to keep you sane when everything else is going upside down. They will save you precious time and money.