The Stages of Incident Management
Server downtime, data breaches and other service issues on the business side are among the most common issues that IT and DevOps teams face. How efficiently teams manage availability and performance plays a significant role in user experience and revenue. Finding and remediating incidents is integral to business performance, but how exactly is it managed?
What is Incident Management?
Incident management is a process to detect and resolve unforeseen events or service performance issues. An incident can be any unexpected event or service interruption that causes a disruption or degradation of service performance. While incidents will vary widely in the severity of the impact, ultimately the goal is to detect and restore the service to its normal operational state as quickly as possible. Incident management processes are typically measured by mean time to detection (MTTD) and mean time to resolution (MTTR).
The KPIs & Cost of Incident Management
Before reviewing the incident management lifecycle, there are several important KPIs that need to be tracked and measured. Not only do these KPIs ensure the organization’s processes are improving over time, but they also help you understand the cost of each incident. These include:
- Time to detection: This is the average time to detect and alert the appropriate team of a potential incident.
- Time to resolution: This is the average time it takes the team to resolve an incident after being alerted.
- SLA compliance rate: This is the percentage of incidents that are resolved within a particular timeframe defined in the service-level-agreement.
- Incident backlog: This is the number of pending incidents that have yet to be resolved.
- Customer satisfaction rate: This is the percentage of customers that are satisfied with the IT services and how past incidents have been managed by the provider.
- Incident Cost: Finally, each incident should be measured in terms of how much it impacts technical operations and/or the bottom line. With business impact alerts, for example, Anodot lets you assign a monetary value for each metric that’s monitored. This impact value is then attached to future alerts so that you know exactly how much the anomaly has cost thus far. (This tool can gauge, based on your company, size, and operations, just how much incidents are costing your company.)
These are just a few important KPIs to consider, although you can learn more about incident management metrics and KPIs here.
Incident Management Lifecycle: From Detection to Resolution
As mentioned, the end goal of incident management is to go from detection to resolution in the fastest time possible. There are, however, tools and processes that must be employed to achieve that goal. Although there isn’t always a specific process for managing every possible scenario, a general framework should be in place in order to guide decision-making.
In their Incident Management Handbook, Atlassian describes the following five stages of the incident management lifecycle:
- Detect: First and foremost, an effective monitoring tool must be in place to detect and alert the appropriate team before the customers do. This means that the monitoring solution must prove real-time alerts of potential problems before they turn into incidents.
- Respond: The next stage is to respond as quickly as possible to the incident by escalating it to the appropriate person or team to find a resolution. It’s important to first understand the incident’s impact on the business and use that to prioritize accordingly. Is it an alert for a KPI that doesn’t much affect your bottom line? Anodot’s Business Impact Alerts tool, for example, uses historical data and user feedback to anticipate each incident’s monetary impact and assign a score. Most teams may not have this feature, but are dealing with some form of alerting system which requires consistent screening and prioritization so that critical incidents don’t fall through the cracks.
- Recover: As incidents inevitability do occur, the highest priority is to restore the service to normal operational levels. The effectiveness of this Once that’s complete and the customer is satisfied, that’s a good time to analyze and glean insights from the incident.
- Learn: To prevent the same issue from happening again, it’s important to establish the root cause and links between anomalies and events. In addition, each incident should be tied back directly to its business impact in order to prioritize future anomalies. Specifically, each incident should include a “post mortem” analysis, or a written record of the incident that highlights its impact, causes, initial actions taken to resolve it, and follow-up actions to prevent it from happening again. Since each incident may be associated with a different technical team, it’s recommended to nominate a single person familiar with the incident to perform post-mortem analysis and have them interview any relevant stakeholders to truly understand the business impact—for example, whether it is monetary or technical. After this analysis is complete, it should be presented to the team to ensure everyone knows their roles and responsibilities in preventing future incidents.
Improve: Finally, each incident is an opportunity to improve the underlying infrastructure and the incident management process so that time-to-resolution is always within an acceptable time frame.
There are a few essential capabilities that are enabling automated, accelerated incident management, such as automated business monitoring solution, business impact measurements, real-time alerts, and root cause analysis.
Popular tools for incident management include Google’s open source GRR Rapid Response framework, AlienVault OSSIM, Data Dog, Splunk, New Relic, Dynatrace and Anodot Autonomous Business Monitoring.
Case Study: Incident Management in Finance
Let’s see incident management in action at Credit Karma, a U.S.-based company that offers online credit management tools and personal finance services.
Credit Karma’s main challenge was struggling with time to detection. Teams responsible for revenue, user experience, IT systems and ad campaigns were using various self-developed tools to monitor hundreds of thousands of business and technical metrics. The issue, however, was that it typically took at least 24 hours before they could detect incidents. For example, revenue for one specific webpage had decreased by 50 percent over the course of three days, and the incident was then prolonged for several days by their need to manually perform root cause analysis.
The company switched to an automated AI-based solution that independently monitors various silos for anomalies and alerts the relevant team in real time. Credit Karma was able to cut mean time to detection from days to hours.
Summary: The Incident Management Lifecycle
Incident management is the process of detecting and resolving unexpected events or performance issues before they negatively impact the end-user. Organizations must have a general framework that empowers teams to detect, respond and recover from each incident. Similarly, once the service is performing at its normal level, each incident represents an opportunity to learn and improve both the internal processes and the tools used to detect and resolve.
In terms of technological requirements, two of the most essential parts of effective incident management include automated monitoring for faster time-to-detection, real-time alerts of potential incidents, and correlation analysis in order to achieve the fastest possible time-to resolution.