An Introduction to Incident Management

Incident Management

Incident management is one of the most critical IT support processes that an IT organization needs to get right. Service outages can be costly to the business and IT teams need an efficient way to respond to and resolve these issues quickly. According to a 2015 HDI study, incident management remains a top priority for 65% of IT teams around the world.

Incident management 101

Here’s how the IT Infrastructure Library (ITIL) defines an incident: “An unplanned interruption that causes, may cause, or reduces the quality of an IT Service.” Incident Management is an IT service management (ITSM) process that aims to restore normal service operation as fast as possible, minimizing the impact on the business and end user. A business application going down is an incident. The printer not working is also an incident.

“A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly, and interfering with productivity. Worse yet, it poses the even greater risk of complete failure.” – Nick Wright, Service Operations Manager at Atlassian

Incident Management vs. Problem Management: A problem is just the not-yet-known root cause behind one or more incidents. In the incidents above where the printer is down and the network is creeping, a misconfigured router could be the underlying problem behind both. Incident management focuses on short-term solutions (not completing a root cause analysis to identify why an incident occurred) and on doing whatever is necessary to restore the service. We’ll talk about managing re-occurring incidents (underlying problems) in the problem management blog.

Mean time to resolution (MTTR)

Mean Time to Resolution (MTTR) is a service-level metric that measures the average elapsed time from when an incident is reported until the incident is resolved and is typically measured in business hours. MTTR is one of the key drivers of customer satisfaction, as users may be either completely down or forced to use workarounds until their incidents have been resolved.

Consequently, improving major incident response is one of the number one goals for IT teams, specifically around finding ways to lower MTTR and streamline the process of finding the root cause to prevent future outages. The below diagram outlines what’s included in the MTTR. A Forrester study found that most of the time is spent within the Investigation and Diagnosis phase. In fact it takes 70% of the time because IT teams find it difficult to collaborate and share valuable insights to quickly find an incident resolution.

Incident management priorities

So what are the key areas and priorities for incident management for IT teams?

Respond effectively so they can recover fast to define who is accountable for it.
Communicate clearly to their stakeholders, both service owners, those within the organization, but ultimately their customers.
Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from sharing and collaborating.
Continuously improve to learn from these outages and apply these lessons to improve a service or even refine the process in the future.

StatusPage: While every team uses different solutions for communication, we recommend a dedicated tool like StatusPage for incident communication. This provides a central source of truth for the current status of an incident as well as a record of past incident communication. Stakeholders can customize how they want to receive StatusPage updates; whether it’s over email, text message, or a ChatOps tool like HipChat.

Incident management process

An incident management process helps service desks investigate, record, and resolve service interruptions or outages. An Information Technology Infrastructure Library (ITIL) incident management workflow aims to reduce downtime and negative impacts. The IT Service Desk template comes with an incident management workflow, which ensures that you log, diagnose, and resolve incidents. We recommend you start with this workflow and adapt it to your business needs. When managed well, incident records can identify missing service requirements, potential improvements and future team member training.

The ITIL incident management process, in brief:

Service end users, monitoring systems, or internal IT members report interruptions.
The service desk describes and logs the incident. They link together all reports related to the service interruption.
The service desk records the date and time, reporter name, and a unique ID for the incident. JIRA Service Desk does this automatically.
A service desk agent labels the incidents with appropriate categorization. The team uses these categories during post-incident reviews and for reporting.
A service desk agent prioritizes the incident based on impact and urgency.
The team diagnoses the incident, the services effected, and possible solutions. Agents communicate with incident reporters to help complete this diagnosis.
If needed, the service desk team escalates the incident to second-line support representatives. These are the people who works regularly on the effected systems.
The service desk resolves the service interruption and verifies that the fix is successful. The resolution is fully documented for future reference.
The service desk closes the incident.