Incident process? Link to heading

Many companies have an incident process. An incident process (often called an incident management process) is a defined set of steps an organization follows to detect, respond to, manage, resolve, and learn from incidents that disrupt normal operations or pose a risk.

flowchart LR A[Incident detected] --> B B[Incident declared] --> C C[Mitigation in place] --> D D[Incident resolved] --> E E[Incident review] --> F F[Corrective / Preventive Actions] --> G[Incident closed]
A common incident process.

The exact meaning depends a bit on the field, but the core idea is the same: handle unexpected problems in a structured, repeatable way.

A key part in an incident process is learning from incidents. It involves things like

  1. Bringing everyone involved into a room, discuss what happened, lessons learned, and coming up with candidate action items to reduce the likelihood of something similarly happening again (a Postmortem); and
  2. Implementing the above action items.
flowchart LR A[Incident detected] --> B B[Incident declared] --> C C[Mitigation in place] --> D D[Incident resolved] --> E subgraph Learnings E[Postmortem] --> F end F[Corrective / Preventive Actions] --> G[Incident closed]
Learning from incidents.

Incident? Link to heading

A question that has commonly been brought up during my incident trainings has been “But what is an incident, anyway? How do we define it?”. Knowing what an incident is and is not is crucial. It has a direct impact on when to declare an incident, and when not to.

Yet, defining what an incident is can be surprisingly hard. Many definitions somehow seem to miss an important part. For example, ChatGPT tells me

“An incident is an unplanned event that:

  • Disrupts services or operations
  • Reduces quality or performance
  • Poses a security, safety, or compliance risk”

Even this definition misses the crucial aspect of learning from “near misses” - when things almost went bad.

An incident definition needs to be

  • Easy to remember such that anyone remembers it; and
  • Be actionable in such a way that anyone quickly knows when to declare an incident.1

So far, my favourite definition is

An incident is an event in which the organisation has an opportunity to learn something.

or phrased somewhat differently

Raise an incident when you think a postmortem would be useful for what you are seeing.

These definitions are outcome-focused. It assumes that the most important outcome of an incident is the learning part - avoiding that something happens again.

Tip
Heads up! I offer consultancy services in this space. Don’t hesitate to reach out if you would like me to help your company improve when it comes to reliability, resiliency, architecture feedback, on-call, alerting, or incident training. 👋

  1. Within some of the organisations I have worked in, anyone could declare an incident. ↩︎