Reliability vs. Resilience

In safety systems engineering (SSE) people talk about “safety type 1” and “safety type 2”. The first one is about prevention & reliability, the second is about resilience. Type 1 has been around for a long time, and type 2 is fairly new.

Prevention will not save us Link to heading

Software systems (and software companies) are complex. And for complex systems, type 1 safety (prevention) will not save us. Three key reasons:

Reason 1a: It’s impossible to prevent all failures from happening in a complex system. Many of them are unknowns and we can’t know them beforehand. This makes it impossible to prevent them from ever happening. You cannot write unit tests for all potential error scenarios you don’t know about.

Reason 1b: The context around a system is never static. There is an assumption that all failure is introduced by an operator. But this is not always the case; For example, load changes over time, new users are registering, auto-scaling might happen, third-party providers might be unavailable, we run out of memory, or we run out of memory in our database. A system is dynamic on many dimensions, and many things can happen.

Reason 2: As long as an operator is making changes to a system, mistakes will occasionally happen. They can’t fully be prevented. For example, every new deployment of some software runs the risk of breaking it. However, you can reduce the likelihood of mistakes happening. This is what type 1 has focused on.

Resilient systems save us from unprevented errors Link to heading

Safety type 2 instead focuses on resilience. It does not entirely replace safety type 1 - there is still value in automated checks in CI/CD - but the insight is that they will not prevent all errors. Type 2 instead tries to make sure that given that something is broken, we make sure to minimize the impact it has on the business. Safety investments must be balanced between type 1 and type 2. In my experience, most companies focus too much on type 1.

Generally, companies that are resilient to errors handle unprevented errors much better. In a way, if you have a company that is good at safety type 2, you don’t need to focus too much on prevention. For example, let’s say that a change to a software system is first rolled out to 0.1% of a random subset of users, and that change can automatically be rolled back within 60 seconds. If the change has an unprevented bug, the bug has almost no negative impact on the business.

The type 1 to type 2 shift Link to heading

The shift from type 1 to type 2 has many implications. Here are some of the shifts that I have seen:

Service levels: There is a shift from talking about system quality (availability, latency, etc.) as “the system is either up or down” to “the system availability is X%”.
The organization starts to understand that there can be a difference between a deployment of a system and a release of a feature.
Rollout strategy for new features is early on a key part of the development process. This includes working with things like staggered rollouts, random sampling & feature flags.
The time it takes to roll back a system becomes more important than preventing errors in the system. When an organization realizes that the details around rolling back are very error-prone, they realize that forward rollbacks are much simpler. They then focus on reducing the general time to deploy.
A stronger focus on the observability of user impact in production (service levels) over “if CI/CD passes, it works”.
A stronger focus on shipping things to production over “if it works on staging, my work is done”.
A stronger focus on getting smaller changes out in production as soon as possible (to know it’s working) over weeks of work to prevent all possible bugs.
An organization celebrates learning from mistakes and is blameless.
A stronger focus on DevOps as a Culture; Developers are more involved with the rollout and how a system is being used by customers.
Incident training is a natural part of daily work - practicing for things going bad, because they eventually will.

The implication of shifting towards safety type 2 is also increased agility; You are resilient to experiments with negative outcomes.

Prevention will not save us Link to heading

Resilient systems save us from unprevented errors Link to heading

The type 1 to type 2 shift Link to heading

Further reading Link to heading