Shower thought: Why do we talk about MTTR (Mean Time To Resolution), when it’s a latency? Wouldn’t it make more sense to talk about P99TTR (99th percentile Time To Resolution) given that latencies usually have a long tail? 🤔 #MTTR #SRE #DevOps
— Jens Rantil (@JensRantil) March 10, 2021
Today I would like to talk about why Mean Time To Recovery (MTTR) is a wrong metric to look at.
For the past few years many software engineers have been using the DORA metrics to track the performance of software delivery. One of the DORA metrics is “Time to Restore Service”, also known as “Mean Time To Recovery (MTTR)”. A couple of years ago Courtney Nash wrote “MTTR is a Misleading Metric—Now What?” where she criticized that the MTTR concept is too simplistic. I could not agree more.
When I recently wrote Mean vs. Median, I was reminded of Courtney’s
[…] measures of central tendency like the mean, aren’t a good representation of positively-skewed data, in which most values are clustered around the left side of the distribution while the right tail of the distribution is longer and contains fewer values. The mean will be influenced by the spread of the data, and the inherent outliers.
In essence, she was saying that using the mean as a performance number for recovery times is a quite useless number.
Just like software engineers are using percentiles as a performance number for latencies, we should be using percentiles when analyzing recovery times. A recovery time is just a latency to fix something, but usually in minutes/hours instead of milliseconds/seconds. We want to be able to know that the recovery time for 95% of all incidents is being reduced; Mean does not say anything about that.
So why is MTTR used in the first place and not PTTR (Percentile of Time To Recover)? Probably because a mean is so much easier to calculate. DORA metrics are gathered from lots of companies, and percentiles are hard.