Why MTTR is a Misleading Metric (And What to Track Instead)

Jul 12

Many engineering teams have that dashboard, the one they've been staring at for months, watching MTTR stubbornly refuse to budge despite all their hard work. Leadership keeps asking why the number isn't improving. The team knows they're doing better work, but the metric tells a different story.

Here's a simple math problem that breaks many organizations.

You have 10 incidents. 9 resolve in 5 minutes each. 1 takes 6 hours (360 minutes).

Your MTTR says 40.5 minutes.

Do you think it tells a good story?

Most of the incidents were short, but your MTTR suggests everything takes 40+ minutes.

The message is simple: Stop using MTTR as your north star metric. The math just doesn't add up and almost never makes any sense.

Similar to the prevention paradox I recently wrote about, where successful resilience work can make itself appear unnecessary, MTTR creates its own illusion while hiding the real story of your system's health.

90% of your incidents resolve in 5 minutes, but MTTR says everything takes 40+ minutes.

The Demoralization Effect

A few months ago, I spoke with a platform engineering team that had spent over two years dramatically improving their systems. They'd implemented comprehensive monitoring, automated recovery procedures, and streamlined their incident response. The team was proud of their work, and rightfully so.

But their MTTR dashboard told a different story. Despite all their improvements, the number stubbornly hovered around 85 minutes. Some months it even went up.

The team was demoralized because leadership questioned their efforts. "If you're really improving things, why isn't MTTR going down?"

When we dug deeper, the true story came to light. The team had become really good at detecting small issues early, problems that would previously have cascaded into major outages. They were catching these issues in minutes and resolving them quickly. But they were also tackling increasingly complex infrastructure problems that genuinely took hours to solve properly.

Their stubborn MTTR was actually masking significant progress: minimal customer-impacting outages in 18 months, a 90% reduction in alert fatigue, and a team that had transformed from a reactive firefighting approach to proactive system improvement.

MTTR not only failed to reflect their success, but it actually actively undermined it.

The Math Simply Doesn't Work

MTTR has become the default way organizations measure incident response. Teams dutifully display their MTTR dashboards, track improvements over time, and use these numbers to justify investments, performing operational excellence for leadership rather than actually achieving it.

But there's a fundamental problem. Incident duration data follows a power-law distribution. Most incidents are resolved quickly: a container restart here, a cache flush there. But occasionally, you get the big ones: database corruptions, complex cascading failures, or novel security breaches that can take hours or days to resolve.

When you average that 5-minute container restart with a 6-hour database recovery, you get a number that represents neither reality.

As discussed in the VOID Report, incident duration data follows a positively-skewed distribution where "measures of central tendency like the mean, aren't a good representation of positively-skewed data, in which most values are clustered around the left side of the distribution while the right tail of the distribution is longer and contains fewer values."

The mean gets pulled toward the outliers, creating a metric that doesn't reflect the typical experience of either your users or your incident response teams. In essence, you're using a statistic designed for normal distributions on data that's anything but normal.

This is exactly why teams see their MTTR stay flat or even increase despite genuine improvements. As teams become better at detecting problems early, they catch more small issues that can be resolved quickly.

But as systems continue to evolve, they also tackle more complex problems that naturally take longer to solve properly. The occasional 6-hour incident will dominate the average, making months of 5-minute fixes invisible.

Time Doesn't Equal Impact

But the statistical issues are just the beginning. MTTR misses the actual point.

A 30-minute incident affecting 100,000 users is fundamentally different from a 2-hour incident affecting 5 users. MTTR treats them identically.

Consider these real-world scenarios:

Live streaming platform: Video playback fails for 10 minutes during the final quarter of a playoff game. 50,000 concurrent viewers lose service at the most critical moment. Social media explodes with complaints. Potential subscription cancellations. Customer support is overwhelmed.

Versus: "My watchlist" feature breaks for 2 hours during off-peak. A few dozen users noticed. Minimal business impact, easily communicated via banner notification.

E-commerce site: Your checkout process crashes for 15 minutes during a Black Friday flash sale. Direct revenue loss of an estimated $500K. Abandoned carts. Frustrated customers are switching to competitors. Marketing spend wasted.

Versus: The "recommended items" widget fails for 3 hours on a Tuesday morning. Slight decrease in discovery metrics. No immediate revenue impact. Most users don't even notice.

In both cases, MTTR would suggest the longer incidents were "worse" than the shorter ones.

But which ones actually damaged the business?

Losing a non-critical service for 5 hours isn't the same as losing your most critical one for 10 minutes. MTTR can't distinguish between these scenarios.

The Human Reality

There is an even more fundamental challenge beyond the statistical issues: incidents are inherently human processes, and MTTR completely ignores human and organizational factors.

Real incident response involves complex dynamics that just can't be captured in a simple time measurement.

First, complex incidents often require multiple teams from different organizations, e.g., database specialists, network engineers, security experts, and product managers. Each handoff introduces delays, communication gaps, and potential misalignment that MTTR treats as "inefficiency" rather than important and necessary collaborative work.

Additionally, during high-stress incidents, engineers frequently face new failure modes and must diagnose novel problems, often with incomplete information. The time spent carefully analyzing symptoms to avoid making things worse isn't captured meaningfully by MTTR.

Shift changes, escalation procedures, approval processes, and dependencies on third-party vendors all introduce delays that have nothing to do with technical competence but significantly impact resolution time.

Finally, thorough incident response often involves deliberately slowing down to understand the real problems, verify fixes, and prevent recurrence or cascading effects, especially for large-scale events. MTTR incentivizes speed over precaution and learning, potentially making systems less resilient in the long term.

These human factors introduce variability that makes MTTR practically misleading, while missing the very aspects of incident response that distinguish operational excellence.

When Good Metrics Go Bad

MTTR actively drives counterproductive behaviors when used as a performance metric. When teams are measured by average resolution time, they will often end up optimizing for the metric rather than actual resilience (a process called surrogation), prioritizing quick fixes over understanding the contributing factors (root causes) and rushing to close tickets rather than implementing lasting solutions. This pressure fosters a culture where teams prioritize finding someone responsible rather than addressing systemic issues, while also promoting superficial verification by skipping thorough post-incident testing to close tickets faster. Teams may even game the numbers by avoiding declaring incidents or manipulating start and stop times to improve their averages. These behaviors make organizations less resilient, not more, and ultimately create a culture focused on looking good on dashboards rather than building genuinely resilient systems.

The "Better Metric" Trap

I often see teams recognize MTTR's limitations and try alternatives, such as MTTM (Mean Time to Mitigate) or MTTD (Mean Time to Detect). The thinking goes: "If we measure time to mitigation instead of full resolution, we'll better capture when customer pain stops."

But this also misses the actual issue.

MTTM has exactly the same statistical problem as MTTR. You're still averaging highly skewed data. Whether you measure "time to resolve" or "time to mitigate," the math doesn't change. That same 6-hour outlier will dominate your MTTM just like it dominated MTTR.

The real issue is using averages at all on this type of messy data, not where we draw the finish line.

What to Track Instead

So, what should you measure instead? Great question! And like much of computer science, it depends!

It depends on what you're trying to achieve, but here are better alternatives:

The Dashboard That Actually Tells Your Story

Picture this: You're in a quarterly review. Instead of staring at an MTTR dashboard showing 45 minutes (making your team look bad), you pull up a different set of metrics:

"99.97% login success rate, only 12 users affected by incidents this quarter."
"Zero revenue-impacting outages in the last 6 months."
"95% of incidents resolved in under 8 minutes."

Suddenly, the conversation shifts from "why is your MTTR so high?" to "how did you achieve such results?"

A big part of our work on resilience is about telling the story our work actually deserves.

Metrics shape the story about your team's work. And that story determines everything, from budget approvals to whether leadership views resilience investments as beneficial or not. So get it right!

MTTR often tells a story of mediocrity and failure. Impact metrics and percentiles tell the story of excellence, learning, and genuine improvement. Your choice of metrics shapes the narrative about your team's work, and that narrative in turn influences everything from budget decisions to career advancement.

You must tell the resilience stories in a way that leadership can understand and appreciate.

Focus on Customer Experience

Use Service Level Objectives (SLOs) that measure availability, latency, and error rates from the customer's point of view. Instead of asking "How long did it take to fix?" ask "What percentage of user requests succeeded?"

I worked with a team whose MTTR was consistently "terrible" at 95 minutes. Leadership was frustrated. However, when we looked at their SLOs, it told us a different story: 99.95% uptime, with the vast majority of their "incidents" being minor maintenance work that users never even noticed.

The team had been solving the right problems and preventing customer impact, but the MTTR made them look like they were failing.

Start with user-facing metrics, such as "login requests succeed 99.9% of the time." Focus on what users actually experience, not what your infrastructure is doing.

Impact-Focused Metrics

Impact-focused metrics measure the actual effect of incidents on users, business operations, and system performance:

Number of users affected
Duration of user impact
Revenue loss or business disruption
Service Level Objective (SLO) violations
Error rates and availability percentages
Customer satisfaction scores during and after incidents

Here's a real example: An e-commerce team had two incidents in the same week. Their MTTR dashboard showed:

Incident A: 3 hours to resolve
Incident B: 20 minutes to resolve

Leadership asked why the team seemed more concerned about Incident B. But the impact metrics told the real story:

Incident A: Affected 5 internal users, $0 revenue impact
Incident B: Affected 50,000 customers during peak hours, estimated $200K revenue loss

MTTR suggested Incident A was "9x worse." Impact metrics revealed the truth. Again, stories matter!

Categorize incidents by business impact—Critical (revenue-affecting, customer-facing), High (internal productivity loss), Medium (degraded experience), Low (no user impact). Track the count and duration of each category separately.

Impact-focused metrics prioritize what matters most by measuring how many users are impacted and for how long, ensuring teams focus on reducing real-world pain rather than just closing tickets quickly. This approach aligns engineering work with actual business risk and value. They expose hidden risks by highlighting that two incidents with the same MTTR can have vastly different impacts, revealing which incidents truly impact the business and require deeper attention while uncovering chronic issues or fragile components that consistently cause high user impact.

Unlike MTTR, which incentivizes quick fixes, impact metrics encourage healthy behaviors by encouraging teams to address underlying causes and prevent recurrence, thereby improving long-term resilience.

Use Percentiles, Not Averages

Use percentiles (P95, P99) instead of averages to understand your real incident distribution. These metrics are less influenced by outliers and give you a clearer picture of what’s actually happening.

Let's return to our original 10 incidents example to see how percentiles tell a completely different story than MTTR: 9 incidents at 5 minutes, 1 incident at 360 minutes (6 hours)

P50 (median): 5 minutes - This tells you that half of your incidents resolve in 5 minutes or less
P90: 5 minutes - This tells you that 90% of your incidents resolve in 5 minutes or less
P95: 5 minutes - This tells you that 95% of your incidents resolve in 5 minutes or less
P99: 360 minutes - This tells you about your worst-case scenario, the 1% of incidents that take much longer
MTTR (mean): 40.5 minutes - This tells you... what exactly?

What does that tell us?

Using the same 10 incidents: P50, P90, and P95 all show 5-minute resolution times. Only P99 reflects the outlier. MTTR gives you a number that represents nobody's actual experience

The percentiles tell you that your incident response is actually excellent for the vast majority of cases. 95% of your incidents resolve in just 5 minutes! The P99 shows you have an occasional complex incident that takes 6 hours, which is important to know and address separately.

MTTR, however, suggests that a "typical" incident takes over 40 minutes, which is completely false. No incident in your dataset actually took 40 minutes. It's a mathematical artifact that doesn't represent anyone's actual experience.

Why this matters for decision-making:

P50 and P90 help you understand your standard operational capability
P95 and P99 help you identify outliers that need special attention
Trends in percentiles show whether you're improving typical performance (P50/P90) or getting better at handling complex incidents (P95/P99)

P99 resolution time tells you what your most challenging incidents look like. P50 tells you about your typical response. Both are more useful than a skewed average that represents nobody's actual experience.

When you track percentiles over time, you might see improvements in both your standard incident response (lower P50/P90) and your complex incident handling (lower P95/P99), insights that MTTR completely obscures.

Noticed the "might"?

Here's the thing: even percentiles won't give you a simple, linear story of improvement. Modern systems are complex, distributed, and constantly evolving. Features are deployed continuously, infrastructure updates are regular, and teams are constantly changing.

Expecting any single metric to capture this complexity and show steady improvement is like wrapping yourself in a warm blanket during a snowstorm; it feels comforting, but it doesn't actually change the weather outside.

Here's the practical advantage of percentiles: they're a relatively good transitional metric if your organization is currently committed to MTTR. Because percentiles tell such a dramatically different and more accurate story than MTTR, leadership is unlikely to reject them. When you show that P50 is 5 minutes while MTTR claims 40+ minutes, the mathematical issues becomes obvious. But more importantly, percentiles don't threaten existing processes because they are fairly similar to MTTR. This makes them perfect for organizations that want to wrap themselves in a warm blanket but also tell a better story.

To further improve percentiles, consider categorizing your incidents into multiple classes, including deployment issues, bugs, regressions, testing issues, infrastructure failures, operational issues, security incidents, and third-party failures.

Each category has different root causes, recovery patterns, and prevention strategies. Averaging them together hides the specific improvements needed for each type.

What the Numbers Can't Tell You

Remember that platform engineering team from the beginning? The one with the stubborn 85-minute MTTR despite two years of improvements? When we dug deeper into their incidents, we discovered something very interesting.

Their "worst" incident, a 6-hour database corruption that dominated their MTTR, had actually taught them more about system resilience than dozens of quick fixes combined. The team had to coordinate across five different groups, improvise solutions when their runbooks failed, and ultimately redesign their backup strategy. Six months later, they prevented three similar issues from happening at all.

But their MTTR dashboard captured none of this. It just recorded "6 hours" and moved on.

This is what resilience engineering teaches us: the most valuable insights from incidents aren't about metrics, they're about adaptation and learning.

Here are some of the important and useful lessons that they actually learned:

Their mental model of how database replication worked was completely wrong. The incident revealed assumptions they'd held for years about their architecture that turned out to be false.
That team discovered that their monitoring was blind to a specific type of database issue they didn't even know existed.
When their standard recovery procedures failed, the team had to improvise. The database specialist was on vacation, the network team was handling a separate issue, and the junior engineer on-call had limited experience. Despite all that, they succeeded. This taught them something crucial: their runbooks were incomplete, their staffing assumptions were wrong, and their "junior" engineers were more capable than anyone realized.
The incident required five teams, seven approval processes, and coordination across three time zones.. It showed them exactly where their architecture was too tightly coupled and where their processes created unnecessary bottlenecks.

Each of these stories contains insights that make the organization more resilient. But MTTR reduces all of this richness to a single number that obscures the very learning that prevents future incidents.

What Resilient Organizations Actually Measure

Organizations that really care about resilience don't obsess over incident duration. They track how well they learn from each incident and how psychologically safe teams feel when reporting problems:

Do we detect similar problems faster each time?
Are teams getting better at improvising when standard procedures fail?
What surprises us about our systems, and how do we turn those surprises into knowledge?
Are we reducing coordination overhead through better architecture and clearer ownership?
Which recovery strategies actually work under pressure?
How quickly do insights from one incident improve our response to future ones?

Going back to that original team: when leadership asked "If you're really improving things, why isn't MTTR going down?" they were asking the wrong question.

The right question was: "Are you building systems that learn, adapt, and get stronger after each failure?"

The answer, hidden beneath that stubborn MTTR, was absolutely yes.

Their 90% reduction in alert fatigue meant they could focus on real problems. Their early detection systems meant small issues stayed small. Their systematic approach to complex incidents meant they were building institutional knowledge that would prevent future outages.

None of this showed up in MTTR. All of it showed up in their actual resilience.

The goal isn't to reduce incident response to an elusive average resolution time. The goal is to build systems and teams that learn from every incident, adapt when plans fail, and turn today's surprises into tomorrow's preventive measures.

That's the story great engineering teams should be telling. And it's the story that MTTR will never capture.

References

I'm not the first to question the usefulness of MTTR. People have been highlighting these problems for years:

John Allspaw argued in 2018 that shallow incident data like MTTR "generates very little insight" because incidents are "dynamic events with people making decisions under time pressure" that can't be captured in simple averages.

The VOID Report confirmed these concerns empirically, finding that "measures of central tendency like the mean aren't a good representation of positively-skewed data."

Štěpán Davidovič's "Incident Metrics in SRE: Critically Evaluating MTTR and Friends" used Monte Carlo simulations to show that reliable calculations of incident duration improvements aren't possible with MTTR.

Lorin Hochstein has demonstrated through statistical analysis that when incident durations follow power-law distributions, "observed MTTR trends convey no useful information at all." His work with power-law-distributed data shows how sample means become unreliable indicators of system performance.

What I hope to contribute here is a practical guide for the many engineering teams still using MTTR today, helping them understand not just why these metrics are misleading, but what they can implement instead to actually improve their resilience.

Allspaw, J. (2018). "Moving Past Shallow Incident Data." Adaptive Capacity Labs.

Nash, C. et al. (2021-24). "The VOID Report."

Davidovič, Š. (2021). "Incident Metrics in SRE: Critically Evaluating MTTR and Friends."

Hochstein, L. (2024). "MTTR: When sample means and power laws combine, trouble follows." Surfing Complexity.

Site Reliability Engineering (SRE)DevOps MetricsIncident ManagementSystem ReliabilityEngineering LeadershipMTTRIncident ResponseService Level ObjectivesMonitoringResilience EngineeringResilienceObservabilityPlatform Engineering

Adrian Hornsby