The Prevention Paradox: Why Successful Resilience Work Becomes Its Own Enemy

Jun 3

Sunlight filtering through a forest canopy, symbolizing the need to illuminate invisible resilience work and make it visible to organizational leadership.

I recently spoke with a Staff Engineer whose team was being "rightsized" (I'm keeping details vague to protect their privacy.) Five years earlier, after a catastrophic Black Friday outage, leadership had given them carte blanche to build world-class resilience. They hired the best engineers money could buy, put a comprehensive incident learning process in place, implemented operational readiness reviews, automated and gradual deployments with continuous testing, built a chaos engineering program, and created feedback loops that institutionalized resilience thinking across the engineering organization. Name it, they either were doing it or planned to have it.

No major outages in almost 2 years. Some smaller ones of course, but nothing catastrophic. Customer satisfaction was high. Resilience that made many envious. More importantly, the organization was healthy. No blaming culture. Eagerness to learn, and do what it could to limit the number of customer-facing failures.

It worked great. Everyone seemed happy. They even talked publicly and published about their practices in their own engineering blog.

Then came a quarterly business review.

"What exactly does your team do? I see a lot of salary costs, but what are the deliverables? Can you quantify the ROI?" asked the newly appointed VP of Engineering.

The team had spent the better part of the past 5 years preventing disasters and building operational and organizational resilience capabilities, not gathering proofs or preparing PowerPoints. They scrambled to explain their value, and didn’t have the data to back up their claims. If you are not prepared to answer the questions from the VP, "We prevented failures" and "we learned a lot" doesn't cut it. It simply doesn’t translate well to budget spreadsheets.

"Nothing's breaking right now. We need to refocus resources on features that drive revenue. We can revisit this stuff later if we need to," said the VP.

Six months later, they were down to only two engineers, and just a few months after that, the bigger and longer outages started again. Finger-pointing started again. People focused on themselves, saving their own jobs, and trying to show the VP how cost-efficient they were. The organization was already losing its operational resilience culture.

This is the prevention paradox, and it has played out countless times throughout history. The Y2K bug offers perhaps the most famous example. After years of preparation and billions in investments to address the Y2K bug, January 1, 2000, passed without major incidents. And rather than celebrating this success, many dismissed Y2K concerns as overblown hysteria precisely because the preventive measures worked as intended. The very absence of catastrophe made it difficult to justify the resources that had been spent on prevention.

Why is this happening?

The prevention paradox cycle happens because our brains struggle to value "non-events", things that didn't happen. We're wired to respond to and remember actual events and visible outcomes. When a disaster is prevented, there's no dramatic story to hold onto or share, so no one hears about it.

After a non-event, people conclude:

"See? Nothing happened, so clearly it wasn't a real problem."

Rather than recognizing:

"Nothing happened because we took appropriate precautions."

Paradoxically, the more successful the resilience efforts, the more that effort appears unnecessary in retrospect.

Organizations tend to focus on systems that are currently working while ignoring the preventive work that keeps them working. A database that hasn't failed in two years seems "naturally reliable" rather than "well-maintained."

When systems work well, we often attribute it to good initial design or "stable technology." When they fail, we blame the incident on bad luck or external factors.

The immediate cost of prevention work is visible and concrete. The future cost of potential failures is abstract and uncertain.

And leadership isn’t immune to it. They experience the same cognitive bias as everyone else: if nothing bad is happening, maybe nothing bad was going to happen anyway. They see the salary of the chaos engineering team; they don't see the 2M € outage that never happened.

The prevention paradox is the most common, predictable, yet devastating cycle I see in software organizations, and it impacts even the most “advanced” organizations out there. Literally NO ONE is immune of it.

The Resilience Amnesia Cycle

Here's how the prevention paradox typically sets in an organization:

1. The Crisis

A major outage hits. Revenue lost. Customers angry. Executives embarrassed. "This can never happen again!"

2. The Investment

Leadership opens the checkbook. "Hire the best SREs and engineers. Build comprehensive incident learning processes. Implement operational excellence. Make it resilient. Whatever it takes!"

3. The Success

The team delivers. They build robust technical systems, adopt chaos engineering, and develop advanced observability. They establish incident response processes to learn from every incident. They implement Operational Readiness Reviews (ORR) and Continuous Integration and Continuous Deployment (CI/CD) practices that catch issues before deployment. They create feedback loops between development and operations that institutionalize resilience thinking across the entire engineering organization.

Systems become resilient. Large outages become rare. Customer satisfaction improves. The organization develops the capability to anticipate problems and adapt quickly to changing conditions.

4. The Return-On-Investment (ROI) Questions

New Leadership. New budget cycle. Different priorities. "What does the team actually do? Can we quantify their impact? Are we over-invested here?"

5. The Cuts

"Nothing's broken, so clearly we don't need this level of investment. Let's move some people to features. We can always scale back up if needed."

6. The Return

Outages start happening again. The organization has lost its learning capabilities. Teams make the same mistakes repeatedly. "How did we get here? We need to invest in resilience!"

And the cycle repeats.

The team that prevented the crisis gets disbanded. The team that responds to the new crisis gets celebrated as heroes.

This is what I call the Resilience Amnesia Cycle. It is a predictable pattern where organizations systematically forget why they invested in prevention, precisely because that prevention worked.

Resilience Amnesia Cycle. It is a predictable pattern where organizations systematically forget why they invested in prevention, precisely because that prevention worked. — The Resilience Amnesia Cycle

Why Teams Struggle to Justify Their Existence

The truth is that most (resilience) teams (including reliability, SREs, Ops, etc.) are totally unprepared for budget scrutiny because they were never asked to justify themselves initially. They were hired in crisis mode with a simple mandate: "Fix this."

They optimized for technical and organizational excellence, not business justification.

When the inevitable budget questions come, they face several challenges:

1. Success Is Invisible

"We prevented X number of potential outages" and "we improved organizational learning velocity" doesn't hit the same as "We shipped 12 new features." Prevention work creates an absence of problems and improved adaptive capacity, which are inherently hard to measure and communicate.

2. No Business Metrics Framework

Teams might track technical metrics, but rarely translate these to business impact. They can show improved velocity and some preparation through the different incidents and operational reviews, but they can't answer "What's the ROI?" because they never built the framework to calculate it or collected the data to even begin to consider it.

3. Different Languages

Engineers speak in availability percentages, bug fixes, number of deployments and rollbacks, and learning cycles from incidents. Finance speaks in revenue impact and cost per outcome. Neither group is fluent in the other's language.

4. Temporal Mismatch

Prevention work pays off over long time horizons. Budget cycles demand quarterly justification. The engineer who spent six months building chaos engineering experiments and learning from incident processes that will prevent next year's outage struggles to show this quarter's value.

5. Attribution Challenges

When systems are resilient, is it because of good initial architecture? Tested dependencies? Or the daily work of the teams building organizational resilience? It's genuinely hard to know, and teams rarely build systems to prove their contribution.

The Leadership Perspective

To be fair to leadership, their questions aren't completely unreasonable. They're managing competing priorities with limited resources. From their perspective:

The resilience team was expensive to build and maintain
Current systems appear stable without obvious ongoing issues
Feature development has clear, measurable business impact
Market pressure demands shipping new capabilities
The "We can always rebuild the team if we need it" is an appealing justification

The problem isn't that leadership doesn't value resilience; in my experience, they always do. It's that, unfortunately, successful resilience work makes itself appear unnecessary.

The Real Cost of Resilience Decay

What leadership (and, to some extent, everyone in an organization) typically underestimates is the lag time and the compound nature of resilience degradation. The decay follows a predictable pattern that happens slowly until it’s already too late to realize what happened.

In the first few months after cuts, everything appears fine. Systems continue running on the momentum of previous investments while technical debt accumulates slowly in the background. Incident response processes start being skipped, reviews become optional, but there's no immediate pain to signal the danger. Small issues begin appearing around month six, individual incidents that seem unrelated, response times that gradually increase with no single smoking gun to point to. More critically, the organization stops learning from incidents effectively, losing the institutional knowledge that once made it resilient.

The degradation accelerates dramatically as problems compound. What used to be small contained failures start cascading across systems. Teams spend more time firefighting and less time building features, and fingers start pointing at people rather than examining systemic issues. The organization makes the same mistakes repeatedly because institutional learning has degraded, the very capability that once prevented these failures.

By 18-24 months, major outages return with a vengeance. Leadership now faces the same crisis that triggered their original resilience investment, but with 18 months of accumulated technical debt and lost organizational learning capability stacked on top.

Finally comes the frantic rebuilding phase, but now they're rebuilding resilience capabilities while simultaneously managing active fires, a much harder and exponentially more expensive proposition than simply maintaining prevention capabilities would have been.

Read our blog post—“The Q uiet Erosion”—for more details about the slow process of how an organization drifts into failure.

Disclaimer: The degradation timeline is based on my experience working in the software industry, where I've observed this pattern repeatedly across different company sizes and industries. While I don’t have the data to quantify this timeline accurately, industry practitioners consistently report similar degradation patterns. The timeline varies based on factors like system complexity, organizational size, and the depth of original resilience investments. Still, the general pattern of slow-then-sudden degradation is a common experience across the software industry. Organizations with more mature practices may see slower degradation, while those with less institutional knowledge may experience faster decline.

The Cost of Resilience Cuts: How organizational resilience degrades over time after prevention investments are cut

Breaking The Cycle: Making Prevention Visible

The key to limiting the Resilience Amnesia Cycle is making invisible work visible and translating technical prevention into business value. This requires both concrete measurement frameworks and deep cultural change—tactical metrics without culture change won't stick, and culture change without metrics won't convince leadership.

Measurement Frameworks

1. Calculate Downtime Cost and ROI

In 2024, the Information Technology Intelligence Consulting (ITIC) estimated that 90 percent of enterprises face costs exceeding $300,000 per hour of downtime, with 41 percent exceeding $1‒5 million per hour.

Downtime costs for smaller businesses range from approximately $137 to $427 per minute, and for larger enterprises, they can reach $16,000 per minute.

The industry average cost of downtime is estimated at about $9,000 per minute.

Downtime costs can be approximated using the formula:

Downtime Cost = Minutes of Downtime x Cost per Minute

Start with that equation. It is simple and straightforward.

You can, of course, develop models that better estimate the ROI of resilience work.

Prevention ROI = (Potential Failure Cost × Probability × Prevention Success Rate) / Prevention Investment Cost

Even imperfect models are better than no models, and you can always improve them iteratively.

The key is to start putting a price on prevention.

2. Quantify Prevented Failures

Document the "alternate reality" and use simulations to show what could have happened without prevention.

Track and document near-misses with their potential business impact and share these with leadership:

"ORR process caught database scaling issue that could have caused a 4-hour outage during Black Friday, saving $1.2M in potential revenue loss."
"Incident review process from previous incident prevented similar failure pattern across three other teams, saving $500K in potential revenue loss."
"Chaos experiment identified memory leak that would have led to failure during product launch, saving $300K in potential revenue loss."

3. Create Prevention Dashboards

Build visibility into prevention work with business-relevant metrics:

Issues caught before customer impact
System resilience improvements over time
Technical debt prevented vs. remediated
Learning from incidents and reviews
Time to detect, time to respond, time to recover trends
Track On-call confidence index

4. Build Compelling Narratives

Document how prevention work creates measurable business value:

Share "near miss" stories and highlight specific instances where measures prevented failures
Before/after resilience work comparisons
Customer satisfaction correlations with resilience investments
Developer productivity gains from improved systems
Innovation velocity enabled by confidence in system resilience
Learn from others' mistakes and point to organizations that failed to prepare
Break prevention into measurable achievements

5. Establish Prevention SLAs

Just as you have uptime SLAs, consider creating accountability for prevention work:

Complete ORR for 100% of new major feature deployments
Conduct post-incident reviews for 100% of Sev-1 and Sev-2 incidents
Allocate at least 20% of the time to addressing technical debt
Execute a minimum of two chaos experiments quarterly for all our critical dependencies
Maintain test coverage above 85% across all services
Conduct a full-scale simulation (GameDay) once a month to validate our incident response capabilities
Review, verify, and exercise at least one Runbook weekly.
100% of Runbooks’ last-verification-date should not exceed 6 months.

6. Build Institutional Memory

Document the reasoning behind protective measures and continuously educate stakeholders about:

Why specific resilience investments were made
What disasters they're designed to prevent
How past incidents shaped current practices
The compound value of organizational learning capabilities
Regularly help stakeholders understand the value of resilience

Cultural Transformation

Even with perfect measurement frameworks, organizations will resist investing in prevention due to what are called "organizational antibodies"—the people and processes that extinguish new ideas as soon as they begin to course through the organization. Overcoming the Resilience Amnesia Cycle requires deep cultural change:

1. Leadership Modeling

Executives must visibly value and celebrate prevention work. When an ORR catches a critical issue or a chaos engineering process prevents repeat failures, that should be as celebrated as shipping a major feature.

2. Career Path Recognition

Create senior career paths for prevention specialists. Principal SREs and Distinguished Engineers in reliability and resilience should be as valued as their counterparts in product development.

3. Shared Context

Regularly share the cost and impact of outages across the organization. When teams understand that a 4-hour outage costs $800K, they better appreciate the team that prevents them.

4. Prevention Storytelling

Develop and share organizational narratives around prevention heroes, not just feature development or incident response heroes. The engineer who prevents a disaster through thoughtful ORR or systematic chaos engineering should get the same recognition as the one who fixes it.

Moving forward

As systems become more complex and customer expectations continue rising, the impacts of the prevention paradox become even more important to understand. Organizations that can't maintain prevention investments will find themselves in a continuous cycle of failure and reactive investment.

The most successful organizations I work with treat prevention work as a strategic capability, not a cost center. They understand that in a world where software is eating everything, resilience is a competitive advantage.

More importantly, they recognize that organizational learning and adaptation capabilities are what separate resilient organizations from fragile ones. The ability to anticipate problems, learn from incidents, and continuously improve is what enables sustainable success in uncertain environments.

If you recognize the prevention paradox in your organization, here's how to start addressing it today:

Audit your current prevention work - What's already happening that leadership doesn't see?
Calculate your prevention ROI - What failures have been avoided and what would they have cost?
Document near-misses - Start building a catalog of prevented failures and lessons learned.
Create visibility dashboards - Make prevention work as visible as feature delivery.
Build business-impact narratives - Connect technical prevention to business outcomes.
Measure learning velocity - Show how quickly the organization adapts and improves.

Don’t wait before it is too late. The prevention paradox isn't inevitable. Organizations that recognize and actively counter it build more reliable systems, retain better engineers, create stronger learning capabilities, and develop sustainable competitive advantages.

However, it requires conscious effort to value invisible work and resist the cognitive biases that make successful prevention appear unnecessary.

The best failure is the one that never happens. The challenge is proving it.

Prevention paradoxDevOps Resiliencedrift into failureincident responseincident preventionOperational Resilienceorganizational resilienceResilience EngineeringResiliencesocio-technical systemsSoftware engineeringsoftware development risks

Adrian Hornsby