What 1,000 Executives Know But Can't Fix

Mar 31

I have no idea how I missed this report when it came out. Cockroach Labs published their State of Resilience 2025 back in late 2024, surveying 1,000 senior technology executives across North America, Europe, and Asia-Pacific, and it somehow slipped past me entirely. Better late than never, because there's a lot here worth digging into, at least if you're willing to ignore Part 6, which makes the same mistake almost every vendor resilience report makes: five sections of genuinely useful organizational data followed by a pivot to "and here's how our product fixes it." It's understandable. They sell distributed databases. But it's also disappointing, because their own data argues against the conclusion, and because it reinforces a persistent myth in this industry that you can buy your way to resilience. More on that in a moment.

The data was collected in August-September 2024, so about 18 months old. If anything, that makes the findings more relevant now. Since this survey was fielded, DORA has gone into full effect, NIS2 enforcement is ramping up across EU member states, AI agents are being woven into operational workflows at a pace that would have seemed aggressive even a year ago, and the organizations that told CrowdStrike-shaken researchers they were "significantly improving their planning" have had a full year and a half to either follow through or quietly drift back to the status quo. The structural dynamics this report captures don't resolve themselves in 18 months, they compound. And the rapid adoption of AI in operations is adding new failure modes to systems that were already struggling with the old ones.

The headline numbers first: the average enterprise experiences 86 outages per year, averaging 196 minutes each. Every single company surveyed lost revenue to outages in the past twelve months. For large enterprises, outage-related losses averaged $495K annually.

But the number that made me pause for a second: 95% of executives say they are aware of at least one unresolved operational weakness that puts their organization at risk. 72% say they have multiple. And 48% say their organizations are doing insufficient work to address it. Nearly every leader knows where the cracks are, and almost half say nothing adequate is being done. The prevention paradox, playing out at scale with hard data behind it.

The blockers are familiar too. Other teams' priorities take precedence (38%). Budget constraints (36%). Lack of leadership buy-in (32%). Meanwhile 92% of teams must deprioritize essential work to fight fires, 48% work overtime and weekends to restore operations, and 39% report a growing backlog of post-mortems. The feedback loop I keep coming back to in client work is right there in the numbers: less time for improvement leads to more incidents, which leads to even less time for improvement.

Then there's the human cost that rarely shows up in resilience reports. 82% of leaders said they or their team members fear losing their jobs following a significant outage. Think about what that does to everything else. If you've ever wondered why your blameless post-mortems don't feel blameless, there's your answer. You can design the most thoughtful incident analysis process imaginable, but 82% job-fear will override any process document every single time.

Now, Part 6 and why it's a missed opportunity. Look at the causes of downtime: network issues (38%), software issues (36%), cyberattacks (36%), cloud provider reliability (35%), third-party failures (33%), environmental factors (31%), human error (31%), capacity issues (30%), hardware failures (30%). That distribution is remarkably flat. Nothing really dominates. And the report itself notes it's consistent regardless of company size, sector, or geography.

To me, that flat distribution is the most important finding in the entire report, and the authors barely pause on it.

Here's why it matters so much. The conventional reading is that these are nine independent failure modes, each requiring its own technical solution. Network problems need better network architecture. Software issues need better testing. Capacity problems need better scaling. And each vendor can point to their slice of the chart and say "we fix that part," and they're often right about their slice.

But the flatness itself is in fact the diagnostic clue that something else is going on. Nine unrelated failure categories don't land within eight percentage points of each other by coincidence. They land that way when they're all symptoms of the same underlying condition. Network issues, software bugs, human error, capacity problems: these aren't nine independent diseases. They're nine ways that the same organizational gaps express themselves in production. Poor feedback loops, misaligned incentives, the distance between how leaders imagine work happens and how it actually happens, insufficient learning from failure: these systemic causes don't prefer one failure category over another. They just make all of them more likely, roughly equally.

Think of it like a doctor seeing a patient with fatigue, headaches, joint pain, and skin problems all at similar severity. You could treat each symptom with a different specialist and a different prescription. But the flat distribution across unrelated systems is itself the signal that something systemic is driving all of it. Treating the headaches won't help the joints, because neither is the actual problem.

This is what makes tool-first approaches to resilience so seductive and so ultimately inadequate. If one cause dominated, say network issues at 70% and everything else in single digits, you'd have a clear technical problem with a clear technical solution. But when the failure profile is this even, the slices are the wrong unit of analysis entirely. No single technical investment will meaningfully shift the overall failure rate, because the technical categories are where the problems surface, not where they originate. The only intervention that touches all nine simultaneously is the organization's ability to detect, respond to, and learn from whatever breaks next, regardless of category. That's an organizational capability, not a technology purchase.

This inability to see symptoms as symptoms, to keep treating the surface categories as root causes and reaching for technical fixes to organizational problems, is exactly why I wrote Why We Still Suck at Resilience. And here, in a vendor's own dataset, is the evidence for the entire thesis of the book.

The report also shows that 100% of organizations already do some form of resilience testing, yet 71% do no failover testing, 62% skip regular backup and restoration exercises, and the average outage still takes over three hours to resolve. The tooling exists. The organizational capability to use it effectively does not.

Cockroach Labs collected data that illuminates exactly this, but unfortunately drew a technical conclusion from organizational evidence. Of course, I understand why. But somebody needs to measure the layer they skipped: the feedback loops, the gap between how leaders think work happens and how it actually happens, the tensions that keep organizations stuck even when they know what's broken. That's the data nobody is collecting systematically, and it's where the actual answers live. I’ll get to that in 2026.

Full report here if you want to read it yourself: https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/

//Adrian

operationsorganiational resilienceResilience Engineeringfeedback loopscrowdstrikeincident analysisDORANIS2ResilienceBitesresilience investmentSREChaos Engineeringprevention paradoxEngineering

Adrian Hornsby

What 1,000 Executives Know But Can't Fix

When Architecture Becomes Fluid