Beyond Root Cause: A Better Approach to Understanding Complex System Failures

The Limitations of Traditional Root Cause Analysis

In a recent interview with Iluminr, I was asked the following question:

"Which framework do you think needs to be retired or radically rethought?"

My answer was clear: the traditional "root cause analysis" and "5 whys" frameworks.

This is the director’s cut version of that answer.

However, I should first confess: I once believed deeply in the 5 Whys method. It's featured in numerous books and endorsed by industry leaders. As a young engineer, questioning these established practices wasn't my first instinct. I not only used it for years but also passionately shared it with colleagues and taught it to others.

Who could blame me? The framework is intuitive, easy to explain, and simple to implement. It makes perfect sense …. on the surface. Keep asking "why" until you find the ultimate cause. Doing that will prevent the incident from recurring.

The Turning Point

As I grew in seniority and gained more experience with complex systems, I became increasingly uncomfortable with the limitations I was seeing. The implicit accusatory tone of "why" questions started to bother me. Team members would become defensive rather than reflective during post-mortems.

Something seemed wrong.

More importantly, I noticed that our "solutions" weren't preventing similar incidents from occurring. We'd fix the specific issue we identified, only to have a different manifestation of the same systemic problem appear again, and again.

Once I understood that the goal of incident investigation was learning, everything changed.

And once you see the benefits of a more nuanced approach, there's no going back.

The Fundamental Problem

These traditional approaches are based on outdated, linear thinking that assumes failures have single, identifiable causes that can be eliminated. But the truth is that's not how complex systems work.

Complex systems never have one single root cause. Instead, they have multiple contributing factors that combine to create failures. And it is the accumulation of these contributing factors that eventually break down over time.

And these failures are non-deterministic, meaning that repeating the same conditions would likely lead to different outcomes. That's because systems operate in completely dynamic environments where conditions and context continuously change.

Why These Frameworks Persist

Yet despite these obvious limitations, these frameworks persist in organizations worldwide. They're comforting in their simplicity. They give us the illusion of control. You can find the root cause, fix it, and the problem is solved.

Organizations also like them because they often lead to solutions that appear straightforward and actionable. "Retrain the engineer" or "Add another approval step to the process" are easier actions to document than "Our system has fundamental design flaws that interact in unpredictable ways."

Real-World Examples: When 5 Whys Lead to Wrong Conclusions

In the interview, I shared an example where a company had experienced a 2-hour database outage. Their initial 5 Whys analysis went something like this:

  1. Why did the database go down? Because it ran out of storage space.

  2. Why did it run out of storage space? Because the log files grew too large.

  3. Why did the log files grow too large? Because log rotation wasn't functioning properly.

  4. Why wasn't log rotation functioning properly? Because the engineer who set it up used incorrect settings.

  5. Why did the engineer use incorrect settings? Because they weren't properly trained in database configuration.

Conclusion: “We need better training for engineers on database configuration.”

And that seems OK. It makes sense, right?

But if you really think about it, this analysis almost purposely created a linear story ending with "insufficient training." It implicitly blamed the engineer and missed systemic issues.

Diagram showing traditional root cause analysis as a linear path leading to a single cause

Complex Systems Demand Different Thinking

Instead of accepting that conclusion, we added some different questions alongside their existing framework.

The instruction was to be curious about the context and to explore different dimensions: culture, processes, and tools.

After a few iterations, we eventually ended up with something like that:

  1. How did you first become aware of the issue? I noticed alerts showing unusual disk usage patterns an hour before the crash, but they weren't critical alerts, so I was finishing another urgent task first.

  2. How did the system appear to be functioning at that time? It seemed normal except for the disk usage. We've had similar warnings before that resolved themselves, so I wasn't immediately concerned.

  3. What were you focusing on when making decisions about priorities? I was trying to balance multiple alerts. Since we typically prioritize customer-facing issues, I was working on a payment processing issue first.

  4. How was the log rotation system originally set up? It was configured during our migration six months ago. We copied settings from our test environment, which had different usage patterns. The rotation was set for weekly rather than daily because test data volumes were much smaller.

  5. How do changes to these systems typically get reviewed? We usually have a checklist for infrastructure changes, but during the migration period, we moved quickly to meet deadlines, and some review steps were abbreviated.

This new approach led the team to different conclusions. Same incident, different ending. They ended up Implementing several improvements instead of just pushing for “more training”:

  • Revising alert classification to distinguish critical issues better

  • Establishing dedicated maintenance periods

  • Enhancing the infrastructure change review process

  • Creating more accurate test environments

  • Addressing workload prioritization issues

The key difference wasn't just in asking better questions. The approach fundamentally recognized that incidents emerge from complex interactions between people, technology, and organizational factors, rather than from a single cause or person's mistake.

Diagram showing systems thinking approach displaying multiple interconnected contributing factors in incident analysis.


Here is another one where a critical service went down during a deployment:

  1. Why did the service fail? Because an invalid configuration was deployed.

  2. Why was an invalid configuration deployed? Because it wasn't properly tested.

  3. Why wasn't it tested? Because the engineer was rushing to meet a deadline.

  4. Why was there a rush? Because the project was running behind schedule.

  5. Why was it behind schedule? Because the estimate was too optimistic.

Conclusion: “We need to improve the estimation process."

Instead, the new approach revealed:

  • The deployment tools made it too easy to accidentally include unrelated changes

  • The monitoring system didn't catch the issue because it was designed to detect hard failures, not degraded performance

  • An engineer had been doing manual system checks that caught several issues early, but this wasn't a formal practice

  • The system degraded gradually rather than failing immediately, making cause-and-effect relationships harder to establish

This, too, led to multiple improvements, including better deployment tooling, enhanced monitoring, formalized morning system checks, and a deeper understanding of how our services degrade under specific conditions.

Picture of a replica Trojan Horse, made of wood.

The Trojan Horse Approach: Implementing Change Without Resistance

After years of trying to improve how teams analyze incidents, I've learned that announcing "your approach is wrong!" rarely works. Instead, I've developed what I call the "Trojan Horse" approach to changing incident analysis practices.

Rather than launching a frontal assault on established methodologies, I've found it more effective to introduce new thinking from within existing frameworks. Just like that old wooden horse, this approach appears harmless but carries within it ideas that can transform how organizations understand incidents.

Here's how I typically introduce it:

  1. Start with the familiar format of a post-incident review that leadership expects

  2. Gradually introduce open-ended "what" and "how" questions alongside the traditional "why" questions.

  3. Be curious about the context and explore its various dimensions, including culture, processes, and tools.

  4. Highlight the richer insights and context these alternative questions produce

  5. Expand the scope beyond finding a single root cause to mapping the system interactions and dynamics

Over time, organizations eventually shift from simplistic root cause thinking to a more nuanced understanding of complex systems, without the resistance that often comes with rejecting established methods outright.

Practical Questions That Transform Incident Analysis

What I've found most effective is introducing subtle but powerful questions into existing processes:

  • "What surprised you during this incident?"

  • "Where did your understanding of the system prove incorrect?"

  • "How did this make sense to everyone involved at the time?"

  • "What pressures and constraints shaped the environment in which decisions were made?"

  • "Who knew things that others didn't?"‘

  • “What were we afraid to talk about before this problem happened?”

These questions don't disrupt the familiar framework but gently expand thinking beyond simplistic cause-and-effect reasoning.

Language Evolution: Shifting From "Root Cause" to "Contributing Factors"

Similarly, I've found you can transform how teams conceptualize incidents by gradually shifting terminology:

  • From "root cause" to "contributing factors"

  • From "human error" to "systemic conditions"

  • From "failure" to "unexpected behavior" or “surprise”

  • From "preventing" to "learning"

  • From "cause" to "influence"

Most people won't even notice these subtle shifts happening in conversations and documentation, but over time, they profoundly change how incidents are understood.

Building Sustainable Improvement in Your Organization

That Trojan Horse approach works because it acknowledges the real-world constraints we all face:

  • Teams have limited time for incident reviews

  • People have varying levels of expertise in systems thinking

  • Leaders want clear, actionable outcomes

  • Everyone feels pressure to "just fix it and move on"

It works precisely because it respects these constraints rather than fighting against them. It enables continuous improvement without demanding radical change all at once.

Be Patient, Be Persistent

The most successful change often happens not through revolution, but through evolution, making each incident review just a little bit better than the last one.

Remember: people don't resist change, they resist being changed.

By meeting teams where they are and gradually expanding their perspective, we create sustainable improvement rather than resistance.

Resilience needs to be nurtured, not imposed.

Previous
Previous

The Quiet Erosion: How Organizations Drift Into Failure

Next
Next

Beyond Traditional Resilience