Controls vs Guardrails: Why Organizations Struggle with Resilience Despite Having All the Right Pieces

Puzzle with a missing piece

After a decade of helping organizations improve their resilience practices, I keep seeing the same puzzle. Companies invest heavily in operational readiness reviews, well-architected reviews, incident reviews, chaos engineering, alerting, monitoring, etc. They have leadership buy-in, dedicated teams, and sophisticated tooling. Yet despite having all the right pieces, many still struggle to build genuine resilience.

The more I observe this pattern across industries, the more convinced I become that we're dealing with something fundamental about human psychology, and how our natural responses to uncertainty systematically undermine the very resilience we're trying to build.

The Evolutionary Trap

The human brain evolved to treat uncertainty as potential danger. When we don't know what's coming next, our amygdala activates, triggering stress and the urgent need to regain control. This served our ancestors well when uncertainty meant immediate physical threats, but it creates problems in complex organizational systems.

We're pattern-seeking creatures who prefer clear cause-and-effect relationships. When faced with ambiguity, we instinctively impose structure with rules, procedures, and approval gates that provide the illusion of predictability, even when they can't actually remove the underlying uncertainty.

Think about what happens after an outage. Teams naturally default to adding more approval gates, extending checklists, requiring additional sign-offs. Each control feels rational in isolation and provides psychological relief by making us feel we're "doing something" about the problem.

But together, these controls create systems so constrained that engineers can't respond effectively when something truly unexpected happens. The very mechanisms designed to prevent problems end up preventing the adaptive response that could have avoided a bigger failure.

Research reveals exactly why this backfires. Organizations that handle crises well are those that can flexibly navigate different responses during real emergencies, rather than simply following rigid procedures. Yet our natural response to past failures is to create more rigid procedures.

When outcomes are uncertain, our decision-making shifts from calculation to heuristics and mental shortcuts. We fall back on availability bias (overweighting recent incidents) and confirmation bias (seeking information that supports our existing beliefs about what went wrong). This leads to controls that address the specific failure we just experienced while missing the broader patterns that create system brittleness.

The desire for predictability is so strong that we often choose the feeling of control over the reality of safety. This explains why organizations continue tracking metrics like MTTR even when teams understand it's mathematically meaningless, or why they maintain cumbersome approval processes that become rubber stamps but create friction during emergencies.

Controls vs Guardrails

The key to understanding and solving this problem is recognizing that most organizations blur a crucial line between two fundamentally different approaches to safety: controls and guardrails.

Controls dictate how work gets done. They're prescriptive, active during normal operations, and create friction for everyone, regardless of whether there's any actual danger. Like tollbooths on a highway, they slow down every single person, every single time, even when there's no safety issue.

I've seen many organizations create elaborate chaos engineering processes with good intentions. They want to prevent teams from causing unintended damage. But these weeks-long coordination requirements create cognitive overload that makes teams avoid learning activities entirely. That's a control masquerading as a safety practice.

The most telling sign that controls have gone too far is when engineers stop raising concerns because "the process doesn't allow for that" or "nobody would listen anyway." That's adaptive capacity disappearing in real time.

Guardrails, on the other hand, define safe operating boundaries while preserving flexibility within those bounds. Like highway guardrails, they activate only when you're approaching real danger, not during normal operations. They make the safe path also the easy path.

Think of it like ziplining in a forest. The controls approach says "Ziplining is dangerous, so we'll require permits, 6 weeks of training, and supervised access only on Tuesdays." Result? Nobody ziplines, or people sneak in and zipline without any safety equipment because the official process is too cumbersome.

The guardrails approach says "Ziplining is dangerous, so here's a harness, safety line, and helmet." The safety equipment enables the risky activity rather than preventing it. People zipline frequently and safely because the gear only constrains them when there's real danger such as weight limits exceeded, equipment failure, or bad weather.

A guardrail approach to chaos engineering might provide lightweight frameworks with ready-made integrations, but allow teams to adapt scope, timing, and focus based on what they're trying to learn about their specific systems. The safety comes from built-in blast radius limits, automatic rollback procedures, and environment isolation, not from bureaucratic overhead.

This distinction shows up everywhere once you know to look for it. I've seen incident reports describing production access processes as "cumbersome," creating friction during the exact moments when adaptive capacity matters most. The irony is that these access controls often become rubber stamps. When so many people need production access for legitimate work, approval processes default to "yes" without real scrutiny.

Meanwhile, guardrails like automated safety checks, environment-specific tooling, and default read-only permissions would actually prevent dangerous actions without slowing down normal troubleshooting.

The pattern is often exacerbated during incident reviews. Organizations naturally gravitate toward quick fixes after outages. The urgency to "do something" drives teams to immediately jump to finding root causes and generating action items. But as John Allspaw observes in his excellent talk Incident Analysis: How *Learning* is Different Than *Fixing*, when "your goal is to fix, you're gonna fix something whether or not it was the right thing to fix," and "once teams find a plausible fix, time and production pressure cause them to stop exploring other options."

Here's the trap I see many organization fall into: when incident reviews become focused on quick fixing rather than deep understanding, they systematically generate more controls. Each incident generates new approval processes, additional checkpoints, and longer procedures. Every. Single. Time. The very mechanism meant to build resilience ends up eroding it through control accumulation.

A guardrails approach to incident learning resists the temptation for quick fixes. Instead, it focuses on understanding "what was difficult for people to understand during the incident, what was surprising for people about the incident.” Answering these difficult questions helps design better guardrails.

The difference is critical: quick incident reviews create controls and bureaucracy, while deep understanding-focused reviews help create good guardrails and in turn build adaptability.

But wait, what happens when engineers take shortcuts that bring down critical systems?

This question becomes crucial when we consider why people work around safety measures. In my experience, there are two distinct patterns:

Systematic workarounds where multiple people consistently bypass the same controls reveal design problems. When everyone ignores a safety measure because it conflicts with getting work done effectively, that's feedback that the control isn't designed for the reality of the work environment.

Individual violations where someone consciously ignores a safety protocol despite understanding the risks represent accountability issues requiring different responses such as training, supervision, or removal from roles.

The key difference is in the data pattern. One person removing a safety device represents an individual problem. Everyone consistently working around the same procedure indicates a system design problem requiring guardrail redesign, not more enforcement. When operational staff or engineers consistently bypass safety measures, it's usually because those measures force them to choose between being safe and being effective. That's a design problem, not a people problem.

For the small percentage who truly are bad actors, attempting to prevent malicious behavior through process controls is futile. If someone really wants to cause trouble, they'll find a way around controls. This is where foundational security principles become essential: comprehensive auditing that records every action, immutable infrastructure that can't be tampered with, and "detect, isolate, replace" strategies. These work as guardrails. They don't prevent every possible action through approvals, but they make malicious changes visible and automatically containable.

You can't control your way out of determined bad actors, but you can architect systems that make their actions obvious and limited in scope.

Breaking the Cycle

The human tendency toward adding controls against uncertainty is so deeply wired that even organizations with excellent resilience instinct and intentions fall into this trap. After incidents, the cultural pressure to "do something" combined with our psychological need for control creates an almost irresistible urge to add approval gates, extend procedures, and create more detailed documentation.

The path forward requires consciously auditing our practices through a controls vs guardrails lens. Where are we creating friction during normal operations when we should be creating safety boundaries? Where are we demanding compliance when we should be enabling adaptation?

The goal isn't eliminating all structure, it's not! Instead it’s ensuring our structure enables the adaptive capacity that makes systems genuinely resilient rather than just compliant.

Here's the important bit: resilience comes from systems that can learn and adapt, not from preventing all possible changes. When we build tollbooths instead of guardrails, we optimize for the feeling of control rather than the reality of safety.

Smart guardrails enable adaptation by making the safe path also the effective path. Rigid controls kill adaptation by forcing people to choose between following procedures and solving problems.

To really improve resilience, organizations need to understand this distinction and design safety mechanisms that activate when needed but don't interfere with normal operations. They need to measure outcomes that matter rather than compliance metrics that feel good. They need to create psychological safety that enables people to surface problems early rather than hide them to avoid bureaucratic friction.

Most importantly, they need to recognize that our instinctive response to uncertainty—adding more controls—is often the enemy of the adaptive capacity that creates real resilience.

You don't make dangerous activities safe by preventing access. You make them safe by giving people the right safety equipment and boundaries.

Next
Next

Why MTTR is a Misleading Metric (And What to Track Instead)