Your best chaos engineering happens before you break anything

Nov 30

Dr. David Woods, one of the pioneers of resilience engineering, said something a few of us have been repeating and it captures the essence of why chaos engineering is valuable:

"Just planning to inject a fault usually reveals that the system works differently than you thought."

Most teams skip straight to running chaos experiments. They pick a tool, design a failure scenario, inject the fault, observe what happens. They think the learning happens during the experiment.

They're wrong. The deepest learning often happens before you break anything at all.

Here's what I mean.

The meeting you've probably been in

You're discussing what happens when X fails. Someone says "obviously the circuit breaker kicks in." Everyone nods. The conversation moves on. Meeting ends.

Three months later, X actually fails in production. The circuit breaker doesn't work the way anyone thought. Or it doesn't exist at all. Or it exists but was disabled during a migration last quarter and nobody remembered to re-enable it.

The problem is that everyone had different mental models of how the system works. Nobody discovered this until real failure exposed it.

This happens constantly. Teams operate under the illusion of shared understanding. Everyone uses the same words. Everyone nods at the same moments.

Then something breaks, and you discover nobody actually agreed about how anything worked.

The hypothesis conversation

Here's what to do instead. Before running any chaos experiment, gather your team and have what I call the hypothesis conversation.

Step 1: Pick a specific failure scenario

Don't start with "what if the database goes down?" It’s too vague.

Instead start with something like: "What happens when our primary PostgreSQL instance becomes unavailable during peak traffic?"

The specificity matters. Vague scenarios produce vague answers. Specific scenarios force people to articulate their actual mental models.

Step 2: Silent writing first

This is the critical step most teams skip.

Before any discussion, have everyone write down:

What they expect to happen
How long they expect recovery to take
What customer impact they expect
What monitoring and alerts they expect to see

Give people 5 minutes of silence to write.

Why silent writing? Because it prevents groupthink. In open discussion, the loudest person always speaks first. Suddenly everyone's nodding along. The quieter team members, who might have spotted something nobody else considered, stay silent. You've just lost the diversity of perspective that makes this valuable.

Silent writing forces everyone to commit to their understanding before social dynamics kick in.

Step 3: Share and compare

Go around the room. Have each person read what they wrote.

No judgment. No critique. Just listen.

This is where it gets interesting. You'll discover:

Someone assumes automatic failover that doesn't exist. Another person thinks there's queuing that isn't implemented. The new engineer admits they have no idea what happens. That’s valuable honesty the rest of the team needs to hear. The senior engineer describes behavior from the old architecture, before the migration two quarters ago. Nobody agrees on expected recovery time.

Your team has been working together for months or years. You thought you understood how the system works. You're discovering right now that you don't share the same understanding at all.

Step 4: Investigate the gaps

Don't just note the disagreements and move on. Dig into them:

"Why did you think queuing was implemented?"

"When was the last time we actually tested failover?"

"Where is that behavior documented?"

"Who would know the answer to this?"

Often you'll discover that nobody knows for certain. The system evolved. Documentation didn't keep up. People made assumptions based on how they thought things worked. Those assumptions never got validated.

Step 5: Decide what to do

Now you have options:

Run the chaos experiment to find out what actually happens. Check the code or configuration to verify behavior. Talk to the team that owns that component. Update documentation based on what you learned. Fix the gap you just discovered.

You might decide the experiment is still valuable. But you've already learned something critical: your team doesn't share the same mental model of how your system behaves. That's worth knowing regardless of what you do next.

Real example

I watched this play out at a company planning their first database failover experiment few years ago.

The team gathered to discuss expectations. Everyone seemed aligned. "Database fails over to replica, maybe 30 seconds of elevated errors, everything recovers." Heads nodded. The consensus felt solid.

Then they did silent writing.

When everyone shared, the room got quiet.

It turns out the DBA expected 10-15 seconds of failover time based on the configuration settings. Application engineers expected 30-60 seconds based on what they'd seen in past incidents. The SRE thought the connection pool would need manual restart because that's how it worked in the old system. A junior engineer thought reads would continue but writes would fail. That was actually a reasonable assumption given the architecture. The architect assumed the application would automatically reconnect after failover. Nobody was sure if the health check would detect the failover or keep routing traffic to the failed instance.

Six people. Six different mental models of the same system behavior.

They spent the next 30 minutes investigating:

Checked the database configuration. Failover timeout was actually 30 seconds, but nobody had tested it recently. They looked at application code. Connection pool settings were wrong, they'd never refresh connections after a failover. Reviewed health check logic. It would completely miss the failover and keep sending traffic to the dead instance. Found monitoring gaps. No visibility into connection pool state, so they wouldn't see the problem during the experiment.

They fixed three critical issues before running any experiment:

Updated connection pool configuration to handle failover
Fixed the health check logic
Added monitoring for connection pool state

When they finally ran the experiment two weeks later, they knew what to expect and what to watch for. The failover worked. More importantly, they'd built shared understanding of how their system actually behaves.

Return on investment

The hypothesis conversation took 45 minutes. In that time, they discovered:

Critical gaps in monitoring
Incorrect connection pool configuration that would have caused extended downtime
Missing health check logic that would have routed traffic to a dead instance
Misaligned team understanding of basic system behavior

All before creating any risk, touching any systems, or needing any special tools.

Compare this to what usually happens: run the experiment, something unexpected occurs, scramble to understand what went wrong, argue about whether the system is working correctly or the experiment is flawed, end up more confused than when you started.

The hypothesis conversation is chaos engineering!

You're systematically exploring how your system behaves under failure. Sometimes that exploration happens through conversation. Sometimes through code review. Sometimes through documentation analysis. Sometimes through actual experiments.

But the learning doesn't wait for the experiment. It starts the moment you begin asking the right questions.

Start tomorrow

Here's your homework for this week:

Pick your scenario

Choose something specific that your team worries about. "What happens when [specific service] becomes unavailable during [specific condition]?"

Don't pick the scariest scenario. Pick something bounded and concrete. You're learning the practice, not stress-testing your team.

Get the right people in the room

You need diversity of perspective. Developers who wrote the code. Operators who run it in production. New team members who haven't absorbed all the assumptions yet. The architect who designed it. The SRE who monitors it. Include product managers too.

Six to eight people is ideal. Enough for diverse perspectives, small enough for real conversation.

Set the context

Before you start, frame it clearly:

"We're not testing anyone's knowledge. We're discovering where our mental models differ. There are no wrong answers. The goal is to find gaps in our shared understanding before they surprise us in production."

This framing matters. People need to feel safe admitting uncertainty.

Do the silent writing

Give people 5 minutes. Remind them to be specific. What exactly do you expect to happen? Not "it'll probably be fine" but "the circuit breaker will trip after three failed requests, fallback logic will return cached data, customers will see stale information for 30-60 seconds."

The specificity reveals the mental model.

Share without judgment

Go around the room. Have each person read what they wrote.

Don't critique. Don't correct. Don't debate yet. Just listen and note where understanding differs.

Resist the urge to immediately resolve disagreements. First, just hear all the perspectives.

Investigate together

Pick the biggest gaps. The places where mental models differ most.

Dig into them as a group. Check the code. Look at configuration. Review documentation. Talk to other teams.

Find out what actually happens. Update your understanding together.

You'll learn more in 30 minutes than most chaos experiments reveal.

What's next

Try this with your team this week. Then hit reply and tell me:

What gaps did you discover? Were you surprised by anything? Did you fix something before running any experiment? What happened when you tried to investigate the gaps?

I read every reply. Your experiences help me understand what actually works in practice, not just in theory.

Next newsletter: "What to do after the hypothesis conversation". How to design experiments that actually test what you're uncertain about, not just what you already know.

Until then,

Adrian

Adrian Hornsby