What to do after the hypothesis conversation

Dec 14

Last time, I walked you through the hypothesis conversation, how to discover that your team has completely different mental models of how your system behaves, all before running any chaos experiment.

Several of you replied with stories. One team discovered that nobody knew whether their cache actually had an eviction policy. Another found out their "automatic" failover required some manual steps. A third learned that their monitoring would completely miss the failure mode they were worried about.

Good. That's exactly what should happen.

Now comes the harder question: what do you actually do with all these gaps you just discovered?

Most teams make one of two mistakes here. Either they panic and try to fix everything immediately, or they shrug and run the experiment anyway "just to see what happens."

Both are wrong. Let me show you a better path.

Three types of gaps

When you run a hypothesis conversation, you uncover three distinct types of gaps. Each requires a different response.

Type 1: Knowledge gaps

These are gaps in understanding where the answer exists somewhere. You just need to find it.

"Does the circuit breaker have a timeout?" "What's our connection pool size?" Someone knows. It's in the code or configuration. You just need to look it up.

What to do: Investigate before experimenting.

Don't run a chaos experiment to answer questions you can answer by reading code, checking configuration, or talking to the team that built the thing. That's using a sledgehammer when you need a magnifying glass.

Spend an hour investigating:

Read the relevant code
Check configuration files
Look at past incidents where this failed
Talk to the team that owns the component
Review any existing documentation

Update your mental models based on what you find. Then reconvene and share what you learned.

Half the time, this investigation reveals more gaps. "Oh, I thought Steve configured that, but he left six months ago and nobody knows if it's still set up that way."

Perfect. Now you know what you don't know.

Type 2: Uncertainty gaps

These are gaps where nobody knows the answer because the behavior emerges from interactions between components.

You can understand every component separately and still have no idea what happens when they all run together under specific conditions.

You checked the code. The connection pool has reconnection logic. The database has failover logic. The load balancer has health check logic. The application has retry logic. Each piece makes sense. The implementations look correct.

But when database failover happens during peak traffic while the cache is cold, what actually happens? Do the retry storms from 50 application instances create a thundering herd that prevents the database from recovering? Does the load balancer pull instances out of rotation before they stop retrying, or after? Do the health checks start passing before the connection pool is actually ready, sending traffic to instances that will just error?

Nobody knows for certain and it is hard to reason about. You can't know from reading the code because the answer depends on timing, load, and how these different components interact under load.

That's emergence. The behavior comes from the interaction, not from the individual components.

What to do: Design experiments specifically to explore these interactions.

This is what chaos engineering is actually good for. Testing things where behavior emerges from complexity, where understanding the parts doesn't tell you how the whole system behaves.

Before you design the experiment, get specific about what interaction you're uncertain about:

"We're uncertain how our retry logic interacts with database failover during high traffic. Each component looks correct in isolation, but we've never validated whether they work together without creating cascading problems. We need to know because retry storms could turn a 30-second database blip into a 90-minute outage."

Now you can design an experiment:

Inject database unavailability with realistic load
Watch how retry behavior scales across all instances
Monitor whether health checks and retries synchronize badly
Track whether the database can actually accept connections when it comes back
Observe the recovery timeline end-to-end

The experiment has a clear learning goal. You're testing how components interact under specific conditions, not whether individual components work.

You'll probably need to run this experiment under multiple conditions. Behavior that emerges from interaction often depends on load, timing, system state. What happens at 2 AM with low traffic might be completely different from what happens during peak hours.

Type 3: Design gaps

These are gaps where you discover something is missing or wrong in your system design.

"Wait, we don't have a circuit breaker at all? I thought we did."

"Our health check doesn't actually validate database connectivity, it just returns 200 OK?"

"There's no monitoring for connection pool state?"

These are real problems you just discovered. These aren't knowledge gaps. These aren't uncertainties.

What to do: Fix them before experimenting.

Don't run a chaos experiment to confirm that something you know is missing or broken is actually missing or broken. That's not learning, that's theater.

If you discover your health check is shallow when it should be deep, fix the health check. If you find out you're missing critical monitoring, add it. If the circuit breaker doesn't exist, decide whether you need one.

Some teams resist this. "But we want to see how bad it is!"

No. You don't learn from deliberately breaking things you know are broken. You just create risk and waste limited resources. And your system will anyway behave differently once to address the missing or broken components.

Fix the known problems first. Then experiment to find the unknown problems.

The investigation phase

Let's say you ran your hypothesis conversation and discovered 15 different gaps. Don't immediately schedule 15 chaos experiments.

Instead, spend time investigating. Create a gaps document. List everything you discovered. Sort the list into these three buckets. Knowledge gaps we need to investigate. Uncertainties we need to experiment on. And design problems we need to fix.

Assign investigation work. Split the knowledge gaps among team members. Give people a week to investigate their assigned areas. This is building shared understanding before you create any risk.

Gather again. Have each person share what they learned. You'll find that investigating knowledge gaps often reveals more uncertainties or design problems. That's good. You're getting more precise about what you actually don't know.

Update your gaps document based on the investigation.

Prioritizing what to test.

You've investigated the knowledge gaps. You've fixed the obvious design problems. Now you're left with genuine uncertainties about how components interact.

You probably can't test all of them immediately and because resources are limited. So prioritize.

Priority 1: High-impact, high-uncertainty

These are failure modes that would cause significant customer impact if they happened, and the behavior emerges from complex interactions you can't predict.

High stakes. Real uncertainty from emergence. Test this first.

Priority 2: High-impact, low-uncertainty

These are scary scenarios where you think you know how components interact, but the stakes are high enough that validation is worth it.

You probably won't learn too much. But confirming that critical interactions work as expected has confidence value.

Priority 3: Low-impact, high-uncertainty

These are things you're uncertain about but wouldn't cause major problems if the interactions failed.

"We're not sure how cache warming and background jobs interact during deployment, but worst case some requests are slower."

Interesting to know. Not urgent to test.

Priority 4: Low-impact, low-uncertainty

Don't test these at all. You're confident about how components interact and the impact is minimal.

Save your time and energy for experiments that actually teach you something important.

Designing your first experiment

Let's say you've prioritized and you're ready to design your first experiment. You've picked: "Test how retry logic and database failover interact during realistic load."

Here's how to design it:

Start with your learning goal

Be explicit: "We want to know how our application's retry logic interacts with database failover under realistic traffic load. Specifically, whether retries from multiple instances create problems for database recovery, and how long the system actually takes to return to normal."

Document the hypothesis clearly (before you start)

Start with your system properties:

"Each application instance maintains a connection pool of 20 connections. Connection timeout is set to 5 seconds. Under normal load, instances handle 50 requests/second. When a connection attempt fails, the pool marks that slot as failed and retries. Applications implement retry logic with 100ms base delay and 2x exponential backoff for a maximum of 4 retries (100ms, 200ms, 400ms, 800ms)."

Now you can make predictions:

"When the database becomes unavailable, each instance will experience connection failures at 5-second intervals. At 50 requests/second with 20 available connections, the pool will exhaust in approximately 0.4 seconds (20 connections / 50 requests/second). Applications will begin returning errors immediately. The retry sequence will complete in 1.5 seconds per request (100 + 200 + 400 + 800). After 4 failed retries, requests will return 500 errors to clients."
"When the database becomes available again, connection pools will attempt reconnection on the next incoming request. With staggered health checks running every 10 seconds across 12 instances, we expect the fleet to detect database availability within 10 seconds. Each instance will establish 20 new connections, taking approximately 100ms per connection (2 seconds total per instance). We expect 90% of traffic to succeed within 15 seconds of database recovery.”

This gives you specific measurements. You know what to instrument (pool exhaustion rate, retry timing, connection establishment time, error rates). You know what success looks like (90% traffic succeeding within 15 seconds). You can validate each number independently.

Specific. Measurable. Testable.

Plan what you'll observe

List all the specific things you want to watch. Be specific. If you can't observe these things, stop. Add the missing observability first. Don't run experiments you can't interpret.

Start small and safe

Do your first run in the development environment, limited blast radius, low traffic, manual injection.

Don't overcomplicate things. Dont’ start with production. Don't start with peak traffic. Don't start with automated injection. Build confidence progressively.

Run it and observe

Actually run the experiment. Watch what happens. Take notes. Don't try to fix things during the experiment. Just observe and learn.

Compare hypothesis to reality. What matched your expectations? What surprised you?

Write a detailed summary of the experiment. Ideally using the same process as your real incidents.

"The connection pool exhausted in 0.6 seconds, roughly matching the 0.4-second prediction. Applications returned 500 errors immediately. The retry sequence timing matched our implementation exactly: 1.5 seconds per request before final failure.
When the database came back online, the first instance detected it in 4 seconds through health checks. It started establishing connections. At 18 connections established, the instance crashed. Out of memory error.
We had overlooked connection cleanup. When the database went down, failed connections stayed in the pool marked as dead but not garbage collected. When the instance tried to establish 20 new connections, it actually had 20 dead connections plus 20 new ones. Memory usage spiked. The JVM killed the process.
The crash triggered our orchestration system to start a replacement instance. That instance came up, detected the healthy database, tried to establish connections, and crashed for the same reason. This happened to 8 instances before we caught it.
The remaining 4 instances stayed alive because they had been restarted recently for unrelated reasons. Their connection pools were clean. They successfully connected and started handling all traffic. Four instances serving the entire load meant each was now processing 150 requests/second instead of 50. Response times jumped to 800ms. Error rates climbed to 12% due to request timeouts.
We had to disable automatic instance replacement and manually restart each instance with connection pool cleanup logic added. Recovery took 23 minutes. The hypothesis predicted connection establishment time correctly. We never considered connection lifecycle management or the interaction between pool state and instance stability."

That's learning. The individual components worked correctly but the interaction between them, at scale, under realistic conditions, created new behavior hard to predict.

Now you know and have choices.

Option 1: Add explicit pool cleanup to your health check logic.

When the health check detects database unavailability, call pool.close() to force cleanup of all connections, then reinitialize the pool. This guarantees a clean state before attempting reconnection. The downside is a brief period where the instance can't serve requests during pool recreation, maybe 200-300ms. You need to ensure your load balancer can handle instances briefly going unhealthy.

Option 2: Configure connection max lifetime in the pool settings.

Set maxLifetime to something like 30 minutes. The pool will automatically evict and replace connections that exceed this age, regardless of their state. This prevents accumulation of dead connections over time. The tradeoff is ongoing connection churn during normal operation, which adds latency (probably 5-10ms per replaced connection). You also need to tune the lifetime value. Too short and you create unnecessary overhead. Too long and you don't solve the problem.

Option 3: Implement connection validation on checkout.

Configure the pool to test connections before handing them to the application (testOnBorrow or equivalent). Dead connections get removed when the application requests them. This distributes cleanup across request processing rather than concentrating it at recovery time. The cost is added latency on every request, typically 1-5ms depending on your validation query. During the outage, you still accumulate dead connections, but they get cleaned up gradually as traffic arrives rather than all at once during reconnection.

You'll probably want to test this again after selecting the option for the fix. Emergent behavior depends on conditions, so you validate solutions under the conditions that matter.

What's next

Last time I showed you how hypothesis conversations reveal what your team actually believes about your system. This time you learned what to do with those gaps: investigate the knowable, fix the broken, experiment on the emergent.

Most teams skip straight to breaking things. You now know better. The learning happens in the conversation, the investigation, and the careful observation of how components interact under real conditions. The chaos experiment is just one tool in that process.

Try this with your team. Start with the hypothesis conversation from last newsletter. Work through the investigation phase. Pick one uncertainty about emergent behavior and design an experiment around it.

Then tell me what you discovered.

Until then,

Adrian

ResilienceChaos EngineeringResilienceBitesEngineeringCloud ComputingSoftware EngineeringSoftware SystemsResilience Engineeringhypothesis conversationsystem resilienceincident preventionchaos engineering best practiceshow to start chaos engineering

Adrian Hornsby