Why We Still Suck at Resilience

Your organization does postmortems. Runs GameDays. Has invested in monitoring, chaos engineering, and operational readiness reviews. And yet the same types of incidents keep happening.

This book examines why. Drawing from two decades of experience in software engineering and a decade building resilience practices at AWS, Adrian Hornsby unpacks the organizational dynamics that quietly undermine even the most well-intentioned resilience programs. The tension between efficiency and resilience. The feedback loops that never close. The gap between how we think work happens and how it actually does.

Get the book

What's inside

The book is organized around the recurring patterns that prevent organizations from becoming genuinely resilient. Each chapter builds on the last, moving from recognizing the problem to understanding the deeper dynamics at play.

The resilience paradox

Why organizations that invest heavily in resilience practices still experience the same failures, and what that tells us about how we've been approaching the problem.

Work-as-Imagined vs. Work-as-Done

The persistent gap between how leadership thinks work happens and what actually happens on the ground, and how this gap erodes resilience from within.

Enabling constraints

The difference between rules that control behavior and structures that create the conditions for adaptive, resilient responses.

Efficiency vs. resilience

The inevitable tension between optimizing for today's performance and preparing for tomorrow's surprises, and why this tension can never be fully resolved.

Broken feedback loops

How incident reviews, GameDays, and chaos experiments generate insights that never make it to the people and teams who need them most.

Adaptive capacity

Why resilience ultimately depends on people's ability to respond creatively to the unexpected, and how organizations systematically undermine this capacity.

Who this book is for

This book is written for people who are responsible for the reliability and resilience of software systems and the organizations that build them. People who have done the "right" things and are frustrated that it hasn't been enough.

VPs of Engineering and CTOs trying to understand why resilience investments aren't paying off
SRE and platform engineering leaders who see the same incident patterns repeating
Engineering managers caught between pressure to ship fast and pressure to be reliable
Chaos engineering and reliability practitioners who want to go deeper than tooling
Anyone who has sat through a postmortem and thought "we've seen this before"

Stop having the same incidents

Get the book. If you want hands-on help, book a diagnostic call.

Get the book