AI doesn't solve your problems. It moves them somewhere you can't see yet.

Estimated read time: 9 minutes


There's a seductive story about AI in operations that goes something like this: we have problems, let's deploy AI, it will fix them. Incidents will resolve faster, anomalies will get caught earlier, postmortems will draft themselves in minutes instead of hours, metrics will improve. I've been hearing this story recently and I don't doubt the promise, but improved metrics and solved problems are not the same thing. The problems don't go away when the metrics get better; they go somewhere else, take new forms, and show up in places nobody thought to look. The question is where they go, and what they look like once they get there.

I've been circling this question for a while. About a year ago I wrote about AI meta-operators and system responsibility, trying to work through what happens when AI agents start making operational decisions that used to require human judgment. More recently I wrote about chaos engineering for AI-generated code, arguing that the velocity and opacity of AI-assisted development demands systematic stress-testing because human review alone can't keep pace. In both cases I could feel the problems but I couldn't connect them, couldn't see why the same kinds of trouble kept showing up in different forms. In my book I argue that a common vocabulary is one of the most underrated tools in resilience work, because once you can name something you can discuss it, and once you can discuss it you can start to act on it. What I was missing was exactly that: a vocabulary for what AI does to the mess rather than just the mess itself.

***

David Woods has been mapping exactly that. He has been developing a heuristic he calls the Messy 9, which he recently premiered on the Fine Pod podcast and discussed on in the Resilience in Software Foundation Slack channel. It's designed to bridge the science of how complex systems actually work and the practical need to do something about it. The setup is what Woods calls GCA: patterns over cycles of Growth, Complexification, and Adaptation. When new technological capabilities affect ongoing worlds of practice, processes of growth, complexification, and adaptation play out in lawful patterns, and stories of technology change should capture or envision the new forms of messiness that arise when apparent benefits get hijacked. The core message is that the messiness of the real world is conserved over attempts to improve systems, conserved in the formal sense described by the No Free Lunch and Robust Yet Fragile theorems: you don't eliminate messes, you move them.

Woods organises the recurring forms into nine patterns, grouped in threes: (1) congestion, cascades, and conflict; (2) saturation, lag, and friction; (3) tempos, surprises, and tangles. He describes them as a small set of generic keys you can use to unlock any episode of change to see how messes reappear, with much of the action living in the cross-connects and overlaps between them. Each points to processes that play out over time as systems grow, and each takes on unfamiliar forms when AI enters the picture.

Congestion, in Woods' framing, is what happens when a bunch of things are going on simultaneously and you have to deal with them all in the time available. Cascades are disturbances propagating across lines of interdependency, where one failure dumps load onto adjacent functions and the effects spread. Conflict is the question of who loses, who sacrifices, what gets prioritised when there's overload, what gets sacrificed first and what gets sacrificed later. These three are the most visible forms of messiness, and in traditional distributed systems we've built tooling to handle them: circuit breakers, bulkheads, load shedding, runbooks. But when AI is the operator, these patterns migrate into territory that existing tooling can't see.

Consider what cascading failure looks like when one model's output feeds another model's judgment, which triggers a third model's action. The failure propagates through reasoning rather than network calls and retry logic. A subtly wrong interpretation becomes a confident decision becomes an automated action, and the whole chain looks healthy from the outside because every component is performing exactly as designed. The cascade is there, running through inference rather than infrastructure, and it remains invisible until the consequences arrive in a form nobody anticipated.

Saturation, in Woods' framework, is what happens when a system approaches the boundary where it runs out of capacity to deal with challenges, where different subsystems start dumping more overload onto other places and the saturation spreads. In traditional systems this means hitting a resource limit you can see, measure, and plan for. AI introduces a different kind: decision saturation, where AI handles enough operational decisions that the humans nominally overseeing it lose the ability to meaningfully evaluate what it's doing, simply because the volume and speed of AI-driven decisions exceed what human attention can track. The oversight saturates while the system hums along, and nobody notices because the dashboards still look green.

Lag is what Woods describes as the pattern where organisations cut the resources they need to integrate new capabilities before they have actually integrated them, anticipating productivity gains and reducing experience, expertise, and people before the ostensible benefits have materialised. This is playing out everywhere right now: teams are being restructured around AI productivity assumptions while the actual work of figuring out where AI fits, where it doesn't, and what new failure modes it introduces is still in its earliest stages. The resources being cut are the very resources needed to discover whether the new capability actually works as advertised.

Friction undergoes an equally counterintuitive transformation. Woods frames friction as a necessary feature of bringing capabilities into practice, the offsetting costs and workload that arise when something new meets the complexity of the real world, and warns that if you underplay it, the things you try to deploy turn out not to work as well as you would like. The old friction in operations was obvious: manual processes, handoffs between teams, slow approval chains. AI removes it, which feels like progress, but friction served a function. It slowed things down enough for people to notice when something was off, created natural pause points where someone might say "wait, does this actually make sense?" Removing the friction removes the speed bumps that gave humans a chance to catch problems before they compounded, making the system faster and smoother while also making it more brittle in ways that only surface when speed and smoothness were exactly the wrong thing.

Tempos, the seventh pattern, describe what happens when different rates of change collide. In DevOps this is already familiar: the tempo of development influences operations, operations constrains development, and incident response introduces its own urgency that overrides both. AI adds new tempos that don't match existing ones: the speed at which models make decisions versus the speed at which humans can review them, the rate at which AI-driven changes accumulate versus the rate at which organisations can absorb their implications. These tempo mismatches create their own congestion as decisions pile up faster than the capacity to evaluate them.

Surprises, the eighth pattern, are not about rare events at the tail of a distribution. Woods insists that the dragons of surprise don't get weaker and more infrequent as systems improve; instead, systems generate new categories of surprise as they change. AI is particularly good at producing novel categories because it fails differently from humans. When a human operator makes a mistake, other humans can usually reconstruct the reasoning and see how someone under pressure with incomplete information made a wrong call. When AI makes a mistake the reasoning is opaque, nobody can explain why, which means nobody can confidently say it won't happen again in a different form, which means the organisation can't learn from it in the way it has always learned from human error.

The tangles might be the most troubling of all. Woods describes tangles as circular dependencies and strange loops, the kind of thing he first encountered in nuclear power plants where a critical function depended on an instantiation of itself. When multiple AI systems operate with overlapping domains, one monitoring infrastructure, another triaging incidents, a third managing capacity, they develop implicit dependencies that exist nowhere in any architecture diagram, learning to compensate for each other's behaviour in ways their operators never specified. These tangles are invisible during normal operations and surface during failure, when one system's unexpected behaviour cascades through compensation patterns that nobody knew existed, creating a debugging problem that is qualitatively different from anything human operators have encountered before. Woods gives a vivid example from the current AI gold rush itself: we need critical infrastructure to support AI computations, AI is being deployed to reduce the people who operate that infrastructure, and the AI doing the operating depends on the infrastructure it's supposed to be operating. The circular dependency is already there.

All of this converges on what Woods identifies as the key constraint: extra adaptive capacity is most needed when least affordable. AI adoption is a period of significant system change that demands more adaptive capacity, more ability to recognise and respond to novel situations, precisely because the system is in flux and the new failure modes haven't been mapped yet. AI adoption also consumes adaptive capacity, because organisations use it as a reason to reduce the human expertise that provides it. The need goes up while the supply goes down, and that gap is where the real risk lives, in a place that improved operational metrics will never show you.

None of this means AI is bad for operations; the improvements are real and sometimes substantial. The story that AI is going to solve your problems is almost certainly incomplete, because as Woods puts it, the Messy 9 exists to counter the tendency we all have to see whatever we develop as solving something instead of moving things and shifting processes. AI solves the problems you are measuring while generating new ones you haven't learned to see yet, and the messes migrate to places that your current instrumentation and your current organisational structure are not designed to detect. If you're deploying AI into your operations and your metrics are getting better, the question worth asking is where the mess went, because messiness is conserved over cycles of change. It takes new forms and it operates at new scales, and that puts a higher premium on exactly the experience, skill, and expertise to figure out how the system is working when it's not working the way you thought it was.

//Adrian

Next
Next

Why We Still Suck at Resilience and Why I Wrote a Book About It