When AI Writes Your Code, Chaos Engineering Writes Your Insurance Policy

Oct 2

AI - with a question mark - written on a whiteboard.

AI-generated code has moved from curiosity to almost standard practice with breathtaking speed. Teams use AI to build entire features and even applications. Code generation tools translate requirements directly into pull requests. What took a developer days now takes minutes and a carefully crafted prompt. When done right, the AI productivity gains are real. Teams shipping AI-assisted code go faster.

The question facing engineering organizations today isn't whether to adopt these tools—developers already have—but how to understand and operate systems built with them safely.

The Acceleration of an Old Problem

Let’s be honest, code opacity didn't start with AI. Every engineering organization battles the same demons: developers leave, context evaporates, documentation rots, deadline pressure produces shortcuts. Six months after launch, someone gets paged at 3 AM to fix a system they didn't write and barely understand. This story is older than version control.

AI-generated code didn’t create this problem but it accelerates it to speeds we've never seen before.

A developer writing code makes choices—explicit or implicit—about tradeoffs, edge cases, and failure modes. They might not always document these choices well, but they existed as conscious thoughts, at least momentarily. You can pull them into a meeting room and ask: "Why did you implement it this way?" They might not remember perfectly, but there's a conversation to have, a human memory to question.

AI-generated code compresses this process into statistical inference. The model chose an implementation based on patterns in its training data, optimizing for likelihood rather than your specific domain constraints. Three months later, when that code fails under unexpected load, there's no developer to ask. The model that generated it no longer exists in the same form—weights have shifted, training has evolved, the context window that held your requirements has long since evaporated. You're maintaining systems written by "intelligences" that can't yet fully be recalled for questioning.

The velocity compounds everything. When one developer writes problematic code, that's one problem to debug. When AI helps fifty developers write code twice as fast, you've potentially scaled both the productivity and the technical debt by orders of magnitude. The same old problems happens at unprecedented speed and volume.

Humans Are Still in the Loop

Let's be clear about what actually happens: AI doesn't drop code directly into production. A developer prompts the AI, reviews what it generates, modifies the output, integrates it with existing systems, AI writes tests, and shepherds it through code review. Humans remain deeply involved.

The challenge emerges from volume and velocity. When you're reviewing thirty AI-generated pull requests weekly instead of ten human-written ones, can you maintain the same focus and scrutiny? When the code looks correct—follows conventions, passes linters, handles obvious failure cases—how often do you catch the subtle problems that only emerge under production conditions the AI never considered?

Human review remains essential, but it faces scaling limits. We need systematic methods for stress-testing AI-generated code that complement human judgment. This is where chaos engineering can help.

Why Chaos Engineering Fits This Moment

Traditional code review asks: "Does this code do what it claims?" Chaos engineering asks: "What does this code do when everything goes wrong?" That second question becomes critical when you're shipping code whose internal logic you inherited rather than invented.

Run a chaos experiment that injects latency into downstream services. Watch which components start failing in ways nobody predicted. Maybe the AI-generated service client retries aggressively without backoff, turning a minor slowdown into a cascading failure. Code review might have caught this if the reviewer specifically looked for retry logic and thought to question its implementation. Chaos engineering catches it by creating the conditions where the problem reveals itself.

The difference becomes clearer over time. You discover patterns: this model's code tends to assume infinite memory; that model's error handling releases resources inconsistently. These insights feed directly into how you prompt AI going forward, what you hunt for in code review, and where you concentrate your testing.

How Chaos Engineering Must Evolve

Traditional chaos engineering assumed code written by humans. Humans you could ask questions to. That assumption breaks down when substantial portions of your codebase emerge from statistical models.

Automated Knowledge Capture

Every chaos experiment that reveals unexpected behavior should generate structured documentation automatically. When an experiment discovers that an AI-generated service degrades catastrophically under certain conditions, the experiment should produce:

- Structured description of the failure mode(s)

- Metrics thresholds indicating the problem emerging

- Potential mitigation steps based on observed and learned behavior

- Links to specific code exhibiting the issue

Engineers review and refine these auto-generated artifacts rather than writing them from scratch. This automation matters because AI code generation produces more components faster than humans can document manually. The documentation gap will become a problem without systematic capture.

Feedback Loops into Code Generation

Here's where things get interesting. Your chaos experiments discover that AI-generated clients consistently lack good retry logic. This gap appears across multiple services. You document it, but more importantly, you can now update your prompts to account for this shortfall: "Implement retry logic with exponential backoff and jitter pattern with these specific thresholds […]"

AI-generated code improves because chaos engineering taught you what to demand. You can feed experiments findings directly into prompt libraries—an identified gap becomes a constraint in every subsequent request.

With such feedback loops, you can continuously update your prompts based on chaos experiments findings, shortening the critical learning cycle.

Pattern Detection at Scale

AI code generation creates an unusual opportunity: many components share similar origins. They emerged from similar models, responding to similar prompts, applying similar patterns from training data. Studies about code generation errors often identify recurring patterns or clusters of failures specific to certain models.

Chaos engineering tools can exploit this and systematically search for patterns common to specific AI models.

When you find these patterns once, hunt for them everywhere across all AI-generated components simultaneously, discovering not just that Service A has a problem, but that twelve services share variants of the same flaw because they were all generated using similar models.

Continuous Experimentation

AI enables shipping dozens of features weekly, each containing substantial generated code. Chaos experiments need to keep pace. However, embedding chaos experiments directly into CI/CD pipelines creates significant problems—the non-deterministic nature of chaos experiments conflicts with the deterministic requirements of deployment pipelines, experiments require production-like load and extended runtime. The solution is a separate, dedicated chaos pipeline running parallel to your CI/CD pipeline, allowing experiments to operate on their own schedule without blocking deployments while still feeding findings back into development practices.

Note: For teams seeking deterministic validation in their CI/CD pipelines, deterministic simulators offer a middle ground. Tools like Antithesis model distributed system behavior deterministically, allowing exploration of failure scenarios with reproducible results. While they require significant investment to build and maintain, and can't capture all real-world complexities, they provide faster feedback than full chaos experiments while being more comprehensive than mocked tests. They work well in pipelines but should complement, not replace, chaos experiments against production-like environments.

The AI Testing AI Question

Could we use AI to design chaos experiments for AI-generated code? The idea is indeed seductive: provide an AI with code it generated previously plus observability data, then ask it to design experiments that would catch emerging issues.

This approach shows promise in my own experiments. With the right prompt, AI-designed chaos experiments sometimes target edge cases humans overlook, precisely because the AI recognizes patterns in generated code that humans don't consciously notice.

But we should still remain skeptical and in the loop. Both AIs operate from similar statistical foundations. They share similar blind spots—the AI experiment designer often misses the same edge cases the code generator misses, for exactly the same reasons. An AI trained primarily on historical data still struggles to imagine creative, realistic failure scenarios that don't appear commonly in training data.

Like for other AI generated artifacts, the key seems to use AI to generate candidate experiments, but require human engineers to review, refine, and prioritize them. AI handles the breadth—proposing dozens of scenarios quickly. Humans handle the depth—identifying which scenarios matter most for your specific systems and constraints.

We're learning by doing, and the lessons keep changing.

Final Thoughts

You just simply can't afford to opt out of the race while competitors embrace AI-assisted development. The productivity gains is just too substantial. Teams using these tools effectively ship features faster. Like with most innovations, organizations that hesitate eventually lose ground against competitors who've figured out how to capture the upside while managing the downside.

The choice facing engineering leaders isn't "AI-generated code: yes or no?" Market forces already made that decision. The real choice is: "How do we operate systems built with AI assistance safely and sustainably?"

Chaos engineering offers a practical answer. By systematically exploring how systems fail, you build the operational understanding that rapid AI-assisted development threatens to erode. You discover hidden dependencies before they cause outages. You document failure modes before customers encounter them. You create feedback loops that improve both your code generation practices and your operational capabilities.

The machines write the code. Chaos engineering helps us understand what we've built and guides what we build next.

chaos engineeringAI codeAI generated codeSystem ResilienceSite Reliability Engineering (SRE)devopsAIAI TestingPrompt EngineeringSoftware QualitySoftware engineering

Adrian Hornsby