The Quiet Erosion: How Organizations Drift Into Failure
Photo by Raul Ling
The Slack notification arrived at 3:17 AM on a Tuesday: "Payment system down. All hands."
Six hours later, as the team sat exhausted around the incident room table, the CTO asked the question we hear after every major outage:
"How did we get here?"
The post-mortem timeline showed no single catastrophic decision. Instead, it revealed dozens of small compromises, each one reasonable at the time:
"We only reduced test coverage for non-critical features."
"We temporarily bypassed code review for this urgent fix."
"We'll address that performance issue next sprint."
Every decision made perfect sense in isolation. Yet somehow, these incremental changes had accumulated into a system operating at the edge of failure, where one memory leak during peak traffic brought down their entire payment infrastructure.
This is the story of organizational drift, one of the most dangerous and least understood risks facing software teams because drift happens slowly, quietly, through rational adaptations to competitive pressure and operational realities.
What makes drift particularly challenging is recognizing that it is happening before it's too late.
To illustrate how drift happens, I've created the story of TrendCart, a fictional e-commerce platform that experiences this exact pattern. While the company and characters are entirely fictional, the problems they experience are real. They are based on my own experiences and observations.
Disclaimer: TrendCart and all characters in this story are fictional. Any resemblance to real companies, people, or events is purely coincidental. This narrative is created solely to illustrate common organizational patterns and does not reference any actual organization.
The Story of TrendCart's Gradual Decline
Maya joined TrendCart as Lead Developer when the e-commerce platform was still celebrating its Series A funding. With 50,000 monthly active users and a reputation for reliability, TrendCart had carved out a niche for independent fashion designers.
"Our philosophy is simple," explained Raj, the CTO, during the onboarding process. "We deploy twice monthly, after full testing. Every commit gets two code reviews, unit and integration tests, and a security review. We take our customers' trust seriously." Maya was impressed by the clear and disciplined development process. Thorough yet not bureaucratic, with clear guidelines that developers consistently followed.
The Pressure Mounts
The first cracks appeared during a routine customer advisory call. Designer after designer asked about features that TrendCart didn't have, features that their competitor, CompeteCart, had already launched.
"When will you have social login?"
"Why can't I bulk upload products?"
"CompeteCart's analytics show me exactly which products are trending in real-time."
Maya watched the sales team's faces during these calls. Each missing feature represented lost revenue. By week's end, they were maintaining a "feature gap spreadsheet" that grew longer daily.
The emergency board meeting happened on a Friday. Maya wasn't invited, but the tension was palpable when leadership came out of the room.
"Eighteen months," the CEO announced to the engineering team. "We have eighteen months of runway. If we don't close the feature gap and accelerate growth, there won't be a TrendCart to run."
It was about survival.
The Reasonable Compromise
Raj called an all-hands engineering meeting. "We need to move faster without breaking what works," he began. "I'm proposing we shift to weekly deployments by streamlining our processes."
The plan seemed logical: Keep thorough testing for payment processing and user data. Reduce the scope of testing for "non-critical" features. Maintain code review, but expedite for urgent fixes. And automate security checks where possible.
Maya raised her hand. "What is 'non-critical'?"
"Features that don't directly impact transactions or user data," Raj replied. "UI elements, analytics dashboards, recommendation engines, all important, but not customer-facing if they break."
Weekly deployments meant faster iteration, quicker feature delivery, and better competitive positioning. The compromise felt reasonable. Maintain safety for core functions while accelerating everything else.
The leadership team and developers were happy with the faster pace, and for a while, everything seemed to be going fine.
For three months, it worked great.
Early Warning Signs Most Teams Miss
Six months later, Maya noticed troubling patterns. Deployment rollbacks had increased from once every few months to twice in the last month. The on-call rotation was getting paged more frequently for "minor" issues that developers would fix with quick patches rather than proper investigations.
Test coverage showed 71%, but Maya knew the number lied. Developers had learned to game the metrics: tests that verified mocks instead of actual functionality, integration tests marked as "flaky" and skipped in continuous integration (CI), and business logic validated only through happy-path scenarios.
"Look at this payment processing change," Maya showed her teammate Alex during a code review. "The tests pass, but they're mocking the actual payment gateway. We're testing that our mocks work, not that payments work."
During the monthly retrospective, Maya presented her concerns. "Yesterday's deploy broke user avatars for three hours. We only discovered it because a customer tweeted a screenshot. We're accumulating technical debt and we're getting comfortable with it.
"These issues aren't critical individually," she continued, "but they're creating systematic brittleness."
The product manager's response was fast: "Every platform has technical debt. Our KPIs look excellent. Look, cart abandonment is down 15%, and transaction volume is up 23% this quarter. Let's not slow momentum over edge cases."
The team moved on.
The First Incident
Black Friday weekend. Traffic peaked at 3x normal levels. At 2:47 PM, the entire platform went down.
Thirty-seven minutes of complete outage during the busiest shopping period of the year.
$127,000 in lost revenue, nearly 3% of their yearly revenue, plus 3,000 abandoned carts and 12 customer complaints. The numbers didn’t look good.
But Maya knew the real cost was likely higher.
The investigation revealed a memory leak that had been flagged during sprint planning three months earlier. Categorized as "performance optimization," it lived in the ever-growing backlog of "technical debt to address later." Under normal load, it was invisible. During peak traffic, it consumed all the available memory, causing the application servers to crash.
The post-mortem focused on immediate fixes, including memory monitoring alerts, load balancing improvements, and automated restart procedures.
"We'll review the performance backlog next quarter," Raj concluded.
Within two weeks, the performance issues were again labeled "not customer-impacting" and deprioritized for feature work.
Maya started noticing a pattern in the incident reports. Each one ended with "this specific issue has been resolved," rather than addressing the category of problems that made this possible.
The Normalization of Deviance Pattern
A month later, a more serious incident happened. A privacy incident. Customer addresses were visible, and the bug had actually merged customer profiles, so orders from one customer appeared in another's account history. It took six hours to fully identify the scope because logging wasn't detailed enough to trace which accounts had been affected.
The investigation revealed a cascade of small failures: A developer had pushed directly to main to "quickly fix" a merge conflict. The automated tests passed because they didn't test cross-user data isolation. The staging environment didn't have enough data volume to reproduce the issue. The code review was marked as "approved," but clearly it hadn’t been thorough, since none of the comments had been addressed. Instead, it was merged with the message “Added a few comments, but approving to make the feature release date. Cutting a ticket to the backlog”
Maya expected this to be a wake-up call. Instead, in the next incident, she heard familiar patterns: "The root cause was the direct push to main, again. We'll fix that by preventing that altogether." The conversation never addressed why the developer felt pressure to bypass the process.
How Small Compromises Compound
Maya, feeling things were getting out of control, started mapping their journey. Using incident reports, deployment metrics, and her own observations, she created a timeline showing TrendCart's drift from disciplined engineering to normalized risk-taking.
She highlighted each small decision that, while logical in isolation, had collectively eroded their safeguards.
When the fourth incident occurred, a billing bug that undercharged customers for three weeks, Maya presented her diagram to the executive team.
"We didn't make a single decision to be unsafe," she explained. "We made hundreds of small trade-offs, each seeming reasonable at the time. But look where we've ended up."
She traced their path across the boundaries. "Here's where we first reduced the testing scope. Here's where we started accepting pull requests with failing tests if they weren't in 'critical paths.' Here's where we stopped requiring security reviews for changes to the user data models."
Recognizing Drift Before It's Too Late
Maya's presentation to executives didn't immediately change everything. The team asked good questions, but the first response was predictable: "We need better tooling to prevent these specific issues."
Then Maya showed them data from their own customer support tickets. The volume of "weird bugs" had tripled over the past six months. Customers were reporting issues that individually seemed minor but collectively painted a picture of a platform becoming unreliable.
"Our Net Promoter Score has dropped eight points," the customer success manager added. "Customers are saying we 'used to be reliable' but now they're not sure they can trust us with their business."
"When did we leave the safe zone?" the CEO asked.
"About seven months ago," Maya replied. "When we started treating the tests as a suggestion rather than a requirement."
"And the red zone?"
"We've been operating in the danger zone since we decided that certain types of bugs were 'acceptable business risks' rather than issues to be fixed. That normalizing of problems is what concerns me most."
The Recognition
The executive team finally understood they weren't looking at isolated incidents requiring specific fixes. They were seeing symptoms of systematic drift from engineering discipline toward normalized risk-taking.
What had felt like necessary adaptations to competitive pressure had gradually transformed into dangerous corner-cutting. Speed and agility had become excuses for compromising foundational practices.
But Maya proposed a solution that surprised them.
"We don't need to slow down or revert to monthly deployments," she explained. "We need to make drift visible and intentional rather than invisible and accidental."
The changes that followed weren't dramatic. They didn't slow down development or return to monthly deployments. Instead, they implemented what Maya called "intentional friction,” small barriers designed to make unsafe shortcuts visible and require conscious decisions, rather than letting things drift.
They established "guardrail reviews", quarterly assessments specifically designed to call-out drift in their development practices. They created clear, non-negotiable boundaries for security, testing, and code quality.
Timeline showing TrendCart's organizational drift from safe practices through warning signs to major incidents and recovery
The Adaptation Awareness Framework
Most importantly, they created a framework that distinguished between two types of changes:
Reactive adaptation: Responding to immediate pressures (what led to drift)
Proactive adaptation: Anticipating systemic risks (what Maya demonstrated)
The framework didn't restrict adaptation. Instead, it encouraged more Maya-style adaptation while making pressure-driven adaptations visible for organizational learning.
Maya had demonstrated exactly the adaptive capacity every organization needs: the ability to sense emerging risks, connect patterns across incidents, and proactively adjust the system's safety boundaries.
The goal was to cultivate more people who could adapt like Maya, not prevent adaptation.
The framework included:
Pattern Recognition: Regular analysis of when, why, and how teams adapted standard processes
Systemic Learning: Understanding what adaptations reveal about underlying system pressures
Adaptive Capacity Building: Strengthening the organization's ability to handle unexpected situations safely
Continuous Sense-Making: Using adaptations as data to improve the system rather than just controlling behavior
The Recovery
The transformation wasn't immediate, but it was measurable. Within a few months, they had restored test coverage to a healthy 83% through systematic debt reduction. They also reduced deployment rollbacks by 60% through improved quality gates and practically eliminated “hot fixes” pushes to production. They also started to see improvements in customer tickets and on-call operations.
"The key insight," Maya explained during an all-hands meeting, "wasn't choosing between speed and safety. It really was choosing between intentional trade-offs and invisible drift."
Epilogue
A few months later, TrendCart's main competitor suffered a major data breach that affected all of its customer records. The incident, caused by accumulated security compromises, forced the company to rebuild its platform. The incident resulted in severe fines and extensive negative publicity.
Designers who had initially left TrendCart for CompeteCart's faster feature delivery began returning. They cited reliability and trust as primary factors.
It was a very expensive incident, to the point that CompeteCart ultimately ended its service.
"The irony," Maya reflected during a conference presentation about their journey, "is that we thought we were making necessary trade-offs to remain competitive. But by recognizing and managing our drift, we ended up with both speed and safety, which turned out to be the real competitive advantage all along."
Maya had adapted, too. When she first noticed the test quality issues, she could have focused solely on her own code and remained quiet. Instead, she adapted from individual contributor to system advocate.
That's what resilient organizations need: people who adapt their perspective from local optimization to system health. By encouraging this kind of positive adaptation, organizations actually become more adaptable as a whole.
A few months later, Maya was promoted to principal engineer.
The Cost of Invisible Drift
Why am I telling this story? Organizational failure rarely announces itself with dramatic warning signs. Instead, it accumulates through thousands of small compromises, each reasonable in isolation, collectively lethal in combination.
The timeline illustrated above shows what most post-mortems forget to mention. The "root cause" is rarely directly related to the final incident itself, but rather the slow erosion of safety boundaries that made that incident inevitable. The memory leak, the privacy breach, and the billing bug weren't separate problems requiring separate solutions. They were symptoms of a system that had gradually drifted from good engineering practices toward normalized risk-taking.
The most insidious aspect of drift is how it makes teams complicit in their own degradation. When test coverage drops over time, no single day feels dangerous. When minor incidents become routine, each one seems manageable. When shortcuts become standard practice, organizations believe they're adapting to business realities rather than compromising their foundation.
Why Traditional Approaches Fail
Most organizations fail to detect drift because they focus on symptoms rather than causes. They detect problems after drift has already occurred, rather than preventing drift from accumulating in the first place.
Drift happens in the space between formal processes and daily reality.
It's the accumulation of small adaptations that individually seem reasonable but collectively undermine system safety.
Intentional Trade-offs
Every organization needs people who can adapt the way Maya did: sensing patterns, surfacing concerns, and strengthening the system's capacity to handle surprises.
Speed and safety aren't mutually exclusive, but they require constant vigilance to maintain together. The goal isn't to eliminate all risk or prevent all adaptations. Instead, it's making risk visible and consciously managed while building the organization's capacity to adapt safely.
To successfully balance speed and safety, you need to:
Track leading indicators of drift, not just incident outcomes
Treat process adaptations as valuable learning data, not violations to prevent
Celebrate teams that surface systemic issues early, not just those that respond to incidents quickly
Build adaptive capacity—the ability to handle unexpected situations safely
The Tale of Two Adaptations
This story reveals something important that I almost overlooked when writing this blog post. TrendCart has experienced two completely different types of adaptation.
Drift-Inducing Adaptations (what the team did):
Individual responses to immediate pressure
Invisible to the broader organization
Focused on local optimization
Accumulated without systemic awareness
Resilience-Building Adaptations (what Maya demonstrated):
Proactive pattern recognition across the system
Made concerns visible to leadership
Focused on long-term system health
Enhanced organizational learning capacity
Next Steps
The next time someone asks, "How did we get here?" after a major incident, remember that the answer probably isn't in the immediate timeline. It's in the months or years of small compromises that made that moment inevitable.
Ask yourself:
What processes has your team adapted or "streamlined" in the past year?
How many of your quality metrics could be gamed without affecting actual quality?
When did you last review the cumulative impact of individual process adaptations?
What would your customers say about your reliability compared to six months ago?
Then take action:
This week: Implement basic drift tracking for your most critical processes
This month: Conduct a retrospective focused on accumulated technical debt and process adaptations
This quarter: Establish guardrail reviews and leading indicator monitoring
This year: Build systematic adaptation awareness into your organizational culture
Note: See Annexes for a more comprehensive set of actions
Transformation is available to every team willing to choose conscious trade-offs over invisible drift.
The question isn't whether your organization is making compromises; every organization does. The question is whether you're making them intentionally, with full awareness of their cumulative effect, or allowing them to accumulate invisibly until the next major incident.
Choose intentional drift detection. Choose systematic quality practices. Choose competitive advantage through reliability.
Your customers, your team, and your future self will thank you.
Implementation Toolkit
Note: This toolkit represents my current practices, informed by research and field experience; however, organizational resilience is deeply contextual. I welcome suggestions, corrections, and additional strategies based on your own experiences. Please reach out with improvements or examples of what has worked (or hasn't worked) in your organization.
The Anatomy of Organizational Drift
There are typically four drift phases that an organization experiences:
Phase 1: The “Reasonable” Compromise
External pressure creates the need for adaptation
Leadership proposes logical process changes
Team maintains core safety practices while "optimizing" peripheral ones
Early metrics show positive results
Gap emerges between work-as-imagined (official processes) and work-as-done (actual practice)
Warning Signs:
Increased deployment frequency without proportional quality investment
Redefinition of "critical" vs "non-critical" systems
Process adaptations that become regular patterns
Phase 2: Metric Gaming
Teams adapt to new incentives by optimizing numbers rather than outcomes
Quality indicators lose correlation with actual quality
Small issues accumulate but remain below the visibility threshold
Teams engage in "satisficing" behavior, doing just enough to meet targets rather than achieve actual quality
Warning Signs:
Test coverage maintains while actual test quality deteriorates
Increased incidents that get “quick-fixed” rather than properly investigated
Growing backlog of "technical debt to address later"
Phase 3: Normalization
Incidents become less exceptional
Each problem gets treated in isolation rather than as part of a pattern
Team culture shifts from "how do we prevent this?" to "how do we fix that fast?"
Organization loses "chronic unease", the healthy skepticism about system safety that prevents complacency
Warning Signs:
Post-mortems focus on specific fixes rather than systemic improvements
Increasing acceptance of "edge cases" and "known issues"
Customer complaints about reliability start appearing
Phase 4: Crisis or Recovery
Accumulated risk manifests as a major incident or competitive threat
Organization either recognizes the pattern and implements systematic changes
Or continues drift until catastrophic failure forces dramatic restructuring
Important note: Even successful recovery requires ongoing vigilance, as drift is a continuous process that can restart at any time
This framework draws from Sidney Dekker's concepts of "drift into failure" and "work-as-imagined vs work-as-done," Diane Vaughan's "normalization of deviance," and supporting research in organizational safety by James Reason, Charles Perrow, and Karl Weick.
Building Drift-Aware Organizations
Here are strategies for making drift visible before it becomes dangerous:
Leading Indicators to Monitor
Quality Metrics
Test coverage trends, not just absolute numbers
Test execution time and flakiness rates
Code review rejection rates and bypass frequency
Deployment rollback frequency and time-to-restore
Work-as-done vs work-as-imagined gaps (actual vs documented processes)
Process Health
Pattern analysis of process adaptations and their justifications
Time between incident detection and customer notification
Percentage of incidents that are repeat issues
Cross-team dependency failure rates
Weak signal detection rate (near-misses identified and reported)
Cultural Indicators
Employee survey responses about process confidence
Voluntary overtime trends during deployment periods
Knowledge transfer effectiveness between team members
Incident response stress levels and team satisfaction
"Chronic unease" levels (healthy skepticism about system safety)
Psychological safety scores for reporting problems and concerns
Systematic Drift Detection
Guardrail Reviews
Map all process adaptations from the previous quarter
Understand why teams felt adaptations were necessary
Assess the cumulative impact of individual compromises
Identify patterns that indicate systematic drift
Review and update boundaries based on emerging risks and system changes
Adaptation Awareness Implementation
Regular pattern analysis of when, why, and how teams adapt processes
Focus on understanding systemic pressures that drive adaptations
Use adaptations as learning opportunities rather than compliance violations
Build organizational capacity to adapt safely rather than just limiting adaptations
Create psychological safety for discussing process pressures without blame
Pattern Recognition Training
Train incident responders to identify systemic issues
Implement post-mortem that surface contributing factors
Maintain cross-incident trend analysis
Regular review of themes across multiple incidents
Focus on "how the system normally succeeds," not just how it fails (Safety-II approach)
Recovery Strategies
For organizations already experiencing drift:
Stabilization
Implement automated quality gates that cannot be bypassed
Establish emergency change review process
Create visible dashboard of leading indicators
Begin systematic documentation of current state
Restore "chronic unease" through leadership modeling of safety-conscious behavior
Assessment
Conduct an audit of current practices vs. stated policies
Map all informal processes and shortcuts currently in use
Identify the highest-risk areas where drift has progressed furthest
Create a prioritized remediation plan
Interview frontline workers to understand work-as-done vs work-as-imagined gaps
Recovery
Implement changes gradually to avoid disrupting operations
Provide training and support for teams returning to disciplined practices
Establish positive incentives for quality behaviors
Measure and communicate progress regularly
Build adaptive capacity—ability to respond to unexpected situations
Reinforcement
Celebrate examples of teams surfacing systemic issues early
Share stories of how quality practices prevented incidents
Make adaptation awareness a regular part of team retrospectives
Maintain executive visibility into adaptation metrics
Institutionalize "productive failure"—learning from near-misses and small failures
Create psychological safety for reporting concerns without blame
Maintaining Vigilance
Continuous monitoring: Drift detection is ongoing, not a one-time fix
Boundary management: Regularly review and update safety boundaries as systems evolve
Learning orientation: Treat drift detection as organizational learning, not compliance checking
Leadership commitment: Executive teams must model and reinforce drift-resistant behaviors
Adaptive capacity building: Strengthen the organization's ability to handle unexpected situations safely
Further Reading
Sidney Dekker - "Drift into Failure: From Hunting Broken Components to Understanding Complex Systems" (2011)
Dekker's work provides the conceptual framework and vocabulary for understanding how complex systems gradually move toward failure boundaries through small adaptations.
Diane Vaughan - "The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA" (1996)
Vaughan's research introduced the concept of "normalization of deviance" - how organizations systematically lower their standards for what constitutes acceptable risk.
Charles Perrow - "Normal Accidents: Living with High-Risk Technologies" (Updated Edition, 1999)
Perrow argues that multiple and unexpected failures are built into society's complex and tightly coupled systems, and that accidents are unavoidable and cannot be designed around. His concept of "normal accidents" explains why some failures are inevitable in complex systems, regardless of safety measures.
Erik Hollnagel, David Woods & Nancy Leveson (Eds.) - "Resilience Engineering: Concepts and Precepts" (2006)
This book charts the efforts being made by researchers, practitioners and safety managers to enhance resilience by looking for ways to understand the changing vulnerabilities and pathways to failure.
James Reason - "Human Error" (1990) and "Managing the Risks of Organizational Accidents" (1997)
Reason introduces the Swiss cheese model, a conceptual framework for the description of accidents based on the notion that accidents will happen only if multiple barriers fail.
James Reason - "The Human Contribution: Unsafe Acts, Accidents and Heroic Recoveries" (2008)
Reason's later work that explores both the positive and negative aspects of human performance in complex systems.
Karl Weick - "Sensemaking in Organizations" (1995)
Karl E. Weick's book highlights how the "sensemaking" process — the creation of reality as an ongoing accomplishment that takes form when people make retrospective sense of the situations in which they find themselves — shapes organizational structure and behavior.
Karl Weick & Kathleen Sutcliffe - "Managing the Unexpected: Sustained Performance in a Complex World" (3rd Edition, 2015)
Essential reading on high-reliability organizations and how some organizations maintain safety despite operating in hazardous environments.
Nancy Leveson - "Engineering a Safer World: Systems Thinking Applied to Safety" (2011)**
STAMP provides a paradigm for system safety engineering and has been increasing in usage across the transportation industry.
Sidney Dekker - "Safety Differently: Human Factors for a New Era" (2014)
Dekker's evolution from traditional safety thinking toward a more nuanced understanding of how safety is created in practice.
Erik Hollnagel - "Safety-I and Safety-II: The Past and Future of Safety Management" (2014)
The traditional safety concept, known as Safety-I, and its associated methods and models have significantly contributed to enhancing the safety of industrial systems. However, they have proven insufficient for application to complex socio-technical systems. From reactive to proactive safety thinking.