Resilience Bites #19 - LinkedIn Rewind (Weeks 36-38)

Sep 18

Welcome to Week 37 of LinkedIn Rewind!

This is the first edition of Resilience Bites coming to you from ResiliumLabs. I appreciate your patience during the transition—moving everything under one umbrella allows me to operate more efficiently as a solo creator and focus more energy on delivering valuable content to you.

Going forward, I'll continue curating my weekly rewinds while also highlighting the best posts from our community. You can find all my reposts by following the #ResilienceBites hashtag on LinkedIn.

Thanks for being here, and I hope you enjoy this new chapter!

LinkedIn Rewind

Resilience Myths

A few days ago, I was asked about common myths related to resilience. Here are few of them I see across organizations, many of which sound reasonable but actually undermine the resilience they're meant to build.

This is what I could come up with this morning, but there are many more lurking beneath the surface.

Any others come to mind?

- Track incident counts as a measure of resilience

- MTTR tells us how good we are at recovery

- MTTD shows our detection capabilities

[…]

Continue reading on LinkedIn

On Success and Failures

Why do organizations struggle with learning and improving? Learning seems easy, right? We do it every day, from small habits to big life changes.

Well, it turns out organizations, and the people within these organizations, often treat failures and successes very differently.

But success and failure aren't opposites; they're products of the same system adapting to complex conditions.

When things go right (which is most of the time), we rarely ask, “Why did this succeed?"

Yet these everyday adaptations and success, all the small adjustments teams make, contain really important learning opportunities that are too often ignored.

[…]

Continue reading on LinkedIn

On Seniority and decisions

As engineers move up the seniority ladder, they often need to broaden their impact, working at an organizational level rather than on individual features or systems.

That move pulls them slowly away from on-call duties, day-to-day operational firefighting, and the messy realities of production systems. But that move creates something perverse: it slowly creates a gap between work-as-imagined and work-as-done.

And if the now-senior engineers holds onto making decisions versus supporting decisions, being a doer versus becoming a teacher, then it slowly creates a problem that undermines the very systems they're trying to improve.

PS: notice the "often". This is obviously not a generality, but an observation :-)

Continue reading on LinkedIn

Observation

It's easy to say you care about resilience. It's much harder to prove how much you actually care and recognize that most of what you think builds resilience actually erodes it.

Continue reading on LinkedIn

What Organizations Forget When doing Chaos Engineering

You know what most organizations get wrong with chaos engineering? The hypothesis part.

Often, they will have one person write and validate the hypothesis. That's it. While it's WAY better than nothing, it's also a missed opportunity to learn even more. Let me explain.

To create a strong hypothesis, you need the whole team, not just one engineer. Bring in everyone involved in the system, including the product owner, technical product manager, developers, designers, and architects - the more, the better.

Why? Because collective intelligence of the team outperforms individual expertise. Every. Single. Day.

[…]

Continue reading on LinkedIn

Podcast - On reliability and leadership

This is hands down one of the best podcast discussions on incident management, coordination, and reliability I've seen in ages!

Fantastic conversation between Beth Adele Long and Jade Rubick on the Decoding Leadership podcast.

Definitely worth a listen for anyone working in this space.

Direct link to Youtube

What is the difference between putting a satellite into orbit and resilience?

I've started to see resilience practices like putting a satellite into orbit and keeping it there. Not that I have experience putting a satellite into orbit, but hear me out :)

Think about it. Launching into orbit is incredibly hard work. You need massive amount of energy, precise calculations, everything has to go perfectly. But that's just the beginning.

Once you're there, you face a constant battle against forces trying to pull your satellite back to Earth. Without regular course corrections, orbital decay is inevitable. But satellites don't just maintain a fixed orbit, they constantly adjust their trajectory to avoid space debris, navigate around other satellites, respond to solar winds and gravitational anomalies. Each adjustment is a response to conditions they couldn't fully predict. This feels very familiar to how organizations approach resilience.

[…]

Continue reading on LinkedIn

The Community Rewind

Todd Underwood

Yesterday we (Anthropic) published an engineering blog post which is a public discussion of some of the correctness/quality issues that have affected our models for the past several weeks. If you're curious about this topic, please do read the whole post.

For many years I have talked ( https://lnkd.in/ep8fMRTu for example from back in 2022 ) about the way that infrastructure and software problems can manifest as quality problems in complex ML systems. Previously, most of my public examples were for training systems, with my favorite examples something about systematically biased skipping of training data. This set of failures indicates how these problems can manifest in serving systems as well.

[…]

Continue reading on LinkedIn

Liz Fong-Jones

What a malfunctioning hotel light taught me about B2B sales ^H^H^H^H responding to incidents:

Last night, I turned out the lamp after a long day expecting to get a peaceful night of rest. The motion sensitive night light on one side of the bed turned off after 45 seconds. The other light did not. The seconds turned into minutes. I tossed and turned.

Eventually it became clear that absent some intervention, the light would remain all night long. I reluctantly got up and started rearranging things. Maybe it was my charging cables dangling? Maybe it was the shoes on the floor? Maybe the sensor in front of the light was dirty?

[…]

Continue reading on Linkedin

Peter Johnson

Chaos engineering isn’t about breaking things it’s about building confidence in how systems handle failure. A mindset shift many teams could benefit from.

If you’re curious about resilience, chaos engineering or want to hear some great stories from Casey Rosenthal, this one’s worth a listen.

Direct link to Podcast

James Boyer

Some leaders are flailing so hard in the Efficiency Era that they're outsourcing the only part of the job that matters—the thinking.

They're not using AI to move faster, they're using it to sidestep learning. They're dodging the discomfort of not knowing and skipping over the struggle that might have actually made them better.

Under pressure, that avoidance compounds. And the cost isn't just real, it's dangerous.

This one is about how AI accelerates organizational dysfunction, speed-running what used to take years to unravel.

Continue reading on LinkedIn

Jade Garratt

How do you respond to critical feedback?

I think it takes an exceptionally well-trained mind to instantly welcome and appreciate criticism. I’m also not of the mind that all feedback is a gift - some feedback can be biased, unfair and intended to harm, not help.

And yet, sometimes we need the critical feedback to help us see something we hadn’t, help us reconsider and improve. Without it, we might carry on making the same mistakes, or perhaps never be as good, or competent or effective as we could be.

So with that in mind, here’s what I *try* to do when feedback feels hard to hear.

[…]

Continue reading on LinkedIn

Russ Miles

Some people would rather lose their job, their friends, and their credibility than utter the words, “I was wrong.”

In software, that affliction is everywhere. Developers cling to clever hacks like holy scripture. Architects defend design decisions as if they were defending the walls of Troy. Managers double down on bad bets because admitting error feels like admitting weakness. But here’s the thing: your need to be right is keeping you broke in every way that matters.

[…]

Continue reading on LinkedIn

Joe McKevitt

In our team we believe in shipping fast. Lots of small, frequent changes every day. It keeps us moving, keeps us learning and gets value into customers' hands without bottlenecks.

But here's the truth: sometimes moving fast bites back. A patch release recently brought our whole platform down.

And yet I wouldn't change our approach. Why? Because every incident teaches us something. This one pushed us to tighten monitoring, improve visibility, sharpen testing and strengthen on-call readiness.

Continue reading on LinkedIn

John Allspaw

Here are a few signals that can shed light on whether your organization is actually learning effectively from incidents:

1. Do post-incident review meetings include people from teams who were NOT directly involved in the incident?

2. Do engineers report learning things about their systems in post-incident meetings (and in the analysis write-ups) that they can’t learn anywhere else?

3. Are there explicit references to specific post-incident write-ups appearing in internal materials such as project roadmaps, runbooks, hiring plans, or design proposals—demonstrating that authors value and recognize the relevance of incidents?

There are many more, but these are a good start.

Read on LinkedIn

ResilienceChaos EngineeringResilienceBitesEngineeringCloud ComputingSoftware EngineeringSoftware SystemsResilience Engineering

Adrian Hornsby