Controlled Failure / Practice Before It Breaks

| Shey Sewani | Toronto

Some recent conversations touched on building a culture of reliability. This post explores one piece of that: practice.

Practice Before It Breaks

Being on call is tough. It’s stressful. The page comes in, adrenaline spikes, and you’re scrambling. It’s not a fun time.

One way to make that stress manageable is to practice. If you’ve never walked through a restart, or an alert, or a login issue before it breaks then of course you’re going to freeze when it does.

Drills give you a chance to interact with the system and figure things out while it’s still calm. You simulate fire ahead of time, so when the real thing happens, you don’t panic. You’ve seen what happens and you have an idea of what to expect.

Practice in Pieces

Not every drill needs to simulate a catastrophe. Some just walk through a single action—loading a dashboard, restarting a service, deploying a config. Others mimic real-world pressure: a customer hammering on a dashboard, a long-running cron job, or support backfilling data.

That kind of mess is what actually causes most incidents. Not the infrastructure itself, but the messy edge cases—customers doing unexpected things, real load, application-layer bugs.

And while you can walk through the steps it’s better to actually perform them and see what really breaks.

Practice in Prod

Staging environments are incomplete representations of production. There’s rarely enough data, and logging and alerting usually aren’t fully wired up. But if that’s where you feel comfortable starting, that’s fine. Start there.

Eventually, though, you’ll want to practice in prod. Nothing beats production. It’s where you learn how an incident really unfolds.

Plan Your Practice

A drill isn’t supposed to be a surprise. No chaos monkey please, we don’t need more stress. So schedule the drill, pick a scope, and let other teams know what you’re doing.

Other Benefits

There’s more to practicing than just not panicking during an incident. Drills are also a way to bring people into the team, to share language and techniques, and to build camaraderie.

They’re also one of the simplest ways to share institutional knowledge (because no matter how much you document, some of it just lives in people’s heads). And sharing that builds trust – it helps people feel like they belong.

Conclusion

Practice doesn’t solve everything. But it lowers panic, surfaces the unknown, and builds trust.


I’ve written elsewhere about operational reviews, and there’s more to say about observability, operability, and performance. This post focuses on practice, specifically, the benefits of practice.