Practice Before It Breaks

| Shey Sewani | Toronto

Some recent conversations touched on building a culture of reliability. This post explores one piece of that: practice.

Practice Before It Breaks

Being on call is tough. It’s stressful. The page comes in, adrenaline spikes, and you’re scrambling. It’s not a fun time.

One way to make that stress manageable is to practice. If you’ve never walked through a restart, or an alert, or a login issue before it breaks then of course you’re going to freeze when it does.

Drills give you a chance to interact with the system and figure things out while it’s still calm. You simulate fire ahead of time, so when the real thing happens, you don’t panic. You’ve seen what happens and you have an idea of what to expect.

Practice in Pieces

Not every drill needs to simulate a catastrophe. Some can focus on just one thing.

Some drills can focus on one part of the process: loading a dashboard, restarting a service, deploying a new config. Other drills can simulate a customer hammering the dashboard, a long-running cron job, or support backfilling data.

And while you can walk through the steps it’s better to actually perform them and see what really breaks.

Practice in Prod

Staging environments are incomplete representations of production. There’s rarely enough data, and logging and alerting usually aren’t fully wired up. But if that’s where you feel comfortable starting, that’s fine. Start there.

Eventually, though, you’ll want to practice in prod. Nothing beats production. It’s where you learn how an incident really unfolds.

And yes, I was overzealous suggesting you restart Redis. The truth is, infra isn’t usually what causes incidents. It’s misbehaving customers, real-world load, and application-layer bugs.

Plan Your Practice

A fire drill isn’t supposed to be a surprise. No chaos monkey please, we don’t need more stress. So schedule the drill, pick a scope, and let other teams know what you’re doing.

Other Benefits

There’s more to practicing than just not panicking during an incident. Drills are also a way to bring people into the team, to share language and techniques, and to build camaraderie.

They’re also one of the simplest ways to share institutional knowledge (because no matter how much you document, some of it just lives in people’s heads). And sharing that builds trust – it helps people feel like they belong.

Conclusion

Practice doesn’t solve everything. But it lowers panic, surfaces the unknown, and builds trust.


I’ve written elsewhere about operational reviews, and there’s more to say about observability, operability, and performance. This post focuses on practice, specifically, the benefits of practice.