The Best Deploy Still Lets You Make Happy Hour | Sarah Hines | Monktoberfest 2025

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

Get more video from Redmonk, Subscribe!

Waking up to on-call pings from ski lifts and red-eye flights makes a good war story—but it’s not strength, it’s unmanaged risk. In this talk, Sarah shows engineering managers can replace panic with psychological safety and resilience. Her thesis: when we design for failure and protect people’s time, we not only protect the mental health of workers, we ship better software.

Transcript

Hi. I’ve had my fair share of tales getting through production crashes. I’ve been woken up in the middle of the night with an emergency notification. I’ve debugged in a car, on a plane, on a train, and yes, even once in the woods. Dr. Seuss would be proud.

     And sometimes I wear those stories like a badge. They allow bonding with other geeks, but here’s the truth. They were expensive, exhausting failures.

     Let’s not confuse suffering with strength. Think back to the consoles we grew up on or well, some of us, granted, my gaming tech was always years out of date. Like original Zelda, original Mario, Oregon Trail. There were no saves. If you died, you started over, it was about memorization and survival, brutal and unforgiving. Show of hands, who has ever rage-quit Mario? Yeah.

     Then saves arrived, memory cards, checkpoints, cloud sync, suddenly you could explore, you could try bold strategies, you could put the controller down and eat dinner and come back to the game that you were already playing. Saves made games more fun, more inclusive. And hard core deaths still exist. These are fearsome words, fun if you choose them, but not exactly how you want your job to feel every day. And when we don’t conscientiously build safe systems and culture at work we’re basically handing our colleagues hard core mode by default.

     In the startup years I got interrupted all hours, nights vacations, mid flights, I’m pretty sure I’ve done troubleshooting from a ski lift. It was absurd, and it became really clear. If I didn’t prioritize building safety nets, I was going to keep getting woken up. So we did it.

     One click roll backs, we actually tested, failover servers that we pushed to fall over when it was quiet, tests baked in. We really designed for my selfish desire for sleep and the fun thing was when we protected our time, we delivered better work. Our clients were happier, so were we. After I joined NAMI. It’s the largest grassroots mental health organization. We support millions of America’s families affected by mental illness so they can build better lives and we do it in a federated network each with their own needs, systems and challenges. At NAMI’s national office I lead systems that encompasses four different teams, across the teams I’ve led I’ve seen the same thing: When people feel safe, they experiment. They try new things. They grow. When they don’t, they shrink back, they go into survival mode. Unsafe systems don’t just risk uptime. They hurt people. I once had a tech person — that’s not resilience, that’s a hostage system.

     And technical debt, it always charges interest, and that APR is brutal. It’s unpredictable.

     Biology backs this up. Stress produces a hormonal reaction, cortisol surges literally impair the prefrontal cortex, that’s the part of your brain that solves problems and remembers things. Under stress, your brain narrows. Your view of the problem and potential solutions shrinks.

     Julia Evans says debugging should start with curiosity, not panic. Calm brains debug faster.

     Sometimes the best debugging tool really is a little walk for your mental health and and it’s not just techies, this is everywhere. According to SAMSA … in NAMI’s 52% of employees felt burned out. 25% considered quitting because of work’s impact on their mental health and a quarter don’t know if their employers cover mental health benefits. So when we talk about unsafe systems, there is a human cost.

     So what does safety actually look like? I would argue it has three layers: Infrastructure, process, and culture.

     Infrastructure safety isn’t just staging servers, it’s designing for failure.

     When we added real alerting at NAMI, we discovered critical software was — I hate admit this — four years past end of life. Painful, yes, but survivable.

     Infrastructure resilience isn’t about never failing. It’s about bouncing when you do.

     There’s plenty of things you can do to increase infrastructure safety, observability, automated roll-backs, guardrails, all of the jargon, but here’s the real point: In complex systems, failure is inevitable. Resilience comes from cushions.

     And on the process side, every team needs cheat codes for bad days. Deploy playbacks, they aren’t bureaucracy, they don’t stop failure, but they’re what keeps mistakes small and survivable. Heck, when we got into a basic habit of always conversationably learning to jot down after failures, people are floundering, and then someone quietly pulled up the Slack eternal record. We had the notes, suddenly we weren’t improvising, we solved it faster and because we had breathing room we took the time to automate a check so we would not have to rely on that one person’s memory if it happened again. That’s process safety. Documentation run groups, drills, postmortems, not as paperwork, but as muscle memory.

     Without process, you reinforce panic. With process, your team rehearses recovery, calmly, repeatedly, so when real-life crises hit, cortisol levels stay low and clarity stays high. Systems teach culture.

     If the system punishes falling, people stop climbing.

     But if the system has safety nets, people climb higher.

     That’s psychological safety. The freedom to speak up, to push back, to make mistakes and still be supported.

     Google’s project Aristotle find it’s number one factor in effective teams, yet too often we never actually train people how to create it. Postmortems turn failures into lessons, but we also need the flip side: The resilience replay.

     A short celebration of the boring deploys, the saves, the quiet successes. What pieces came together which truly allowed for a seamless release.

     Because when you shine a light on what went right, people learn to repeat it.

     And some of you may remember my talk from a past Monktoberfest where I told you about Steve Money, where the exact same requests got better responses when I sent them under a masculine alias. That was a work-around to improve outcomes for my company when we had to interface with unsafe systems.

     Cultural safety means nobody should ever feel like they need a Steve to be heard. Sorry, Steve.

     And I do need to give a quick plug to an initiative of NAMI’s, our stigma-free work initiative gives people practical tools across three areas: Awareness, culture and access, so teams can share knowledge, people normalize help-seeking and people can find care.

     I’ve told you about some of the safety nets I’ve worked on but when I sat down to write this talk, I really struggled on what suggestions to make, because the hard question is, how do you — how do I help you improve these patterns in your own unique environment? How would I even know? I don’t know your maturity. Some of you are in highly regulated enterprises, some of you are in startups, you’re working in different ecosystems, with different maturity levels with different cultures and different management. So if I had to give one piece of advice on where to start, it would be this: Start with your people.

     Ask your people what stresses you out, what’s the Slack message you dread seeing, what’s the one thing you wished worked better right now, because it leaves you exposed?

     Interrogate yourself, too. And make sure you’re documenting the fire drills, the deer in the headlights Slack thread, the recurring emergency tickets, those are your guideposts of where to improve. And don’t forget, some wins are really simple: 7 5% of employees say it’s appropriate to talk about mental health at work, but a quarter don’t know if their companies offer mental health benefits and the majority of all employees find all mental health topics useful. So read the handbook, know your employee assistance program, sometimes the biggest fixes around technical at all.

     And let me be honest: This is hard work. It can sound like I’m just saying, just add surveys, run books, celebrate normal deploys, you’ll be fine, but that’s not how it goes. It’s grueling work to make technical debt visible. To fight for every investment in the boring invisible work that keeps systems safe and the irony is that is exactly the kind of work that gets pushed aside when emergencies hit. Every fire drill steals time from a foundation which makes the next fire all the more likely.

     Nobody runs a parade for refactoring, most of the time it isn’t even seen.

     But that’s the real grit. The discipline to get the boring done. If you carve out that space, if you protect your team’s time, something starts shifting. People suggest more improvements, they buy in more, they take even more pride in their craft. They build with confidence because they trust the foundation under them.

     I’ve seen this at startups, nonprofits, enterprises, the pattern holds. When people can depend on safety, they give more of themselves.

     They build better.

     That’s why I keep pushing. Even when it’s invisible, even when it’s not respected, because building systems that lower cortisol, that make crises survivable instead of catastrophic is some of the most impactful work that I feel like I can do.

     It’s slow, it’s heavy, but it pays off, and people eventually get to feel that and they join you.

     And that kindness isn’t just fluff. it’s fiscal.

     Mental health protections just like systems protections make organizations more resilience.

     Global Lancet study found that every dollar invested in employee mental health yields a $4 return, 4 to 1, that’s ROI we’d all love. So we’ll always share our stars. They’re human. The bug caught before happy hour, the deploy so boring we forgot it happened. Because the best deploy is the one that still lets you make happy hour, or at this point in my life, bedtime.

     At NAMI, we say mental health matters everywhere. In tech it matters because safe systems build healthy teams. So here’s a toast, though I don’t have a drink: To leaders who value peace of mind, let’s remove hard core mode from our day jobs. Cheers.

    [applause].



More in this series

Monktoberfest 2025 (13)