Navigating outages

3 min readJun 11, 2023
picture of a dog

Photo by @karsten116 on Unsplash

One question that frequently arises during behavioral or cultural interviews is how we handle situations where a bug disrupts services and impacts a large number of users. This query has become a staple in assessing soft skills.

Dealing with an outage, much like other extreme scenarios, involves a series of steps and approaches. In this article, I aim to share insights based on my personal experiences—what has worked and what hasn't.

The sense of urgency

Over time, I have learned to handle pressure more effectively through practice. The lessons I've learned have taught me to remain calm whenever possible.

Unfortunately, some individuals exploit these situations to amplify the sense of urgency or importance, using them as opportunities to validate their work.

To navigate this, experience, context, and seniority enable us to analyze the problem objectively, prioritize its impact on users and teams, and prevent ourselves from succumbing to unnecessary stress.

In larger companies, where processes move slower, people often leverage this urgency to expedite progress. Hence, it’s crucial to stay vigilant and focused on what truly matters.

Gaining a deep understanding of the business and the impact of our work helps us recognize whether the pressure we face is truly warranted.

Looking ahead, turning problems into solutions

In handling these challenging situations, I have found a simple yet transformative approach: looking ahead and transforming problems into solutions.

Shifting our focus toward resolving the immediate issue, and implementing necessary patches, would avoid us to play the blaming game.

This means that regardless of the team in charge of this, the energy must be on finding a quick solution. Once the stress is out of the picture, and our services are back up and running again, we can design a prevention plan that effectively mitigates the chances of recurrence.

I have personally discovered that thinking on the "why" yields positive outcomes. However, It’s more productive to let the immediate chaos settle before conducting retrospectives or post-mortems. These processes are vital for in-depth analysis, but performing them during heightened stress worsens the situation.

Embracing accountability

Making mistakes is an integral part of every software engineer's journey. While they may serve as intense and costly learning experiences, focusing on the lessons they offer can be invaluable. If we bear the cost, we should ensure that we all gain something from it.

Instantly acknowledging errors, taking ownership, and learning from them have a profound impact on the social capital of our teams. Such actions not only encourage others to take responsibility to tackle new challenges, even if they initially seem intimidating. In fact, as I recently read in a book that I have previously recommended “The Staff Engineer's Path”, this fosters a significant boost in the team's psychological safety.

As we progress in our careers, we inevitably reach a point where we become experienced members of our teams. It is crucial that we learn to navigate these situations effectively, as sooner or later, we find ourselves as the ones guiding others by being the grown-up in the room.

By incorporating these insights into our approach to outages and bugs, we can develop resilience, promote a proactive mindset, and cultivate an environment of growth and collaboration.

Thanks for reading ❤️

Other articles