⟵ Archive

Reducing Outages and Improving Product Delivery Through Weekly Operations Reviews

| Shey Sewani | Toronto

After a major reorganization in engineering, I found myself co-managing a team handling 75% of our critical product services– services especially known for their soul-crushing on-call load. With my co-conspirator going on parental leave I became responsible for leading both teams.

My top priority was to protect the morale of the two teams. There was an urgent need for a systemic approach to address the on-call load as it jeopardized the team’s ability to deliver new features and maintain service reliability.

The goal was clear: reduce the on-call burden, boost morale, and improve system stability.


The Weekly Operations Meeting

In response, I created a meeting creatively titled, “The Weekly Operations Review.” The main goal was to identify and address the immediate operational pain points; secondly to aid in knowledge sharing between the teams.

Recognizing that a developer’s cognitive bandwidth and time are invaluable, the success of this initiative hinged on our ability to conduct these meetings efficiently, following a strict agenda. The role of the ‘operator’ was key here. Tasked with guiding the meeting and setting the discussion agenda—usually based on team votes—the operator also had the authority to introduce urgent operational tasks as needed.

The meetings were open to all team members, with mandatory attendance for the ‘operator’, anyone who was on-call previously, and those scheduled next.

  • Core Metrics Review: We started each session by reviewing key service metrics like latency, response times, and queue lengths. The goal was to identify and understand any odd trends in the data. Any puzzling anomalies or concerning patterns were flagged for further investigation.

  • On-Call Hand-Offs: The on-call hand-off aimed to reduce the anxiety associated with on-call duties. Engineers who had just completed their rotation would share and discuss any incidents and the actions taken.

  • Open Platform for Discussion: Each meeting included a time for team members to raise questions and talk through any operational issues they noticed over the week. This open dialogue encouraged proactive problem-solving and was a chance to brief the team on upcoming changes.


Benefits of the Meeting

  • Accelerated Product Delivery: By reducing the on-call burden, the team found more time to focus on product development.

  • Empowered Team: Regular discussions on our service metrics made knowledge sharing quicker and more intuitive. This rapid acquisition of knowledge helped the team advance their operational expertise and set up Service Level Indicators (SLIs) and Objectives (SLOs), giving them to engage in meaningful discussions with product management and leadership. Notably, the rotating ‘operator’ role in these meetings helped this growth by providing team members a chance to develop leadership skills in a supportive environment.

  • Strategic Investments Identified: The open (and candid) discussions were more effective for knowledge sharing than traditional information sessions. During these reviews, the team pinpointed shortcomings in documentation, tools, and overall system comprehension. Leveraging their collective experiences, they managed to sidestep over-complication and make well-informed choices, keeping their focus on the most impactful areas.


Why It Worked

These meetings cultivated a sense of belonging and purpose within the team. The structured yet welcoming approach to these weekly meetings offered a well-defined forum to identify and address issues on the appropriate timescale. Insights from these discussions enabled the engineers to design more reliable services. Moreover, these sessions significantly accelerated the integration of the two teams and the onboarding of new developers.

From the beginning, the weekly operations meeting was a success. It’s rare to find meetings as well-attended as the operations review was. Within a month, the on-call load had noticeably lightened, and during my tenure, our reliability metrics soared to new (such great) heights. The operations review very quickly became a cornerstone of our team culture, playing a vital role in maintaining high morale and strong engagement.