Incident Naming Matters: A Simple Practice for a Better Incident Response

2024-05-10T14:00:00-04:00

I’ve found that naming incidents—especially with fun and creative names—really helps improve incident response management. Here’s why it works.

For Engineers: Using the incident name in logging, dashboarding, and other instrumentation simplifies the process for other developers to engage with and contribute to the incident response, reducing the need for extensive explanation and saving time.
For Incident Commanders: A single name for the incident simplifies searching through tickets, emails, Slack conversations, and other tools, speeding up and simplifying post-incident analysis.
For Leadership: When multiple incidents occur simultaneously, distinct names help prevent mix-ups, especially in communications with stakeholders.
For Everyone: Using fun names, like the “color-animal” format, as an example, “Blue-Camel”, adds a touch of silliness and can make a stressful situation a bit lighter.
For the Technology: Creative but non-specific names prevent unfairly labeling any one technology as the “root cause.” For example, “Yellow-Lemur” is a more neutral name compared to “The Great Cassandra Incident of 2018.”

Naming your incidents is a rather simple practice that can have an outsized positive impact on your incident management response and team culture 🐪.

Reducing Outages and Improving Product Delivery Through Weekly Operations Reviews

2024-04-28T11:00:00-04:00

After a major reorganization in engineering, I found myself co-managing a team handling 75% of our critical product services– services especially known for their soul-crushing on-call load. With my co-conspirator going on parental leave I became responsible for leading both teams.

My top priority was to protect the morale of the two teams. There was an urgent need for a systemic approach to address the on-call load as it jeopardized the team’s ability to deliver new features and maintain service reliability.

The goal was clear: reduce the on-call burden, boost morale, and improve system stability.

The Weekly Operations Meeting

In response, I created a meeting creatively titled, “The Weekly Operations Review.” The main goal was to identify and address the immediate operational pain points; secondly to aid in knowledge sharing between the teams.

Recognizing that a developer’s cognitive bandwidth and time are invaluable, the success of this initiative hinged on our ability to conduct these meetings efficiently, following a strict agenda. The role of the ‘operator’ was key here. Tasked with guiding the meeting and setting the discussion agenda—usually based on team votes—the operator also had the authority to introduce urgent operational tasks as needed.

The meetings were open to all team members, with mandatory attendance for the ‘operator’, anyone who was on-call previously, and those scheduled next.

Core Metrics Review: We started each session by reviewing key service metrics like latency, response times, and queue lengths. The goal was to identify and understand any odd trends in the data. Any puzzling anomalies or concerning patterns were flagged for further investigation.
On-Call Hand-Offs: The on-call hand-off aimed to reduce the anxiety associated with on-call duties. Engineers who had just completed their rotation would share and discuss any incidents and the actions taken.
Open Platform for Discussion: Each meeting included a time for team members to raise questions and talk through any operational issues they noticed over the week. This open dialogue encouraged proactive problem-solving and was a chance to brief the team on upcoming changes.

Benefits of the Meeting

Accelerated Product Delivery: By reducing the on-call burden, the team found more time to focus on product development.
Empowered Team: Regular discussions on our service metrics made knowledge sharing quicker and more intuitive. This rapid acquisition of knowledge helped the team advance their operational expertise and set up Service Level Indicators (SLIs) and Objectives (SLOs), giving them to engage in meaningful discussions with product management and leadership. Notably, the rotating ‘operator’ role in these meetings helped this growth by providing team members a chance to develop leadership skills in a supportive environment.
Strategic Investments Identified: The open (and candid) discussions were more effective for knowledge sharing than traditional information sessions. During these reviews, the team pinpointed shortcomings in documentation, tools, and overall system comprehension. Leveraging their collective experiences, they managed to sidestep over-complication and make well-informed choices, keeping their focus on the most impactful areas.

Why It Worked

These meetings cultivated a sense of belonging and purpose within the team. The structured yet welcoming approach to these weekly meetings offered a well-defined forum to identify and address issues on the appropriate timescale. Insights from these discussions enabled the engineers to design more reliable services. Moreover, these sessions significantly accelerated the integration of the two teams and the onboarding of new developers.

From the beginning, the weekly operations meeting was a success. It’s rare to find meetings as well-attended as the operations review was. Within a month, the on-call load had noticeably lightened, and during my tenure, our reliability metrics soared to new (such great) heights. The operations review very quickly became a cornerstone of our team culture, playing a vital role in maintaining high morale and strong engagement.

High Performance Indexing in PostgreSQL

2024-04-09T11:00:00-04:00

Initial Setup

When working with Rails apps, fast data retrieval is important part of the app’s overall performance.

Consider this query where we fetch the latest entries from a table:

SELECT * FROM pings WHERE sensor_id = 11 ORDER BY created_at DESC LIMIT 65

Without any indexes, this query will be slow, taking about 16ms as it scans the entire table.

Single Column Index

Now with a cost baseline, let’s apply a common index optimization.

Adding an index on sensor_id will reduce the query time to around 13ms—a small 3 ms improvement.

CREATE INDEX index_pings_on_sensor_id ON pings (sensor_id)

Multi-column Index

To further reduce the time, add a composite index on both sensor_id and created_at.You’ll see that the new composite index brings down the execution time to 0.079ms.

It’s also worth noting that in production environments, composite indexes are generally preferred over single column indexes. This is because postgresql can use these composite indexes in place of single column indexes too.

 CREATE INDEX index_pings_on_sensor_id_created_at ON pings (sensor_id, created_at)

Ordered Index

Optimizing further by ordering the created_at column in descending order in our index will make the query even faster, with execution time coming in at about 0.067ms

CREATE INDEX index_pings_on_sensor_id_create_at_desc on pings (sensor_id, created_at desc);

By applying a compound index on sensor_id and created_at in descending order for the pings table, we achieve a notable reduction in query cost. The result is an execution time of 0.067 ms, squeezing out just a little more performance.

By improving indexing strategies from a basic single column to a more ordered composite index, we’ve improved the query speed significantly, from 16ms down to under a millisecond. These indexing strategies are practical implementations that can lead to substantial performance gains in real-world scenarios.

Rails and SigNoz: Stumbling into Better Observability

2024-04-02T11:14:00-04:00

Hello everyone,

I’d like to share a quick update on what’s been happening with httpscout and some insights from the recent launch.

Initially, I set up Prometheus and Grafana for httpscout, enjoying the process of hosting them myself, but, after the latest launch, I noticed some gaps in monitoring. Mostly, I missed having tracing, and building dashboards in Grafana gradually became more painful.

On a friend’s recommendation, I decided to try SigNoz, an open-source Application Performance Monitoring (APM) and Observability tool.

I was skeptical–Rails has somewhat iffy support in the DevOps/SRE land, but I’m happy to say that setting up SigNoz was straightforward. They provide reasonably easy-to-follow instructions and a useful docker-compose file.

I also discovered some open-source Open Telemetry gems gems for Rails.

As of yesterday, httpscout is now monitored by a self-hosted SigNoz installation 🎉.

My first impressions are very positive– SigNoz is a good fit for this phase of httpscout’s development. It’s also past time to start moving away from Grafana and focus more on learning about OpenTelemetry.

More updates to come!

Notes

sumo logic: Ruby on Rails OpenTelemetry auto-instrumentation

Introducing httpscout.io: A Website Uptime Monitor

2024-03-20T11:14:00-04:00

Today, I’m excited to share a project close to my heart: httpscout.io. It’s now in early Beta as a reliable, cost-effective website uptime monitor.

I’ve been developing httpscout.io over the past months. Though it’s still evolving, launching now lets me gather feedback and see how it performs under real-world conditions.

This project has been a dream to work on. The site is built with Ruby 3, Rails 7, Bootstrap 5, Sidekiq 6, and PostgreSQL 15, and is hosted in Canada, independent of the major infrastructure providers.

Thanks to the support from my friends and family during this journey. If you have a moment, please visit httpscout.io to help test the site.

Stripe Integration with Rails: Pay Gem Tips and Fixes

2024-03-12T11:14:00-04:00

Update 2024-05-19

I extracted the Pay integration from httpscout.io and packaged it into a Rails app that is available on github.

Default URL Options

I didn’t find a reference to this in the docs. And if not configured the integration will error our with a reasonable error message. But, if you’re new to rails, it may not be so obvious. The integration requires that Rails.application.routes.default_url_options is set and available. I choose to set the value in both config/environments/development.rb and config/environments/production.rb, and used include Rails.application.routes.url_helpers in the class where I generate the checkout url.

Disable Telemetry

By default, the Stripe Ruby gem sends telemetry to Stripe. Disable telemetry if you’re in a resource-starved environment. While configuring Stripe, also set Stripe.api_version—it’s important to be aware of the API version you’re using.

Stripe Webhooks

When you are configuring a webhook endpoint in Stripe, you will be asked to select the events for which you want to receive webhooks. While you can choose to subscribe to all events, if you have limited resources, it is best to only subscribe to the events that you actually need. A list of the events that the Pay gem expects is available in this gist.

Hiring is Broken: A Guide for Early Stage Startups

2024-02-01T10:14:00-05:00

Here’s a secret. We all know the interview process is broken. The current method of interviewing, especially for Senior Developers, is exhaustive– it stretches resources thin without significantly improving the quality of hires.

Over the years of working at start-ups I’ve built a process tailored for teams of three to fifteen software developers that prioritizes meaningful engagement and organizational alignment. This methodology has proven effective in my experience.

The Foundation: It Starts with You

The energy you bring to an interview sets the tone for the entire process. A rushed or tired demeanor won’t inspire confidence or encourage the best from your candidates. The solution? Interview fewer candidates but do it more thoughtfully. Begin with a resume screen that focuses on those that truly resonate with you, looking for projects or accomplishments that spark your interest and align with your organizational needs.

Understanding Your Needs

Before diving into interviews, clarify to yourself what you’re seeking in a new hire. Is it mentorship ability, process improvement skills, sound architectural decision-making, or the capacity to lead the development of a new feature?

The Interview Process: A Structured Conversation

Allocate an hour and a half for an interview that explores the candidate’s past projects, technical challenges, and achievements. Start by walking the candidate through the process. The interview starts with project deep dives. Then switches to general software knowledge, and ends with a practical coding exercise.

Project Deep Dive: Ask them to describe a software system they built and are proud of. You’re looking to understand their role and why it was significant to them. Explore the technological challenges they faced and what they’d do differently in hindsight.

Consistent Delivery and Impactful Contribution: The ability to consistently deliver projects to completion is especially critical in smaller organizations, where each team member’s impact is magnified. Senior developers should have a history of successfully seeing projects through.

Quantifiable details: In addition to evaluating their delivery record also delve into the quantifiable details of their projects. Inquire about specifics such as the number of requests served, page load times, or the size of the largest database table. With this technique you gather evidence of their active involvement and understand the scale of their work.

Technology Alignment: Inquire about their favorite technology and how it aligns with your stack. This can reveal if they’re a good fit for the role you’re envisioning. For senior positions, especially, alignment with the existing tech stack is non-negotiable.

General Software Knowledge: Dive into their technical understanding with questions that cover fundamental and practical knowledge. Encourage guesses and ask them to reason about the question.

In the context of HTTP, what is the difference between PUT and PATCH?
What is the difference between Docker and Kubernetes?`
What is the difference between Javascript and Node.js?
What is the purpose of a database index?
Which datatypes are suitable to store money values such as the invoice total?

Practical Coding Exercise: Evaluate coding proficiency through straightforward tasks focusing on the basics of the langauge. Here’s a Python challenge I use:

Create a Person class.
Add functionality to the class to allow it to “speak it’s own name.”
Create a collection of Persons.
Filter out all persons named “X” from the collection you created.

Acknowledging the Two-Way Street

Remember, interviews are as much for the candidate as they are for you. Be prepared to answer questions about the role, on-call load, the technology and your team.

Hiring as a Skill

Not everyone is naturally adept at interviewing. It requires a breadth of experience and a nuanced understanding of both technical requirements and human engagement.

Reflecting on Building an App

2024-01-17T10:14:00-05:00

I Built an App 🍎

I built an app. Last week I demoed it to a select group of colleagues. I want to pause and reflect on the past three months.

I’ve realized that the app is more than just code. It’s the fulfilment of a lifelong dream and an expression of my values of simplicity, efficiency, and reliability.

The Journey Begins 🚉

Three months and 170 PRs – that’s what it took to transform an idea into an alpha-level app.

Initially, I planned to use Python and Django for my project. However, as a solo developer with a strong preference for back-end development, I decided to switch to Rails. The scaffolding feature of Rails made it easier for me to experiment and develop my project quickly.

Embracing the Pain 🏋‍♂️

At first, I was trying to rush the development of my app, hoping to create it in just one month like some of the successful app creators out there. However, this approach often resulted in more work and a growing feeling of resentment.

I eventually realized that greater productivity would come if I slowed down and was more deliberate in my approach. Slowing down allowed me to think more critically about what was necessary for the app rather than just trying to add every interesting feature that came to mind.

Strategic Decisions 🤔

Throughout this process, I learned as much about myself as I did about app development. I discovered my limits and strengths and the importance of not pushing myself to the brink.

A decision that paid off early was postponing unit testing until after the core app was developed. This approach went against “standard development practices” but allowed for the flexibility needed in the early stages.

As for testing, I found a preference for MiniTest over RSpec – a choice guided by the need for simplicity and clarity during those late-night coding sessions.

A Lovely Companion 🌍

ChatGPT became the programming buddy I never knew I needed. It could have been better, but it was a sounding board that helped me through some complicated decisions. I explored Copilot, but I found it intrusive– it didn’t align with my programming style.

Having Fun 🎉

One of my proudest achievements was implementing Prometheus and Grafana for monitoring. I love Datadog but promised to have fun and use as much open-source technology as possible. I can stare at the dashboards all day long.

Looking Ahead 📺

I forgot to build an invitation system, so I apologize for any confusion. Thank you for reading this far. The support of strangers and colleagues is immeasurable.

Running Postgres on AWS RDS? Reach out and say hi 👋🏽!

Getting More Out of AWS RDS: Beyond the Basic Settings

2023-08-21T11:14:00-04:00

Did you know that AWS RDS for PostgreSQL uses the default PostgreSQL configurations with minimal customization? There’s a good chance you can get more out of your database. Lets examine the parameters parameters that can dramatically improve a PostgreSQL database’s performance.

work_mem

This parameter determines the amount of memory allocated for sorting operations during queries. Insufficient work_mem can cause operations to spill onto disk, slowing down the qeurying process, while excessively high settings may deplete memory available for new connections. The optimal way to adjust work_mem is by monitoring your system. Start with 4MB for most new apps; if you’re running client reports, like a monthly summary, consider bumping it to 8MB.

log_temp_files

Enabling this setting helps track queries that are spilling to disk, indicated by log messages about temporary files or “external merge Disk”. For web applications with administrative backend or reporting functionality, starting with 4MB is advisable, adjusting as needed based on workload demands. Values larger than 64MB suggest conflicting workload requirements. Consider using a read replica for queries that require more memory.

maintenance_work_mem

By allocating more memory to maintenance_work_mem, maintenance tasks complete more rapidly, which minimizes the downtime or slow performance associated with these operations. This will increase the overall throughput of your database by helping it spend less time locked in maintenance tasks and more time serving requests. A good starting point is setting maintenance_work_mem to around 50% of your total memory.

Checkpoint Parameters

checkpoint_completion_target
checkpoint_timeout
max_wal_size
min_wal_size
log_checkpoints

The default settings in PostgreSQL lead to excessive checkpointing, creating additional write loads. The aim is to have regular checkpoints, guided by the checkpoint_timeout parameter. Increase max_wal_size gradually if the system hits the limit often, and increase checkpoint_timeout for databases with high I/O during checkpoints. For checkpoint_timeout, values up to one hour are sensible.

Here is a good “starter” configuration for a host with 4GB of memory:

checkpoint_completion_target    = 0.97
checkpoint_timeout              = 25m
max_wal_size                    = 4GB
min_wal_size                    = 512MB
log_checkpoints                 = on

wal_buffers

wal_buffers are important for improving write performance on servers where many clients are committing transactions simultaneously. The previous default of 16MB is a solid start– I recommend increasing this by 4MB for each CPU core (minus one). Please note, a system restart is required to apply the new settings. To minimize downtime, consider adjusting wal_buffers together with max_connections.

max_connections

Increase the default to 200 to allow for more simultaneous connections. This requires a restart, but restarting now affects fewer customers, making it a preferable option. Consider using AWS RDS Proxy for higher demands or if restarting isn’t an option.

effective_io_concurrency

This setting controls how many concurrent operations the underlying system can handle. Adjusting this parameter can lead to significant performance improvements for databases frequently performing Bitmap Heap Scans. I recommend starting with a value of 25 times the number of CPUs– you can explore settings up to 1000.

Logging Parameters

log_statement
log_min_duration_statement
log_lock_waits

To prevent the potential exposure of Personally Identifiable Information (PII), set log_statement to none to disable it and change the slow_query_logging setting to -1 to turn slow query logging off. To identify slow queries, consider using the auto_explain extension for brief periods. Enable log_lock_waits to identify whether extended lock waits occur. Lock wait timeouts lead to lower throughput.

statement_timeout

Long queries consume resources, slow down queries, block new connections, and cause conflicts with other transactions. Setting the statement_timeout parameter to a maximum of 80% of the client timeout is recommended to prevent this scenario. For instance, if using Puma with a default timeout of 60 seconds, set the statement_timeout parameter to a maximum of 48 seconds.

By adjusting these fifteen PostgreSQL parameters, you can significantly boost your database’s performance.

For those running PostgreSQL on AWS RDS or similar environments, remember that while defaults get you started, they are rarely optimized for all workloads. The configurations suggested here, including the starter settings for a 4GB server, are just the beginning. Customize them further based on your system’s specific demands and usage patterns.

Continual monitoring and adjustment of these parameters as your database grows and evolves are essential to maintaining a robust, responsive system. Keep experimenting, keep optimizing, and you’ll see not just improved performance but also potentially lower costs and happier users.

Notes

Photo by Adi Goldstein on Unsplash

Railcar+Railroad: A Ready-to-Deploy Rails 7 and Capistrano Solution

2023-07-19T11:14:00-04:00

These days software feels over-engineered and expensive. Long build and deploy cycles are commonplace. I miss what software development felt like circa 2016. I miss the easy debugability, how simple things were to learn, and the ability to carry over what I learned from one place to another.

If you’re like me, you miss the old days and enjoy working with Rails, are comfortable running a production server, or are interested in understanding how things worked in the past; then this post is for you.

Introducing Railcar and Railroad– two sibling projects demonstrating how to use Capistrano to deploy a Rails project to a Ubuntu server.

The Railcar repo is a Rails 7.0.5 project pre-configured with Capistrano; the Railroad repo with Ansible playbooks and roles to configure a Ubuntu 20.04 server with a Rails 7 environment (Rbenv, Nginx, LetsEncrypt/TLS, Unicorn, and Postgres).

Each repository has installation and setup instructions.

Railroad and Railcar are new projects with old technology, providing a suitable alternative to PaaS. It is a scaled-down version of the same setup that served Rails apps reliably for a decade.

Railcar and Railroad are not just trips down memory lane but valuable resources demonstrating how we can still leverage tools like Capistrano to deploy projects efficiently and effectively.

The outage.name website runs a slightly modified version of Railcar on a budget-friendly $7/month DigitalOcean VPS configured with Railroad.

Whether you’re an old-school Rails fan, a developer seeking to understand the roots of current technology, or someone looking for a viable alternative to PaaS, there’s something here for everyone.

Running Rails on AWS? Reach out and say hi 👋🏽!