On-call discussions are typically greeted with complaints. Here's how you change the story
and create a more sympathetic and resilient procedure.
I assumed responsibility for an important aspect of our infrastructure a few years ago. It was a complicated project that could have caused significant outages and economic interruptions across a wide radius if it hadn't been managed and controlled. These kinds of projects need a strong team of engineers who are available for emergencies during off-peak hours, along with well-equipped engineers who understand the systems and dependencies.
In my experience, I encountered a well-meaning team that was unprepared for the unpredictable nature of call-in work. My goal was to transform on-call from a perpetual source of stress to an example of steadiness.
Being on call can be stressful
On-call responsibilities are a major socio-technical difficulty and are frequently one of the most stressful parts of a developer's career, according to the Honeycomb CTO Charity Majors. This concept is a result of some fundamental issues that might make being on call especially challenging:
Unpredictable work schedule: In order to be considered on-call, engineers must be accessible to respond to crises after hours. Personal time is interfered with, and the work-life balance becomes erratic.
High stakes and high pressure: Being on call means having to handle failures that could have a big impact on business operations. There may be a lot of strain because of the stakes. Taking care of difficult problems by yourself after work also adds to a feeling of loneliness.
Lack of preparation: Engineers will feel ill-prepared to deal with any problems they may face if they do not receive the necessary training, preparation, and experience. This intensifies the problem by causing worry and the fear of making the wrong choices.
Alert fatigue: Engineers may get desensitised to frequent, non-critical alarms, which may make it harder for them to identify serious problems. This may result in more stress, delayed reaction times, missing important alarms, and weakened system dependability. This leads to general job discontent as well as impairs their capacity to act promptly when a real problem does arise.
Why are on-call procedures necessary?
Being on call is crucial for preserving the robustness and health of production systems, notwithstanding its difficulties. Your services must be covered by someone during off-peak hours.
As an on-call engineer, being near production systems is crucial since it guarantees:
Corporate continuity: In order to reduce downtime and lessen the impact on end users and corporate operations, quick response times are essential. They distinguish between a small hiccup and a significant outage.
Deep comprehension of systems: Being fully integrated into the system enables you to fully appreciate the subtleties of the production environment. Creating a culture where team members prioritise performance optimisation and problem prevention above just responding to them is beneficial.
Enhancing soft skills: Engineers who are on call are required to have a wide range of abilities, such as crisis management, rapid decision-making, and effective communication. These abilities are beneficial for job advancement in a wider range of professional settings.
Getting used to a brave on-call structure
The reality check for those on call
When my team first started adjusting to the on-call system, we encountered a patchwork of temporary solutions and a clear lack of confidence during deployments. Despite their resilience, the engineers worked in isolation. Every night spent on call was a solitary (mis)adventure. It was also evident that a big outage will occur shortly due to the fragility of our system if greater planning and support were not provided. We were dealing with a ticking time bomb.
Small, basic foundational steps marked the beginning of our metamorphosis.
- Putting together a pre-on-call checklist
Using a pre-on-call checklist is an easy yet effective approach to ensure that engineers have completed all required tasks prior to starting their shift. It reduces the possibility of being unprepared, guards against mistakes, and encourages an aggressive approach to incident management. The following categories, each with a specific task underneath, made up the list we created:
- Squad-level instruction
Instruction tailored to the squad's call-out duties. Role-specific assignments, standard on-call protocols, getting acquainted with required tools and architectures, and simulation exercises are all included in this.
- Onboarding records
To comprehend the function and responsibilities during the on-call procedure, have precise documentation. Ideally, this should include thorough descriptions of each function, incident handling protocols, escalation routes, and important contacts. Additionally, we need to confirm that this documentation is updated frequently and is easily accessible.
- Guidelines for responding to incidents
Rules for responding to and overseeing occurrences. These must be thorough, maybe encompassing escalation protocols, organisational policies, and training programmes.
- Timetable for on-call
Give specifics regarding the on-call rotation, such as who is available and when. Additionally, make sure that each engineer's calendar has all scheduled shifts added to it using the appropriate scheduling tools. This keeps everyone ready and informed.
- Tools and access for on-call situations
Ensure that engineers have access to all tools and levels of access needed for on-call activities. This could involve keeping an eye on OpsGenie, AWS queues, and dashboards.
- Routes of communication
registering for relevant Google Groups and Slack channels. This could include department-specific incident channels or channels where information about incidents is shared or debated. There may even be a #all-incidents channel in your organisation. Join the channels for your organisation's stability groups, cross-functional updates (like #marketing-updates) that may affect your services, and any other channels that offer important information that may affect users. Real-time communication during crises can also be facilitated by setting up or establishing a dedicated channel for on-call coordination with all of the participants.
- Runbooks and manuals for troubleshooting
Ensure that every team member has access to the pertinent runbooks and that they are routinely reviewed to become familiar with frequent problems and their fixes.
- The postmortem procedure
Create or strengthen an analysis and learning process for occurrences. Post-mortem meetings, when incidents are examined in detail to determine the underlying reasons and opportunities for change, should ideally be a part of this process. It is imperative for engineers to adopt a blameless approach, prioritising learning and development over placing blame.
- Contacts for emergencies
Make certain that a list of contacts is available to every member of the on-call team in case of an emergency. It is imperative that the team understands when and how to get in touch with emergency services.
Regular evaluation
Attend your squad's or organisation's operations, stability, and on-call process review meetings. For ongoing improvement, these discussions ought to happen on a regular basis—for example, every two weeks or every month. Frequent reviews support efficient operations and aid in the improvement of the on-call procedure.
We added the engineers to the rotation once they were comfortable using the pre-on-call checklist. Although they were aware of the difficulties in theory, we cautioned them that managing an actual crisis might be more difficult and involved.
Make sure that workers are correctly included in any on-call engineer payment system that your company maintains. A good resource for different on-call payment methods is Pagerly.
Wheel of misfortune: Practice role-playing exercises
We presented a role-playing game called "wheel of misfortune," which was modelled after Google's Site Reliability Engineering (SRE) methodology. The concept is straightforward: we mimic service outages to evaluate and enhance on-call engineers' reaction times in a safe setting. Some of these simulations were carried out to assist teams in becoming better equipped to handle actual emergencies.
According to a 2019 post by Google Cloud systems engineer Jesus Climent, "If you have played any role-playing game, you probably already know how it works: a leader such as the Dungeon Master, or DM, runs a scenario where some non-player characters get into a situation (in our case, a production emergency) and interact with the players, who are the people playing the role of the emergency responders."
It's a good technique to make sure engineers are capable of handling important but uncommon occurrences with assurance and skill.
Using data to improve preparation for on-call
Step 1: Data verification
The next thing to do was to make sure we had the proper data for useful dashboards and warnings. This was a crucial step because proper monitoring and incident response depend on having pertinent and accurate data. We also carefully examined if our initial perceptions of what constitutes "good" dashboards and warnings were accurate. Ensuring that alerts are trustworthy indicators of real problems and preventing false positives are two benefits of validating data quality and relevance.
Capturing and interpreting critical system performance or business metrics from incidents is essential to an effective on-call system. We made sure we have the appropriate metrics for every service, recorded them regularly, and kept a dashboard for quick overviews.
Step 2: Improving dashboards and notifications
In order to address alert fatigue, which is a prevalent problem in on-call systems, we redesigned our monitoring procedures:
Review of alerts: To cut down on superfluous noise and fine-tune alert sensitivity, routine reviews of false positives were established. The primary goal was to avoid the desensitisation that can result from receiving numerous, pointless messages.
Review of the dashboard: In order to determine which dashboards are the most beneficial overall, with the best widgets, access, and visibility, we started prioritising them. Some important queries we thought about were: Is the dashboard readable when it should be? Is it practicable? Is it put to use? Adrian Howard provided a set of excellent questions in a recent blog post that may be used to further filter your dashboards in an organised manner.
Using golden signals: We monitored traffic, error rates, and key business indicators using our golden signals dashboard. Errors, saturation, delay, and traffic are the four golden signals. Response times are measured by latency; resource utilisation is captured by saturation; errors track unsuccessful requests; and traffic tracks demand. This makes it easier for on-call engineers to evaluate the system's health and comprehend the effects of any modifications or problems.
Step 3: Raising consciousness
After the team had a firm understanding of our deployments, flow, and health data, we started looking for further areas that might be optimised.
Understanding and keeping an eye on your dependencies: For incident response, it's critical to comprehend both your upstream and downstream dependencies. We began by drawing architecture diagrams to help everyone comprehend the flow of information. The interactions between various components, such as services A, B, and C, where service A depends on service B and service B depends on service C, were made clear using these diagrams. After that, we put up dashboards to keep an eye on these dependencies, which enables us to promptly spot and handle any odd behaviour. This made it easier to keep track of the interdependencies within the system.
Communication about dependencies: We made sure that every engineer understood how to get in touch with the teams and services that we rely on in the event of an emergency.
Runbooks: By increasing their accessibility, we have made our runbooks better. Each runbook had sufficient information and sample situations to allow knowledgeable on-call engineers to dive right into problem-solving.
Stress reduction for on-call
As soon as things settled into a routine, we began to optimise the areas that required more work or had gaps in our understanding.
Handover ritualization: We made the procedure of handover a ritual. The exiting engineer formally analyses the system status, existing issues, recent occurrences, and potential risks prior to the start of the engineer's on-call shift. The new engineer is thoroughly prepped for their shift thanks to this.
Honouring on-call accomplishments: We started honouring on-call accomplishments. We recognised the difficulties encountered and the successes attained in every post-on-call review. This may involve anything from proactively resolving an issue before it got out of hand to managing a serious incident.
Constant development and information exchange
Frequent post-incident evaluations: In order to grow and learn from each event, we conducted impartial, frequent post-incident reviews. With the involved on-call team, each session included a root-cause analysis that prioritised understanding problems above placing blame.
Monthly operations review meetings: We hold monthly operational review meetings in order to keep our on-call procedures up to date and improved. These meetings' agenda items included:
Going through earlier action items
Rotations for on-call status: evaluating the effectiveness of the current on-call system. This includes adding new members to the team rotation, evaluating the group's performance, confirming that all shifts are covered, and estimating the workload.
Current events: Talk about any incidents that have happened since the previous meeting.
Reviews of alerts: Examine reports on mean time to recovery (MTTR).
Lessons discovered: Share your observations gleaned from recent situations or notifications.
Points of pain: Discuss the difficulties the group is facing and come up with ways to make these problems go away.
Knowledge base and documentation: Playbooks, runbooks, and documentation are all included in our knowledge base. It is updated frequently and connected to our monitoring systems, giving on-call engineers incident-specific guidance.
What comes next?
You should begin actively investing in SLOs (Service Level Objectives) and SLIs (Service Level Indicators) as soon as the firefighting has ceased and you have some breathing room. They are the only way your team can stop always reacting to everything that occurs and start acting proactively. SLOs, such as 99.9% uptime, specify the desired level of reliability for your services. SLIs, or specific uptime percentage, are measures that indicate how effectively you're accomplishing certain goals.
Your team may stop constantly responding to issues and concentrate on attaining predetermined performance targets by establishing and maintaining SLOs and SLIs.
Concluding remarks
“It is the engineering's responsibility to be on call and own their code. It is management’s responsibility to make sure that on-call does not suck. This is a handshake, it goes both ways...” – Charity Majors
Engineering leaders can demonstrate effective on-call strategies by looking at how they value and organise their systems. Your strategy for on-call will be based on the unique requirements and past performance of your company. Take into account your needs and what you can live without as you modify your plan to suit your team's needs.
By taking a proactive approach, you may raise the bar for your operational resilience in addition to preserving stability. However, it is going to require work, upkeep, continuous improvement, and care. The good news is that there is a solution available.
Top comments (0)