DEV Community

Cover image for Why ‘owning Services’ is critical for effective Incident Response
Squadcast Community for Squadcast

Posted on • Edited on • Originally published at squadcast.com

Why ‘owning Services’ is critical for effective Incident Response

There is a famous quote that goes like this…
For every minute spent organizing, an hour is earned.’

At least in the world of incident response, nothing is more apt than this. Digital infrastructure these days is made up of multiple services, an outage could result from either one impacted service or multiple impacted services. So it's essential to have a catalog of all the services along with the point of contact (service owner) responsible for maintaining it.

However, in the absence of service ownership details, the incident response will go on for longer than necessary. And even basic questions such as these will seem like a mystery to everyone involved:

*Which services are affected?
*Who developed these services? And, who is responsible for maintaining them?
*Which are the other dependent services that are also affected?
Being ignorant of these questions will make it a reactive incident response process with an obvious drift between Mean Time to Detection and Mean Time to Recovery. This not only brings down metrics closely tied to team goals (such as MTTA & MTTR) but also increases the chances of more customers getting exposed to the issue.

So what’s new in this process?

Most readers here will argue that maintaining Service Ownership is an age-old practice. Rightfully said, documenting the list of services and their respective owners were a standard practice followed by Infrastructure and Operations teams over the years because they were responsible for the system’s performance and uptime.

But what has changed?

In recent years, it's not about - ‘are ownership details documented?’
Rather it's about - ‘where are ownership details documented?’

The foremost questions you need to ask yourself (and your team) is -
‘Do we have the details stored in the right place?’
‘Are the details centralized and easily accessible by everyone?’
‘Can everyone quickly access it during emergencies?’
‘Is there automation in place to alert the right people?’

Likely solution?

Better Ownership & Greater Transparency. Response teams must be able to access ownership details in mere seconds, even if not minutes. And the best place to document these details can’t just be any random tool, but an Incident Management platform such as Squadcast.

And to meet this need, we’ve built a feature that can act as a centralized Service Directory, highlighting the health status of Services and their respective owners. This not only makes incident response less chaotic, but is also the first step in making it a proactive process, rather than a reactive process.

Before we get into the details of how modern incident response teams are using our Service Catalog, to prevent incidents from spiralling out of control, let’s spend some time understanding what it means to actually ‘own Services’.

Service Ownership

Service ownership is the act where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle. Since Service owners are the SMEs (subject matter experts) for their services – it makes a lot of sense for them to own response and resolution of production issues. This not only promotes a stable product but also bridges the gap between engineering teams and the impact they have on customers.

When it comes to Incident Management, being organized is a superpower that can prevent you from losing millions of dollars in a short window of downtime, all thanks to the timely availability of information. On the contrary, every minute spent scrambling for data, will only lead to more tickets and escalations.

Introducing Squadcast’s Service Catalog

Our Service Catalog is a Service Directory that acts like a centralized knowledge base containing all the specifics of that particular service, and the personnel within the team responsible for maintaining it.

It can typically answers questions such as:

The owner(s) accountable for its uptime
The associated escalation policy
The health status of the service (whether it is degraded or functional?)
The environment(s) where it is deployed (production / test / staging)
The various integrations configured for that service (which might need to be re-configured)
Its dependent upstream/downstream services (which will also get impacted)
Having all the service-related information in a centralized location can make Service Ownership less chaotic for the team not only at the time of an outage, but also when there is a partial service degradation.

Benefits of clearly defining Ownership for Services

Better & accurate escalation to on-call (when an incident needs to be reported)
Improved accountability of services
Improved reliability of services
Happier customers due to faster incident resolution
But associating ownership with services is not as easy as it sounds. There are numerous processes and best practices that should be followed. Let’s read about that in the next section.

Defining a Service

Now let’s understand what exactly is a Service within Squadcast’s ecosystem.

What is a Service in Squadcast?

Services in Squadcast represent specific systems, applications, or core components of your infrastructure for which alerts are generated, and incidents get created.

In the simplest terms, a Service in Squadcast can be summarized as a component that you want to constantly monitor for uptime, report incidents at the slightest hint of performance degradation, and have certain people on-call to quickly remediate the issue.

For every service created in Squadcast, appropriate service owners should be defined.

Establishing Service Ownership

Establishing the culture of ‘owning Services’ will help you take the next big next leap in your reliability journey, and every member involved in the process should buy-in to the cause. This includes everyone in incident response - starting from the incident commander to the on-call engineers working on L1 issues.

So in the next section of this blog, let’s understand the best practices to keep in mind while configuring services and ownership. To check out the best practices to reduce MTTR for Services configured in Squadcast, refer to this guide.

1. Create a list of Services**

First, create a list of all the services that are critical to your business. This should include both *Technical Services and *Business Services that need to be monitored 24*7. Start by differentiating between the two types of services and assign ownership to the appropriate teams accordingly because even a few seconds of degradation or downtime can upset customers and stakeholders.

*Technical Service - a discrete piece of code or functionality within the product owned by the engineering team

*Business Service - can be a combination of one or more Technical Services that have a direct impact on the business/ customer

Business and technical service tags

2. Name Services appropriately

Using appropriate naming conventions will make incident response less chaotic during times of urgency. When naming services:

Avoid fancy terminology and ensure to use unique names that the team can easily recognize
Add a description that is informative and answers questions such as the intent of the service, and its value-add
Use tags to highlight if that service has the potential to affect customers
Use naming conventions that can properly differentiate between business services and technical services

3. Pick the right owner to own the Service

Every Service should be wholly owned by a team or an individual. Ideally, this should be the same team responsible for developing and maintaining the service because they are the Subject Matter Experts who understand how the service works and should be notified when something goes wrong.

4. Set up on-call rotation & escalations

On-call rotations are key to distributing the load equally among team members. Based on your organization’s requirements and structure, you should build out a roster (a full-blown on-call calendar) for indicating how many individuals will be on-call at a given time and who will be notified straight away for certain severe incidents.

The best practice is to:

  • First define an escalation policy for the service
  • And then decide who will be on-call (which is usually the 1st layer of escalation for any service)

5. Set up ‘Tags’ to classify Services

‘Tags’ help in classifying services appropriately. And classifying services adds a lot more context to the services based on incident impact. For ex:

  • Classifying if the service belongs to the test-environment or prod-environment, helps in prioritizing response
  • Classifying if the service needs a high-priority response, also helps in identifying how severe the incident can become
  • Classifying if the service has a direct impact on customers
  • Classifying similar services together to determine dependent services that can also potentially get affected, etc.

How to track its functioning

Setting up ownership for services is only the first step towards better incident response. In order to strengthen the value that its adding, you can do the following:

1. Track SLOs

SLOs (Service Level Objectives) is one of the best indicators to measure service functionality. Various functional targets should be established for every service. Targets here, could be in the form of the expected amount of uptime, acceptable amount of latency, number of errors, error rate, etc.

But the key point is to make sure the owner has a tab on these performance indicators, along with some form of automation that can notify the owner(s) as and when the targets are not being met.

2. Use Analytics

Analytics is another useful medium to understand the health of the service. By analyzing a service’s past behavior, you can get answers to various questions like:

  • How prone is this service to outages/ incidents?
  • How many on-call engineers get actively involved during resolution?
  • What caused this service degradation, etc. The key point is, analytics can be leveraged to decipher various patterns in a Service’s behavior. This data can be used to drive home numerous insights that can improve on-call and incident response processes.

3. Conduct transparent discussions with the team

Most of all, having open discussions with the team is very important in maintaining team harmony. It can also help to bolster confidence and increase psychological safety as service degradations are inevitable. Exchanging perspectives and settling down on an approach to deliver maximum uptime is the best way forward.

Conclusion

Customers and stakeholders tend to be happier when they see a healthy and functioning service. A functioning service is thus a result of proactive incident response, which is itself a byproduct of well-defined Service Ownership.

Squadcast is an incident management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.

Top comments (0)