SRE book notes: Managing Incidents

#sre #books #incident

These are the notes from Chapter 14: Managing Incidents from the book Site Reliability Engineering, How Google Runs Production Systems.

This is a post of a series. The previous post can be seen here:

SRE book notes: Emergency Response

Hercules Lemke Merscher ・ Jan 30 ・ 2 min read

#sre #books #emergency

This chapter puts you in the shoes of some personas dealing with different incidents and how they deal with the situation at hand.

It ends by summarizing the best practices for incident management:

Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing.

Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants.

Trust. Give full autonomy within the assigned role to all incident participants.

Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support.

Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response.

Practice. Use the process routinely so it becomes second nature.

Change it around. Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.