If you've ever been part of an on-call rotation, you probably know it can be a bit of a drag, especially if your alerting is a little too good; you wake up 5 times a night to pages that seem to resolve as soon as you're just awake enough to ensure you don't easily fall back asleep, and maybe it happens to often that when a serious page does come through, you don't catch it fast enough, and wham you have an incident review to prepare for the next day.
Creating an iterative, audit-driven, test-able program for your on-call rotation is taxing, sure, but the pay-off is that you get better data, more usable context, less time spent responding to noise, and more time preemptively identifying signal.
Auditing for Coverage
The first step in this process is knowing what is, and is not, being monitored. What hosts are observed? Which services are observed? Is there a host for every service? the questions go on and on.
I sought to answer this question myself by querying my source of truth, in this case, Nagios.
I wrote a script to prepare a spreadsheet of host monitors, and a sheet of service monitors that indicate whether or not a mapping to a monitored host exists:
#!/bin/bash
hosts () {
cat /etc/nagios/conf.d/hosts/*.cfg | grep "host_name\|address\|alias\|hostgroups" |grep -v localhost | perl -ne '$line = $_;
chomp($line);
if ($line =~ /host_name(.*)/) {
$match = $1 ;
$match =~ s/ |,//g;
print "\n".$match.",";
};
if ($line =~ /address(.*)/) {
$match = $1 ;
$match =~ s/ |,//g;
print $match.",";
}
if ($line =~ /alias(.*)/) {
$match = $1 ;
$match =~ s/^\s+//;
$match =~ s/,//g;
print $match.",";
}
if ($line =~ /hostgroups(.*)/) {
$match = $1 ;
$match =~ s/^\s+//;
$match =~ s/,//g;
print $match.",";
};
'
}
services () {
cat /etc/nagios/conf.d/services/*.cfg | grep "hostgroup_name\|service_description\|check_command" |grep -v localhost | perl -ne '$line = $_;
chomp($line);
if ($line =~ /hostgroup_name(.*)/) {
$match = $1 ;
$match =~ s/ |,//g;
print "\n".$match.",";
};
if ($line =~ /service_description(.*)/) {
$match = $1 ;
$match =~ s/ |,//g;
print $match.",";
}
if ($line =~ /check_command(.*)/) {
$match = $1 ;
$match =~ s/ |,//g;
print $match.",";
};
'
}
write_inv () {
if [ $1 = "dump" ]; then
if [ $2 = "hosts" ]]; then
hosts
elif [ $2 = "services" ]; then
services
else
echo "bad option (services | hosts)"
fi
elif [ $1 = "audit" ]; then
DATE=`date +%Y%m%d%H%m`
OUTFILE_hosts="$HOME/hosts_$DATE.csv"
OUTFILE_services="$HOME/services_$DATE.csv"
echo "hostgroup_name,service_description,check_command," >> $OUTFILE_services && \
echo "Writing $OUTFILE_services..." && \
services >> $OUTFILE_services
echo "hostname,address,hostgroups," >> $OUTFILE_hosts && \
echo "Writing $OUTFILE_hosts..." && \
hosts >> $OUTFILE_hosts
else
echo "Options: { dump (prints to stdout) [hosts, services] | audit (writes to file) }"
fi
}
main () {
write_inv $@
}
main $@
So, running ./audit.sh audit
will create that pair of CSVs, which in the service
view, will give you the host group, the service, and what is tested in the service (this is a path to a module in nagios, so you can then ensure which monitors are in use), and then the host_group usage can be cross-check with the list of hosts being monitored (which will tell you which hosts make up that group).
This will tell you two things:
- What is monitored
and
- How well it is monitored (coverage)
This context alone, however, won't solve all your problems; this will tell you if something is, or is not, monitored, not if that monitor is worth your time. You can identify is something is missing, but not misaligned.
For example, on database servers, you may be monitoring for deadlocks, but that will do you little good if the instance is already on fire (a network outage, a replication slot failure, the instance may have been terminated and an upstream libvirt check may not have caught it, the list goes on!). You can fill in those gaps, but callibrating your team's response comes with some additional insight, which again, is trackable and automatable over your audit cycle (i.e. quarterly).
Alert Fatigue!
Key to improving your monitoring is understanding where alerts are coming from, and evaluating if the response was proportional.
If you know something will happen, and you understand why and how it's happening, and it resolves itself inside of some period of time, and you routinely acknowledge a page for this issue, but find it resolved before you're online to check, then it's possible this is something that can be automated out of existence, that only escalates if the issue persists (even if that escalation path adds a few seconds to indicate it is not resolvable by automation alone).
Most on-call pager services have some plugin for this, but what I'd like to talk about is auditing your responses as well to identify misaligned monitors.
Because I like speadsheets, I run a script like this one periodically to see what we spend the most amount of time responding to during an on-call rotation:
import requests
import json
import os
import datetime
import sys
import optparse
p = optparse.OptionParser(conflict_handler="resolve", description= "Creates CSV for PagerDuty alert history; requires start and end date.")
p.add_option('-s', '--since', action='store', type='string', dest='since', default='', help='Start date for reporting')
p.add_option('-u', '--until', action='store', type='string', dest='until', default='', help='End date for Reporting')
p.add_option('-k', '--key', action='store', type='string', dest='api_key', default='', help='PagerDuty API Key')
options, arguments = p.parse_args()
since = options.since
until = options.until
key = options.key
if 'key' in locals():
API_KEY = key
else:
API_KEY = os.environ['PAGERDUTY_RO_KEY']
SINCE = since
UNTIL = until
STATUSES = []
TIME_ZONE = 'UTC'
LIMIT = 50
RUNDATE = datetime.datetime.today().strftime('%Y%m%d%H%M%S')
def list_incidents(offsetval):
url = 'https://api.pagerduty.com/incidents'
headers = {
'Accept': 'application/vnd.pagerduty+json;version=2',
'Authorization': 'Token token={token}'.format(token=API_KEY)
}
payload = {
'since': SINCE,
'until': UNTIL,
'statuses[]': STATUSES,
'limit': LIMIT,
'time_zone': TIME_ZONE,
'offset': offsetval
}
r = requests.get(url, headers=headers, params=payload)
return r.text
def write_csv(resp):
incidents = json.loads(resp)['incidents']
incidents_data = open('%s/pd-audit/%s-Incidents-%s.csv' % (os.path.expanduser('~'), RUNDATE, offset), 'w+')
for inc in incidents:
incidents_data.write("%s,%s,\n" % (inc['title'],inc['created_at']))
incidents_data.close()
if __name__ == '__main__':
more_status = True
offset = 0
while more_status == True:
resp = list_incidents(offset)
more = json.loads(resp)['more']
if more == False:
more_status = False
print "No more pages after current run. '%s/pd-audit/%s-Incidents-%s.csv'..." % (os.path.expanduser('~'), RUNDATE, offset)
write_csv(resp)
else:
print "Writing '%s/pd-audit/%s-Incidents-%s.csv'..." % (os.path.expanduser('~'), RUNDATE, offset)
resp = list_incidents(offset)
write_csv(resp)
offset += 1
You can, then sort the sheet by incident type--how many of these were helped by me manually intervening? How many were unnecessary pages that got resolved just quickly enough to justify the use of a bot to acknowledge intermediately? How many could've gone better because we could've detected it sooner? (Remember that database example?) What could've happened differently to make what we actually did better? The list of questions goes on.
The point of these exercises is to keep, in front of your mind, what is and is not working for you and your team. This allows you some objective distance between you and what you actually responded to by having the data to say something proved helpful, or proved to be kind of a drag.
By implementing this iterative process, each pass may gain you more insight into what your team could be spending less time on while fighting fires by investing in, or, in my case, divesting from these monitoring blind spots
Top comments (0)