We have three working environments – Dev, Stage, Production.
Also, there are a bunch of alerts with different severities – info, warning и critical.
For example:
...
- name: SSLexpiry.rules
rules:
- alert: SSLCertExpiring30days
expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30
for: 10m
labels:
severity: info
annotations:
summary: "SSL certificate warning"
description: "SSL certificate for the {{ $labels.instance }} will expire within 30 days!"
...
Alerts are sent to Slack and OpsGenie.
The task is depending on environment and severity level – send to Slack only, or Slack + OpsGenie.
OpsGenie in its turn depending on severity level will do:
- for the warning – will send an email plus notification to its mobile application
- for the critical – email plus notification to its mobile application plus bot’s call to a mobile
Thus the whole logic looks like next:
-
Dev
- all messages independent on severity – send to Slack only
-
Staging:
- info – Slack only
- warning и critical – Slack + OpsGenie and set warning priority for OpsGenie (P3)
-
Production
- info – Slack only
- warning и critical – Slack and OpsGenie and set critical priority for OpsGenie (P1)
To break down messages between Slack and OpsGenie we gave three receivers configured and in warning and critical receivers – priorities P3 or P1 will be set for the OpsGenie:
...
receivers:
- name: 'default'
slack_configs:
- send_resolved: true
title_link: 'https://monitor.example.com/prometheus/alerts'
title: '{{ if eq .Status "firing" }}:confused:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"
- name: 'warning'
slack_configs:
- send_resolved: true
title_link: 'https://monitor.example.com/prometheus/alerts'
title: '{{ if eq .Status "firing" }}:disappointed_relieved:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"
opsgenie_configs:
- priority: P3
- name: 'critical'
slack_configs:
- send_resolved: true
title_link: 'https://monitor.example.com/prometheus/alerts'
title: '{{ if eq .Status "firing" }}:scream:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"
opsgenie_configs:
- priority: P1
And the routing itself is done in the route
block:
...
route:
group_by: ['alertname', 'cluster', 'job', 'env']
repeat_interval: 24h
group_interval: 5m
# capture All Dev + All INFO
receiver: 'default'
routes:
# capture All WARN to the 'warning' with P3
- match:
severity: warning
receiver: warning
routes:
# forward Dev WARN to the 'default'
- match_re:
env: .*(-dev).*
receiver: default
# capture All CRIT to the 'critical' with P1
- match:
severity: critical
receiver: critical
routes:
# forward Stage CRIT to the 'warning'
- match_re:
env: .*(-stage).*
receiver: warning
# forward Dev CRIT to the 'default'
- match_re:
env: .*(-dev).*
receiver: default
...
Here we the ‘default
‘ route set – all alerts didn’t match for other rules below will be sent via this route, which will send only Slack notification.
The additional routes are described:
- in
match
catch alerts using theseverity: warning
tag - in the nested route using
match_re
theenv
will be checked – if it as “-dev” value then it will send back to thedefault
receiver - all other alerts with the warning level will be sent back and will go thru the
receiver: warning
receiver
Similarly, rules on the next level will be applied – catch alerts with the severity: critical
and check them:
- if
env: .*(-stage).*
– then go to thewarning
receiver - if
env: .*(-dev).*
– then go to thedefault
receiver - everything other (only env == production and severity == critical are left) – will go thru the
critical
receiver
Using such an approach you can write rules using any tags and use any nested levels to check conditions and select next routes for alerts.
Similar posts
- 03/26/2019 Prometheus: Alertmanager – send alerts to a “/dev/null” (0)
- 03/06/2019 Prometheus: blackbox-exporter probe_http_status_code == 0 and its debug (0)
- 08/15/2018 Prometheus: проверка отсутствия метрик – avg_over_time() (0)
- 03/10/2019 Prometheus: RTFM blog monitoring set up with Ansible – Grafana, Loki, and promtail (0)
Top comments (0)