Alerting design
In this series (10 parts)
Every monitoring system supports alerting. The hard part is not configuring alerts. The hard part is configuring the right alerts. Too many alerts and the on-call engineer learns to ignore them. Too few and real incidents go unnoticed.
Good alerting is a design discipline, not a checkbox.
Properties of good alerts
An alert is good when it is:
Actionable. Someone can do something about it right now. “Disk at 92%” is actionable: expand the volume or clean up files. “CPU spiked to 80% for 30 seconds” is usually not actionable: it might be a normal traffic burst.
Urgent. It requires attention within the alert’s time window. A warning about certificate expiry in 30 days does not need to page someone at 3 AM. It belongs in a daily report or a ticket.
Novel. It tells you something you did not already know. If the same alert fires every Tuesday during the batch job and you always acknowledge it without action, delete it.
Relevant to the recipient. The database team should not get paged for a frontend JavaScript error. Route alerts to the team that owns the service.
Symptom-based vs cause-based alerting
Cause-based alerts fire on infrastructure signals: CPU high, memory low, disk full. They are easy to write but often not actionable because the cause might not affect users.
Symptom-based alerts fire on user-facing signals: error rate high, latency degraded, success rate below SLO. They directly reflect user pain.
| Approach | Example | Problem |
|---|---|---|
| Cause-based | CPU > 90% for 5m | CPU spike might be normal. No user impact. |
| Symptom-based | Error rate > 1% for 5m | Directly indicates user-facing degradation. |
The recommended approach: page on symptoms, investigate causes. Your primary alerts should be “error rate is above threshold” and “latency is above threshold.” Cause-based metrics (CPU, memory, disk) appear on dashboards for investigation but rarely justify a page.
Alert routing and escalation
Alerts flow from Prometheus through Alertmanager to notification channels. The routing tree determines who gets notified.
flowchart TD
Prom[Prometheus] -->|firing alert| AM[Alertmanager]
AM -->|severity=critical| PD[PagerDuty]
AM -->|severity=warning| Slack[Slack #alerts-warning]
AM -->|severity=info| Email[Daily digest email]
PD -->|no ack in 10min| Esc1[Escalation: Secondary on-call]
Esc1 -->|no ack in 15min| Esc2[Escalation: Engineering Manager]
subgraph Routing
AM
end
subgraph Escalation
PD
Esc1
Esc2
end
Critical alerts page through PagerDuty with escalation. Warnings go to Slack. Info-level alerts go to daily digests.
Alertmanager configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
receiver: slack-warnings
group_by: [alertname, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 1h
- match:
severity: warning
receiver: slack-warnings
- match:
severity: info
receiver: email-digest
receivers:
- name: pagerduty-critical
pagerduty_configs:
- service_key: "<pagerduty-integration-key>"
description: "{{ .CommonAnnotations.summary }}"
details:
firing: "{{ .Alerts.Firing | len }}"
dashboard: "https://grafana.example.com/d/red-dashboard"
- name: slack-warnings
slack_configs:
- api_url: "https://hooks.slack.com/services/T00/B00/xxx"
channel: "#alerts-warning"
title: "{{ .CommonAnnotations.summary }}"
text: "{{ .CommonAnnotations.description }}"
- name: email-digest
email_configs:
- to: "team@example.com"
send_resolved: true
Grouping
group_by: [alertname, service] means that if 10 pods of the same service all fire the same alert, you get one notification with all 10 listed, not 10 separate pages.
group_wait: 30s waits 30 seconds after the first alert fires to batch any related alerts into a single notification. This prevents a cascade of pages during an incident.
Inhibition and silencing
Inhibition
If the entire cluster is down, you do not need separate alerts for every service in the cluster. Inhibition rules suppress child alerts when a parent alert is firing:
inhibit_rules:
- source_match:
alertname: ClusterDown
target_match_re:
alertname: .+
equal: [cluster]
When ClusterDown fires for cluster=us-east-1, all other alerts for that cluster are suppressed.
Silences
Silences temporarily mute alerts during planned maintenance. Create them via the Alertmanager UI or API:
# Silence all alerts for order-api for 2 hours during deployment
amtool silence add \
--alertmanager.url=http://alertmanager:9093 \
--author="deploy-bot" \
--comment="Rolling deployment of order-api v2.3.1" \
--duration=2h \
service="order-api"
Always set a duration. Silences without expiry are forgotten and become permanent blind spots.
Writing effective alert rules
Error rate alert with for clause
- alert: HighErrorRate
expr: service:http_error_ratio:5m > 0.01
for: 5m
labels:
severity: critical
service: "{{ $labels.service }}"
annotations:
summary: "Error rate above 1% on {{ $labels.service }}"
description: |
Error rate is {{ $value | humanizePercentage }}.
Dashboard: https://grafana.example.com/d/red?var-service={{ $labels.service }}
Runbook: https://wiki.example.com/runbooks/high-error-rate
The for: 5m clause prevents transient spikes from paging. The condition must be true continuously for 5 minutes.
Latency alert
- alert: HighP99Latency
expr: service:http_latency_p99:5m > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s on {{ $labels.service }}"
Resource exhaustion (cause-based, warning only)
- alert: DiskSpaceRunningLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
This is a cause-based alert, so it gets warning severity and routes to Slack, not PagerDuty.
Alert review hygiene
Alerts degrade over time. Thresholds that made sense six months ago might be too tight or too loose today. Run a monthly alert review:
- List all alerts that fired in the last 30 days. For each one, ask: was it actionable? Did someone take action?
- Delete alerts that were always ignored. If every firing was acknowledged without investigation, the alert has no value.
- Tighten or loosen thresholds. If an alert fires daily and is always a false positive, the threshold is wrong.
- Check for missing alerts. Review recent incidents. Was there an alert for the symptom? If not, add one.
- Verify runbook links. Every critical alert should link to a runbook. Broken links waste time during incidents.
Track alert quality metrics:
# Alerts per week (should be stable or decreasing)
count(ALERTS{alertstate="firing"})
# Time to acknowledge (from PagerDuty API, imported as a metric)
pagerduty_incident_ack_seconds
A team that pages fewer than 2 times per on-call shift has healthy alerting. A team that pages 10+ times per shift has an alert fatigue problem that needs immediate attention.
What comes next
Alerts tell you something is wrong. SLIs, SLOs, and error budgets gives you a framework for deciding how wrong is too wrong, and when to prioritize reliability work over features.