Apr 15, 2026 · 15 min read · DevOps

Alerting design

In this series (10 parts)

Every monitoring system supports alerting. The hard part is not configuring alerts. The hard part is configuring the right alerts. Too many alerts and the on-call engineer learns to ignore them. Too few and real incidents go unnoticed.

Good alerting is a design discipline, not a checkbox.

Properties of good alerts

An alert is good when it is:

Actionable. Someone can do something about it right now. “Disk at 92%” is actionable: expand the volume or clean up files. “CPU spiked to 80% for 30 seconds” is usually not actionable: it might be a normal traffic burst.

Urgent. It requires attention within the alert’s time window. A warning about certificate expiry in 30 days does not need to page someone at 3 AM. It belongs in a daily report or a ticket.

Novel. It tells you something you did not already know. If the same alert fires every Tuesday during the batch job and you always acknowledge it without action, delete it.

Relevant to the recipient. The database team should not get paged for a frontend JavaScript error. Route alerts to the team that owns the service.

Symptom-based vs cause-based alerting

Cause-based alerts fire on infrastructure signals: CPU high, memory low, disk full. They are easy to write but often not actionable because the cause might not affect users.

Symptom-based alerts fire on user-facing signals: error rate high, latency degraded, success rate below SLO. They directly reflect user pain.

Approach	Example	Problem
Cause-based	CPU > 90% for 5m	CPU spike might be normal. No user impact.
Symptom-based	Error rate > 1% for 5m	Directly indicates user-facing degradation.

The recommended approach: page on symptoms, investigate causes. Your primary alerts should be “error rate is above threshold” and “latency is above threshold.” Cause-based metrics (CPU, memory, disk) appear on dashboards for investigation but rarely justify a page.

Alert routing and escalation

Alerts flow from Prometheus through Alertmanager to notification channels. The routing tree determines who gets notified.

flowchart TD
  Prom[Prometheus] -->|firing alert| AM[Alertmanager]
  AM -->|severity=critical| PD[PagerDuty]
  AM -->|severity=warning| Slack[Slack #alerts-warning]
  AM -->|severity=info| Email[Daily digest email]

  PD -->|no ack in 10min| Esc1[Escalation: Secondary on-call]
  Esc1 -->|no ack in 15min| Esc2[Escalation: Engineering Manager]

  subgraph Routing
      AM
  end

  subgraph Escalation
      PD
      Esc1
      Esc2
  end

Critical alerts page through PagerDuty with escalation. Warnings go to Slack. Info-level alerts go to daily digests.

Alertmanager configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  receiver: slack-warnings
  group_by: [alertname, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: slack-warnings

    - match:
        severity: info
      receiver: email-digest

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - service_key: "<pagerduty-integration-key>"
        description: "{{ .CommonAnnotations.summary }}"
        details:
          firing: "{{ .Alerts.Firing | len }}"
          dashboard: "https://grafana.example.com/d/red-dashboard"

  - name: slack-warnings
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/xxx"
        channel: "#alerts-warning"
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ .CommonAnnotations.description }}"

  - name: email-digest
    email_configs:
      - to: "team@example.com"
        send_resolved: true

Grouping

group_by: [alertname, service] means that if 10 pods of the same service all fire the same alert, you get one notification with all 10 listed, not 10 separate pages.

group_wait: 30s waits 30 seconds after the first alert fires to batch any related alerts into a single notification. This prevents a cascade of pages during an incident.

Inhibition and silencing

Inhibition

If the entire cluster is down, you do not need separate alerts for every service in the cluster. Inhibition rules suppress child alerts when a parent alert is firing:

inhibit_rules:
  - source_match:
      alertname: ClusterDown
    target_match_re:
      alertname: .+
    equal: [cluster]

When ClusterDown fires for cluster=us-east-1, all other alerts for that cluster are suppressed.

Silences

Silences temporarily mute alerts during planned maintenance. Create them via the Alertmanager UI or API:

# Silence all alerts for order-api for 2 hours during deployment
amtool silence add \
  --alertmanager.url=http://alertmanager:9093 \
  --author="deploy-bot" \
  --comment="Rolling deployment of order-api v2.3.1" \
  --duration=2h \
  service="order-api"

Always set a duration. Silences without expiry are forgotten and become permanent blind spots.

Writing effective alert rules

Error rate alert with `for` clause

- alert: HighErrorRate
  expr: service:http_error_ratio:5m > 0.01
  for: 5m
  labels:
    severity: critical
    service: "{{ $labels.service }}"
  annotations:
    summary: "Error rate above 1% on {{ $labels.service }}"
    description: |
      Error rate is {{ $value | humanizePercentage }}.
      Dashboard: https://grafana.example.com/d/red?var-service={{ $labels.service }}
      Runbook: https://wiki.example.com/runbooks/high-error-rate

The for: 5m clause prevents transient spikes from paging. The condition must be true continuously for 5 minutes.

Latency alert

- alert: HighP99Latency
  expr: service:http_latency_p99:5m > 2.0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency above 2s on {{ $labels.service }}"

Resource exhaustion (cause-based, warning only)

- alert: DiskSpaceRunningLow
  expr: |
    (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Disk space below 10% on {{ $labels.instance }}"

This is a cause-based alert, so it gets warning severity and routes to Slack, not PagerDuty.

Alert review hygiene

Alerts degrade over time. Thresholds that made sense six months ago might be too tight or too loose today. Run a monthly alert review:

List all alerts that fired in the last 30 days. For each one, ask: was it actionable? Did someone take action?
Delete alerts that were always ignored. If every firing was acknowledged without investigation, the alert has no value.
Tighten or loosen thresholds. If an alert fires daily and is always a false positive, the threshold is wrong.
Check for missing alerts. Review recent incidents. Was there an alert for the symptom? If not, add one.
Verify runbook links. Every critical alert should link to a runbook. Broken links waste time during incidents.

Track alert quality metrics:

# Alerts per week (should be stable or decreasing)
count(ALERTS{alertstate="firing"})

# Time to acknowledge (from PagerDuty API, imported as a metric)
pagerduty_incident_ack_seconds

A team that pages fewer than 2 times per on-call shift has healthy alerting. A team that pages 10+ times per shift has an alert fatigue problem that needs immediate attention.

What comes next

Alerts tell you something is wrong. SLIs, SLOs, and error budgets gives you a framework for deciding how wrong is too wrong, and when to prioritize reliability work over features.

← Back to all series