Search…

On-call tooling and runbooks

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

Your monitoring detects the problem. Your alerts notify the right person. Now what? At 3 AM, half awake, staring at a PagerDuty notification, the on-call engineer needs to know exactly what to check, what to try, and when to escalate. That is what runbooks provide.

Runbook structure

A runbook is a document that guides an engineer through diagnosing and resolving a specific alert. Every critical alert should link to a runbook. The runbook follows a consistent structure:

1. Alert description

What does this alert mean? What SLI is affected? One or two sentences.

2. Symptoms

What does the user experience? This helps the on-call engineer verify the problem is real and assess severity.

3. Investigation steps

Ordered checklist of things to check. Each step includes the exact command or dashboard link:

## Investigation

1. Check the error rate dashboard:
   [Grafana: Order API RED Dashboard](https://grafana.example.com/d/red?var-service=order-api)

2. Look at recent deployments:

kubectl -n production rollout history deployment/order-api


3. Check downstream dependency health:
```promql
up{job="payment-api"}
up{job="inventory-api"}
  1. Search logs for error details:

    {service="order-api", level="error"} |= "exception" | json
  2. Check database connection pool:

    order_api_db_pool_active / order_api_db_pool_max

### 4. Mitigation steps

Actions that restore service while the root cause is investigated:

```markdown
## Mitigation

### If caused by a bad deployment:
Rollback to the previous version:

kubectl -n production rollout undo deployment/order-api

Verify the rollback:

kubectl -n production rollout status deployment/order-api


### If caused by database connection exhaustion:
Restart the affected pods to reset connection pools:

kubectl -n production rollout restart deployment/order-api


### If caused by downstream payment-api failure:
Enable the circuit breaker fallback:

kubectl -n production set env deployment/order-api PAYMENT_FALLBACK=true

5. Escalation

When and who to contact if the on-call engineer cannot resolve the issue:

## Escalation

If the issue is not resolved within 30 minutes:
- Page the secondary on-call: PagerDuty escalation policy "order-team"
- For database issues: Page the DBA on-call
- For payment provider issues: Contact Stripe support (account ID in 1Password)

Complete runbook example

Here is a full runbook for a high error rate alert:

# Runbook: HighErrorRate - order-api

## Alert
The order-api is returning errors above the 1% SLO threshold.
This directly impacts customers attempting to place orders.

## Symptoms
- Users see "Something went wrong" on the checkout page.
- Order completion rate drops in the business metrics dashboard.

## Severity
Critical - revenue impact. Resolve within 15 minutes or escalate.

## Investigation
1. Open the RED dashboard:
   https://grafana.example.com/d/red?var-service=order-api

2. Identify the error type:
   {service="order-api", level="error"} | json | line_format "{{ .error_code }}"

3. Check if a deployment happened in the last 30 minutes:
   kubectl -n production rollout history deployment/order-api

4. Check downstream services:
   - payment-api health: https://grafana.example.com/d/red?var-service=payment-api
   - inventory-api health: https://grafana.example.com/d/red?var-service=inventory-api
   - postgres: order_api_db_pool_active / order_api_db_pool_max

5. Check for resource exhaustion:
   - Memory: container_memory_working_set_bytes{pod=~"order-api.*"}
   - CPU: rate(container_cpu_usage_seconds_total{pod=~"order-api.*"}[5m])

## Mitigation
- Bad deployment -> rollback: kubectl -n production rollout undo deployment/order-api
- DB connection exhaustion -> restart: kubectl -n production rollout restart deployment/order-api
- Downstream failure -> enable fallback: kubectl set env deployment/order-api PAYMENT_FALLBACK=true
- Traffic spike -> scale up: kubectl -n production scale deployment/order-api --replicas=10

## Escalation
- 15min unresolved: page secondary on-call via PagerDuty
- Database issues: page DBA on-call
- Infrastructure issues: page platform-team on-call

Linking alerts to runbooks

Every alert annotation should include a runbook URL:

- alert: HighErrorRate
  expr: service:http_error_ratio:5m > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1% on {{ $labels.service }}"
    runbook: "https://wiki.example.com/runbooks/{{ $labels.service }}/high-error-rate"

The PagerDuty or Slack notification includes the runbook link. The on-call engineer clicks directly to the relevant document without searching a wiki.

Store runbooks in a location that is:

  • Accessible during outages. If your wiki is hosted on the same infrastructure that is failing, you cannot read runbooks during the outage. Use an external wiki or keep a cached copy.
  • Version controlled. Runbooks in a Git repository get the same review process as code.
  • Searchable. Engineers should find runbooks by alert name, service name, or symptom.

Automating runbook steps

Manual runbook steps are a stepping stone. Once a mitigation is well-understood and safe, automate it:

Level 1: Copy-paste commands

The runbook provides exact commands. The engineer runs them manually.

Level 2: One-click scripts

Wrap common mitigations in scripts:

#!/bin/bash
# rollback-order-api.sh
set -euo pipefail

SERVICE="order-api"
NAMESPACE="production"

echo "Current revision:"
kubectl -n "$NAMESPACE" rollout history "deployment/$SERVICE" | tail -3

echo "Rolling back..."
kubectl -n "$NAMESPACE" rollout undo "deployment/$SERVICE"

echo "Waiting for rollout..."
kubectl -n "$NAMESPACE" rollout status "deployment/$SERVICE" --timeout=120s

echo "Verifying error rate..."
sleep 30
# Check if error rate has dropped (requires promtool or curl to Prometheus)
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=service:http_error_ratio:5m{service=\"$SERVICE\"}" | jq -r '.data.result[0].value[1]')
echo "Current error rate: $ERROR_RATE"

Level 3: Auto-remediation

The alert triggers an automated response. This is appropriate only for well-understood, safe, and reversible actions:

# Example: auto-scale on high latency
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_request_duration_seconds_p99
        target:
          type: AverageValue
          averageValue: "500m"

Not everything should be auto-remediated. Rollbacks, data migrations, and anything that changes state should require human confirmation.

Keeping runbooks accurate

Runbooks rot fast. A runbook written six months ago references dashboards that moved, commands for a deployment system you replaced, and escalation contacts who left the team.

Practices that keep runbooks fresh:

  1. Post-incident review updates. After every incident, update the runbook for the alert that fired. Add steps that worked, remove steps that were irrelevant.

  2. Runbook drills. Monthly, pick a random runbook and have an engineer walk through it on a staging environment. If any step is wrong or unclear, fix it immediately.

  3. Ownership. Each runbook has an owner (the team that owns the service). The owner reviews runbooks quarterly.

  4. Freshness tracking. Add a last_reviewed field to every runbook. Flag any runbook not reviewed in 90 days.

# Runbook metadata header
metadata:
  service: order-api
  alert: HighErrorRate
  owner: order-team
  last_reviewed: 2026-03-15
  review_cadence: quarterly
  reviewers:
    - alice
    - bob

On-call dashboards

The on-call engineer needs a single starting point. An on-call dashboard provides:

Row 1: Active alerts. All currently firing alerts with severity and duration.

Row 2: SLO status. Error budget remaining for each critical service. Red means budget is nearly exhausted.

Row 3: Recent changes. Deployments, config changes, and infrastructure events in the last 24 hours. Most incidents correlate with recent changes.

Row 4: Service health overview. RED metrics for every service in a grid. Green/yellow/red status at a glance.

# Recent deployments (using a deploy marker metric)
changes(deploy_timestamp{namespace="production"}[24h])

# Error budget remaining
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d])) by (service)
  /
  sum(rate(http_requests_total[30d])) by (service)
) / 0.001

This dashboard is the first thing the on-call engineer opens when paged. It answers: “What is broken? What changed? How bad is it?”

On-call health metrics

Track the health of your on-call process itself:

MetricTargetWhy
Pages per shift< 2More than 2 indicates alert noise
Time to acknowledge< 5 minutesMeasures on-call responsiveness
Time to resolve< 30 minutesMeasures runbook effectiveness
False positive rate< 10%High rate erodes trust in alerts
Runbook coverage100% of critical alertsEvery critical alert needs a runbook

Review these metrics monthly. If pages per shift are climbing, the team is accumulating alert debt. If time to resolve is growing, runbooks are stale or incomplete.

What comes next

This article completes the Monitoring and Observability series. You now have a foundation spanning all three telemetry pillars, aggregation and storage, alerting and SLO design, user-facing monitoring, and incident response processes. The next step is applying these practices to your own systems: start with RED metrics and one SLO per critical service, then expand from there.

Start typing to search across all content
navigate Enter open Esc close