Search…

On-call culture and incident management

In this series (10 parts)
  1. What DevOps actually is
  2. The software delivery lifecycle
  3. Agile, Scrum, and Kanban for DevOps teams
  4. Trunk-based development and branching strategies
  5. Environments and promotion strategies
  6. Configuration management
  7. Secrets management
  8. Deployment strategies
  9. On-call culture and incident management
  10. DevOps metrics and measuring maturity

At 2:47 AM, your phone buzzes. PagerDuty. The checkout service is returning 500 errors. Revenue is dropping. Customers are tweeting. You have been on-call for three days and this is the second page tonight.

What happens next depends entirely on how your organization approaches incident management. Do you have a runbook? Is there a clear escalation path? Will anyone blame you for the outage in tomorrow’s meeting?

The answers to these questions determine whether engineers dread on-call or treat it as a normal, manageable part of the job.

What on-call should look like

On-call means you are the first responder when something breaks. It does not mean you are expected to fix everything alone, sacrifice your sleep indefinitely, or be punished when systems fail.

Healthy on-call has these properties:

  • Rotations are fair. Everyone on the team takes turns. Senior engineers do not exempt themselves. New team members get shadowing periods before going solo.
  • Compensation exists. On-call is work. Teams that pay for on-call or provide time off in lieu have lower burnout rates.
  • Escalation is clear. The on-call engineer is not the last line of defense. There is always someone to escalate to.
  • Alerts are actionable. Every page requires a human decision. If the response is always “restart the pod,” that should be automated.
  • Handoffs are documented. When a rotation ends, the outgoing engineer summarizes open issues, recent incidents, and anything the next person should know.

A rotation that pages someone six times a night is not an on-call problem. It is a reliability problem. Fix the system, not the rotation schedule.

Alert fatigue

Alert fatigue is the single biggest threat to effective on-call. When engineers receive 50 alerts per shift, they stop reading them carefully. Critical signals drown in noise.

The math is unforgiving. If 90% of alerts are noise and you receive 100 alerts per day, you are training your brain to ignore pages. The 10 real alerts get the same dismissive treatment as the 90 false ones.

Fixing alert fatigue:

  • Delete alerts that never lead to action. If nobody has responded to an alert type in three months, it should not page anyone.
  • Separate pages from notifications. A page wakes you up. A notification goes to Slack. CPU at 70% is a notification. CPU at 95% for five minutes is a page.
  • Set meaningful thresholds. Alert on symptoms (error rate, latency), not causes (CPU, memory). Users feel errors, not CPU spikes.
  • Review alert volume weekly. Track pages per shift. If the number is going up, investigate why. Target fewer than two pages per on-call shift.

Incident severity levels

Not every incident is a fire. A clear severity framework helps responders prioritize and communicate.

SeverityCriteriaResponse timeExample
SEV-1Complete service outage, data loss risk, security breachImmediate, all handsPayment processing down for all users
SEV-2Major feature degraded, significant user impact15 minutesSearch returning errors for 30% of queries
SEV-3Minor feature broken, workaround available1 hourExport CSV button failing, manual export works
SEV-4Cosmetic issue, no functional impactNext business dayDashboard chart rendering incorrectly

The exact levels and criteria vary by organization. What matters is that they exist, everyone knows them, and they map to specific response expectations.

Severity drives two things: who gets paged and how urgently they need to respond. A SEV-1 pages the on-call engineer, their backup, and the engineering manager. A SEV-4 creates a Jira ticket.

The incident commander role

For SEV-1 and SEV-2 incidents, one person takes the incident commander (IC) role. The IC does not debug. They coordinate.

IC responsibilities:

  • Declare the incident and set severity
  • Open a communication channel (war room, Slack channel, video call)
  • Assign roles: who is debugging, who is communicating, who is scribing
  • Make decisions: do we roll back? do we scale up? do we page the database team?
  • Provide regular status updates to stakeholders
  • Decide when the incident is resolved

The IC is a traffic controller, not a mechanic. They keep the response organized so that the people debugging can focus on debugging.

sequenceDiagram
  participant MON as Monitoring
  participant OC as On-Call Engineer
  participant IC as Incident Commander
  participant ENG as Engineering Team
  participant COMM as Communications Lead
  participant STAKE as Stakeholders

  MON->>OC: Alert: Checkout 500 error rate > 5%
  OC->>OC: Assess severity (SEV-1)
  OC->>IC: Escalate, declare SEV-1
  IC->>IC: Open incident channel
  IC->>ENG: Page backend and database teams
  IC->>COMM: Assign communications lead
  COMM->>STAKE: Status: Investigating checkout failures
  ENG->>IC: Root cause identified (bad deploy)
  IC->>ENG: Decision: Roll back deployment
  ENG->>IC: Rollback complete, monitoring
  COMM->>STAKE: Status: Fix deployed, monitoring
  IC->>IC: Verify recovery, close incident
  IC->>COMM: Schedule postmortem for tomorrow

Incident escalation flow: from alert to resolution, the IC coordinates while specialists debug.

Runbooks

A runbook is a step-by-step guide for responding to a specific type of incident. When your phone buzzes at 3 AM, you should not need to think creatively. You should follow a checklist.

Good runbooks include:

  • Symptoms: What does this alert mean? What is the user impact?
  • Diagnosis steps: Which dashboards to check. Which logs to query. Which commands to run.
  • Remediation steps: Restart this service. Scale up these pods. Roll back this deployment.
  • Escalation criteria: When to page the database team. When to escalate to SEV-1.
  • Verification: How to confirm the fix worked.
## Runbook: Checkout Service 5xx Spike

### Symptoms
- Error rate on /api/checkout exceeds 5%
- PagerDuty alert: checkout-error-rate-high

### Diagnosis
1. Check deployment history: was there a recent deploy?
   kubectl rollout history deployment/checkout -n production
2. Check database connectivity:
   kubectl exec -it checkout-pod -- pg_isready -h db.internal
3. Check dependent services:
   curl -s http://payment-svc.internal/health
   curl -s http://inventory-svc.internal/health

### Remediation
- If recent deploy: roll back
  kubectl rollout undo deployment/checkout -n production
- If database unreachable: check RDS status in AWS console
- If payment service down: enable checkout fallback mode
  kubectl set env deployment/checkout PAYMENT_FALLBACK=true

### Escalation
- Database issues: page @database-oncall
- Payment service: page @payments-oncall
- If unresolved after 15 minutes: escalate to SEV-1

Runbooks are living documents. Update them after every incident where the runbook was missing a step, had an outdated command, or pointed to a renamed service.

Postmortems without blame

The postmortem is where learning happens. It is also where organizations reveal their true culture. A blame-oriented postmortem teaches engineers to hide mistakes. A blameless postmortem teaches everyone to build safer systems.

Blameless does not mean accountable-less. It means focusing on systemic causes rather than individual errors. The question is never “who caused this?” It is “what conditions allowed this to happen?”

A human will always be the proximate cause of an outage. Someone deployed bad code. Someone misconfigured a firewall. Someone ran a query without a WHERE clause. The interesting question is: why did the system allow that action to cause an outage?

Postmortem structure

  1. Summary. One paragraph describing what happened, when, and the impact.
  2. Timeline. Minute-by-minute account of the incident from detection to resolution.
  3. Root cause analysis. What went wrong at a systemic level. Use “5 whys” or similar frameworks.
  4. Contributing factors. What made the incident worse or detection slower.
  5. What went well. Acknowledge the things that worked. Monitoring that caught the issue. Runbooks that guided the response.
  6. Action items. Specific, assigned, time-boxed improvements. Not “be more careful.”

The 5 whys in practice

Why did checkout fail?
  -> The database connection pool was exhausted.
Why was the pool exhausted?
  -> A new query was holding connections for 30 seconds.
Why was a 30-second query deployed?
  -> The code review did not flag the missing index.
Why did the review miss it?
  -> There is no automated query analysis in the CI pipeline.
Why is there no automated query analysis?
  -> It was never prioritized.

Action item: Add query analysis to CI that flags queries
without index usage on tables over 1M rows.

Every action item should pass this test: would this action have prevented or reduced the impact of this specific incident? Vague improvements like “improve monitoring” fail this test. “Add an alert when checkout error rate exceeds 2% for three minutes” passes.

Building the culture

Incident management is a cultural practice, not a tooling problem. PagerDuty, OpsGenie, and Grafana OnCall are valuable tools, but they do not create a healthy incident response culture.

Run incident response drills. Game days, chaos engineering, tabletop exercises. Practice responding to incidents when there is no real pressure. Teams that practice recover faster during real incidents.

Celebrate good incident response. Publicly recognize teams that detected issues quickly, communicated effectively, or wrote thorough postmortems. This reinforces the behaviors you want.

Track incident metrics over time. Mean time to detect (MTTD), mean time to resolve (MTTR), and incidents per week. These trends tell you whether your systems and processes are improving.

Teams that invest in postmortem action items see measurable improvement quarter over quarter. The number of SEV-1 incidents drops. MTTR shrinks. On-call becomes less painful.

Make postmortems public. Within the engineering organization, at minimum. Cross-team visibility means one team’s outage teaches every team a lesson. Google, Cloudflare, and GitLab publish postmortems externally. Internal publishing is the starting point.

For a deeper treatment of incident response frameworks from an SRE perspective, see SRE incident response.

What comes next

You have environments, configuration, secrets, deployment strategies, and incident management. The question that ties it all together: how do you know if your DevOps practices are actually working? DORA metrics provide a data-driven answer, measuring deployment frequency, lead time, change failure rate, and mean time to recovery across your organization.

Start typing to search across all content
navigate Enter open Esc close