Search…

Incident response in practice

In this series (11 parts)
  1. What SRE is
  2. Reliability fundamentals
  3. SLIs, SLOs, and error budgets in practice
  4. Toil reduction and automation
  5. Capacity planning
  6. Performance testing and load testing
  7. Chaos engineering
  8. Incident response in practice
  9. Postmortems and learning from failure
  10. Production readiness reviews
  11. Reliability patterns for services

It is 2:47 AM. PagerDuty fires. Error rates on the checkout service jumped from 0.02% to 15%. Customers cannot complete purchases. Revenue is dropping by the minute.

What happens in the next 30 minutes depends entirely on whether your team has a practiced incident response process. With one, you triage fast, communicate clearly, and restore service. Without one, people scramble in different directions while the outage drags on.

From alert to incident

Not every alert is an incident. Alerts fire for transient spikes, brief network blips, and momentary resource pressure. Most resolve on their own.

An alert becomes an incident when it meets one or more of these criteria:

  • Customer-facing impact is confirmed
  • Multiple alerts fire for the same underlying issue
  • The on-call engineer cannot resolve it alone within 15 minutes
  • A critical business function is degraded or unavailable

The on-call engineer makes the initial call. When in doubt, declare the incident. It is cheaper to stand down a false alarm than to discover an hour later that you should have escalated.

Severity levels

Severity determines the urgency, staffing, and communication cadence for the incident. Define levels in advance so there is no debate during the crisis.

LevelDefinitionResponse timeCommunication cadence
SEV1Critical business function down, revenue impactImmediate, all handsEvery 15 to 30 minutes
SEV2Major feature degraded, significant user impactWithin 15 minutesEvery 30 to 60 minutes
SEV3Minor feature degraded, limited user impactWithin 1 hourHourly or as needed
SEV4Cosmetic issue or internal tooling, no user impactNext business dayDaily summary

SEV1 and SEV2 activate the full incident response process. SEV3 and SEV4 are tracked but handled through normal on-call procedures.

Severity can change during an incident. A SEV3 that worsens gets upgraded. A SEV1 where impact is less than initially assessed gets downgraded. The Incident Commander makes this call.

Incident roles

Clear roles prevent the “everybody investigates, nobody communicates” pattern. Three roles form the core of incident response.

Incident Commander (IC)

The Incident Commander owns the incident. They do not debug. They coordinate. Their responsibilities:

  • Declare the incident and set severity
  • Assemble the response team
  • Open the incident communication channel
  • Track investigation threads and delegate tasks
  • Make decisions about mitigation strategies
  • Decide when to escalate or de-escalate severity
  • Declare the incident resolved
  • Schedule the postmortem

The IC role rotates. Every senior engineer should be trained and practice the role. A team that depends on one person to run incidents is a single point of failure.

Communications Lead

The Communications Lead handles all external and internal communication. They post status page updates, send stakeholder notifications, and keep the incident channel summary current.

This role exists so the IC and Operations Lead can focus on resolution without fielding “what is happening?” messages from five different Slack channels.

Operations Lead

The Operations Lead drives the technical investigation. They coordinate debugging efforts, suggest mitigation strategies, and execute changes. In practice, this is often the most senior engineer available who knows the affected system.

For large incidents, the Operations Lead may delegate to multiple investigation threads, each with an assigned engineer.

The incident lifecycle

sequenceDiagram
  participant Alert as Monitoring
  participant OnCall as On-Call Engineer
  participant IC as Incident Commander
  participant Team as Response Team
  participant Comms as Comms Lead

  Alert->>OnCall: Alert fires
  OnCall->>OnCall: Assess severity
  OnCall->>IC: Declare incident
  IC->>Team: Assemble response team
  IC->>Comms: Open incident channel
  Comms->>Comms: Post initial update
  Team->>Team: Investigate and mitigate
  Comms->>Comms: Stakeholder updates
  Team->>IC: Service restored
  IC->>IC: Close incident
  IC->>Team: Schedule postmortem

The incident lifecycle from alert to postmortem. Clear handoffs between roles keep the process moving.

Step 1: Detection

Monitoring systems detect the anomaly and fire an alert. The on-call engineer receives the page and begins assessment.

Good detection depends on good observability. If your alerts are noisy, engineers develop alert fatigue and slow their response. Tune alerts to fire on symptoms (elevated error rates, latency spikes) rather than causes (high CPU). For more on building effective alerting, see incident management.

Step 2: Declaration

The on-call engineer assesses severity and declares the incident. This triggers the formal process.

Declaration involves:

  1. Creating a dedicated incident channel (e.g., #inc-2026-04-20-checkout)
  2. Posting an initial summary: what is broken, who is affected, current severity
  3. Paging the Incident Commander and Communications Lead
  4. Starting the incident timer

Step 3: Assembly

The IC identifies which teams and engineers are needed. For a checkout service incident, that might include the checkout team, the payments team, and the infrastructure team.

Do not page everyone. Page the minimum set of people needed to investigate and resolve. You can always escalate later.

Step 4: Investigation and mitigation

This is where the actual debugging happens. The Operations Lead coordinates investigation threads.

Mitigation before root cause. This is the most important principle in incident response. Stop the bleeding first. If a bad deploy caused the issue, roll back. If a database is overwhelmed, enable the circuit breaker. If traffic is spiking, scale up.

You do not need to understand why something broke to fix the immediate impact. Root cause analysis happens in the postmortem, not during the incident.

Common mitigation actions:

  • Roll back the last deployment
  • Restart affected services
  • Scale up compute resources
  • Enable feature flags to disable problematic features
  • Failover to a secondary region
  • Block abusive traffic sources

Step 5: Communication

The Communications Lead posts regular updates. For SEV1 incidents, updates go out every 15 to 30 minutes. Each update includes:

  • Current status (investigating, identified, mitigating, resolved)
  • What we know so far
  • What we are doing about it
  • Estimated time to resolution (if known)
  • Next update time

Even “no new information” is an update. Stakeholders who hear nothing assume the worst.

Step 6: Resolution

The IC declares the incident resolved when:

  • Customer-facing impact has ended
  • Metrics have returned to steady state
  • The mitigation is stable (not a temporary band-aid)

Resolution does not mean root cause is identified. It means the service is restored. The IC documents the resolution time and schedules the postmortem.

Blameless postmortems

A postmortem is a structured review conducted after every SEV1 and SEV2 incident. The purpose is learning, not blame.

The word “blameless” is critical. If engineers fear punishment for honest reporting, they will hide information. Hidden information means you cannot learn from incidents. The failures repeat.

Postmortem structure

Hold the postmortem within 3 to 5 business days of the incident. Use a consistent template:

1. Summary. One paragraph describing the incident: what happened, when, and how severe.

2. Timeline. A chronological list of events from first alert to resolution. Include timestamps.

14:47 UTC - Alert fires: checkout error rate > 5%
14:49 UTC - On-call acknowledges page
14:52 UTC - SEV1 declared, incident channel opened
14:55 UTC - IC pages checkout and payments teams
15:03 UTC - Root cause identified: bad config push to payment gateway
15:07 UTC - Config rollback initiated
15:12 UTC - Error rate dropping, service recovering
15:18 UTC - Metrics at steady state, incident resolved

3. Impact. Quantify the damage. How many users were affected? How much revenue was lost? How long was the service degraded?

4. Root cause. Describe the chain of events that led to the incident. Be specific. “Config change” is not a root cause. “A config change to the payment gateway timeout was pushed without review because the config pipeline bypasses code review for YAML files” is a root cause.

5. Contributing factors. What made the incident worse or delayed recovery? Slow alerting, missing runbooks, lack of monitoring on the affected component.

6. What went well. Acknowledge what worked. Fast detection, smooth role handoffs, effective communication. This reinforces good practices.

7. Action items. Concrete, assigned tasks with deadlines. Each action item prevents the specific incident or class of incidents from recurring.

Action item tracking

Action items are the entire point of the postmortem. Without follow-through, postmortems are just documentation theater.

Track action items in your team’s task tracker (Jira, Linear, GitHub Issues). Tag them as postmortem items. Review completion in weekly team meetings.

Good action items are specific and measurable:

  • “Add config change review requirement to the payment gateway pipeline” (owner: Platform team, due: April 30)
  • “Create runbook for payment gateway failures” (owner: Checkout team, due: May 7)
  • “Add alert for config deployment failures” (owner: SRE team, due: April 25)

Bad action items are vague: “improve monitoring,” “be more careful,” “add more tests.” These never get done and do not prevent recurrence.

Building the muscle

Incident response is a skill. It improves with practice. Three practices build the muscle.

Incident drills. Run mock incidents quarterly. Inject a simulated failure, activate the response process, and practice role assignments. This is where chaos engineering and incident response intersect. Use chaos experiments as the trigger for response drills.

Rotation. Rotate the IC role across senior engineers. Everyone should be comfortable running an incident. This prevents a single point of failure and builds organizational resilience.

Review previous incidents. New team members should read past postmortems. They contain lessons about system architecture, failure modes, and team processes that no documentation can capture.

The incident response checklist

Keep this checklist accessible to every on-call engineer:

  1. Confirm the alert is real (check dashboards, not just the alert)
  2. Assess severity using the defined criteria
  3. Declare the incident and create the channel
  4. Page the IC and Communications Lead
  5. Post initial summary in the incident channel
  6. Investigate symptoms, not root cause
  7. Mitigate first, diagnose later
  8. Communicate at the defined cadence
  9. Declare resolved when metrics return to normal
  10. Schedule the postmortem within 5 business days

Print it. Tape it to your monitor. At 3 AM when adrenaline is high, a checklist beats memory every time.

What comes next

This article covers the structured response process for incidents. The broader topic of incident management, including on-call practices, escalation policies, and tooling, is covered in incident management. Combine both to build a complete incident handling capability for your organization.

Start typing to search across all content
navigate Enter open Esc close