Incident response in practice
In this series (11 parts)
- What SRE is
- Reliability fundamentals
- SLIs, SLOs, and error budgets in practice
- Toil reduction and automation
- Capacity planning
- Performance testing and load testing
- Chaos engineering
- Incident response in practice
- Postmortems and learning from failure
- Production readiness reviews
- Reliability patterns for services
It is 2:47 AM. PagerDuty fires. Error rates on the checkout service jumped from 0.02% to 15%. Customers cannot complete purchases. Revenue is dropping by the minute.
What happens in the next 30 minutes depends entirely on whether your team has a practiced incident response process. With one, you triage fast, communicate clearly, and restore service. Without one, people scramble in different directions while the outage drags on.
From alert to incident
Not every alert is an incident. Alerts fire for transient spikes, brief network blips, and momentary resource pressure. Most resolve on their own.
An alert becomes an incident when it meets one or more of these criteria:
- Customer-facing impact is confirmed
- Multiple alerts fire for the same underlying issue
- The on-call engineer cannot resolve it alone within 15 minutes
- A critical business function is degraded or unavailable
The on-call engineer makes the initial call. When in doubt, declare the incident. It is cheaper to stand down a false alarm than to discover an hour later that you should have escalated.
Severity levels
Severity determines the urgency, staffing, and communication cadence for the incident. Define levels in advance so there is no debate during the crisis.
| Level | Definition | Response time | Communication cadence |
|---|---|---|---|
| SEV1 | Critical business function down, revenue impact | Immediate, all hands | Every 15 to 30 minutes |
| SEV2 | Major feature degraded, significant user impact | Within 15 minutes | Every 30 to 60 minutes |
| SEV3 | Minor feature degraded, limited user impact | Within 1 hour | Hourly or as needed |
| SEV4 | Cosmetic issue or internal tooling, no user impact | Next business day | Daily summary |
SEV1 and SEV2 activate the full incident response process. SEV3 and SEV4 are tracked but handled through normal on-call procedures.
Severity can change during an incident. A SEV3 that worsens gets upgraded. A SEV1 where impact is less than initially assessed gets downgraded. The Incident Commander makes this call.
Incident roles
Clear roles prevent the “everybody investigates, nobody communicates” pattern. Three roles form the core of incident response.
Incident Commander (IC)
The Incident Commander owns the incident. They do not debug. They coordinate. Their responsibilities:
- Declare the incident and set severity
- Assemble the response team
- Open the incident communication channel
- Track investigation threads and delegate tasks
- Make decisions about mitigation strategies
- Decide when to escalate or de-escalate severity
- Declare the incident resolved
- Schedule the postmortem
The IC role rotates. Every senior engineer should be trained and practice the role. A team that depends on one person to run incidents is a single point of failure.
Communications Lead
The Communications Lead handles all external and internal communication. They post status page updates, send stakeholder notifications, and keep the incident channel summary current.
This role exists so the IC and Operations Lead can focus on resolution without fielding “what is happening?” messages from five different Slack channels.
Operations Lead
The Operations Lead drives the technical investigation. They coordinate debugging efforts, suggest mitigation strategies, and execute changes. In practice, this is often the most senior engineer available who knows the affected system.
For large incidents, the Operations Lead may delegate to multiple investigation threads, each with an assigned engineer.
The incident lifecycle
sequenceDiagram participant Alert as Monitoring participant OnCall as On-Call Engineer participant IC as Incident Commander participant Team as Response Team participant Comms as Comms Lead Alert->>OnCall: Alert fires OnCall->>OnCall: Assess severity OnCall->>IC: Declare incident IC->>Team: Assemble response team IC->>Comms: Open incident channel Comms->>Comms: Post initial update Team->>Team: Investigate and mitigate Comms->>Comms: Stakeholder updates Team->>IC: Service restored IC->>IC: Close incident IC->>Team: Schedule postmortem
The incident lifecycle from alert to postmortem. Clear handoffs between roles keep the process moving.
Step 1: Detection
Monitoring systems detect the anomaly and fire an alert. The on-call engineer receives the page and begins assessment.
Good detection depends on good observability. If your alerts are noisy, engineers develop alert fatigue and slow their response. Tune alerts to fire on symptoms (elevated error rates, latency spikes) rather than causes (high CPU). For more on building effective alerting, see incident management.
Step 2: Declaration
The on-call engineer assesses severity and declares the incident. This triggers the formal process.
Declaration involves:
- Creating a dedicated incident channel (e.g.,
#inc-2026-04-20-checkout) - Posting an initial summary: what is broken, who is affected, current severity
- Paging the Incident Commander and Communications Lead
- Starting the incident timer
Step 3: Assembly
The IC identifies which teams and engineers are needed. For a checkout service incident, that might include the checkout team, the payments team, and the infrastructure team.
Do not page everyone. Page the minimum set of people needed to investigate and resolve. You can always escalate later.
Step 4: Investigation and mitigation
This is where the actual debugging happens. The Operations Lead coordinates investigation threads.
Mitigation before root cause. This is the most important principle in incident response. Stop the bleeding first. If a bad deploy caused the issue, roll back. If a database is overwhelmed, enable the circuit breaker. If traffic is spiking, scale up.
You do not need to understand why something broke to fix the immediate impact. Root cause analysis happens in the postmortem, not during the incident.
Common mitigation actions:
- Roll back the last deployment
- Restart affected services
- Scale up compute resources
- Enable feature flags to disable problematic features
- Failover to a secondary region
- Block abusive traffic sources
Step 5: Communication
The Communications Lead posts regular updates. For SEV1 incidents, updates go out every 15 to 30 minutes. Each update includes:
- Current status (investigating, identified, mitigating, resolved)
- What we know so far
- What we are doing about it
- Estimated time to resolution (if known)
- Next update time
Even “no new information” is an update. Stakeholders who hear nothing assume the worst.
Step 6: Resolution
The IC declares the incident resolved when:
- Customer-facing impact has ended
- Metrics have returned to steady state
- The mitigation is stable (not a temporary band-aid)
Resolution does not mean root cause is identified. It means the service is restored. The IC documents the resolution time and schedules the postmortem.
Blameless postmortems
A postmortem is a structured review conducted after every SEV1 and SEV2 incident. The purpose is learning, not blame.
The word “blameless” is critical. If engineers fear punishment for honest reporting, they will hide information. Hidden information means you cannot learn from incidents. The failures repeat.
Postmortem structure
Hold the postmortem within 3 to 5 business days of the incident. Use a consistent template:
1. Summary. One paragraph describing the incident: what happened, when, and how severe.
2. Timeline. A chronological list of events from first alert to resolution. Include timestamps.
14:47 UTC - Alert fires: checkout error rate > 5%
14:49 UTC - On-call acknowledges page
14:52 UTC - SEV1 declared, incident channel opened
14:55 UTC - IC pages checkout and payments teams
15:03 UTC - Root cause identified: bad config push to payment gateway
15:07 UTC - Config rollback initiated
15:12 UTC - Error rate dropping, service recovering
15:18 UTC - Metrics at steady state, incident resolved
3. Impact. Quantify the damage. How many users were affected? How much revenue was lost? How long was the service degraded?
4. Root cause. Describe the chain of events that led to the incident. Be specific. “Config change” is not a root cause. “A config change to the payment gateway timeout was pushed without review because the config pipeline bypasses code review for YAML files” is a root cause.
5. Contributing factors. What made the incident worse or delayed recovery? Slow alerting, missing runbooks, lack of monitoring on the affected component.
6. What went well. Acknowledge what worked. Fast detection, smooth role handoffs, effective communication. This reinforces good practices.
7. Action items. Concrete, assigned tasks with deadlines. Each action item prevents the specific incident or class of incidents from recurring.
Action item tracking
Action items are the entire point of the postmortem. Without follow-through, postmortems are just documentation theater.
Track action items in your team’s task tracker (Jira, Linear, GitHub Issues). Tag them as postmortem items. Review completion in weekly team meetings.
Good action items are specific and measurable:
- “Add config change review requirement to the payment gateway pipeline” (owner: Platform team, due: April 30)
- “Create runbook for payment gateway failures” (owner: Checkout team, due: May 7)
- “Add alert for config deployment failures” (owner: SRE team, due: April 25)
Bad action items are vague: “improve monitoring,” “be more careful,” “add more tests.” These never get done and do not prevent recurrence.
Building the muscle
Incident response is a skill. It improves with practice. Three practices build the muscle.
Incident drills. Run mock incidents quarterly. Inject a simulated failure, activate the response process, and practice role assignments. This is where chaos engineering and incident response intersect. Use chaos experiments as the trigger for response drills.
Rotation. Rotate the IC role across senior engineers. Everyone should be comfortable running an incident. This prevents a single point of failure and builds organizational resilience.
Review previous incidents. New team members should read past postmortems. They contain lessons about system architecture, failure modes, and team processes that no documentation can capture.
The incident response checklist
Keep this checklist accessible to every on-call engineer:
- Confirm the alert is real (check dashboards, not just the alert)
- Assess severity using the defined criteria
- Declare the incident and create the channel
- Page the IC and Communications Lead
- Post initial summary in the incident channel
- Investigate symptoms, not root cause
- Mitigate first, diagnose later
- Communicate at the defined cadence
- Declare resolved when metrics return to normal
- Schedule the postmortem within 5 business days
Print it. Tape it to your monitor. At 3 AM when adrenaline is high, a checklist beats memory every time.
What comes next
This article covers the structured response process for incidents. The broader topic of incident management, including on-call practices, escalation policies, and tooling, is covered in incident management. Combine both to build a complete incident handling capability for your organization.