Apr 19, 2026 · 16 min read · DevOps

Incident response in practice

In this series (11 parts)

It is 2:47 AM. PagerDuty fires. Error rates on the checkout service jumped from 0.02% to 15%. Customers cannot complete purchases. Revenue is dropping by the minute.

What happens in the next 30 minutes depends entirely on whether your team has a practiced incident response process. With one, you triage fast, communicate clearly, and restore service. Without one, people scramble in different directions while the outage drags on.

From alert to incident

Not every alert is an incident. Alerts fire for transient spikes, brief network blips, and momentary resource pressure. Most resolve on their own.

An alert becomes an incident when it meets one or more of these criteria:

Customer-facing impact is confirmed
Multiple alerts fire for the same underlying issue
The on-call engineer cannot resolve it alone within 15 minutes
A critical business function is degraded or unavailable

The on-call engineer makes the initial call. When in doubt, declare the incident. It is cheaper to stand down a false alarm than to discover an hour later that you should have escalated.

Severity levels

Severity determines the urgency, staffing, and communication cadence for the incident. Define levels in advance so there is no debate during the crisis.

Level	Definition	Response time	Communication cadence
SEV1	Critical business function down, revenue impact	Immediate, all hands	Every 15 to 30 minutes
SEV2	Major feature degraded, significant user impact	Within 15 minutes	Every 30 to 60 minutes
SEV3	Minor feature degraded, limited user impact	Within 1 hour	Hourly or as needed
SEV4	Cosmetic issue or internal tooling, no user impact	Next business day	Daily summary

SEV1 and SEV2 activate the full incident response process. SEV3 and SEV4 are tracked but handled through normal on-call procedures.

Severity can change during an incident. A SEV3 that worsens gets upgraded. A SEV1 where impact is less than initially assessed gets downgraded. The Incident Commander makes this call.

Incident roles

Clear roles prevent the “everybody investigates, nobody communicates” pattern. Three roles form the core of incident response.

Incident Commander (IC)

The Incident Commander owns the incident. They do not debug. They coordinate. Their responsibilities:

Declare the incident and set severity
Assemble the response team
Open the incident communication channel
Track investigation threads and delegate tasks
Make decisions about mitigation strategies
Decide when to escalate or de-escalate severity
Declare the incident resolved
Schedule the postmortem

The IC role rotates. Every senior engineer should be trained and practice the role. A team that depends on one person to run incidents is a single point of failure.

Communications Lead

The Communications Lead handles all external and internal communication. They post status page updates, send stakeholder notifications, and keep the incident channel summary current.

This role exists so the IC and Operations Lead can focus on resolution without fielding “what is happening?” messages from five different Slack channels.

Operations Lead

The Operations Lead drives the technical investigation. They coordinate debugging efforts, suggest mitigation strategies, and execute changes. In practice, this is often the most senior engineer available who knows the affected system.

For large incidents, the Operations Lead may delegate to multiple investigation threads, each with an assigned engineer.

The incident lifecycle

sequenceDiagram
  participant Alert as Monitoring
  participant OnCall as On-Call Engineer
  participant IC as Incident Commander
  participant Team as Response Team
  participant Comms as Comms Lead

  Alert->>OnCall: Alert fires
  OnCall->>OnCall: Assess severity
  OnCall->>IC: Declare incident
  IC->>Team: Assemble response team
  IC->>Comms: Open incident channel
  Comms->>Comms: Post initial update
  Team->>Team: Investigate and mitigate
  Comms->>Comms: Stakeholder updates
  Team->>IC: Service restored
  IC->>IC: Close incident
  IC->>Team: Schedule postmortem

The incident lifecycle from alert to postmortem. Clear handoffs between roles keep the process moving.

Step 1: Detection

Monitoring systems detect the anomaly and fire an alert. The on-call engineer receives the page and begins assessment.

Good detection depends on good observability. If your alerts are noisy, engineers develop alert fatigue and slow their response. Tune alerts to fire on symptoms (elevated error rates, latency spikes) rather than causes (high CPU). For more on building effective alerting, see incident management.

Step 2: Declaration

The on-call engineer assesses severity and declares the incident. This triggers the formal process.

Declaration involves:

Creating a dedicated incident channel (e.g., #inc-2026-04-20-checkout)
Posting an initial summary: what is broken, who is affected, current severity
Paging the Incident Commander and Communications Lead
Starting the incident timer

Step 3: Assembly

The IC identifies which teams and engineers are needed. For a checkout service incident, that might include the checkout team, the payments team, and the infrastructure team.

Do not page everyone. Page the minimum set of people needed to investigate and resolve. You can always escalate later.

Step 4: Investigation and mitigation

This is where the actual debugging happens. The Operations Lead coordinates investigation threads.

Mitigation before root cause. This is the most important principle in incident response. Stop the bleeding first. If a bad deploy caused the issue, roll back. If a database is overwhelmed, enable the circuit breaker. If traffic is spiking, scale up.

You do not need to understand why something broke to fix the immediate impact. Root cause analysis happens in the postmortem, not during the incident.

Common mitigation actions:

Roll back the last deployment
Restart affected services
Scale up compute resources
Enable feature flags to disable problematic features
Failover to a secondary region
Block abusive traffic sources

Step 5: Communication

The Communications Lead posts regular updates. For SEV1 incidents, updates go out every 15 to 30 minutes. Each update includes:

Current status (investigating, identified, mitigating, resolved)
What we know so far
What we are doing about it
Estimated time to resolution (if known)
Next update time

Even “no new information” is an update. Stakeholders who hear nothing assume the worst.

Step 6: Resolution

The IC declares the incident resolved when:

Customer-facing impact has ended
Metrics have returned to steady state
The mitigation is stable (not a temporary band-aid)

Resolution does not mean root cause is identified. It means the service is restored. The IC documents the resolution time and schedules the postmortem.

Blameless postmortems

A postmortem is a structured review conducted after every SEV1 and SEV2 incident. The purpose is learning, not blame.

The word “blameless” is critical. If engineers fear punishment for honest reporting, they will hide information. Hidden information means you cannot learn from incidents. The failures repeat.

Postmortem structure

Hold the postmortem within 3 to 5 business days of the incident. Use a consistent template:

1. Summary. One paragraph describing the incident: what happened, when, and how severe.

2. Timeline. A chronological list of events from first alert to resolution. Include timestamps.

14:47 UTC - Alert fires: checkout error rate > 5%
14:49 UTC - On-call acknowledges page
14:52 UTC - SEV1 declared, incident channel opened
14:55 UTC - IC pages checkout and payments teams
15:03 UTC - Root cause identified: bad config push to payment gateway
15:07 UTC - Config rollback initiated
15:12 UTC - Error rate dropping, service recovering
15:18 UTC - Metrics at steady state, incident resolved

3. Impact. Quantify the damage. How many users were affected? How much revenue was lost? How long was the service degraded?

4. Root cause. Describe the chain of events that led to the incident. Be specific. “Config change” is not a root cause. “A config change to the payment gateway timeout was pushed without review because the config pipeline bypasses code review for YAML files” is a root cause.

5. Contributing factors. What made the incident worse or delayed recovery? Slow alerting, missing runbooks, lack of monitoring on the affected component.

6. What went well. Acknowledge what worked. Fast detection, smooth role handoffs, effective communication. This reinforces good practices.

7. Action items. Concrete, assigned tasks with deadlines. Each action item prevents the specific incident or class of incidents from recurring.

Action item tracking

Action items are the entire point of the postmortem. Without follow-through, postmortems are just documentation theater.

Track action items in your team’s task tracker (Jira, Linear, GitHub Issues). Tag them as postmortem items. Review completion in weekly team meetings.

Good action items are specific and measurable:

“Add config change review requirement to the payment gateway pipeline” (owner: Platform team, due: April 30)
“Create runbook for payment gateway failures” (owner: Checkout team, due: May 7)
“Add alert for config deployment failures” (owner: SRE team, due: April 25)

Bad action items are vague: “improve monitoring,” “be more careful,” “add more tests.” These never get done and do not prevent recurrence.

Building the muscle

Incident response is a skill. It improves with practice. Three practices build the muscle.

Incident drills. Run mock incidents quarterly. Inject a simulated failure, activate the response process, and practice role assignments. This is where chaos engineering and incident response intersect. Use chaos experiments as the trigger for response drills.

Rotation. Rotate the IC role across senior engineers. Everyone should be comfortable running an incident. This prevents a single point of failure and builds organizational resilience.

Review previous incidents. New team members should read past postmortems. They contain lessons about system architecture, failure modes, and team processes that no documentation can capture.

The incident response checklist

Keep this checklist accessible to every on-call engineer:

Confirm the alert is real (check dashboards, not just the alert)
Assess severity using the defined criteria
Declare the incident and create the channel
Page the IC and Communications Lead
Post initial summary in the incident channel
Investigate symptoms, not root cause
Mitigate first, diagnose later
Communicate at the defined cadence
Declare resolved when metrics return to normal
Schedule the postmortem within 5 business days

Print it. Tape it to your monitor. At 3 AM when adrenaline is high, a checklist beats memory every time.

What comes next

This article covers the structured response process for incidents. The broader topic of incident management, including on-call practices, escalation policies, and tooling, is covered in incident management. Combine both to build a complete incident handling capability for your organization.

← Back to all series