Search…

Postmortems and learning from failure

In this series (11 parts)
  1. What SRE is
  2. Reliability fundamentals
  3. SLIs, SLOs, and error budgets in practice
  4. Toil reduction and automation
  5. Capacity planning
  6. Performance testing and load testing
  7. Chaos engineering
  8. Incident response in practice
  9. Postmortems and learning from failure
  10. Production readiness reviews
  11. Reliability patterns for services

Every incident contains a lesson. The question is whether your organization actually learns it. Postmortems exist to extract those lessons, but only if you build a culture that values honesty over blame. This article covers how to run postmortems that produce real improvements instead of empty documents nobody reads.

Blameless postmortem as culture

A postmortem is not a form you fill out after an outage. It is a practice. A habit. A signal that your organization treats failure as an opportunity rather than a punishment.

Blameless culture means that the people closest to an incident can describe exactly what happened without fear of retaliation. This is harder than it sounds. Humans default to finding someone responsible. Leadership has to actively resist that instinct, repeatedly, until the team believes it.

Blameless does not mean accountable-less. People still own their work. The difference is that you investigate the system conditions that led to a mistake, not the character of the person who made it.

Why blame kills learning

When people expect punishment, they hide information. They minimize their role. They avoid volunteering details that might make them look bad. The postmortem becomes a sanitized narrative that misses the real causes.

Consider two scenarios after a database outage:

  • With blame: “The engineer ran a migration without testing it.” Everyone nods. The engineer feels terrible. Nobody asks why the migration pipeline lacks a staging step.
  • Without blame: “The migration ran against production because our deployment tooling doesn’t enforce a staging gate. The engineer followed the documented process, which had a gap.” Now you fix the tooling.

The second scenario produces a systemic fix. The first produces nothing except resentment.

The five whys technique

Five whys is a simple method: keep asking “why” until you reach a systemic cause. Start with the observable symptom and drill down.

  1. Why did the site go down? The database ran out of connections.
  2. Why did connections exhaust? A new feature opened connections without closing them.
  3. Why did this reach production? The code review didn’t catch the connection leak.
  4. Why didn’t the review catch it? The reviewer was unfamiliar with the database library.
  5. Why was an unfamiliar reviewer assigned? We have no ownership mapping for database-heavy code.

Five whys works well for simple causal chains. It breaks down when incidents have multiple contributing causes, which is most of the time.

Limitations of five whys

The technique assumes a single linear chain of causes. Real incidents rarely work that way. You can also reach very different conclusions depending on which “why” branch you follow at step two or three. Five whys is a useful starting point, not a complete analysis framework.

Contributing factors analysis

Most incidents have multiple causes that combined to produce the failure. Contributing factors analysis maps all the conditions that had to be true for the incident to occur.

Instead of asking “what was the root cause?”, ask “what conditions existed that allowed this to happen?” You typically find a mix of:

  • Technical factors (missing monitoring, inadequate capacity)
  • Process factors (unclear runbook, no staging gate)
  • Organizational factors (understaffed on-call rotation, knowledge silos)
  • Environmental factors (unexpected traffic spike, third-party API change)

No single factor is “the” cause. All of them contributed. Fixing any one of them might have prevented this specific incident. Fixing several of them makes your system broadly more resilient.

Postmortem process flow

The postmortem lifecycle starts during the incident and continues until action items are complete.

graph TD
  A["Incident detected"] --> B["Incident resolved"]
  B --> C["Schedule postmortem within 48 hours"]
  C --> D["Incident lead drafts timeline"]
  D --> E["Team reviews and adds context"]
  E --> F["Identify contributing factors"]
  F --> G["Define action items"]
  G --> H["Postmortem review meeting"]
  H --> I["Publish and share"]
  I --> J["Track action items to completion"]
  J --> K["Review in weekly ops meeting"]

Postmortem process from incident detection through action item completion.

Schedule the postmortem within 48 hours of resolution. Memories fade fast. If you wait a week, people will reconstruct events from memory rather than recall them. That reconstruction introduces bias.

Postmortem template walkthrough

A good template guides the writer without constraining them. Here are the essential sections.

Summary

Two to three sentences. What happened, when, and what was the customer impact. Anyone in the company should understand this section without context.

Impact

Quantify the damage. How many users were affected? What was the duration? Did you violate any SLOs? Include numbers, not just qualitative descriptions. “5,200 users received errors for 34 minutes” is better than “some users experienced issues.”

Timeline

A chronological list of events from the first signal to full resolution. Use timestamps. Include both automated signals (alerts firing, metrics crossing thresholds) and human actions (who did what, when). The timeline is the backbone of the analysis.

Contributing factors

List every condition that contributed to the incident. For each factor, note whether it was a known risk or a surprise. Group them into categories: technical, process, organizational.

Action items

Concrete steps to prevent recurrence or reduce impact. More on this in the next section.

Lessons learned

What went well during the response? What could have gone better? What surprised you? This section captures institutional knowledge that doesn’t fit neatly into action items.

Action item quality

Bad action items: “Improve monitoring.” “Be more careful.” “Add tests.”

Good action items are specific, owned, time-bound, and tracked.

QualityBad exampleGood example
Specific”Fix the deploy process""Add a staging gate to the migration pipeline that blocks production deploys without a successful staging run”
Owned”Someone should look into this""Alex will implement the staging gate”
Time-bound”When we get to it""Complete by March 15”
TrackedWritten in a doc nobody readsFiled as a ticket in the sprint backlog

Every postmortem should produce between three and seven action items. Fewer than three means you didn’t dig deep enough. More than seven means you’re trying to fix everything at once and will finish nothing.

Track action items in your regular project management tool, not in the postmortem document. The postmortem links to the tickets. Review completion status in your weekly operations meeting.

Prioritizing action items

Not all action items are equal. Categorize them:

  • Prevent recurrence: Changes that make this specific incident impossible
  • Reduce impact: Changes that limit the blast radius if something similar happens
  • Improve detection: Changes that help you find the problem faster next time

Prevention items take priority. If you can only do one thing, make it harder for the same failure to occur.

Sharing postmortems across the organization

A postmortem that lives in a folder nobody opens is wasted effort. The learning has to spread.

Postmortem reading clubs

Some organizations run bi-weekly sessions where a team presents a recent postmortem to a wider audience. The presenting team walks through the incident and the analysis. Other teams ask questions, share similar experiences, and sometimes spot patterns across multiple incidents.

These sessions build empathy between teams. When the payments team hears about a caching incident in the search team, they might realize their own caching layer has the same vulnerability.

Internal newsletters

A monthly digest of postmortems, summarized to two or three paragraphs each, reaches people who won’t attend a reading club. Include the contributing factors and action items. Skip the detailed timeline.

Searchable archive

Store postmortems in a searchable system. When an engineer encounters a new problem, they should be able to search for “connection pool exhaustion” and find three prior postmortems with different contributing factors and solutions.

Tag postmortems by system, failure mode, and contributing factor category. This makes pattern analysis possible. If you see five postmortems in six months where “missing staging gate” is a contributing factor, that tells you something about your deployment infrastructure.

Measuring postmortem effectiveness

Track two metrics over time:

  1. Action item completion rate: What percentage of postmortem action items are completed within their target date? Below 70% means you are generating more work than you can absorb.
  2. Recurrence rate: How often do you see incidents with the same contributing factors? Declining recurrence means your postmortems are producing real improvements.

If your completion rate stays low, you are writing action items that are too ambitious or not prioritizing them in sprint planning. Adjust scope or allocate dedicated capacity for reliability work.

Common pitfalls

Skipping postmortems for “small” incidents. Small incidents often reveal the same systemic issues as large ones. Set a threshold (any incident over 10 minutes of user impact, for example) and stick to it.

Writing postmortems but never reading them. If nobody references past postmortems during future incidents, the archive is decorative. Build postmortem search into your incident response process.

Letting action items rot. Unfinished action items are worse than no action items. They create the illusion of improvement. Review the backlog monthly and close items that are no longer relevant.

What comes next

With a solid postmortem practice in place, the next step is making sure new services meet reliability standards before they launch. The next article on production readiness reviews covers how to evaluate whether a service is ready for production traffic and what to do when it is not.

Start typing to search across all content
navigate Enter open Esc close