Search…

Reliability fundamentals

In this series (11 parts)
  1. What SRE is
  2. Reliability fundamentals
  3. SLIs, SLOs, and error budgets in practice
  4. Toil reduction and automation
  5. Capacity planning
  6. Performance testing and load testing
  7. Chaos engineering
  8. Incident response in practice
  9. Postmortems and learning from failure
  10. Production readiness reviews
  11. Reliability patterns for services

“We had no downtime this quarter.” That statement sounds good. It tells you almost nothing. How did you measure it? What counts as downtime? For which users? Under what conditions?

Reliability engineering replaces vague claims with precise measurements. This article covers the math and mental models behind those measurements.

Availability and the nines

Availability is the fraction of time a system is operational and serving requests correctly. It’s expressed as a percentage, and the industry shorthand uses “nines.”

Three nines means 99.9% availability. Four nines means 99.99%. Each additional nine is ten times harder to achieve.

Here’s what each level actually means in concrete downtime:

AvailabilityDowntime/yearDowntime/monthDowntime/week
99% (two nines)3.65 days7.31 hours1.68 hours
99.9% (three nines)8.77 hours43.83 minutes10.08 minutes
99.95%4.38 hours21.92 minutes5.04 minutes
99.99% (four nines)52.56 minutes4.38 minutes1.01 minutes
99.999% (five nines)5.26 minutes26.30 seconds6.05 seconds

Look at the jump from three nines to four nines. You go from 43 minutes of monthly downtime to about 4 minutes. That means your entire detection, response, and recovery pipeline needs to complete in under 4 minutes. Every month. Without exception.

Five nines allows 5.26 minutes of downtime per year. That’s roughly the time it takes a human to read an alert, open a laptop, and log in. At five nines, human intervention is too slow. Everything must be automated.

Each additional nine reduces allowed downtime by a factor of 10. The y-axis uses a logarithmic scale to show the exponential relationship.

MTBF and MTTR

Two metrics capture the dynamics of failure and recovery.

MTBF (Mean Time Between Failures) measures how often a system fails. If a service experiences 4 outages in a year, the MTBF is roughly 3 months. Higher MTBF means fewer failures.

MTTR (Mean Time To Recovery) measures how quickly you restore service after a failure. If your average incident takes 45 minutes to resolve, your MTTR is 45 minutes. Lower MTTR means faster recovery.

These two metrics connect directly to availability through a simple formula:

Availability = MTBF / (MTBF + MTTR)

Suppose your MTBF is 720 hours (30 days) and your MTTR is 0.72 hours (43 minutes). Your availability is:

720 / (720 + 0.72) = 0.999 = 99.9%

This formula reveals something important. You have two levers for improving availability: fail less often (increase MTBF) or recover faster (decrease MTTR). In practice, reducing MTTR is usually easier and more cost-effective.

Why? Preventing all failures requires perfect systems. Recovering quickly requires good detection, clear runbooks, and automation. The first is impossible. The second is engineering work with a clear path forward.

The economics of each lever

Investing in MTBF means preventing failures. That includes better testing, redundant infrastructure, careful code review, and chaos engineering. These are valuable but face diminishing returns. You can’t prevent every failure. Hardware fails. Networks partition. Humans make mistakes.

Investing in MTTR means faster recovery. That includes better monitoring (detect problems in seconds, not minutes), automated rollbacks, runbook automation, and incident response training. These investments have more linear returns because there’s always a step in the recovery process you can make faster.

Most organizations get the best reliability improvement per dollar by focusing on MTTR until their detection and recovery pipelines are solid, then shifting attention to MTBF.

Failure modes

Not all failures look the same. Understanding failure modes helps you design detection systems and recovery procedures.

Crash failures

The process terminates unexpectedly. The server stops responding. This is the simplest failure mode to detect and handle. Health checks catch it. Restart policies fix it. Load balancers route around it.

Omission failures

The system silently drops requests. It’s running, health checks pass, but some requests never get a response. These are insidious because the system appears healthy. You need request-level monitoring to catch them, not just process-level health checks.

Timing failures

The system responds, but too slowly. A database query that usually takes 10ms starts taking 5 seconds. The service is technically “up” but effectively unusable. Latency-based SLIs catch this where availability checks miss it.

Response failures

The system responds quickly with the wrong answer. A pricing service returns $0 for every item. A search service returns empty results for every query. These require semantic monitoring, checking that responses are not just fast but also correct.

Byzantine failures

The system behaves arbitrarily, potentially sending different responses to different clients or corrupting data. These are the hardest failures to detect and handle. They often require consensus protocols or redundant computation with voting.

Most production systems primarily deal with crash, omission, and timing failures. Design your monitoring to catch these three well before worrying about Byzantine tolerance.

The reliability hierarchy

Reliability practices build on each other like Maslow’s hierarchy. You need the lower levels before the upper levels provide value.

graph BT
  A["Monitoring & Alerting"] --> B["Incident Response"]
  B --> C["Blameless Postmortems"]
  C --> D["Testing & Release Engineering"]
  D --> E["Capacity Planning"]
  E --> F["Product Development Integration"]

The reliability hierarchy. Each level depends on the levels below it. Start at the bottom.

Level 1: Monitoring and alerting

The foundation. Without monitoring, you learn about outages from angry users on Twitter. Good monitoring tells you what’s broken, when it broke, and provides the signals needed to understand why.

Monitoring should be based on symptoms (user-facing impact), not just causes (CPU usage, disk space). An SLO-based alert that fires when error rates exceed your budget is more useful than a CPU alert at 90%.

Level 2: Incident response

When monitoring fires an alert, what happens? Incident response defines the process. Who gets paged? How do they communicate? What tools do they use? How do they escalate?

Without a defined incident response process, every outage becomes chaos. People duplicate effort, miss critical information, and take longer to resolve the issue than necessary.

Level 3: Blameless postmortems

After every significant incident, a postmortem examines what happened, why, and how to prevent recurrence. The “blameless” part is critical. If people fear punishment, they hide information. If they feel safe, they share the details that lead to real fixes.

A postmortem without action items is just documentation. Every postmortem should produce concrete, assigned work that improves the system.

Level 4: Testing and release engineering

Prevent failures from reaching production. This includes unit tests, integration tests, canary deployments, progressive rollouts, and automated rollback capabilities. Good release engineering makes deployments boring and reversible.

Level 5: Capacity planning

Anticipate growth before it becomes an emergency. Capacity planning uses load testing, traffic forecasting, and resource modeling to ensure your systems can handle tomorrow’s load, not just today’s.

Level 6: Product development integration

At the top of the hierarchy, reliability engineering integrates with product development. SREs participate in design reviews. Reliability requirements influence architecture decisions. Error budgets shape the release cadence.

Measuring what matters

Not everything that can be measured should be measured. Good reliability metrics share three properties:

  1. They reflect user experience. Latency percentiles matter more than CPU utilization. Error rates matter more than log volume.
  2. They are actionable. A metric that goes up or down without telling you what to do about it wastes attention.
  3. They have defined thresholds. A metric without a target is just a number. SLOs give metrics meaning by defining “good enough.”

The next article in this series covers SLIs, SLOs, and error budgets in detail. Those concepts transform raw metrics into a reliability management framework.

Compound availability

Real systems depend on multiple components. A request might traverse a load balancer, an API gateway, a microservice, and a database. The overall availability is the product of individual availabilities.

If each of four components has 99.9% availability:

0.999 x 0.999 x 0.999 x 0.999 = 0.996 = 99.6%

Four three-nines components give you less than three nines combined. This is why distributed systems are harder to keep reliable. Every additional dependency multiplies the failure probability.

Strategies to combat compound availability include redundancy (multiple instances of each component), graceful degradation (serve cached results when the database is slow), and reducing the dependency chain (fewer components in the critical path).

What comes next

Reliability needs targets. Without them, you’re flying blind. The next article covers SLIs, SLOs, and error budgets in practice, the framework that turns reliability measurement into actionable engineering decisions. You’ll learn how to define what “reliable enough” means and what to do when you fall short.

Start typing to search across all content
navigate Enter open Esc close