Search…

SLIs, SLOs, and error budgets

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

Every system has failures. The question is not “will it fail?” but “how much failure is acceptable?” SLOs answer that question with a number. Error budgets turn the answer into an engineering decision framework.

Definitions

SLI (Service Level Indicator) is a metric that quantifies user experience. “The proportion of requests that completed in under 300ms” or “the proportion of requests that returned a non-error response.”

SLO (Service Level Objective) is a target for an SLI over a time window. “99.9% of requests complete in under 300ms, measured over a rolling 30-day window.”

SLA (Service Level Agreement) is a contractual commitment with financial consequences for missing the target. SLAs are for customers. SLOs are for engineering teams.

Error budget is the allowed amount of failure. If the SLO is 99.9%, the error budget is 0.1%.

Choosing SLIs

Good SLIs reflect what users experience. Bad SLIs measure infrastructure internals that may not correlate with user happiness.

Bad SLIWhyBetter SLI
CPU utilizationHigh CPU does not always mean users are affectedRequest success rate
Database replication lagUsers do not see replication lag directlyData freshness from user perspective
Pod restart countRestarts might be invisible to usersAvailability (successful requests / total requests)

The two most useful SLI categories:

Availability SLI: The proportion of valid requests that were served successfully.

sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

Latency SLI: The proportion of valid requests served faster than a threshold.

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

Setting SLO targets

SLO targets balance user expectations against engineering cost. Higher targets are exponentially more expensive to maintain.

SLOAllowed downtime per 30 daysEngineering effort
99%~7.2 hoursBasic redundancy
99.9%~43 minutesMulti-AZ, automated failover
99.95%~22 minutesGlobal load balancing, chaos engineering
99.99%~4.3 minutesActive-active multi-region, dedicated SRE team

Start with a target slightly above your current performance. If your service currently achieves 99.85% availability, set the SLO at 99.9%. This gives you a meaningful target that is achievable with focused work.

Do not set every service to 99.99%. Internal batch processing does not need the same reliability as the checkout flow. Match the SLO to user expectations and business impact.

Error budget calculation

The error budget is the complement of the SLO. For a 99.9% availability SLO over a 30-day window:

Error budget = 1 - 0.999 = 0.001 = 0.1%
Total requests in 30 days: 100,000,000
Allowed failures: 100,000,000 * 0.001 = 100,000

If you have consumed 60,000 of those 100,000 allowed failures halfway through the month, you have burned 60% of the budget in 50% of the time. That burn rate signals trouble.

Burn rate alerts

Traditional threshold alerts (“error rate > 1%”) do not account for budget consumption. A 1% error rate might be fine for a service with a 99% SLO but catastrophic for one with a 99.99% SLO.

Burn rate alerts measure how fast the error budget is being consumed relative to the window:

Burn rate = (actual error rate) / (budget error rate)

For a 99.9% SLO (budget rate = 0.1%):

  • If current error rate is 0.1%, burn rate = 1x (consuming budget exactly on pace)
  • If current error rate is 1.4%, burn rate = 14x (will exhaust budget in ~2 days)

Multi-window burn rate alerting

Google’s SRE practices recommend alerting on two windows simultaneously to reduce false positives:

# Fast burn: 14.4x burn rate over 1 hour (checked against 5-minute window)
- alert: ErrorBudgetFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
      /
      sum(rate(http_requests_total[1h])) by (service)
    ) > (14.4 * 0.001)
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      /
      sum(rate(http_requests_total[5m])) by (service)
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Fast error budget burn on {{ $labels.service }}"

# Slow burn: 3x burn rate over 3 days (checked against 6-hour window)
- alert: ErrorBudgetSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[3d])) by (service)
      /
      sum(rate(http_requests_total[3d])) by (service)
    ) > (3 * 0.001)
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
      /
      sum(rate(http_requests_total[6h])) by (service)
    ) > (3 * 0.001)
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Slow error budget burn on {{ $labels.service }}"

Fast burn catches acute incidents (outages, bad deployments). Slow burn catches gradual degradation that goes unnoticed day-to-day.

Visualizing error budget consumption

The orange line shows actual budget consumption. The incident on day 15 caused a fast burn that consumed budget ahead of the ideal pace. A fast-burn alert would fire during this spike.

Error budget policy

An error budget policy defines what happens when the budget runs low or is exhausted:

Budget remaining > 50%: Ship features normally. Reliability is healthy.

Budget remaining 20-50%: Slow down risky changes. Require extra review for deployments. Focus testing on reliability-sensitive paths.

Budget remaining < 20%: Freeze non-critical feature deployments. Prioritize reliability improvements. Conduct a focused investigation into budget-consuming incidents.

Budget exhausted: Full feature freeze. All engineering effort goes to reliability until the budget recovers in the next window.

This policy gives product and engineering a shared language. Instead of arguing about whether to ship a feature or fix reliability, the error budget decides. It converts a subjective debate into an objective measurement.

# Example error budget policy document
policy:
  service: checkout-api
  slo: 99.9% availability over 30 days
  thresholds:
    - budget_remaining: ">50%"
      action: "Normal development velocity"
    - budget_remaining: "20-50%"
      action: "Extra deployment review, canary rollouts mandatory"
    - budget_remaining: "<20%"
      action: "Feature freeze, reliability sprint"
    - budget_remaining: "0%"
      action: "Full freeze until budget recovers"
  review_cadence: weekly
  stakeholders:
    - engineering-manager
    - product-manager
    - sre-team-lead

Practical tips

  1. Start with one SLO per critical service. Availability is the easiest to measure and the most impactful.
  2. Use a rolling window, not calendar months. A rolling 30-day window avoids the perverse incentive to burn budget at the start of the month.
  3. Exclude planned maintenance from SLI calculations if users are notified in advance.
  4. Review SLOs quarterly. As traffic patterns and architecture change, targets may need adjustment.
  5. Make the error budget dashboard visible. Display it on the team’s main screen. When everyone sees the budget, reliability becomes a shared concern.

What comes next

SLOs measure backend reliability. Real User Monitoring and synthetic testing measures what users actually experience in the browser, including Core Web Vitals and uptime checks that catch issues before users report them.

Start typing to search across all content
navigate Enter open Esc close