SLIs, SLOs, and error budgets
In this series (10 parts)
Every system has failures. The question is not “will it fail?” but “how much failure is acceptable?” SLOs answer that question with a number. Error budgets turn the answer into an engineering decision framework.
Definitions
SLI (Service Level Indicator) is a metric that quantifies user experience. “The proportion of requests that completed in under 300ms” or “the proportion of requests that returned a non-error response.”
SLO (Service Level Objective) is a target for an SLI over a time window. “99.9% of requests complete in under 300ms, measured over a rolling 30-day window.”
SLA (Service Level Agreement) is a contractual commitment with financial consequences for missing the target. SLAs are for customers. SLOs are for engineering teams.
Error budget is the allowed amount of failure. If the SLO is 99.9%, the error budget is 0.1%.
Choosing SLIs
Good SLIs reflect what users experience. Bad SLIs measure infrastructure internals that may not correlate with user happiness.
| Bad SLI | Why | Better SLI |
|---|---|---|
| CPU utilization | High CPU does not always mean users are affected | Request success rate |
| Database replication lag | Users do not see replication lag directly | Data freshness from user perspective |
| Pod restart count | Restarts might be invisible to users | Availability (successful requests / total requests) |
The two most useful SLI categories:
Availability SLI: The proportion of valid requests that were served successfully.
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
Latency SLI: The proportion of valid requests served faster than a threshold.
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
Setting SLO targets
SLO targets balance user expectations against engineering cost. Higher targets are exponentially more expensive to maintain.
| SLO | Allowed downtime per 30 days | Engineering effort |
|---|---|---|
| 99% | ~7.2 hours | Basic redundancy |
| 99.9% | ~43 minutes | Multi-AZ, automated failover |
| 99.95% | ~22 minutes | Global load balancing, chaos engineering |
| 99.99% | ~4.3 minutes | Active-active multi-region, dedicated SRE team |
Start with a target slightly above your current performance. If your service currently achieves 99.85% availability, set the SLO at 99.9%. This gives you a meaningful target that is achievable with focused work.
Do not set every service to 99.99%. Internal batch processing does not need the same reliability as the checkout flow. Match the SLO to user expectations and business impact.
Error budget calculation
The error budget is the complement of the SLO. For a 99.9% availability SLO over a 30-day window:
Error budget = 1 - 0.999 = 0.001 = 0.1%
Total requests in 30 days: 100,000,000
Allowed failures: 100,000,000 * 0.001 = 100,000
If you have consumed 60,000 of those 100,000 allowed failures halfway through the month, you have burned 60% of the budget in 50% of the time. That burn rate signals trouble.
Burn rate alerts
Traditional threshold alerts (“error rate > 1%”) do not account for budget consumption. A 1% error rate might be fine for a service with a 99% SLO but catastrophic for one with a 99.99% SLO.
Burn rate alerts measure how fast the error budget is being consumed relative to the window:
Burn rate = (actual error rate) / (budget error rate)
For a 99.9% SLO (budget rate = 0.1%):
- If current error rate is 0.1%, burn rate = 1x (consuming budget exactly on pace)
- If current error rate is 1.4%, burn rate = 14x (will exhaust budget in ~2 days)
Multi-window burn rate alerting
Google’s SRE practices recommend alerting on two windows simultaneously to reduce false positives:
# Fast burn: 14.4x burn rate over 1 hour (checked against 5-minute window)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn on {{ $labels.service }}"
# Slow burn: 3x burn rate over 3 days (checked against 6-hour window)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[3d])) by (service)
/
sum(rate(http_requests_total[3d])) by (service)
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
/
sum(rate(http_requests_total[6h])) by (service)
) > (3 * 0.001)
for: 1h
labels:
severity: warning
annotations:
summary: "Slow error budget burn on {{ $labels.service }}"
Fast burn catches acute incidents (outages, bad deployments). Slow burn catches gradual degradation that goes unnoticed day-to-day.
Visualizing error budget consumption
The orange line shows actual budget consumption. The incident on day 15 caused a fast burn that consumed budget ahead of the ideal pace. A fast-burn alert would fire during this spike.
Error budget policy
An error budget policy defines what happens when the budget runs low or is exhausted:
Budget remaining > 50%: Ship features normally. Reliability is healthy.
Budget remaining 20-50%: Slow down risky changes. Require extra review for deployments. Focus testing on reliability-sensitive paths.
Budget remaining < 20%: Freeze non-critical feature deployments. Prioritize reliability improvements. Conduct a focused investigation into budget-consuming incidents.
Budget exhausted: Full feature freeze. All engineering effort goes to reliability until the budget recovers in the next window.
This policy gives product and engineering a shared language. Instead of arguing about whether to ship a feature or fix reliability, the error budget decides. It converts a subjective debate into an objective measurement.
# Example error budget policy document
policy:
service: checkout-api
slo: 99.9% availability over 30 days
thresholds:
- budget_remaining: ">50%"
action: "Normal development velocity"
- budget_remaining: "20-50%"
action: "Extra deployment review, canary rollouts mandatory"
- budget_remaining: "<20%"
action: "Feature freeze, reliability sprint"
- budget_remaining: "0%"
action: "Full freeze until budget recovers"
review_cadence: weekly
stakeholders:
- engineering-manager
- product-manager
- sre-team-lead
Practical tips
- Start with one SLO per critical service. Availability is the easiest to measure and the most impactful.
- Use a rolling window, not calendar months. A rolling 30-day window avoids the perverse incentive to burn budget at the start of the month.
- Exclude planned maintenance from SLI calculations if users are notified in advance.
- Review SLOs quarterly. As traffic patterns and architecture change, targets may need adjustment.
- Make the error budget dashboard visible. Display it on the team’s main screen. When everyone sees the budget, reliability becomes a shared concern.
What comes next
SLOs measure backend reliability. Real User Monitoring and synthetic testing measures what users actually experience in the browser, including Core Web Vitals and uptime checks that catch issues before users report them.