SLIs, SLOs, and error budgets in practice
In this series (11 parts)
- What SRE is
- Reliability fundamentals
- SLIs, SLOs, and error budgets in practice
- Toil reduction and automation
- Capacity planning
- Performance testing and load testing
- Chaos engineering
- Incident response in practice
- Postmortems and learning from failure
- Production readiness reviews
- Reliability patterns for services
Every engineering team says they care about reliability. Few can answer a simple question: “How reliable is your service right now, and how reliable does it need to be?”
SLIs, SLOs, and error budgets answer that question with numbers. They turn reliability from a feeling into a measurable, manageable engineering constraint.
Service Level Indicators
A Service Level Indicator (SLI) is a quantitative measure of some aspect of the service level you provide. It captures what users actually experience.
Good SLIs measure from the user’s perspective, not the server’s. The difference matters.
Server perspective: “CPU is at 30%, all processes healthy.” User perspective: “My request took 3 seconds and returned an error.”
Both statements can be true at the same time. The server might be fine overall while a specific code path causes failures for a subset of users. SLIs focus on the user’s truth.
Common SLI types
Availability SLI. The proportion of valid requests that are served successfully. A 500 error is a failure. A 200 is a success. A 400 (client error) typically isn’t counted as a failure because the system behaved correctly.
availability = successful requests / total valid requests
Latency SLI. The proportion of requests that complete within a threshold. Not the average latency. The proportion under a target. This matters because averages hide tail latency problems.
latency SLI = requests < 200ms / total requests
Throughput SLI. The proportion of time the system can handle the expected request rate. Useful for batch processing systems where completing work within a time window matters more than individual request latency.
Error rate SLI. The inverse of availability, sometimes more intuitive. “0.1% of requests fail” is the same as “99.9% availability” but frames the problem differently.
Writing good SLIs
Start with user journeys. What does the user do with your service? For an e-commerce platform, the critical journeys might be: browse products, add to cart, and complete checkout.
Each journey gets its own SLIs. The checkout SLI is more important than the product browsing SLI because a checkout failure means lost revenue.
A few guidelines for writing SLIs:
- Measure at the edge, as close to the user as possible. Load balancer logs or client-side telemetry beat application-level metrics.
- Use proportions, not averages. “99.9% of requests under 200ms” is better than “average latency is 50ms.”
- Include only valid requests. Bot traffic, health checks, and synthetic monitoring should be excluded.
- Keep it simple. If you can’t explain an SLI in one sentence, it’s too complex.
For a deeper treatment of how SLIs connect to observability systems, see SLIs, SLOs, and error budgets in observability.
Service Level Objectives
A Service Level Objective (SLO) is a target value or range for an SLI. It defines “reliable enough.”
If your SLI is “proportion of requests served under 200ms,” your SLO might be “99.9% of requests served under 200ms, measured over a 30-day rolling window.”
An SLO has three components:
- The SLI it targets. What are you measuring?
- The threshold. What’s the target value?
- The measurement window. Over what time period?
Why not 100%?
A 100% SLO is wrong for several reasons.
First, it’s impossible. Hardware fails. Networks partition. Software has bugs. Even the most reliable systems in the world experience some downtime.
Second, it’s infinitely expensive. Going from 99.99% to 99.999% costs dramatically more than going from 99% to 99.9%. The engineering effort follows an exponential curve.
Third, it prevents shipping. If any failure violates your SLO, you can never deploy anything. Every deployment carries some risk. A 100% target means zero tolerance for risk, which means zero deployments.
The right SLO balances user expectations against engineering cost. Most web services land between 99.5% and 99.99% depending on how critical the service is.
Setting your first SLOs
Start with data. Measure your current performance for 2 to 4 weeks before setting targets. If your service currently achieves 99.95% availability, setting an SLO of 99.99% creates immediate pain. Setting it at 99.9% gives you room to learn.
SLOs should be slightly below your current performance. This creates a buffer for experimentation and gives you an error budget to work with.
Error budgets
The error budget is the allowed unreliability. It’s the gap between 100% and your SLO.
If your SLO is 99.9%, your error budget is 0.1%. Over a 30-day month, that’s:
30 days x 24 hours x 60 minutes x 0.001 = 43.2 minutes
You’re allowed 43.2 minutes of total downtime per month. Or equivalently, 0.1% of your requests can fail.
This reframes reliability as a budget, not a binary. You don’t need perfect reliability. You need enough reliability, and you get to spend the rest on velocity.
How error budgets change the conversation
Without error budgets, the reliability conversation is adversarial. SRE wants stability. Product wants features. Each side argues qualitatively and whoever talks louder wins.
With error budgets, the conversation is data-driven. “We have 30 minutes of error budget remaining this month. The proposed deployment has historically caused 5 minutes of degradation. That leaves 25 minutes of buffer. Let’s ship it.”
Or: “We burned through our error budget by day 15. Per our policy, we freeze feature deployments and focus on reliability work until the budget resets.”
Error budgets turn reliability into a shared resource that both teams manage together.
Error budget policies
An error budget policy defines what happens at different budget levels. Here’s a typical policy:
- Budget healthy (> 50% remaining). Ship features normally. Run experiments. Deploy as often as you like.
- Budget caution (20% to 50% remaining). Increase deployment scrutiny. Require canary deployments for all changes. Prioritize reliability fixes.
- Budget critical (< 20% remaining). Freeze feature deployments. Only ship reliability improvements and critical bug fixes.
- Budget exhausted (0% remaining). Full feature freeze. All engineering effort goes to reliability until the budget resets or a new measurement window starts.
The policy must be agreed upon by both product and engineering leadership before it’s needed. Negotiating during an active budget crisis never goes well.
Burn rates and multi-window alerts
A simple “SLO violated” alert is too slow. By the time you’ve consumed your entire error budget, the damage is done. Burn rate alerts detect budget consumption speed and fire early.
The burn rate is how fast you’re consuming your error budget relative to the expected steady pace. A burn rate of 1 means you’ll exhaust the budget exactly at the end of the window. A burn rate of 2 means you’ll exhaust it in half the time.
Multi-window burn rate alerts
Single-window alerts suffer from either too many false positives (short window) or too slow detection (long window). Multi-window alerts solve this by requiring both a long window and a short window to breach simultaneously.
| Alert severity | Long window | Short window | Burn rate | Budget consumed at alert |
|---|---|---|---|---|
| Page (immediate) | 1 hour | 5 minutes | 14.4x | 2% |
| Page (urgent) | 6 hours | 30 minutes | 6x | 5% |
| Ticket (next business day) | 3 days | 6 hours | 1x | 10% |
The first row says: if the burn rate over the last hour is 14.4x AND the burn rate over the last 5 minutes is also 14.4x, page someone immediately. This fires within minutes of a serious incident while avoiding false alarms from brief blips.
The third row catches slow burns. If you’re consuming budget at the expected rate over 3 days but that rate is sustained (confirmed by the 6-hour window), open a ticket. It’s not urgent, but it needs attention before the month ends.
Steady consumption uses the budget evenly across the month. A 2x burn rate exhausts the budget by day 15, triggering escalation policies well before month end.
SLO review cadence
SLOs aren’t set once and forgotten. They need regular review.
Monthly review. Check budget consumption, identify top contributors to budget burn, and verify that the SLO still reflects user expectations. This is a 30-minute meeting with SRE and the product team.
Quarterly review. Evaluate whether the SLO target itself is correct. Should you tighten it because users expect more? Should you loosen it because the current target is blocking velocity with no measurable user impact? Adjust the error budget policy if needed.
After major incidents. Any incident that consumes more than 20% of the error budget in a single event deserves an immediate SLO review. Was the SLO too loose, allowing a serious degradation without alerting? Or was the SLO appropriate and the incident was genuinely exceptional?
Putting it all together
Here’s how SLIs, SLOs, and error budgets work as a system:
- Identify user journeys. What are the critical paths through your service?
- Define SLIs. For each journey, what metrics capture user experience?
- Set SLOs. Based on current performance and user expectations, what targets make sense?
- Calculate error budgets. How much unreliability can you tolerate per measurement window?
- Create error budget policies. What actions trigger at different budget levels?
- Implement burn rate alerts. Detect budget consumption speed, not just violations.
- Review and adjust. Monthly check-ins on budget health. Quarterly reviews of the targets themselves.
This framework replaces “we need to be more reliable” with “we have 27 minutes of error budget remaining, the top consumer is the payment service timeout issue, and fixing it will recover 15 minutes of budget.”
That’s the power of SLIs, SLOs, and error budgets. They make reliability concrete, measurable, and actionable.
What comes next
Reliability targets are meaningless if your team is drowning in manual work. The next article covers toil reduction and automation, the practice of identifying, measuring, and systematically eliminating the repetitive operational work that prevents SRE teams from doing engineering.