Search…

What SRE is

In this series (11 parts)
  1. What SRE is
  2. Reliability fundamentals
  3. SLIs, SLOs, and error budgets in practice
  4. Toil reduction and automation
  5. Capacity planning
  6. Performance testing and load testing
  7. Chaos engineering
  8. Incident response in practice
  9. Postmortems and learning from failure
  10. Production readiness reviews
  11. Reliability patterns for services

Servers go down. Deploys fail. Disks fill up at 3 AM. Every engineering organization deals with these problems. The question is how.

Traditional operations teams handle incidents reactively, processing tickets and performing manual tasks. Site Reliability Engineering (SRE) takes a fundamentally different approach. It treats operations as a software problem and applies engineering discipline to solve it.

The origin story

In 2003, Ben Treynor joined Google to lead a production team. When asked what SRE is, his answer became the field’s defining statement: “SRE is what happens when a software engineer is asked to design an operations team.”

The insight was simple. Engineers who write software know how to automate repetitive work. They build systems, not checklists. Treynor’s team didn’t just keep Google running. They wrote code that kept Google running. That distinction matters more than it sounds.

Google published the SRE book in 2016, and the discipline spread rapidly. Today SRE teams exist at organizations of every size, from startups to enterprises.

SRE vs DevOps vs traditional ops

These three approaches solve the same underlying problem: getting reliable software to users. They differ in philosophy and execution.

Traditional operations separates development and operations into distinct teams. Developers write code and throw it over the wall. Ops catches it, deploys it, and pages someone when it breaks. Communication flows through tickets. Incentives are misaligned. Dev wants to ship fast. Ops wants stability.

DevOps is a cultural philosophy that breaks down silos between development and operations. It emphasizes shared ownership, continuous delivery, and automation. DevOps tells you what to value but not precisely how to implement those values.

SRE is a concrete implementation of DevOps principles. If DevOps is the interface, SRE is the class that implements it. SRE prescribes specific practices: error budgets, service level objectives, toil budgets, blameless postmortems, and capacity planning.

You can practice DevOps without SRE. You cannot practice SRE without embodying DevOps principles.

Reliability is a feature

Users don’t distinguish between “the app is slow” and “the app is broken.” Both mean the same thing: the product isn’t working.

Reliability is the most important feature of any system. It doesn’t matter how many features your product has if users can’t access them. A payment system that fails 1% of the time will lose customers. A messaging app with 5 seconds of latency will lose users to a competitor.

SRE treats reliability as a feature that requires engineering effort, just like any other feature. It gets designed, implemented, tested, and maintained. The difference is that reliability competes for engineering time with new features. Error budgets (covered in article 3 of this series) formalize that tradeoff.

What SREs actually do

An SRE’s day looks different from a traditional ops engineer’s day. Here’s the contrast.

Traditional ops engineer:

  • Processes deployment tickets
  • Manually scales infrastructure
  • Responds to alerts reactively
  • Writes runbooks in wikis
  • Troubleshoots the same issues repeatedly

Site reliability engineer:

  • Writes code to automate deployments
  • Builds autoscaling systems
  • Designs alerting based on SLOs
  • Converts runbooks into automated remediation
  • Eliminates recurring issues permanently

SREs write code. Typically 50% or more of their time goes to software engineering work: building tools, automating processes, improving monitoring, and designing resilient architectures. The rest goes to operational work, but that split is carefully managed.

SRE responsibility areas

SRE work spans several domains. Each feeds into the others.

graph TD
  A[SRE Responsibilities] --> B[Availability & Reliability]
  A --> C[Latency & Performance]
  A --> D[Monitoring & Observability]
  A --> E[Incident Response]
  A --> F[Capacity Planning]
  A --> G[Change Management]
  B --> H[SLIs / SLOs / Error Budgets]
  D --> I[Alerting & Dashboards]
  E --> J[On-call & Postmortems]
  F --> K[Load Testing & Forecasting]
  G --> L[Release Engineering & Automation]

SRE responsibilities span reliability measurement, observability, incident management, capacity planning, and change management.

Monitoring and observability form the foundation. You cannot improve what you cannot measure. Incident response builds on monitoring, turning signals into action. Postmortems turn incidents into systemic improvements. Release engineering ensures changes reach production safely. Capacity planning keeps systems ahead of growth.

Understanding toil

Toil is one of SRE’s most important concepts. It has a precise definition, not just “work I don’t like.”

Toil is work that meets all of these criteria:

  1. Manual. A human performs it, not a script or system.
  2. Repetitive. You’ve done it before and will do it again.
  3. Automatable. A machine could do it with sufficient engineering investment.
  4. Tactical. It’s reactive, responding to events rather than planning ahead.
  5. No lasting value. Once done, it doesn’t permanently improve the system.
  6. Scales linearly. If you double the number of services, the work doubles too.

Restarting a crashed service is toil. Rotating certificates by hand is toil. Manually provisioning user accounts is toil. Running a capacity review meeting is not toil. Writing an autoscaler is not toil. Designing a new monitoring dashboard is not toil.

The distinction matters because toil is the enemy of engineering. Every hour spent on toil is an hour not spent building systems that prevent future toil.

The 50% rule

Google’s SRE book establishes a hard guideline: SREs should spend no more than 50% of their time on toil. The remaining 50% goes to engineering work that reduces future toil or improves reliability.

This isn’t just a nice idea. It’s a contract. When toil exceeds 50%, the SRE team pushes back. Maybe the development team takes on some operational burden. Maybe a project gets prioritized to automate the toil away. Maybe the team stops onboarding new services until existing toil is addressed.

Without this boundary, SRE teams gradually become traditional ops teams. The engineering work gets deferred indefinitely while firefighting consumes every available hour. The 50% rule prevents that drift.

Tracking toil requires measurement. SRE teams typically log time spent on different categories of work, review the breakdown monthly, and flag when toil trends upward. The measurement itself creates accountability.

Core SRE principles

Several principles run through all SRE practices:

Embrace risk. Pursuing 100% reliability is wrong. It’s infinitely expensive and prevents you from shipping features. SRE sets an explicit reliability target and uses the remaining error budget for velocity.

Eliminate toil. Automate repetitive work. If a human does it more than twice, write code to do it.

Monitor meaningfully. Alerts should require human intelligence to resolve. If an alert has an obvious, mechanical response, automate that response instead.

Release safely. Use canary deployments, progressive rollouts, and automated rollbacks. Make releases boring.

Learn from failure. Blameless postmortems focus on systemic causes, not individual mistakes. Every incident is a learning opportunity the organization should capture and act on.

Plan for capacity. Organic growth and sudden traffic spikes both need to be modeled. Running out of capacity is a preventable failure.

Getting started with SRE

You don’t need Google-scale infrastructure to adopt SRE practices. Start with three things:

  1. Define SLOs for your most critical services. What does “working” mean from a user perspective? Set measurable targets.
  2. Measure toil. Track how your team spends time for two weeks. Categorize work as toil or engineering. The results will probably surprise you.
  3. Run blameless postmortems. After your next incident, focus on what the system allowed to happen, not who made a mistake.

These three practices create a foundation. Error budgets, automation projects, and capacity planning build on top of them naturally.

What comes next

Now that you understand what SRE is and why it matters, the next article covers the math behind reliability. Reliability fundamentals explores availability calculations, the nines table, MTBF, MTTR, failure modes, and the hierarchy of reliability practices that keep services running.

Start typing to search across all content
navigate Enter open Esc close