Toil reduction and automation
In this series (11 parts)
- What SRE is
- Reliability fundamentals
- SLIs, SLOs, and error budgets in practice
- Toil reduction and automation
- Capacity planning
- Performance testing and load testing
- Chaos engineering
- Incident response in practice
- Postmortems and learning from failure
- Production readiness reviews
- Reliability patterns for services
An SRE spends Monday morning manually restarting three services that crashed overnight. Tuesday she rotates expiring TLS certificates across twelve hosts. Wednesday she provisions accounts for new hires. Thursday she cleans up disk space on production servers. Friday she does it all again.
None of this work made the system better. None of it prevented future problems. It just kept things from getting worse. That’s toil.
What toil is (and isn’t)
Toil has a precise definition in SRE. It’s not just “boring work” or “work I’d rather not do.” Toil meets all six of these criteria:
- Manual. A human has to do it. A script or automated system cannot handle it today.
- Repetitive. It happens again and again. The first time you debug a new failure mode isn’t toil. The tenth time you restart the same crashing service is.
- Automatable. A computer could do it with sufficient engineering effort. Writing a quarterly business review isn’t toil because it requires human judgment that’s difficult to codify.
- Tactical. It’s interrupt-driven and reactive. You do it because something happened, not because you planned it.
- No lasting value. When it’s done, the system isn’t permanently better. You’ll need to do it again next week.
- Scales linearly with service growth. If you add 10 more services, the toil grows proportionally.
Not all operational work is toil. On-call shifts involve toil (restarting crashed processes) and non-toil (investigating novel failures). Writing a postmortem isn’t toil. Designing a monitoring dashboard isn’t toil. These activities produce lasting improvements.
The distinction matters because SRE treats toil as a measurable resource drain that should be minimized through engineering.
Measuring toil
You can’t reduce what you don’t measure. Toil measurement requires tracking how your team actually spends time.
The time tracking approach
For two weeks, have each team member categorize their work into buckets:
- Toil. Manual, repetitive operational work matching the six criteria above.
- Engineering. Writing code, designing systems, building automation, improving monitoring.
- Overhead. Meetings, email, administrative tasks, planning.
Calculate your toil percentage:
toil percentage = toil hours / (toil hours + engineering hours) x 100
Overhead is excluded from the ratio. You’re comparing toil against productive engineering work.
What the numbers tell you
Google’s target is clear: toil should not exceed 50% of an SRE’s time. If it does, the team has a structural problem.
Here’s how to interpret your measurement:
- Below 30%. Healthy. Your automation efforts are paying off. Keep investing in engineering.
- 30% to 50%. Normal range. Identify the largest toil sources and prioritize automation projects.
- 50% to 70%. Warning zone. The team is becoming an ops team. Escalate to leadership. Reduce service load or increase automation investment.
- Above 70%. Critical. The team cannot do meaningful engineering work. Drastic intervention is needed, potentially handing services back to development teams or pausing new service onboarding.
Tracking toil over time
Don’t measure once and forget. Track toil percentage monthly. Plot the trend. A rising toil percentage is an early warning signal that automation isn’t keeping pace with growth.
Common toil tracking tools include time-tracking spreadsheets, ticketing system labels (tag tickets as “toil” or “engineering”), and custom dashboards that categorize on-call actions.
Common sources of toil
Certain operational tasks appear on almost every SRE team’s toil list.
Manual deployments. Someone SSH-es into servers or clicks through a UI to deploy code. This should be a git push that triggers a pipeline.
Certificate rotation. TLS certificates expire. Someone manually generates new ones, copies them to servers, and restarts services. This should be handled by automated certificate management like cert-manager or ACME protocols.
Disk cleanup. Log files and temporary data fill disks. Someone logs in, identifies large files, and deletes them. This should be log rotation policies and automated cleanup jobs.
User provisioning. New employees need accounts, permissions, and access to systems. Someone processes each request manually. This should be identity management automation triggered by HR system events.
Capacity scaling. Traffic increases and someone manually adds instances or increases resource limits. This should be autoscaling policies based on load metrics.
Incident remediation. A known issue triggers an alert and someone executes a runbook manually. If the runbook steps are deterministic, a script should do them.
Prioritizing automation
Not all toil is equally worth automating. The automation priority matrix helps you decide where to invest.
Plot each toil task on two axes: how frequently it occurs and how long it takes each time.
| Low frequency (monthly) | Medium frequency (weekly) | High frequency (daily) | |
|---|---|---|---|
| Long duration (> 1 hour) | Medium priority | High priority | Critical |
| Medium duration (15 to 60 min) | Low priority | Medium priority | High priority |
| Short duration (< 15 min) | Ignore for now | Low priority | Medium priority |
A task that happens daily and takes an hour each time is costing you 20+ hours per month. Automating it is almost certainly worth the investment. A task that happens monthly and takes 10 minutes can probably wait.
Factor in risk too. A manual task with high error consequences (like database migrations or certificate rotation) deserves automation even if it’s infrequent, because human error during that task could cause an incident.
The automation progression
Automation isn’t binary. Tasks progress through maturity levels, and each level delivers value.
graph LR A["Manual Process (undocumented)"] --> B["Documented Runbook (step-by-step)"] B --> C["Scripted Steps (human triggers)"] C --> D["Fully Automated (system triggers)"] D --> E["Self-Healing (detect + fix)"] style A fill:#EF553B,color:#fff style B fill:#FFA15A,color:#fff style C fill:#FECB52,color:#000 style D fill:#00CC96,color:#fff style E fill:#636EFA,color:#fff
The progression from manual processes to self-healing systems. Each stage reduces toil and human involvement.
Stage 1: Undocumented manual process
The knowledge lives in one person’s head. When they’re on vacation, nobody knows how to do it. This is the most fragile and dangerous state.
Stage 2: Documented runbook
Steps are written down in a wiki or runbook. Anyone on the team can follow them. This doesn’t reduce the time spent on the task, but it reduces the risk of single points of knowledge and standardizes the process.
Stage 3: Scripted steps
The runbook is partially or fully converted into scripts. A human still triggers the script and monitors its execution, but the actual work is automated. Error rates drop because scripts don’t make typos.
A script that handles certificate rotation might look like:
#!/bin/bash
# rotate-certs.sh - Rotate TLS certificates for a service
SERVICE=$1
certbot renew --cert-name "$SERVICE"
systemctl reload nginx
echo "Certificate rotated for $SERVICE"
The human runs ./rotate-certs.sh api-gateway. The script does the work. That’s a meaningful reduction in toil and error risk.
Stage 4: Fully automated
The system triggers the automation without human involvement. A cron job or event-driven pipeline handles the entire process. Humans are notified of the result but don’t need to intervene.
Certificate rotation becomes a cert-manager policy that automatically renews certificates 30 days before expiration, injects them into the cluster, and reloads services.
Stage 5: Self-healing
The system detects problems and fixes them automatically. A crashed service restarts itself. A full disk triggers automated cleanup. A failed deployment rolls back without human intervention.
Self-healing combines monitoring, automated diagnosis, and automated remediation into a closed loop. The human only gets involved when the automation encounters a case it can’t handle.
Building automation that lasts
Automation that breaks frequently or behaves unpredictably creates more toil than it eliminates. Follow these principles:
Make automation idempotent. Running it twice should produce the same result as running it once. If a certificate rotation script fails halfway through, rerunning it should complete the rotation without duplicating work.
Build in observability. Every automated action should produce logs, metrics, or events. When automation runs at 3 AM, you need to verify it worked correctly the next morning.
Add circuit breakers. Automation that detects unexpected state should stop and alert rather than proceeding blindly. A disk cleanup script that finds no files to delete should log that fact and exit, not delete random files trying to free space.
Test automation in staging. Automated processes that touch production should be validated in a staging environment first. This is especially important for destructive operations like cleanup jobs and scaling events.
Version control everything. Scripts, configurations, and automation pipelines belong in git. Treat automation code with the same rigor as application code.
Calculating the return on automation
Before investing engineering time in automation, estimate the return.
monthly toil cost = frequency per month x time per occurrence
automation build cost = estimated engineering hours to build + test
payback period = automation build cost / monthly toil cost
If a task takes 30 minutes and happens 20 times per month, the monthly toil cost is 10 hours. If automating it takes 40 engineering hours, the payback period is 4 months. After that, you’re saving 10 hours every month permanently.
Be honest about the build cost. Include testing, documentation, edge case handling, and ongoing maintenance. A common mistake is underestimating the effort to handle all the edge cases that a human currently resolves intuitively.
Some automation is worth building even with a long payback period if the manual task is error-prone and the consequences of errors are severe. Automating database failover might take months of engineering effort, but a single botched manual failover can cost millions in downtime.
Sustaining toil reduction
Toil reduction isn’t a one-time project. It’s an ongoing practice.
Reserve automation time. Block 20% of sprint capacity for automation projects. If automation competes with feature work for every sprint, features always win and toil accumulates.
Celebrate automation wins. When a team eliminates a significant source of toil, recognize it. Share the before-and-after metrics. Make automation a valued engineering achievement, not invisible background work.
Track new toil sources. Every new service, feature, or integration can introduce new toil. Include toil impact assessment in design reviews. Ask “what operational work does this create?” before approving new systems.
Rotate toil fairly. When manual work must be done, distribute it evenly across the team. Concentrating toil on junior engineers or specific individuals leads to burnout and attrition.
What comes next
This article completes the foundational concepts of Site Reliability Engineering. You now understand what SRE is, how to measure reliability, how to set targets with SLIs and SLOs, and how to reduce the operational burden through automation.
The next step is putting these ideas into practice. Start by measuring your team’s current toil percentage, picking the highest-impact automation candidate, and moving it from manual process to scripted steps. Small wins build momentum for larger reliability investments.
For a broader view of how SRE fits into the DevOps ecosystem, revisit What SRE is or explore other articles in this series.