Production readiness reviews
In this series (11 parts)
- What SRE is
- Reliability fundamentals
- SLIs, SLOs, and error budgets in practice
- Toil reduction and automation
- Capacity planning
- Performance testing and load testing
- Chaos engineering
- Incident response in practice
- Postmortems and learning from failure
- Production readiness reviews
- Reliability patterns for services
Shipping a new service to production is exciting. It is also the moment where gaps in reliability planning become real problems. A Production Readiness Review (PRR) is a structured evaluation that checks whether a service meets the bar for reliable operation before it takes live traffic.
The goal is not to slow teams down. The goal is to catch the gaps that cause 3 AM pages two weeks after launch.
What a PRR evaluates
A PRR covers the operational aspects of a service that development teams sometimes overlook. Each area below represents a category of risk.
Monitoring and alerting
Can you tell when the service is unhealthy? Specifically:
- Are SLIs defined and measured?
- Do alerts fire when SLOs are at risk?
- Are dashboards available showing request rate, error rate, and latency?
- Do alerts route to the correct on-call rotation?
A service without monitoring is a service you cannot operate. This is the single most common gap in PRRs.
Runbooks for common failure modes
For every alert, there should be a runbook that describes what the alert means, how to diagnose the problem, and what remediation steps to take. Runbooks do not need to be perfect. They need to exist and be findable.
Ask the team: “If your service starts returning 500 errors at 2 AM, what does the on-call engineer do first?” If the answer is “look at the code,” you need a runbook.
Load testing
Has the service been tested under expected peak load? This means:
- Identifying the expected peak (often 2x to 3x the average)
- Running synthetic load at that level for a sustained period
- Measuring latency, error rate, and resource utilization under load
- Identifying the breaking point where the service degrades
Load testing in staging is acceptable. Load testing in production is better but requires careful coordination.
Rollback plan
Every deployment needs a rollback plan. The PRR checks:
- Can you roll back to the previous version in under five minutes?
- Have you tested the rollback procedure?
- Are database migrations backward-compatible so rollback does not corrupt data?
- Is the rollback automated or does it require manual steps?
A rollback plan that has never been tested is not a rollback plan. It is a hypothesis.
Dependency mapping
What does this service depend on? For each dependency:
- What happens if the dependency is unavailable?
- Is there a timeout configured?
- Is there a circuit breaker or fallback?
- What is the dependency’s SLA, and does your service’s SLO account for it?
Teams frequently underestimate how many dependencies they have. A service that calls three APIs, uses a database, reads from a cache, and publishes to a message queue has six dependencies. Each one can fail.
Data retention and backup
- Is data backed up on a regular schedule?
- Has a backup restore been tested?
- What is the recovery point objective (how much data can you lose)?
- What is the recovery time objective (how long to restore)?
- Are there data retention policies that comply with organizational or regulatory requirements?
Backups that have never been restored are not backups. They are hopes.
Security review
- Has the service passed a security review?
- Are secrets stored in a vault, not in configuration files?
- Is TLS enforced for all network communication?
- Are inputs validated and sanitized?
- Is authentication and authorization implemented correctly?
Security is not optional for production readiness. A vulnerability in a new service becomes an incident for the entire organization.
PRR workflow
The review follows a structured flow from request to approval.
graph TD
A["Team requests PRR"] --> B["SRE assigns reviewer"]
B --> C["Team completes self-assessment"]
C --> D["Reviewer evaluates each area"]
D --> E{"Findings?"}
E -->|"No blocking issues"| F["Approve for launch"]
E -->|"Advisory findings"| G["Approve with recommendations"]
E -->|"Blocking findings"| H["Remediation required"]
H --> I["Team addresses findings"]
I --> D
G --> J["Launch with follow-up items"]
F --> J
Production readiness review workflow from request through launch approval.
The cycle from findings to remediation can repeat multiple times. That is normal. Each iteration makes the service more resilient.
Checklist vs conversation
A checklist helps ensure nothing gets missed. But a PRR that is only a checklist misses the point.
The real value comes from the conversation between the development team and the reviewer. When a reviewer asks “what happens if your primary database fails over?”, the team’s answer reveals their understanding of the system. Sometimes the answer is “we haven’t thought about that.” That is exactly the kind of gap a PRR is designed to surface.
Structure the PRR as a meeting where the team walks through each area. Use the checklist as a guide, not a substitute for discussion.
The self-assessment
Before the review meeting, the team fills out a self-assessment covering each PRR area. This serves two purposes:
- It forces the team to think through operational concerns before the meeting.
- It gives the reviewer context so they can focus the conversation on gaps rather than areas that are already solid.
A good self-assessment is honest. “We have not load tested yet” is more useful than “load testing is planned.” The reviewer needs to know the current state, not the future state.
Launch blocking vs advisory findings
Not every gap should block a launch. The distinction matters.
Blocking findings represent risks that are likely to cause a significant incident. Examples:
- No monitoring or alerting configured
- No rollback plan
- No load testing for a high-traffic service
- Unresolved critical security vulnerability
Advisory findings represent risks that should be addressed but are unlikely to cause an immediate incident. Examples:
- Runbooks exist but are incomplete
- Load testing covered average load but not peak
- Backup restore has not been tested in the last 90 days
- Some alerts lack runbook links
Advisory findings ship with the service but are tracked as follow-up items with deadlines. If advisory findings are never resolved, they accumulate into blocking risk for the next major change.
Severity matrix
A simple matrix helps calibrate decisions:
| Impact if it fails | Likelihood of failure | Finding type |
|---|---|---|
| High | High | Blocking |
| High | Low | Advisory (urgent) |
| Low | High | Advisory |
| Low | Low | Note for future |
This keeps the process consistent. Without a matrix, whether something blocks a launch depends on who the reviewer is and what mood they are in.
When to require a PRR
Not every code change needs a full production readiness review. Reserve PRRs for:
- New services entering production for the first time
- Major architecture changes that alter failure modes (switching databases, adding a new dependency, changing deployment topology)
- Services crossing a traffic threshold where the blast radius of an outage increases significantly
- Services handling sensitive data for the first time
For incremental changes to existing services, rely on your regular code review and deployment processes.
Scaling PRRs across the organization
If you have hundreds of services, you cannot review every one individually. Tiered approaches work well:
- Tier 1 (critical services): Full PRR with SRE reviewer. Annual re-review.
- Tier 2 (standard services): Self-assessment plus lightweight SRE review. Re-review on major changes.
- Tier 3 (low-risk internal tools): Self-assessment only. Spot-check periodically.
The tier assignment should be based on user impact and traffic volume. A service that handles payments is tier 1. An internal admin dashboard might be tier 3.
Tracking PRR coverage
Track coverage as a percentage of services that have completed a PRR within the required timeframe. Low coverage in tier 1 is a problem. Low coverage in tier 3 is acceptable initially but should improve over time.
Common mistakes
Treating PRR as a gate to punish teams. If teams see the PRR as an obstacle, they will avoid it or game the checklist. Position it as a partnership. The reviewer’s job is to help the team succeed in production, not to find reasons to block them.
Reviewing too late. If the PRR happens the day before launch, there is no time to fix blocking issues without delaying the launch. Start the PRR process early, ideally two weeks before the target launch date.
Ignoring advisory findings. Advisory findings that never get resolved erode trust in the process. Track them with the same discipline you apply to postmortem action items.
One-size-fits-all checklists. A data pipeline has different operational needs than a user-facing API. Tailor the checklist to the service type while keeping a common baseline.
What comes next
Now that you know how to evaluate services before launch, the next article on reliability patterns covers the engineering patterns that make services resilient during operation: graceful degradation, load shedding, circuit breakers, and more.