DevOps metrics and measuring maturity
In this series (10 parts)
- What DevOps actually is
- The software delivery lifecycle
- Agile, Scrum, and Kanban for DevOps teams
- Trunk-based development and branching strategies
- Environments and promotion strategies
- Configuration management
- Secrets management
- Deployment strategies
- On-call culture and incident management
- DevOps metrics and measuring maturity
“Are we doing DevOps well?” is a question every engineering leader eventually asks. The answer used to be subjective. One team would point to their CI/CD pipeline. Another would cite their uptime. Neither had a common benchmark.
The DORA (DevOps Research and Assessment) team, through years of survey data across thousands of organizations, identified four metrics that reliably predict software delivery performance and organizational outcomes. These are not vanity metrics. They correlate with profitability, market share, and employee satisfaction.
The four DORA metrics
1. Deployment frequency
How often does your team deploy to production?
This measures the batch size of changes. Frequent deployments mean small changes. Small changes are easier to test, easier to debug, and faster to roll back. A team deploying once a quarter ships large, risky releases. A team deploying multiple times per day ships tiny, low-risk increments.
Deployment frequency is a proxy for how well your delivery pipeline works. If deploys are painful, you deploy rarely. If deploys are automated and reliable, you deploy often.
2. Lead time for changes
How long does it take from code commit to code running in production?
This measures the efficiency of your entire pipeline: code review, CI builds, test suites, approval gates, and deployment automation. A lead time of one hour means a developer’s fix reaches users the same day. A lead time of two weeks means urgent fixes queue behind process bottlenecks.
Lead time is the most sensitive metric to pipeline problems. A flaky test suite that fails 20% of builds doubles your effective lead time. A manual approval gate that requires a manager’s sign-off adds days.
3. Change failure rate
What percentage of deployments cause a failure in production?
A failure is anything that requires remediation: a rollback, a hotfix, a patch. If you deploy 100 times and 15 of those cause incidents, your change failure rate is 15%.
This metric balances deployment frequency. Deploying 50 times a day means nothing if half those deployments break something. High performers deploy frequently and maintain low failure rates. They achieve this through testing, code review, feature flags, and canary deployments.
4. Mean time to recovery (MTTR)
When a failure occurs, how long does it take to restore service?
MTTR is the ultimate measure of your incident response capability. It includes detection time, diagnosis time, remediation time, and verification time. A 5-minute MTTR means your monitoring caught the issue instantly, your runbook guided a fast fix, and your deployment pipeline pushed the fix quickly. A 4-hour MTTR suggests gaps in observability, documentation, or deployment speed.
Performance tiers
DORA classifies organizations into four performance tiers based on these metrics:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment frequency | Multiple per day | Weekly to monthly | Monthly to every 6 months | Fewer than once per 6 months |
| Lead time | Less than 1 hour | 1 day to 1 week | 1 to 6 months | More than 6 months |
| Change failure rate | 0-5% | 6-10% | 11-15% | 16-30% |
| MTTR | Less than 1 hour | Less than 1 day | 1 day to 1 week | More than 1 week |
The relationship between these tiers is not what most people expect. Elite performers deploy more frequently and have lower failure rates. Speed and stability are not a tradeoff. They reinforce each other.
Elite performers sit in the top-right quadrant: high deployment frequency and low change failure rate. The common belief that “moving fast breaks things” is empirically false for teams with mature DevOps practices.
How to measure each metric
Deployment frequency
Source: your deployment tool. Count production deployments per day, week, or month.
Watch out for gaming. Some teams count “deploys” that only update config or documentation. Measure deployments that include code changes to production services.
-- Example: deployment frequency from a deployments table
SELECT
DATE_TRUNC('week', deployed_at) AS week,
COUNT(*) AS deployments,
COUNT(DISTINCT service_name) AS services_deployed
FROM deployments
WHERE environment = 'production'
AND deployed_at > NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;
Lead time for changes
Source: your version control and deployment systems. Measure the time between a commit merging to the main branch and that commit running in production.
This is harder than it sounds. A single deployment may include dozens of commits. Do you measure from the first commit or the last? The DORA definition uses the median time for all commits in a deployment. Most teams start with the simpler “time from merge to deploy.”
Change failure rate
Source: your incident tracking and deployment systems. Divide the number of deployments that caused an incident by the total number of deployments.
Define “caused an incident” precisely. Does a canary that was rolled back before users noticed count? It should. The canary caught a failure, and that failure was caused by a deployment change. Whether users felt it is a separate question about detection quality.
Mean time to recovery
Source: your incident management system. Measure the time from incident detection to service restoration.
MTTR is an average. Averages hide outliers. Track the distribution. A team with a 30-minute median MTTR but a 12-hour p99 has a tail-risk problem. One in a hundred incidents takes half a day to resolve.
DevOps maturity models
Maturity models give organizations a roadmap beyond the four metrics. They describe capabilities at each level, helping teams identify the next improvement to pursue.
graph LR L1["Level 1<br/>Ad Hoc"] --> L2["Level 2<br/>Managed"] L2 --> L3["Level 3<br/>Defined"] L3 --> L4["Level 4<br/>Measured"] L4 --> L5["Level 5<br/>Optimized"] style L1 fill:#e74c3c,stroke:#333,color:#fff style L2 fill:#f39c12,stroke:#333,color:#fff style L3 fill:#f1c40f,stroke:#333 style L4 fill:#3498db,stroke:#333,color:#fff style L5 fill:#2ecc71,stroke:#333,color:#fff
Five-level maturity progression from ad hoc processes to continuous optimization.
Level 1: Ad Hoc. No standardized processes. Deployments are manual. Testing is sporadic. Outages are chaotic. Teams rely on individual heroics.
Level 2: Managed. Basic CI/CD exists. Some tests run automatically. Deployments are semi-automated. Incident response follows informal patterns.
Level 3: Defined. Standardized pipelines across teams. Comprehensive test suites. Infrastructure as code. Postmortems are routine. On-call is structured.
Level 4: Measured. DORA metrics are tracked. Teams set targets and measure progress. Feedback loops from production inform development priorities.
Level 5: Optimized. Continuous experimentation. Chaos engineering. Automated canary analysis. The organization invests proactively in reliability rather than reacting to incidents.
Most organizations are between Level 2 and Level 3. The jump from Level 3 to Level 4 requires a cultural shift: treating metrics as a feedback mechanism rather than a reporting obligation.
Common traps when optimizing metrics
Metrics are powerful. They are also dangerous. Goodhart’s Law applies: “When a measure becomes a target, it ceases to be a good measure.”
Trap 1: Optimizing deployment frequency by splitting deploys
A team deploys once per week. Management wants higher deployment frequency. The team starts splitting their weekly release into five daily deploys, each containing a fifth of the changes. The number goes up. Nothing else improves.
The goal behind deployment frequency is small batch sizes and fast feedback. Artificially splitting deploys achieves neither.
Trap 2: Hiding failures to lower change failure rate
If change failure rate is a performance target, teams have an incentive to not report failures. A canary rollback becomes “normal operation” instead of a change failure. Error rate spikes get attributed to “user behavior” instead of a bad deploy.
Change failure rate works as a diagnostic metric, not a performance target for individuals. Track it at the organizational level. Use it to identify systemic improvement opportunities, not to rank teams.
Trap 3: Cutting corners to reduce lead time
A long lead time might reflect thorough code review, comprehensive testing, and careful staging validation. Removing those steps reduces lead time but increases change failure rate. The metrics are a system. Improving one at the expense of another is not progress.
Trap 4: Measuring MTTR without measuring detection time
MTTR starts at detection, not at occurrence. If a failure goes undetected for two hours and then takes ten minutes to fix, MTTR is 130 minutes. Teams that only measure “time from page to resolution” miss the detection gap entirely. Invest in monitoring that catches issues before users report them.
Using metrics as a feedback loop
The right way to use DORA metrics:
- Baseline. Measure where you are today. No judgment, just data.
- Identify constraints. Which metric is worst? That is your bottleneck.
- Target improvements. Pick one metric. Set a modest improvement goal.
- Invest in capabilities. If lead time is the bottleneck, invest in test speed, build caching, and deployment automation.
- Measure again. Did the investment move the metric? Did other metrics stay stable or improve?
- Repeat. Continuous improvement, not a one-time project.
graph TD B["Baseline Metrics"] --> IC["Identify Constraint"] IC --> TI["Target Improvement"] TI --> INV["Invest in Capability"] INV --> M["Measure Again"] M -->|"Improved"| IC M -->|"No change"| RE["Re-evaluate Approach"] RE --> INV
Metrics-driven improvement cycle: measure, identify the bottleneck, invest, verify.
The cycle never ends. Elite performers are not organizations that reached a destination. They are organizations that continuously improve faster than their environment changes.
Beyond DORA
DORA metrics measure delivery performance. They do not measure everything that matters.
Reliability metrics: Error budgets, SLI/SLO compliance, availability percentages. These measure what users experience.
Developer experience metrics: Developer satisfaction surveys, onboarding time, time to first commit. These measure whether your engineers can do their best work.
Security metrics: Vulnerability remediation time, dependency update frequency, security incident rate. These measure your security posture.
Cost metrics: Infrastructure cost per request, cost per deployment, cloud waste percentage. These measure efficiency.
DORA metrics are the foundation. Build additional metrics around them based on what your organization needs to improve.
What comes next
This article completes the DevOps Fundamentals series. You have covered environments, configuration, secrets, deployment strategies, incident management, and metrics. The natural next step is depth. Pick the area where your organization scores lowest on the maturity model and go deep. If incidents are your problem, invest in observability and chaos engineering. If deployment speed is the bottleneck, focus on CI/CD pipeline optimization and test infrastructure. The metrics will tell you where to look. The practices you have learned will tell you what to build.