Deployment strategies
In this series (10 parts)
- What DevOps actually is
- The software delivery lifecycle
- Agile, Scrum, and Kanban for DevOps teams
- Trunk-based development and branching strategies
- Environments and promotion strategies
- Configuration management
- Secrets management
- Deployment strategies
- On-call culture and incident management
- DevOps metrics and measuring maturity
Deploying code is the moment where everything you have built meets reality. A bad deployment strategy turns a minor bug into a full outage. A good one limits the blast radius so that when something goes wrong, and it will, only a fraction of users notice.
There is no single best strategy. Each one trades off deployment speed, rollback complexity, resource cost, and risk exposure. The right choice depends on your system’s tolerance for downtime, your infrastructure budget, and how much you trust your test suite.
Recreate deployment
The simplest strategy. Stop the old version. Start the new version. There is a window where nothing is running.
graph LR
subgraph Phase 1
V1A["v1"] --> V1B["v1"] --> V1C["v1"]
end
subgraph Phase 2
DOWN["Downtime<br/>(all stopped)"]
end
subgraph Phase 3
V2A["v2"] --> V2B["v2"] --> V2C["v2"]
end
Phase 1 --> Phase 2 --> Phase 3
Recreate: full stop, then full start. Simple but causes downtime.
When to use it: Internal tools, batch processing systems, development environments. Anything where a few minutes of downtime is acceptable.
When to avoid it: User-facing production services. Any system with SLA requirements.
Rollback: Deploy the old version using the same process. Downtime again.
Rolling update
Replace instances one at a time (or in small batches). At any point during the rollout, both the old and new versions are running simultaneously.
Kubernetes uses rolling updates as its default deployment strategy. You configure maxSurge (how many extra pods to create) and maxUnavailable (how many pods can be down during the update).
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
With maxUnavailable: 0, Kubernetes creates a new pod before terminating an old one. Capacity never drops below the desired count.
When to use it: Stateless services where old and new versions can coexist. APIs with backward-compatible changes. Most day-to-day deployments.
When to avoid it: Database schema changes that break the old version. Deployments requiring atomic switchover.
Rollback: Kubernetes tracks revision history. A single kubectl rollout undo reverts to the previous version using the same rolling mechanism.
The risk is version skew. During the rollout, some requests hit v1 and some hit v2. If v2 changes an API response format, clients talking to different pods get different responses. Design for this: make changes backward-compatible or use feature flags to activate new behavior only after all pods are updated.
Blue-green deployment
Maintain two identical environments: blue (current production) and green (new version). Deploy to green, test it thoroughly, then switch the router to point all traffic at green. Blue becomes the instant rollback target.
graph LR LB["Load Balancer"] -->|"100% traffic"| BLUE["Blue (v1)<br/>Current Production"] GREEN["Green (v2)<br/>New Version"] -.->|"0% traffic<br/>(standby)"| LB style BLUE fill:#4a9eff,stroke:#333,color:#fff style GREEN fill:#2ecc71,stroke:#333,color:#fff
Blue-green: traffic points entirely at one environment. The other waits on standby.
After switching:
graph LR LB["Load Balancer"] -->|"100% traffic"| GREEN["Green (v2)<br/>Now Production"] BLUE["Blue (v1)<br/>Rollback Target"] -.->|"0% traffic<br/>(standby)"| LB style BLUE fill:#4a9eff,stroke:#333,color:#fff style GREEN fill:#2ecc71,stroke:#333,color:#fff
After the switch, green serves all traffic. Blue is the instant rollback target.
When to use it: When you need zero-downtime deployments with instant rollback. Mission-critical services where rollback speed matters more than infrastructure cost.
When to avoid it: When you cannot afford double the infrastructure. When database migrations make it impossible to run two versions against the same data store.
Rollback: Flip the router back to blue. Takes seconds. This is the fastest rollback of any strategy.
The database problem is real. If v2 adds a column and v1 does not expect it, flipping back to blue might work fine. If v2 removes a column, rollback breaks. Use expand-and-contract migration patterns: add the column in one deploy, start using it in the next, remove the old column in a third.
Canary deployment
Route a small percentage of traffic to the new version while the majority stays on the old version. Monitor error rates, latency, and business metrics. If the canary looks healthy, gradually increase its traffic share. If something is wrong, route all traffic back to the old version.
graph LR LB["Load Balancer<br/>(traffic split)"] LB -->|"95%"| V1["v1 (stable)"] LB -->|"5%"| V2["v2 (canary)"] V2 --> MON["Monitoring<br/>Errors? Latency?"] MON -->|"Healthy"| INC["Increase to 25%, 50%, 100%"] MON -->|"Degraded"| RB["Rollback to 0%"] style V2 fill:#f1c40f,stroke:#333
Canary: a fraction of traffic tests the new version. Monitoring drives the rollout decision.
This is risk management applied to deployments. Instead of betting all traffic on an untested version, you expose a small slice of users and watch what happens.
Error rate during a canary rollout
A well-executed canary shows a brief spike in error rate at each traffic increase, quickly settling back to baseline:
Each step increases traffic to the canary. Small error spikes appear and resolve. If the error rate crossed the rollback threshold (dashed line), automation would revert the canary to 0%.
When to use it: Production services where you want to validate changes with real traffic before full rollout. Services with good observability. Any time you are deploying a risky change.
When to avoid it: When you lack monitoring to detect problems in the canary. When your traffic volume is too low for a 5% split to be statistically meaningful.
Rollback: Route all traffic back to the stable version. Fast, but slightly slower than blue-green because the traffic split needs to be reconfigured.
Shadow deployment (traffic mirroring)
Copy production traffic to the new version without serving its responses to users. The old version handles all real responses. The new version processes the same requests in parallel, and you compare its behavior against the old version.
User request --> Load Balancer --> v1 (serves response)
|
+--> v2 (processes request, response discarded)
This is the safest strategy for testing changes that are hard to validate in staging. Machine learning model updates, search ranking changes, and database query optimizations all benefit from shadow testing against real traffic patterns.
When to use it: Validating performance-sensitive changes. Testing new ML models. Verifying that a rewrite produces the same results as the original.
When to avoid it: When the new version has side effects (sending emails, charging credit cards, writing to external systems). Mirrored traffic must be read-only or routed to sandboxed dependencies.
Rollback: Not applicable. The new version never served real traffic.
Choosing a strategy
| Strategy | Downtime | Rollback speed | Infra cost | Risk | Complexity |
|---|---|---|---|---|---|
| Recreate | Yes | Minutes | 1x | High | Low |
| Rolling | No | Seconds-minutes | 1x + surge | Medium | Low |
| Blue-green | No | Seconds | 2x | Low | Medium |
| Canary | No | Seconds | 1x + small | Low | High |
| Shadow | No | N/A | 2x | None | High |
Most teams land on rolling updates for routine deploys and canary for high-risk changes. Blue-green works well when infrastructure cost is not a constraint and you value instant rollback. Shadow deployments are specialized tools for specific validation scenarios.
Rollback complexity
Rollback is the most important feature of any deployment strategy. It is also the most undertested. Teams practice deploying forward constantly but rarely practice rolling back.
Three factors determine rollback complexity:
Database state. If the new version ran a migration, rolling back the code does not roll back the schema. Use reversible migrations and test the rollback path.
Stateful services. If the new version wrote data in a new format, the old version must handle that data. Version your data formats.
External dependencies. If the new version registered a webhook with a third-party API, rolling back the code does not unregister the webhook. Track external side effects.
A rollback plan is not “revert the deployment.” It is a checklist that covers code, data, infrastructure, and external integrations.
Progressive delivery in practice
Modern deployment tooling combines these strategies into progressive delivery pipelines:
- Deploy to a canary (5% traffic)
- Run automated analysis for 10 minutes
- If metrics are healthy, promote to 25%
- Run analysis for 10 more minutes
- Promote to 50%, then 100%
- At any step, automated rollback if error rate or latency exceeds thresholds
Tools like Argo Rollouts, Flagger, and AWS CodeDeploy automate this progression. The deploy engineer’s job shifts from “click the deploy button and watch dashboards” to “define the rollout policy and let automation handle the rest.”
# Argo Rollouts canary strategy
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 10m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
canaryMetadata:
labels:
role: canary
What comes next
Deployments go wrong. Services degrade. Pages fire at 3 AM. The question is not whether incidents happen but how your team responds when they do. On-call culture, incident severity levels, runbooks, and blameless postmortems define whether an incident is a learning opportunity or a blame game.