Design a code deployment system
In this series (18 parts)
- Design a URL shortener
- Design a key-value store
- Design a rate limiter
- Design a web crawler
- Design a notification system
- Design a news feed
- Design a chat application
- Design a video streaming platform
- Design a music streaming service
- Design a ride-sharing service
- Design a food delivery platform
- Design a hotel booking platform
- Design a search engine
- Design a distributed message queue
- Design a code deployment system
- Design a payments platform
- Design an ad click aggregation system
- Design a distributed cache
Every production outage has a deployment somewhere in its ancestry. A code deployment system takes source code, turns it into runnable artifacts, and rolls those artifacts across a fleet of servers without taking down the service. The challenge is doing this safely at scale while keeping developer velocity high.
This design draws on message queues for async orchestration and reliability patterns for safe rollouts. We will also touch on observability for deployment health checks and storage systems for artifact persistence.
1. Requirements
Functional
- Developers push code to a repository; the system triggers a build automatically.
- Builds run in isolated environments with reproducible outputs.
- Artifacts are versioned and stored durably.
- Deployments support blue-green, canary, and rolling strategies.
- Rollback to any previous artifact completes within 2 minutes.
- Secrets are injected at deploy time, never baked into artifacts.
- A web dashboard shows pipeline status in real time.
Non-functional
| Metric | Target |
|---|---|
| DAU (developers) | 10,000 |
| Peak builds per minute | 500 |
| Avg build duration | 5 minutes |
| Artifact size (avg) | 200 MB |
| Target fleet size | 50,000 servers |
| Deploy latency (canary start) | < 30 seconds |
| Rollback latency | < 2 minutes |
| System availability | 99.95% |
2. Capacity Estimation
Build compute. 500 concurrent builds at peak, each running 5 minutes. We need 500 isolated build containers at any moment. With 2 vCPU and 4 GB RAM per container, that is 1,000 vCPUs and 2 TB RAM reserved for builds.
Artifact storage. 10,000 developers producing an average of 3 builds per day: 30,000 artifacts daily. At 200 MB each, that is 6 TB per day. Retaining 30 days of artifacts requires 180 TB. Object storage like S3 handles this well with lifecycle policies for expiration.
Bandwidth. Deploying a 200 MB artifact to 50,000 servers in a rolling fashion (10% at a time) means 5,000 simultaneous downloads. That is 1 TB per wave. Using a P2P distribution layer (like BitTorrent) reduces origin bandwidth from 1 TB to roughly 50 GB per wave.
Queue throughput. Each build generates roughly 20 state transition events (queued, building, testing, packaging, uploading, etc.). At 500 builds per minute, that is 10,000 messages per minute or about 170 QPS on the message bus.
3. High-Level Architecture
graph TB Dev[Developer Push] --> VCS[Version Control Service] VCS --> WH[Webhook Handler] WH --> MQ[Message Queue] MQ --> BS[Build Scheduler] BS --> BP1[Build Pod 1] BS --> BP2[Build Pod 2] BS --> BPN[Build Pod N] BP1 --> AS[Artifact Store] BP2 --> AS BPN --> AS AS --> DC[Deploy Controller] DC --> LB[Load Balancer] LB --> FG[Fleet Green] LB --> FB[Fleet Blue] DC --> SM[Secret Manager] DC --> OBS[Observability] OBS --> DC
High-level architecture of the deployment system. The message queue decouples webhook ingestion from build scheduling, and the deploy controller drives rollout strategies.
The webhook handler is stateless and horizontally scalable. It validates payloads, deduplicates pushes to the same commit, and enqueues build requests. The build scheduler pulls from the queue and assigns work to available build pods.
4. Deep Dives
4.1 Build Isolation and Scheduling
Each build runs inside a short-lived container with a clean filesystem. This guarantees reproducibility: the same commit always produces the same artifact regardless of what ran on that machine before.
The build scheduler maintains a priority queue. Hotfix branches get elevated priority. Builds from the same repository are serialized to avoid resource contention on shared caches (dependency downloads).
Container lifecycle:
- Scheduler claims a build request from the queue.
- A fresh container spins up with the repo checked out at the target commit.
- Build script executes (compile, test, package).
- On success, the artifact uploads to the artifact store with metadata (commit SHA, branch, timestamp, checksums).
- Container is destroyed. No state persists.
If a build pod crashes mid-build, the scheduler detects the missing heartbeat within 30 seconds, marks the build as failed, and re-enqueues it. A build gets at most 3 retry attempts before requiring manual intervention.
Secret handling during builds. Build-time secrets (API keys for private package registries) are mounted as read-only volumes from a secret manager. They never appear in build logs. Log scrubbing catches accidental leaks by pattern-matching known secret formats.
Resource limits. Each build container gets hard CPU and memory limits. A runaway build cannot starve other builds. If a build exceeds 30 minutes, the scheduler kills it and marks it as timed out. This prevents zombie builds from consuming cluster resources indefinitely.
Caching. Dependency downloads are the slowest part of most builds. We maintain a shared cache per repository (keyed on lock file hash) stored in the artifact store. Cache hit rates above 80% reduce average build time from 5 minutes to under 2 minutes.
4.2 Pipeline State Machine
Every deployment follows a strict state machine. Transitions are event-driven through the message queue.
stateDiagram-v2 [*] --> Queued Queued --> Building Building --> Testing Building --> BuildFailed Testing --> Packaging Testing --> TestFailed Packaging --> Uploading Uploading --> ReadyToDeploy ReadyToDeploy --> Deploying Deploying --> CanaryCheck CanaryCheck --> RollingOut CanaryCheck --> RollingBack RollingOut --> Deployed RollingBack --> RolledBack BuildFailed --> [*] TestFailed --> [*] Deployed --> [*] RolledBack --> [*]
Pipeline state machine. Each transition emits an event to the message queue, enabling real-time dashboard updates and audit logging.
Every state transition is persisted to a PostgreSQL table with timestamps. This gives us a complete audit trail. The dashboard subscribes to state change events via WebSocket for live updates.
A deployment can only move forward if health checks pass. The CanaryCheck state runs for a configurable window (default: 5 minutes) and evaluates error rates, latency percentiles, and custom metrics from the observability stack.
4.3 Deployment Strategies and Traffic Routing
The deploy controller supports three strategies. The choice depends on the service’s risk tolerance and fleet size.
Blue-green deployment. Two identical environments exist. The load balancer points all traffic to “blue.” We deploy to “green,” run health checks, then atomically switch the load balancer. Rollback is instant: switch back to blue. Downside: double the infrastructure cost.
Canary deployment. Route a small percentage of traffic (1% to 5%) to the new version. Monitor error rates and latency. If metrics stay healthy, gradually increase traffic. This is the default strategy for large fleets.
graph LR U[Users] --> LB[Load Balancer] LB -->|95% traffic| SV1[Stable v2.3] LB -->|5% traffic| CV1[Canary v2.4] CV1 --> MC[Metrics Collector] SV1 --> MC MC --> AE[Analysis Engine] AE -->|healthy| LB AE -->|unhealthy| RB[Rollback Trigger] RB --> LB
Canary deployment routing. The analysis engine compares canary metrics against the stable baseline and triggers automatic rollback if degradation exceeds thresholds.
Rolling deployment. Update servers in batches (e.g., 10% at a time). Each batch must pass health checks before the next batch starts. Slower than blue-green but requires no extra infrastructure.
The traffic split is implemented at the load balancer using weighted routing rules. For canary, we use consistent hashing on user ID so the same user always hits the same version during the rollout. This avoids confusing behavior from version inconsistency mid-session.
4.4 Rollback Mechanism
Rollback is not a special operation. It is just another deployment pointing at a previous artifact version. The artifact store keeps all versions immutable. Rolling back means:
- Operator (or automation) selects a known-good artifact version.
- Deploy controller initiates a new deployment with that artifact.
- The chosen strategy (typically blue-green for speed) executes.
Automated rollback triggers when canary metrics breach thresholds. The analysis engine compares the canary’s p99 latency and error rate against the stable fleet. If the canary’s error rate exceeds 2x the baseline or p99 latency exceeds 1.5x, rollback fires automatically.
4.5 Artifact Distribution at Scale
Pushing a 200 MB artifact to 50,000 servers from a central store would crush the origin. We use a tiered distribution model:
- Origin (artifact store): S3-compatible object storage. Serves as the source of truth.
- Regional caches: Each data center has a pull-through cache. The first server in a region pulls from origin; subsequent servers pull from the regional cache.
- P2P layer: Within a data center, servers that already have the artifact seed it to peers. This reduces cache load from O(N) to O(log N).
Artifact integrity is verified at every hop using SHA-256 checksums embedded in the deployment manifest.
5. Trade-offs and Alternatives
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Build isolation | VMs (stronger isolation) | Containers (faster startup) | Containers. 2-second startup vs 30-second for VMs. The security boundary is acceptable for CI workloads. |
| Artifact store | Self-hosted (MinIO) | Managed (S3) | S3. Operational burden of managing 180 TB of storage is not worth the cost savings. |
| Queue technology | Kafka (high throughput) | SQS (managed, simpler) | Kafka. We need ordered event streams for the state machine and replay capability for debugging. |
| Distribution | Central pull | P2P hybrid | P2P hybrid. Central pull does not scale past 5,000 servers without massive origin bandwidth. |
Monorepo vs polyrepo builds. Monorepo builds require change detection to avoid rebuilding everything on every commit. This adds complexity (dependency graph analysis) but saves compute. Polyrepo is simpler but leads to more repositories and more pipeline configurations to maintain.
6. What Real Systems Actually Do
GitHub Actions uses ephemeral VMs (not containers) for stronger isolation between tenants. Each job gets a fresh VM that is destroyed after execution. This is slower but necessary for a multi-tenant SaaS platform.
Google’s Borg uses a binary distribution system called MPM (Midas Package Manager) that distributes packages via a content-addressable store with aggressive caching at every level of the network hierarchy.
Spinnaker (Netflix) popularized the canary analysis approach with its Automated Canary Analysis (ACA) system. It runs statistical tests comparing canary and baseline metrics before promoting a deployment.
Argo Rollouts implements progressive delivery as a Kubernetes-native controller, using custom resources to define canary and blue-green strategies declaratively.
Most large companies end up building custom deployment systems because off-the-shelf tools do not handle their specific fleet topology, security requirements, or legacy infrastructure constraints.
7. What Comes Next
This design assumes a relatively homogeneous fleet. Real-world extensions include:
- Multi-region coordination. Deploy to one region first, validate, then fan out. This requires a global deployment orchestrator that respects region-level health signals.
- Feature flags integration. Decouple code deployment from feature activation. Ship dark code and enable features independently of deploys.
- Database migrations. Schema changes require careful coordination with application deploys. A deployment system needs hooks for pre-deploy and post-deploy migration steps.
- Cost optimization. Auto-scaling the build pool based on queue depth rather than maintaining a fixed fleet of build pods.
The deployment system is one piece of a larger developer platform. It connects to the observability stack for health signals, the secret manager for credential injection, and the service mesh for traffic routing. Each of these deserves its own deep dive.
Understanding message queues helps you design the event-driven orchestration layer. Reliability patterns inform how you build rollback, retry, and circuit-breaking into every stage of the pipeline. Start there if you have not already.
A well-designed deployment system is the single highest-leverage investment a platform team can make. Every improvement to deploy speed and safety compounds across every team in the organization.