Search…

Design a video streaming platform

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

Video is the dominant form of internet traffic. Over 80% of all bytes crossing the wire today are video. Designing a platform that ingests 500 hours of content every minute, stores exabytes of data, and streams to 2 billion daily active users requires careful thinking about every layer of the stack.

This case study walks through the architecture of a YouTube-scale video streaming platform. We will cover the upload pipeline, adaptive bitrate transcoding, CDN strategy, recommendation at scale, and view count accuracy.

1. Requirements

Functional requirements

  • Upload: creators upload videos up to 12 hours long, with metadata (title, description, tags).
  • Transcode: convert raw uploads into multiple resolutions and codecs for adaptive streaming.
  • Stream: viewers watch videos with adaptive bitrate switching based on network conditions.
  • Search and discover: full-text search plus personalized recommendations.
  • Engagement: likes, comments, subscriptions, view counts.

Non-functional requirements

MetricTarget
DAU2B
Concurrent streams100M peak
Uploads500 hours of video per minute
Storage1 EB total, growing ~1 PB/day
Stream start latency< 2 seconds (p99)
Availability99.99%
View count accuracyeventual consistency within 30 seconds

2. Capacity estimation

Upload bandwidth. 500 hours/min of raw video at an average bitrate of 20 Mbps gives us roughly 600 Gbps of ingest bandwidth. After transcoding into 6 renditions, storage write amplification is about 3x.

Storage. Each minute of video at 1080p is roughly 150 MB after encoding. 500 hours/min translates to about 4.5 TB of new encoded content per minute, or ~1.3 PB/day across all renditions stored in storage systems.

Streaming bandwidth. If 100M concurrent viewers each consume 5 Mbps on average, we need 500 Pbps of egress bandwidth globally. CDN caching and edge presence are non-negotiable at this scale.

Metadata QPS. 2B DAU generating 10 API calls each (search, recommendations, video page loads) yields roughly 230K QPS sustained, with peaks at 500K+ QPS.

3. High-level architecture

graph TD
  Client[Client Apps] --> LB[Load Balancer]
  LB --> API[API Gateway]
  API --> US[Upload Service]
  API --> SS[Stream Service]
  API --> RS[Recommendation Service]
  API --> MS[Metadata Service]
  US --> OBS[Object Storage - Raw]
  US --> MQ[Message Queue]
  MQ --> TS[Transcode Workers]
  TS --> CDNO[Object Storage - Encoded]
  CDNO --> CDN[CDN Edge Nodes]
  SS --> CDN
  MS --> MDB[(Metadata DB)]
  MS --> Cache[Cache Cluster]
  RS --> ML[ML Feature Store]
  RS --> Cache
  CDN --> Client

High-level architecture of a video streaming platform. Upload and streaming paths are separated to isolate failure domains.

The system splits into three major paths: the upload pipeline (write-heavy, latency-tolerant), the streaming path (read-heavy, latency-critical), and the metadata/recommendation layer (high QPS, cacheable). Each can scale independently.

4. Deep dive: upload and transcode pipeline

When a creator uploads a video, the file goes directly to object storage. We do not route raw video bytes through our application servers. Instead, the client gets a pre-signed URL and uploads directly.

Upload flow

  1. Client requests an upload URL from the API gateway.
  2. API gateway validates auth, creates a video record in the metadata DB with status UPLOADING, and returns a pre-signed URL to object storage.
  3. Client uploads the raw file using multipart upload (chunks of 5-10 MB for resumability).
  4. Object storage sends a completion callback. The upload service publishes a VIDEO_UPLOADED event to message queues.
  5. Transcode workers pick up the event and begin processing.
sequenceDiagram
  participant C as Creator Client
  participant API as API Gateway
  participant DB as Metadata DB
  participant S3 as Object Storage
  participant MQ as Message Queue
  participant TW as Transcode Workers
  participant CDN as CDN Origin

  C->>API: Request upload URL
  API->>DB: Create video record (UPLOADING)
  API-->>C: Pre-signed upload URL
  C->>S3: Multipart upload (raw video)
  S3-->>API: Upload complete callback
  API->>MQ: Publish VIDEO_UPLOADED
  MQ->>TW: Consume event
  TW->>S3: Fetch raw video
  TW->>TW: Transcode to multiple renditions
  TW->>CDN: Push encoded segments
  TW->>DB: Update status (READY)
  TW->>MQ: Publish VIDEO_READY

Sequence diagram for the upload and transcode pipeline. The client never touches our application servers with raw bytes.

Transcoding at scale

Each video gets transcoded into a bitrate ladder, typically 6 renditions from 240p to 4K. A single 10-minute 4K video takes roughly 30 minutes of CPU time to transcode across all renditions. At 500 hours uploaded per minute, we need approximately 9,000 transcode workers running continuously.

We use a priority queue for transcoding jobs. Popular creators or videos that are gaining traction quickly get bumped to the front. Each rendition is split into 4-second segments for HLS/DASH packaging, enabling adaptive bitrate switching on the client.

Target bitrates for each resolution tier. The client switches between these based on measured throughput.

The total storage per video varies wildly. A 10-minute video at all 6 renditions uses roughly 3 GB. Multiply by millions of uploads per day and the numbers get serious fast.

Fault tolerance in transcoding

Transcode jobs fail. Hardware faults, OOM kills, and codec bugs are everyday occurrences at this scale. We handle this with idempotent jobs and at-least-once delivery from the message queue. Each segment gets a content-addressable hash. If a worker crashes mid-job, another worker picks it up and only re-encodes segments that are missing or corrupted.

We also partition transcode work by rendition. A single video spawns 6 independent jobs (one per rendition). This means a failure in the 4K encode does not block the 720p version from going live. The metadata DB tracks per-rendition status so the video becomes progressively available as renditions complete.

Segment packaging

Each rendition is split into segments (typically 2-6 seconds). The manifest file (.m3u8 for HLS, .mpd for DASH) lists all available renditions and segment URLs. The client player reads this manifest and picks the appropriate quality level for each segment.

Segment naming follows a deterministic scheme: {video_id}/{rendition}/{segment_number}.ts. This makes cache keys predictable across the CDN and simplifies purge operations when a creator replaces a video.

5. Deep dive: CDN streaming architecture

Video streaming is a CDN problem. Without edge caching, our origin servers would need 500 Pbps of bandwidth. With a well-designed CDN layer, origin traffic drops by 95% or more.

CDN topology

We use a multi-tier CDN approach. Edge nodes sit in 200+ cities globally. Regional mid-tier caches aggregate requests from edges. The origin shield protects our object storage from thundering herd problems.

graph LR
  V[Viewer] --> E1[Edge PoP - Mumbai]
  V2[Viewer] --> E2[Edge PoP - Delhi]
  V3[Viewer] --> E3[Edge PoP - London]
  E1 --> R1[Regional Cache - Asia]
  E2 --> R1
  E3 --> R2[Regional Cache - Europe]
  R1 --> OS[Origin Shield]
  R2 --> OS
  OS --> OBJ[Object Storage]

Multi-tier CDN architecture. Most requests terminate at edge nodes, reducing origin load by 95%+.

Cache hit optimization

Video content follows a heavy-tailed distribution. The top 20% of videos account for 80%+ of all views. We pre-warm popular content to edge nodes during off-peak hours. Long-tail content gets pulled on demand and cached with a TTL based on recent view velocity.

For live content, we use a push model where the encoder pushes segments to the CDN origin, which fans out to edge nodes. Latency target for live is under 10 seconds glass-to-glass.

Adaptive bitrate switching

The client player continuously measures download throughput. If a 1080p segment takes too long to download, the next segment request drops to 720p. This happens transparently. A well-tuned ABR algorithm maintains a playback buffer of 15-30 seconds to absorb network fluctuations.

Buffer management is critical. Too aggressive quality switches create visual jarring. Too conservative switching leads to rebuffering. Modern players use throughput estimation combined with buffer occupancy to make switching decisions.

The initial segment request is key to perceived performance. We serve the first segment from a low resolution (360p or 480p) regardless of the user’s bandwidth estimate. This gets pixels on screen fast. The player then ramps up quality over the next 2-3 segments as it builds a reliable throughput estimate. This technique cuts time-to-first-frame from 3+ seconds to under 1 second on most connections.

For mobile clients on cellular networks, we apply a more conservative switching policy. Cellular throughput is bursty, so we use a larger averaging window (10 seconds vs 3 seconds on WiFi) to avoid constant quality oscillation.

6. Deep dive: view counts and recommendations

Accurate view counts

View counts seem simple but they are not. At 100M concurrent viewers, naive counting creates a massive write hotspot. A single viral video might get 1M views per second.

We use a multi-stage counting pipeline:

  1. Client-side deduplication: the player sends a view event only after 30 seconds of watch time. This filters bots and accidental clicks.
  2. Edge aggregation: view events batch at the edge for 5-second windows before forwarding to the aggregation layer.
  3. Stream processing: a Kafka-based pipeline consumes view events. We use count-min sketch data structures to approximate counts in real-time while a slower batch pipeline reconciles exact numbers.
  4. Caching layer: displayed counts come from a cache with 30-second TTL. Users see “approximately correct” numbers that converge to exact values.

Recommendation system

The recommendation engine handles two surfaces: the home feed and “up next” suggestions. Both run on a two-stage architecture. This is the highest-leverage system in the entire platform since recommendations drive the majority of total watch time.

Candidate generation narrows the full video corpus (say 800M videos) down to a few thousand candidates using lightweight models. These models rely on user watch history, subscription graph, and video embeddings. This stage must complete in under 50ms to stay within the overall latency budget.

Ranking takes the candidate set and scores each video using a heavier model that considers watch-time prediction, engagement prediction, and diversity constraints. The top 20-50 results get returned.

At 230K QPS for recommendations, we precompute results for the top 10M most active users and cache them. The remaining long-tail users get real-time inference with a p99 latency target of 200ms.

7. Trade-offs and alternatives

Codec choice. H.264 has universal support but H.265 cuts bitrate by 30-40% at equal quality. AV1 goes even further (50% reduction) but encoding is 10x slower. Most platforms encode in all three and serve based on client capability.

Segment length. Shorter segments (2s) give faster quality switching but increase manifest size and HTTP overhead. Longer segments (6s) are more efficient but slower to adapt. 4 seconds is a common compromise.

Push vs pull CDN. For VOD content, pull-based caching works well since content is static and popularity-driven. For live streams, push-based distribution is necessary to minimize latency. Supporting both adds operational complexity.

View count consistency. Strong consistency for view counts would require distributed transactions and severely limit throughput. Eventual consistency with a 30-second window is acceptable for display purposes while audited counts (for monetization) run through a separate batch pipeline that guarantees exactly-once semantics.

Transcoding strategy. Eager transcoding (encode all renditions immediately) wastes resources on videos nobody watches. Lazy transcoding (encode on first view) adds latency for the first viewer. A hybrid approach encodes 480p and 720p eagerly, then adds higher/lower renditions based on demand.

8. What real systems actually do

YouTube processes over 500 hours of uploads per minute and serves billions of hours daily. They use a custom CDN (Google Global Cache) deployed inside ISP networks, cutting transit costs dramatically. Their recommendation system drives 70%+ of all watch time.

Netflix pre-encodes their entire catalog into 100+ renditions per title using per-shot encoding optimization. They partner with ISPs to deploy Open Connect Appliances (essentially Netflix-specific CDN boxes) inside ISP data centers. Their approach works because the catalog is relatively small (tens of thousands of titles) compared to UGC platforms.

Twitch focuses on live streaming where the challenge shifts from storage to ultra-low-latency ingest and distribution. They use a custom RTMP-based ingest protocol and fan out via their own CDN plus partnerships with Akamai and Cloudfront.

9. What comes next

This design covers the core video platform, but production systems add many more layers:

  • Content moderation: ML pipelines that scan uploads for policy violations before publishing. At 500 hours/min, this requires massive parallel inference capacity.
  • DRM and encryption: protecting premium content with Widevine, FairPlay, and PlayReady adds complexity to both the encoding pipeline and the client player.
  • Analytics pipeline: tracking engagement metrics, A/B test results, and creator analytics requires a separate data infrastructure.
  • Multi-region replication: ensuring upload availability across regions while maintaining a single global namespace for video IDs.
  • Cost optimization: tiered storage (hot/warm/cold) for video segments based on view recency. Moving long-tail content to cheaper storage tiers saves millions annually at this scale.
  • Offline playback: supporting downloads for mobile means pre-packaging DRM-protected renditions at specific quality levels, adding another dimension to the encoding pipeline.

Building a video platform at this scale is a continuous optimization problem. The video streaming domain is one of the most infrastructure-intensive problems in system design. Every component operates at extreme scale, and the interplay between encoding, storage, CDN, and client-side logic makes it a rich area for engineering trade-offs.

Start by nailing the upload pipeline and ABR streaming. Those two subsystems determine your user experience ceiling. Everything else, recommendations, view counts, moderation, layers on top of a solid content delivery foundation.

Start typing to search across all content
navigate Enter open Esc close