Dec 4, 2025 · 21 min read · System Design

Design a music streaming service

In this series (18 parts)

Music streaming looks simple from the outside. A user taps play, audio starts within 200 ms, and playback continues without interruption for hours. Under the hood, that experience requires a global audio delivery network, a metadata system serving billions of reads per day, a playlist service handling millions of concurrent writes, and a recommendation pipeline processing terabytes of listening data daily.

Requirements

Functional requirements

Search and browse. Users search by track, artist, album, or genre. Results appear within 300 ms.
Stream audio. Tracks play with under 200 ms start latency. Quality adapts to network conditions (96, 160, 320 kbps).
Offline playback. Premium users download tracks for offline listening. The client manages local storage and license expiry.
Playlist management. Users create, edit, reorder, and share playlists. Collaborative playlists support multiple editors.
Real-time listening activity. Friends see what each other is playing. The “Now Playing” state updates within 2 seconds.
Recommendations. The system generates personalized playlists (Discover Weekly, Daily Mix) and contextual suggestions.

Non-functional requirements

Scale. 600M monthly active users, 200M daily active, 80M peak concurrent streams.
Catalog. 100M tracks, growing at 100K tracks per day.
Availability. 99.99% for playback, 99.9% for social features and playlists.
Latency. Track start under 200 ms at p99 for users on broadband. Seek within a track under 100 ms.
Durability. Zero data loss for user libraries and playlists. Audio files replicated across three regions.

Capacity estimation

Storage. The average track is 3.5 minutes at three quality levels. At 320 kbps the high quality file is roughly 8.4 MB. Storing three bitrates averages about 15 MB per track. For 100M tracks that is 1.5 PB of audio. Metadata (track info, artist profiles, album art) adds another 200 TB.

Bandwidth. 80M concurrent streams at an average of 160 kbps each requires 12.8 Tbps of sustained throughput. A CDN absorbs nearly all of this since tracks are immutable and highly cacheable.

QPS. Play requests peak at 500K per second. Search hits 200K QPS. Playlist writes average 50K QPS with spikes to 150K during release events. The recommendation pipeline processes 20B listening events per day in batch, plus a real-time stream at 300K events per second.

Caching. The top 1% of tracks account for roughly 80% of plays. Caching the top 1M tracks at edge PoPs (15 TB per PoP) covers the vast majority of requests.

High-level architecture

graph TD
  Client[Mobile / Desktop / Web Client]
  LB[Load Balancer]
  API[API Gateway]
  Auth[Auth Service]
  Search[Search Service]
  Catalog[Catalog Service]
  Playlist[Playlist Service]
  Stream[Streaming Service]
  Reco[Recommendation Service]
  Activity[Activity Service]
  CDN[CDN Edge Nodes]
  ObjStore[Object Storage - Audio Files]
  MetaDB[(Metadata DB - PostgreSQL)]
  PlaylistDB[(Playlist DB - DynamoDB)]
  SearchIdx[(Search Index - Elasticsearch)]
  Cache[Redis Cluster]
  MQ[Message Queue - Kafka]
  RecoStore[(Feature Store - Cassandra)]

  Client --> CDN
  Client --> LB --> API
  API --> Auth
  API --> Search --> SearchIdx
  API --> Catalog --> MetaDB
  API --> Playlist --> PlaylistDB
  API --> Stream
  API --> Reco --> RecoStore
  API --> Activity
  Stream --> CDN
  Stream --> ObjStore
  Catalog --> Cache
  Activity --> MQ
  MQ --> Reco

High-level architecture of the music streaming service. Clients fetch audio from CDN edge nodes and interact with backend services through an API gateway.

The system splits into six core services. The Streaming Service resolves track IDs to CDN URLs and handles adaptive bitrate negotiation. The Catalog Service owns track, artist, and album metadata. The Playlist Service manages user-created and system-generated playlists. The Search Service maintains an inverted index over the catalog. The Activity Service tracks listening events and powers the “Now Playing” feed. The Recommendation Service consumes activity data to generate personalized content.

Deep dive: audio delivery

Audio encoding and storage

Each uploaded track gets transcoded into three quality tiers: 96 kbps (Ogg Vorbis, for low bandwidth), 160 kbps (default), and 320 kbps (premium). Files are split into chunks of roughly 5 seconds each. Chunking enables seek without downloading the entire file and supports adaptive quality switching mid-stream.

Chunks land in object storage (S3 or equivalent) organized by track ID and quality level: audio/{track_id}/{quality}/{chunk_number}.ogg. A manifest file per track lists all chunks with byte offsets and durations.

Track play flow

When a user taps play, the client needs audio bytes flowing within 200 ms. Here is the sequence:

sequenceDiagram
  participant C as Client
  participant API as API Gateway
  participant SS as Streaming Service
  participant Cache as Redis Cache
  participant CDN as CDN Edge
  participant S3 as Object Storage

  C->>API: POST /play track_id, quality_pref
  API->>SS: resolve track
  SS->>Cache: lookup manifest for track_id
  Cache-->>SS: manifest cache hit
  SS-->>API: signed CDN URL + manifest
  API-->>C: playback session + CDN URL

  C->>CDN: GET chunk_0.ogg
  CDN-->>C: audio chunk (cache hit)

  Note over C: Playback starts

  C->>CDN: GET chunk_1.ogg
  CDN->>S3: cache miss, fetch from origin
  S3-->>CDN: audio chunk
  CDN-->>C: audio chunk

Sequence diagram for the track play flow. The client resolves a signed CDN URL through the streaming service, then fetches audio chunks directly from CDN edge nodes.

The streaming service returns a signed URL with a TTL of 24 hours. The client prefetches two chunks ahead of the current playback position. If a CDN edge has a cache miss, it pulls from origin and caches the chunk for subsequent listeners. Popular tracks stay warm in CDN cache indefinitely since audio files are immutable.

Adaptive bitrate

The client monitors download speed for each chunk. If throughput drops below 1.5x the current bitrate, the client steps down to the next quality tier starting with the next chunk boundary. If throughput recovers, it steps back up after three consecutive chunks at full speed. This avoids oscillation on flaky connections.

Network quality varies dramatically across the user base. A user on 5G in Seoul gets a stable 100 Mbps. A user on 3G in rural India sees 500 kbps with frequent drops. The adaptive bitrate logic must handle both extremes gracefully, defaulting to 96 kbps when it cannot measure throughput reliably.

Offline playback

The client downloads entire tracks (all chunks) and stores them in an encrypted local cache. A license token from the auth service controls offline access: it expires after 30 days and requires a periodic check-in (once every 7 days when online). The client maintains a local SQLite database mapping downloaded tracks to their encrypted file paths and license state.

Deep dive: playlist management

Playlists are the most write-heavy entity in the system. Users reorder tracks, add and remove items, and collaborate on shared playlists. At 50K writes per second with spikes to 150K, the data model and conflict resolution strategy matter.

Data model

Each playlist is stored as a document in DynamoDB with this structure:

Partition key: playlist_id
Sort key: "META" | "TRACK#{position}"

META item:
  name, owner_id, collaborative (bool), follower_count, updated_at

TRACK items:
  track_id, added_by, added_at, position (float)

Using float positions allows insertions between existing tracks without rewriting every item. Insert between position 3.0 and 4.0 by assigning 3.5. After many insertions, a background compaction job renumbers positions to integers.

Collaborative playlists

Collaborative playlists allow multiple users to add, remove, and reorder tracks simultaneously. Each mutation is appended to a message queue topic keyed by playlist ID. A single consumer per partition serializes writes, preventing conflicts. The playlist service applies operations in order and broadcasts the updated state to connected clients via WebSocket.

For offline edits (a user modifies a collaborative playlist while disconnected), the client queues operations locally and replays them on reconnection. Conflicts are resolved with last-writer-wins at the individual track level, using a vector clock to detect concurrent edits.

Deep dive: recommendation pipeline

The recommendation engine is the competitive moat. It needs to process billions of listening events and generate personalized playlists for 200M daily active users.

Pipeline architecture

graph LR
  Events[Listening Events - 300K/s]
  Kafka[Kafka Topics]
  RT[Real-time Processor - Flink]
  Batch[Batch Processor - Spark]
  FeatStore[(Feature Store - Cassandra)]
  ModelTrain[Model Training - GPU Cluster]
  ModelServe[Model Serving - TensorFlow Serving]
  UserProfile[(User Taste Profiles)]
  Personalized[Personalized Playlists]

  Events --> Kafka
  Kafka --> RT
  Kafka --> Batch
  RT --> FeatStore
  Batch --> FeatStore
  FeatStore --> ModelTrain
  ModelTrain --> ModelServe
  FeatStore --> UserProfile
  UserProfile --> ModelServe
  ModelServe --> Personalized

Recommendation pipeline combining real-time and batch processing. Listening events flow through Kafka into both a Flink stream processor and a Spark batch job, feeding a shared feature store.

Real-time path

Every play, skip, save, and repeat event flows into Kafka. A Flink job updates the user’s taste profile in near real-time: incrementing genre affinity scores, tracking skip rates per artist, and computing session-level features like “time of day” and “listening duration.” These features land in the Cassandra feature store within 30 seconds of the event.

The real-time path powers contextual recommendations: “Because you just listened to jazz, here is a similar album.” The model serving layer queries the feature store for the user’s current taste vector and runs a lightweight nearest-neighbor search against precomputed track embeddings.

Batch path

A nightly Spark job processes the full day’s listening data (20B events, roughly 4 TB compressed). It retrains collaborative filtering models using matrix factorization across the full user-track interaction matrix (600M users x 100M tracks, extremely sparse). The output is a set of track embeddings and user embeddings stored in the feature store.

Weekly personalized playlists like “Discover Weekly” are generated in batch. For each user, the system retrieves their embedding, finds the top 200 candidate tracks by cosine similarity, filters out tracks they have already heard, applies diversity rules (no more than 3 tracks from one artist), and writes the final 30-track playlist to the playlist service.

Cold start

New users with no listening history get recommendations based on registration data (country, age bracket) and the first few tracks they play. After 10 listened tracks, the system has enough signal for collaborative filtering to produce meaningful results. Until then, a popularity-based fallback serves trending tracks in the user’s locale.

Evaluation

Recommendation quality is measured offline and online. Offline metrics include precision at k (how many recommended tracks the user would actually play) and catalog coverage (what fraction of the 100M catalog gets recommended to at least one user per week). Online A/B tests measure skip rate, session length, and 30-day retention. A good recommendation system keeps the average skip rate below 25% and pushes average session length above 45 minutes.

Trade-offs and alternatives

Audio format: Ogg Vorbis vs AAC vs Opus. Ogg Vorbis offers good quality at low bitrates and is royalty-free. AAC has wider hardware decoder support, which matters for battery life on mobile. Opus is technically superior at low bitrates but has limited legacy device support. The trade-off is licensing cost vs device compatibility vs audio quality.

Playlist storage: DynamoDB vs PostgreSQL. DynamoDB handles the write throughput and provides single-digit millisecond latency at any scale. PostgreSQL would offer richer querying (full-text search on playlist names, complex joins for analytics) but requires careful sharding to handle 50K writes per second. DynamoDB wins for the primary path; a replicated read store in PostgreSQL serves analytics queries.

Recommendation model: collaborative filtering vs content-based vs hybrid. Pure collaborative filtering suffers from the cold start problem and popularity bias. Content-based filtering (analyzing audio features like tempo, key, energy) works for new tracks but misses social listening patterns. A hybrid approach that combines both, weighted by the amount of available data for each user, performs best in practice.

CDN strategy: push vs pull. For the top 1% of tracks, pushing chunks to edge nodes proactively avoids first-listener latency spikes on new releases. For the long tail, pull-through caching is more cost effective. A tiered approach uses push for known popular content and pull for everything else.

What real systems actually do

Spotify uses Ogg Vorbis at 96/160/320 kbps and stores audio in Google Cloud Storage. Their recommendation engine (described in numerous engineering blog posts) combines collaborative filtering, natural language processing on playlist names and blog text, and raw audio feature analysis using CNNs. They famously run a massive Kafka infrastructure processing over 30 trillion events per quarter.

Apple Music uses AAC and ALAC for lossless streaming. Their CDN infrastructure leverages both Akamai and their own edge network. Recommendations lean heavily on human curation combined with algorithmic suggestions.

YouTube Music benefits from Google’s infrastructure, serving audio from the same CDN that handles YouTube video. Their recommendation engine is an extension of YouTube’s deep learning recommendation system.

All three systems use chunked audio delivery, signed URLs for access control, and a combination of real-time and batch processing for recommendations. The differentiator is usually the recommendation quality and the catalog licensing agreements, not the infrastructure architecture.

What comes next

This design handles the core streaming, playlist, and recommendation workloads. Several areas deserve further exploration:

Lyrics sync and karaoke mode. Timestamped lyrics require a separate metadata pipeline syncing text to audio waveforms at 10 ms precision.
Podcasts and long-form audio. Episodes are 10-100x longer than tracks, changing the chunking strategy and CDN caching economics. Resume position tracking becomes critical.
Social features at scale. Friend activity feeds, shared listening sessions (“Group Session”), and social playlists introduce presence tracking and real-time sync challenges.
Audio quality: lossless and spatial audio. Lossless FLAC files are 3-5x larger than lossy equivalents, dramatically increasing storage and bandwidth costs. Spatial audio (Dolby Atmos) requires multi-channel encoding and decoder support on client devices.
Rights management and geo-restrictions. Licensing varies by country. The streaming service must enforce geo-fencing at the track level, requiring a rights database checked on every play request.

The hardest part of building a music streaming service is not the technology. It is negotiating catalog licenses with record labels. But assuming you have the rights, this architecture scales from day one to hundreds of millions of concurrent listeners.

← Back to all series