Design a distributed cache
In this series (18 parts)
- Design a URL shortener
- Design a key-value store
- Design a rate limiter
- Design a web crawler
- Design a notification system
- Design a news feed
- Design a chat application
- Design a video streaming platform
- Design a music streaming service
- Design a ride-sharing service
- Design a food delivery platform
- Design a hotel booking platform
- Design a search engine
- Design a distributed message queue
- Design a code deployment system
- Design a payments platform
- Design an ad click aggregation system
- Design a distributed cache
Why distributed caching matters
A single-node cache tops out around 100K reads per second before CPU and memory become bottlenecks. When your product serves 50M DAU and every page load triggers 5 to 10 cache lookups, you need a fleet of cache nodes working together. A distributed cache uses consistent hashing to partition keys, replicates data for fault tolerance, and keeps p99 latency under 1 ms. This article walks through designing one from scratch.
1. Requirements
Before jumping into architecture, pin down what the system needs to support. These numbers drive every capacity and design decision that follows.
Functional requirements
- Put(key, value, TTL): store a key-value pair with an optional time-to-live.
- Get(key): retrieve the value for a key; return null on miss.
- Delete(key): remove a key immediately.
- Bulk operations: multiget and multiset for batch access.
- Atomic counters: increment and decrement operations.
Non-functional requirements
| Metric | Target |
|---|---|
| DAU | 50 million |
| Peak read QPS | 5 million |
| Peak write QPS | 500K |
| p99 read latency | < 1 ms |
| Availability | 99.99% (< 53 min downtime/year) |
| Max value size | 1 MB |
| Total working set | 10 TB |
2. Capacity estimation
Storage: 50M users, average 200 cached objects per user at 1 KB each. That gives 50M x 200 x 1 KB = 10 TB. With a replication factor of 3, total raw storage is 30 TB. Using 64 GB nodes, we need roughly 470 nodes. Round up to 500 for headroom.
Bandwidth: 5M reads/sec x 1 KB average = 5 GB/s inbound read traffic. Write side: 500K/sec x 1 KB = 500 MB/s. Replication triples writes to 1.5 GB/s across the cluster.
QPS per node: 5M reads / 500 nodes = 10K reads per node. Well within what a single Redis or Memcached instance handles. Each node should have capacity for 2x this load to handle traffic spikes during rolling deployments when half the cluster may be temporarily offline.
Network: each node handles 10K reads/sec x 1 KB = 10 MB/s read throughput. Add replication traffic (3x write fan-out) and heartbeat overhead, and each node needs at least a 1 Gbps NIC. For larger clusters, 10 Gbps NICs are standard to avoid becoming network-bound during rebalancing.
3. High-level architecture
graph TD C1[Client] --> LB[Load Balancer] C2[Client] --> LB C3[Client] --> LB LB --> R[Router / Proxy Layer] R --> S1[Shard 1<br/>Primary + Replicas] R --> S2[Shard 2<br/>Primary + Replicas] R --> S3[Shard 3<br/>Primary + Replicas] R --> SN[Shard N<br/>Primary + Replicas] S1 --> CM[Cluster Manager<br/>Config + Health] S2 --> CM S3 --> CM SN --> CM CM --> ZK[Coordination Service<br/>ZooKeeper / etcd]
High-level architecture: clients hit a proxy that routes to the correct shard based on key hash.
Clients connect through a thin proxy layer that hashes each key and routes the request to the correct shard. Each shard is a primary with one or two replicas. A cluster manager watches node health and triggers failover when a primary goes down. The coordination service (ZooKeeper or etcd) stores the authoritative cluster topology and provides distributed locking for leader election.
The proxy can be a standalone service or a client-side library. Client-side routing (like Redis Cluster) removes a network hop but pushes topology awareness into every client. A proxy (like Twemproxy or Envoy) centralizes routing logic at the cost of an extra hop adding roughly 0.1 ms. At 5M QPS, that extra hop costs about 500 CPU-seconds per second across the proxy fleet, so size the proxy tier accordingly.
4. Deep dive: consistent hashing and data placement
Caching at scale demands predictable data placement. Consistent hashing maps both keys and nodes onto the same hash ring. When a node joins or leaves, only 1/N of keys need to move. Compare this to simple modulo hashing where adding a single node invalidates nearly every key, causing a massive cache miss storm.
The hash function matters. Use a fast, well-distributed function like xxHash or MurmurHash3. MD5 and SHA-1 are cryptographically stronger than needed and 3 to 5x slower. At 5M operations per second, those extra microseconds add up.
Plain consistent hashing can produce uneven load. Virtual nodes fix this: each physical node owns 100 to 200 positions on the ring, smoothing out the distribution. With 500 physical nodes and 150 vnodes each, that is 75,000 points on the ring. The standard deviation of load drops below 5%.
How key routing works
When a client issues GET user:42, the proxy hashes the key and walks clockwise on the ring until it finds the first vnode. That vnode’s physical node is the primary owner. The proxy sends the request there. For writes, the primary acknowledges the client and then replicates asynchronously to the next two distinct physical nodes on the ring.
graph LR
subgraph Hash Ring
direction LR
V1["vnode A1<br/>hash: 0x0A"] --> V2["vnode B1<br/>hash: 0x1F"]
V2 --> V3["vnode C1<br/>hash: 0x35"]
V3 --> V4["vnode A2<br/>hash: 0x4C"]
V4 --> V5["vnode B2<br/>hash: 0x68"]
V5 --> V6["vnode C2<br/>hash: 0x7A"]
V6 --> V7["vnode A3<br/>hash: 0x91"]
V7 --> V8["vnode B3<br/>hash: 0xB2"]
V8 --> V9["vnode C3<br/>hash: 0xD5"]
V9 --> V1
end
K1([key: user:42]) -.->|hashed to 0x22| V3
K2([key: session:99]) -.->|hashed to 0x5E| V5
Consistent hash ring with virtual nodes. Keys map to the next clockwise vnode.
Replication strategy
Each key is stored on the primary vnode and replicated to the next N-1 distinct physical nodes clockwise on the ring. With a replication factor of 3, every key lives on three separate machines. Writes go to the primary first, then propagate asynchronously to replicas. This gives us eventual consistency, which is acceptable for a cache. If you need stronger guarantees, synchronous replication is an option but it doubles write latency.
Eviction policies
When a node runs out of memory, it must evict keys. Common policies:
| Policy | Best for | Drawback |
|---|---|---|
| LRU | General workloads | Frequency-blind |
| LFU | Skewed access patterns | Slow to adapt |
| Random | Simplicity | Unpredictable |
| TTL-based | Session data | Doesn’t help with hot keys |
Most production caches use an approximated LRU. Redis samples 5 random keys and evicts the least recently used among them. This avoids maintaining a full LRU linked list, which would add 16 bytes of overhead per key. At 1 billion keys, that is 16 GB just for the eviction metadata.
For write-heavy workloads, consider allkeys-lfu. LFU tracks access frequency, not just recency, and keeps popular keys in cache even if they have not been accessed in the last few seconds. Redis implements LFU with a logarithmic counter that decays over time, using only 8 bits per key.
5. Deep dive: failover and high availability
A cache with 500 nodes will see hardware failures weekly. At that scale, expect 1 to 2 node failures per week based on industry data showing annual failure rates of 2 to 4% for commodity servers. The system must detect and recover from failures automatically within seconds to meet 99.99% availability.
The failover process has three phases: detection, election, and promotion. Each phase must complete quickly and correctly. A slow failover means cache misses flood the database. An incorrect failover (promoting the wrong replica or creating a split-brain) means data corruption.
stateDiagram-v2 [*] --> Healthy Healthy --> Suspected: Heartbeat timeout (3s) Suspected --> Healthy: Heartbeat resumes Suspected --> Failed: Quorum confirms failure Failed --> Promoting: Elect replica as new primary Promoting --> Healthy: Promotion complete, clients redirected Failed --> ManualReview: No healthy replica available ManualReview --> Healthy: Operator provisions new node
Failover state machine: from heartbeat timeout to replica promotion.
Detection
The cluster manager pings every node every second. If a node misses three consecutive heartbeats (3 seconds), it enters a “suspected” state. The manager then asks other nodes in the same shard to confirm. If a quorum (2 out of 3) agrees the node is unreachable, it is marked failed. This two-phase detection prevents false positives caused by temporary network blips or GC pauses.
Why not use a shorter timeout? At 500 nodes, a 1-second timeout with aggressive failure marking would trigger dozens of false failovers per day. The 3-second window balances detection speed against false positive rate. In practice, real hardware failures take minutes to develop (disk degradation, memory errors), so a few seconds of detection delay is negligible.
Recovery
Once a primary is marked failed:
- The cluster manager selects the replica with the most up-to-date replication offset.
- That replica is promoted to primary. This takes under 1 second.
- The cluster manager updates the routing table and pushes it to all proxies.
- Clients see a brief spike in latency (5 to 20 ms) during the switchover but no errors because the proxy retries.
Split-brain prevention
Network partitions can cause two nodes to both think they are primary. To prevent this, a primary must maintain contact with a majority of the cluster manager nodes. If it loses quorum, it stops accepting writes and demotes itself to read-only. Clients get a clear error on writes, which is better than silent data divergence.
This ties directly to the CAP theorem. Our cache chooses AP (availability + partition tolerance) for reads and leans toward CP for writes during partitions.
6. Deep dive: persistence and warm restarts
A pure in-memory cache loses all data on restart. For a 64 GB node, repopulating from the database at 100K ops/sec takes over 10 minutes. During that window, every request hits the database, a classic thundering herd.
Two persistence approaches solve this:
Snapshotting (RDB): periodically dump the full dataset to disk. Fast to load on restart but you lose data written since the last snapshot. A 64 GB snapshot takes about 30 seconds to write using fork-and-copy-on-write.
Append-only log (AOF): log every write operation. On restart, replay the log. More durable but the log grows large and replay can be slow. Periodic compaction (rewriting the log with only current values) keeps it manageable.
The pragmatic choice is to combine both: take snapshots every 5 minutes and keep an AOF for writes since the last snapshot. On restart, load the snapshot then replay the AOF tail. Recovery drops from 10 minutes to under 30 seconds.
Here is how the two strategies compare under different failure scenarios:
| Scenario | RDB only | AOF only | RDB + AOF |
|---|---|---|---|
| Clean restart | 15s load | 3 min replay | 15s load + 2s replay |
| Crash (data loss window) | Up to 5 min | < 1 sec | < 1 sec |
| Disk space usage | 64 GB snapshot | 120 GB+ log | 64 GB + small tail |
| Fork overhead | Yes (COW) | No (sequential) | Yes (periodic) |
For backend caching scenarios where the cache sits between the application and the database, persistence lets you survive restarts without hammering the backing store.
7. Trade-offs and alternatives
Cluster mode vs. sentinel
Redis offers two operational models, and the choice depends on your scale and team capacity.
Redis Sentinel monitors a set of primary/replica pairs and handles automatic failover. It does not shard data. You need an external mechanism (client-side consistent hashing or a proxy like Twemproxy) to distribute keys across multiple primaries. Sentinel works well for small to medium deployments (under 10 shards) where operational simplicity matters.
Redis Cluster bakes sharding into the protocol using 16,384 hash slots. Clients learn the slot map and talk directly to the right node. This removes the proxy hop but every client library must understand the cluster protocol. Cluster is the right choice at scale (50+ shards) because it handles both partitioning and failover in a single system.
Memcached + Twemproxy is the simplest option. Memcached has no built-in replication or cluster awareness. Twemproxy (or mcrouter) sits in front and handles consistent hashing. Failover is manual or requires external scripting. This works when you treat the cache as fully disposable.
| Approach | Sharding | Failover | Complexity |
|---|---|---|---|
| Redis Sentinel | Client-side or proxy | Automatic via sentinels | Moderate |
| Redis Cluster | Built-in (16,384 hash slots) | Gossip-based | Higher |
| Memcached + Twemproxy | Proxy-based | Manual or external | Lower |
Consistency vs. latency
Synchronous replication guarantees no data loss on failover but adds 0.5 to 1 ms per write. For a cache, that tradeoff rarely makes sense. Losing a few seconds of writes on failover is acceptable because the source of truth is the database.
Memory efficiency
A naive key-value store wastes memory on per-key overhead. Techniques to reduce it:
- Hash encoding for small objects: Redis stores hashes with fewer than 128 fields as a ziplist, cutting memory use by 10x.
- Key compression: prefix stripping or short key names save bytes at scale. At 1 billion keys, saving 10 bytes per key reclaims 10 GB.
- Slab allocation: Memcached pre-allocates memory in fixed-size slabs, reducing fragmentation.
8. What real systems actually do
Redis Cluster uses 16,384 hash slots distributed across shards. Each shard runs a primary and one or two replicas. Failover is gossip-based with a majority vote. It supports persistence via RDB and AOF. Most large deployments run it behind a proxy (like Redis Cluster Proxy or Envoy) to simplify client logic.
Memcached has no built-in replication or persistence. Sharding is done entirely client-side with consistent hashing. It is simpler and can be faster for pure read-heavy workloads because there is zero replication overhead. Facebook ran Memcached at scale with a custom solution called mcrouter for routing.
Amazon ElastiCache wraps both Redis and Memcached with managed failover, backups, and scaling. It uses multi-AZ replication for high availability. Under the hood, it is the same software with operational automation layered on top.
Pelikan (Twitter’s cache) was built specifically for predictable latency. It uses a single-threaded event loop per core and avoids garbage collection pauses by not using a GC language. Pelikan achieves p99.9 latency under 100 microseconds, which is 10x better than stock Redis under heavy load.
Dragonfly is a newer entrant that uses a shared-nothing architecture with io_uring for async I/O. A single Dragonfly instance on a 64-core machine can match the throughput of a multi-node Redis Cluster, simplifying operations at the cost of vertical scaling limits.
The common thread: all production caches use consistent hashing for partitioning, async replication for availability, and approximated LRU for eviction. The differences are in operational tooling, memory management, and how they handle edge cases like hot keys and large values.
9. Monitoring and operational health
Running a distributed cache at scale requires visibility into several key metrics:
Hit rate is the single most important metric. A healthy cache maintains 95%+ hit rate. If it drops below 90%, investigate whether keys are being evicted too aggressively, TTLs are too short, or the working set grew beyond memory capacity.
Replication lag measures how far behind replicas are from their primary. Under normal conditions, lag stays under 1 MB. During bulk loads or network congestion, lag can spike. If it exceeds 100 MB, the replica is at risk of serving stale data during a failover.
Memory fragmentation ratio compares memory allocated by the OS to memory actually used by the cache. A ratio above 1.5 means the allocator is wasting memory on fragmentation. Restart the node or switch to jemalloc to bring it down.
Eviction rate tells you whether the cache is under memory pressure. A sudden spike in evictions often correlates with a new feature pushing more data into the cache without a corresponding capacity increase.
Track these per shard, not just at the cluster level. A single overloaded shard can degrade the whole system while cluster-wide averages look fine.
10. What comes next
This design covers the core of a distributed cache. Several extensions are worth exploring depending on your scale:
- Hot key mitigation: replicate frequently accessed keys to all nodes or use a local L1 cache in the client. This prevents a single shard from becoming a bottleneck. At 50M DAU, the top 100 keys often account for 10% of all reads.
- Multi-region caching: run independent clusters per region with invalidation propagated through a message bus like Kafka. Adds complexity but drops latency for global users from 200 ms (cross-region database read) to under 1 ms (local cache hit).
- Cache stampede protection: use probabilistic early expiration or locking to prevent thundering herd on key expiry. A simple approach: refresh keys that are within 10% of their TTL on read. For a key with a 60-second TTL, any read after 54 seconds triggers an async refresh.
- Observability: track hit rate, eviction rate, memory fragmentation, and replication lag per shard. Alert on hit rate drops because they often signal upstream bugs, not cache problems.
- Client-side caching: add a small in-process LRU (1,000 to 10,000 keys) to avoid network round trips for the hottest keys. Redis 6 introduced server-assisted client caching with invalidation messages, so the local cache stays fresh without polling.
- Tiered storage: as datasets grow, keep hot data in memory and warm data on NVMe SSDs. Redis on Flash and Dragonfly both support this model, extending effective capacity by 5 to 10x at the cost of higher tail latency for SSD reads.
A distributed cache is the backbone of low-latency systems. Get consistent hashing and failover right and the rest is operational discipline.