Consistent hashing
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
You have five cache servers holding 200 million keys. One server dies. With naive modular hashing, hash(key) % n changes for almost every key when n drops from 5 to 4. That means roughly 80% of your keys now map to the wrong server. Every one of those lookups becomes a cache miss, and the miss storm hammers your database. Consistent hashing exists to prevent exactly this scenario.
Prerequisites
You should understand database sharding and the basic reasons we split data across multiple machines. Familiarity with caching and load balancing helps too, since consistent hashing shows up in both contexts.
The problem with modular hashing
The simplest way to assign keys to servers is modular arithmetic. Hash the key, divide by the number of servers, and use the remainder as the server index.
server = hash(key) % N
This works well when N is constant. The trouble starts the moment N changes. Suppose you have 4 servers and you add a fifth. The table below shows what happens to 8 keys:
| Key hash | hash % 4 | hash % 5 | Moved? |
|---|---|---|---|
| 17 | 1 | 2 | Yes |
| 23 | 3 | 3 | No |
| 44 | 0 | 4 | Yes |
| 58 | 2 | 3 | Yes |
| 71 | 3 | 1 | Yes |
| 86 | 2 | 1 | Yes |
| 92 | 0 | 2 | Yes |
| 103 | 3 | 3 | No |
Six out of eight keys moved. In the general case, adding one server to N causes roughly (N-1)/N of all keys to remap. For N = 100, that is 99% of your data moving at once. This is not acceptable in production.
The hash ring
Consistent hashing maps both keys and servers onto the same circular space. Picture the output range of a hash function (say 0 to ) bent into a circle where position 0 meets position . Each server gets a position on this ring by hashing its identifier (IP address, hostname, some unique label). Each key also gets a position by hashing its value. To find which server owns a key, walk clockwise from the key’s position until you hit the first server.
graph LR
subgraph Ring["Hash Ring (0 to 2³²)"]
direction TB
A["Server A<br/>pos: 500M"]
B["Server B<br/>pos: 1.5B"]
C["Server C<br/>pos: 3B"]
end
K1["Key k1<br/>pos: 200M"] -->|clockwise| A
K2["Key k2<br/>pos: 900M"] -->|clockwise| B
K3["Key k3<br/>pos: 2B"] -->|clockwise| C
K4["Key k4<br/>pos: 3.5B"] -->|clockwise| A
Three servers on the hash ring. Each key walks clockwise to its nearest server.
When a key hashes to position 200M, it walks clockwise and lands on Server A at position 500M. Key k2 at 900M lands on Server B at 1.5B. Key k4 at 3.5B wraps around past the end of the ring and lands on Server A at 500M.
What happens when a server joins
Suppose Server D joins the ring at position 2.5B. Only the keys between Server B (1.5B) and Server D (2.5B) need to move. Those keys previously belonged to Server C. Everything else stays put. If keys are uniformly distributed across the ring, the fraction of keys that move is roughly , where N is the number of servers after the addition. With 4 servers, about 25% of keys relocate instead of 75%.
What happens when a server leaves
If Server B at position 1.5B crashes, only the keys between Server A (500M) and position 1.5B need a new home. They now belong to Server C (the next clockwise server). Again, roughly of keys move. The rest of the system is unaffected.
This is the core guarantee: node additions and removals cause key movements, where K is the total number of keys and N is the number of nodes.
The imbalance problem
A hash ring with three physical nodes has three positions on the circle. Unless those positions happen to be perfectly equidistant (they almost never are), the arc lengths between nodes vary. A server that owns a larger arc handles more keys. In extreme cases, one server might handle 60% of all traffic while another handles 10%. That defeats the purpose of distributing load.
Consider the numbers. With 3 nodes placed randomly on a ring, the expected maximum load is about 2x the average. With 10 nodes, it improves but remains 1.5x to 1.8x. We need a way to smooth this out without adding physical hardware.
Virtual nodes
The solution is virtual nodes (vnodes). Instead of placing each physical server at one position on the ring, you place it at many positions. If you assign 150 virtual nodes to each physical server, three physical servers produce 450 points on the ring. The arc lengths between adjacent points become much more uniform, which means the load distribution tightens dramatically.
graph TB
subgraph Without["Without Virtual Nodes"]
direction LR
P1["Server A"]
P2["Server B"]
P3["Server C"]
end
subgraph With["With Virtual Nodes (3 per server)"]
direction LR
V1["A-0"] --- V2["B-0"]
V2 --- V3["C-0"]
V3 --- V4["A-1"]
V4 --- V5["B-1"]
V5 --- V6["C-1"]
V6 --- V7["A-2"]
V7 --- V8["B-2"]
V8 --- V9["C-2"]
end
Without virtual nodes, three servers create three large arcs. With 3 vnodes each, nine points create smaller, more uniform arcs.
How many virtual nodes?
The number of vnodes per server controls the tradeoff between balance and overhead. More vnodes means better balance but more entries in the ring lookup table and more metadata to manage.
| Vnodes per server | Std dev of load (% of mean) | Ring table entries (100 servers) |
|---|---|---|
| 1 | ~100% | 100 |
| 50 | ~14% | 5,000 |
| 150 | ~8% | 15,000 |
| 500 | ~4.5% | 50,000 |
At 150 vnodes per server, the standard deviation drops to about 8% of the mean load. That is tight enough for most production systems. Going beyond 500 vnodes rarely justifies the extra memory. A ring with 50,000 entries is still tiny (a few megabytes at most), but the routing metadata and rebalancing bookkeeping grow with it.
Heterogeneous hardware
Virtual nodes also solve the problem of servers with different capacities. A server with 64 GB of RAM can be assigned 300 vnodes while a server with 32 GB gets 150. The load tracks the vnode count, so stronger machines absorb proportionally more keys without any changes to the hashing algorithm itself.
The lookup algorithm
In practice, the ring is implemented as a sorted array or balanced tree of vnode positions. To find the server for a key:
- Compute
hash(key). - Binary search the sorted ring for the smallest position greater than or equal to the hash.
- If no position is greater, wrap around to the first entry (the ring is circular).
- Return the physical server associated with that vnode.
This lookup is , where V is the total number of vnodes. With 15,000 entries, that is about 14 comparisons. Even under heavy load, this adds negligible latency.
When a node is added, you insert its vnodes into the sorted structure and migrate keys from the successor nodes. When a node is removed, you delete its vnodes and hand keys to their new successors. Both operations touch only the affected key ranges.
Replication and fault tolerance
Consistent hashing pairs naturally with replication. A common strategy is to walk clockwise from a key’s position and place replicas on the next N distinct physical servers. If the replication factor is 3, the key lives on the first three unique physical servers encountered clockwise. Virtual nodes complicate this slightly because consecutive vnodes might belong to the same physical server, so the walk skips duplicate physical servers.
This gives each key a preference list of servers. If the primary fails, reads and writes shift to the next server in the list. The client or coordinator keeps trying clockwise until it finds a healthy node.
Consistent hashing in the real world
Amazon DynamoDB
DynamoDB’s design, documented in the original Dynamo paper (2007), uses consistent hashing as its core partitioning strategy. Each table’s partition key is hashed, and the hash determines placement on a ring. DynamoDB assigns vnodes to partition ranges and automatically splits or merges partitions as throughput demands change. When a node fails, its key ranges move to the next nodes on the ring. Hinted handoff and anti-entropy (Merkle tree comparison) ensure replicas converge after failures.
The Dynamo paper reported that with 150 vnodes per node and a replication factor of 3, the system achieved load balance within 10% across nodes, even during node additions and removals. This tight balance held at scale with hundreds of nodes.
Apache Cassandra
Cassandra adopted the Dynamo-style ring from the start. Each node is assigned token ranges on a ring spanning to . The partitioner (by default Murmur3Partitioner) hashes the partition key to determine placement. Cassandra supports two token assignment strategies:
- Single-token assignment. Each node gets one token, and the operator manually distributes tokens for balance. This was the original approach and is error-prone.
- Virtual nodes (vnodes). Each node gets
num_tokenspositions (default 256 in modern Cassandra). The database automatically assigns random positions, and rebalancing on node addition or removal happens incrementally.
With vnodes enabled, adding a node to a 10-node Cassandra cluster moves roughly 10% of the data. Without vnodes, the operator would need to manually recalculate and reassign tokens, a process that is both slow and risky.
Memcached client libraries
Memcached itself is a simple key-value store with no built-in distribution. The consistent hashing happens in client libraries like libketama (originally built by Last.fm). The client maintains a ring of memcached server positions, routes each get and set to the correct server, and gracefully handles server additions and removals by only remapping the affected key range. When a memcached node goes down, roughly of requests miss the cache and fall through to the database, not the catastrophic miss storm of modular hashing.
Load balancers
Some load balancers use consistent hashing to ensure session affinity. The key is derived from the client IP or a session cookie, and the ring maps it to a backend server. If a backend goes down, only its sessions shift to other backends. Envoy Proxy and HAProxy both support consistent hashing as a load balancing strategy.
Comparing with other partitioning strategies
How does consistent hashing stack up against alternatives?
| Strategy | Keys moved on add/remove | Load balance | Complexity |
|---|---|---|---|
Modular hashing (hash % N) | ~ of all keys | Good if static | Trivial |
| Range-based partitioning | Varies (split/merge) | Can hotspot | Medium |
| Consistent hashing (no vnodes) | ~ of all keys | Poor | Low |
| Consistent hashing (with vnodes) | ~ of all keys | Good | Low |
| Rendezvous (HRW) hashing | ~ of all keys | Good | Medium |
Rendezvous hashing (highest random weight) offers similar key-movement guarantees without a ring structure. Each key computes a weighted hash against every server and picks the highest score. The downside is lookup per key instead of . For systems with hundreds of nodes, that difference matters.
Common pitfalls
Ignoring hash function quality. A weak hash function creates clusters on the ring. Use a well-distributed function like MurmurHash3, xxHash, or SHA-256 (if you need cryptographic strength). MD5 works but is slower than necessary for non-cryptographic use.
Too few vnodes. Running with 10 vnodes per server on a 5-node cluster gives you 50 ring points. The load variance is still high. Start with at least 100 to 200 vnodes per server.
Not accounting for replication in vnode walks. If your replication code simply takes the next 3 vnodes clockwise, two of those vnodes might map to the same physical server. Always deduplicate by physical server when building preference lists.
Forgetting about data migration costs. Consistent hashing minimizes which keys move, but it does not eliminate movement. Moving 10% of 1 TB is still 100 GB. Plan for background transfer, throttling, and consistency during migration windows.
Hot keys. Consistent hashing distributes keys uniformly, not access patterns. If one key receives 50% of all reads (a celebrity’s profile, a viral post), it will overload whichever node owns it. Solutions include key-level caching tiers, read replicas, or splitting hot keys into sub-keys.
Putting it all together
The consistent hashing recipe for a production system:
- Choose a hash function with good distribution (MurmurHash3 is the most common choice).
- Assign 150 to 256 vnodes per physical server.
- Store the ring as a sorted array of
{vnode_hash, physical_server}pairs. - For writes, hash the key, find the responsible server via binary search, and replicate to the next N-1 distinct physical servers clockwise.
- For reads, hash the key and query the responsible server. On failure, try the next server in the preference list.
- When adding a server, insert its vnodes, stream the affected key ranges from successor nodes, and update the ring atomically.
- When removing a server, hand its key ranges to successor nodes, delete its vnodes, and update the ring.
This pattern powers systems handling millions of requests per second across thousands of nodes. The math is simple. The implementation is straightforward. The impact on system availability during scaling events is enormous compared to the naive alternative.
Consistent hashing does one thing and does it well: it turns the violent, system-wide reshuffling of modular hashing into a calm, localized adjustment. That single property makes it indispensable in distributed systems.
What comes next
With data partitioned reliably across nodes, the next challenge is decoupling producers and consumers of work. Message queues provide the asynchronous backbone that lets services communicate without tight coupling, absorb traffic spikes, and retry failed operations gracefully.