Search…

Consistent hashing

In this series (20 parts)
  1. What is system design and why it matters
  2. Estimations and back-of-envelope calculations
  3. Scalability: vertical vs horizontal scaling
  4. CAP theorem and distributed system tradeoffs
  5. Consistency models
  6. Load balancing
  7. Caching: strategies and patterns
  8. Content Delivery Networks
  9. Databases: SQL vs NoSQL and when to use each
  10. Database replication
  11. Database sharding and partitioning
  12. Consistent hashing
  13. Message queues and event streaming
  14. API design: REST, GraphQL, gRPC
  15. Rate limiting and throttling
  16. Proxies: forward and reverse
  17. Networking concepts for system design
  18. Reliability patterns: timeouts, retries, circuit breakers
  19. Observability: logging, metrics, tracing
  20. Security in system design

You have five cache servers holding 200 million keys. One server dies. With naive modular hashing, hash(key) % n changes for almost every key when n drops from 5 to 4. That means roughly 80% of your keys now map to the wrong server. Every one of those lookups becomes a cache miss, and the miss storm hammers your database. Consistent hashing exists to prevent exactly this scenario.

Prerequisites

You should understand database sharding and the basic reasons we split data across multiple machines. Familiarity with caching and load balancing helps too, since consistent hashing shows up in both contexts.

The problem with modular hashing

The simplest way to assign keys to servers is modular arithmetic. Hash the key, divide by the number of servers, and use the remainder as the server index.

server = hash(key) % N

This works well when N is constant. The trouble starts the moment N changes. Suppose you have 4 servers and you add a fifth. The table below shows what happens to 8 keys:

Key hashhash % 4hash % 5Moved?
1712Yes
2333No
4404Yes
5823Yes
7131Yes
8621Yes
9202Yes
10333No

Six out of eight keys moved. In the general case, adding one server to N causes roughly (N-1)/N of all keys to remap. For N = 100, that is 99% of your data moving at once. This is not acceptable in production.

The hash ring

Consistent hashing maps both keys and servers onto the same circular space. Picture the output range of a hash function (say 0 to 23212^{32} - 1) bent into a circle where position 0 meets position 23212^{32} - 1. Each server gets a position on this ring by hashing its identifier (IP address, hostname, some unique label). Each key also gets a position by hashing its value. To find which server owns a key, walk clockwise from the key’s position until you hit the first server.

graph LR
  subgraph Ring["Hash Ring (0 to 2³²)"]
      direction TB
      A["Server A<br/>pos: 500M"]
      B["Server B<br/>pos: 1.5B"]
      C["Server C<br/>pos: 3B"]
  end
  K1["Key k1<br/>pos: 200M"] -->|clockwise| A
  K2["Key k2<br/>pos: 900M"] -->|clockwise| B
  K3["Key k3<br/>pos: 2B"] -->|clockwise| C
  K4["Key k4<br/>pos: 3.5B"] -->|clockwise| A

Three servers on the hash ring. Each key walks clockwise to its nearest server.

When a key hashes to position 200M, it walks clockwise and lands on Server A at position 500M. Key k2 at 900M lands on Server B at 1.5B. Key k4 at 3.5B wraps around past the end of the ring and lands on Server A at 500M.

What happens when a server joins

Suppose Server D joins the ring at position 2.5B. Only the keys between Server B (1.5B) and Server D (2.5B) need to move. Those keys previously belonged to Server C. Everything else stays put. If keys are uniformly distributed across the ring, the fraction of keys that move is roughly 1/N1/N, where N is the number of servers after the addition. With 4 servers, about 25% of keys relocate instead of 75%.

What happens when a server leaves

If Server B at position 1.5B crashes, only the keys between Server A (500M) and position 1.5B need a new home. They now belong to Server C (the next clockwise server). Again, roughly 1/N1/N of keys move. The rest of the system is unaffected.

This is the core guarantee: node additions and removals cause O(K/N)O(K/N) key movements, where K is the total number of keys and N is the number of nodes.

The imbalance problem

A hash ring with three physical nodes has three positions on the circle. Unless those positions happen to be perfectly equidistant (they almost never are), the arc lengths between nodes vary. A server that owns a larger arc handles more keys. In extreme cases, one server might handle 60% of all traffic while another handles 10%. That defeats the purpose of distributing load.

Consider the numbers. With 3 nodes placed randomly on a ring, the expected maximum load is about 2x the average. With 10 nodes, it improves but remains 1.5x to 1.8x. We need a way to smooth this out without adding physical hardware.

Virtual nodes

The solution is virtual nodes (vnodes). Instead of placing each physical server at one position on the ring, you place it at many positions. If you assign 150 virtual nodes to each physical server, three physical servers produce 450 points on the ring. The arc lengths between adjacent points become much more uniform, which means the load distribution tightens dramatically.

graph TB
  subgraph Without["Without Virtual Nodes"]
      direction LR
      P1["Server A"]
      P2["Server B"]
      P3["Server C"]
  end
  subgraph With["With Virtual Nodes (3 per server)"]
      direction LR
      V1["A-0"] --- V2["B-0"]
      V2 --- V3["C-0"]
      V3 --- V4["A-1"]
      V4 --- V5["B-1"]
      V5 --- V6["C-1"]
      V6 --- V7["A-2"]
      V7 --- V8["B-2"]
      V8 --- V9["C-2"]
  end

Without virtual nodes, three servers create three large arcs. With 3 vnodes each, nine points create smaller, more uniform arcs.

How many virtual nodes?

The number of vnodes per server controls the tradeoff between balance and overhead. More vnodes means better balance but more entries in the ring lookup table and more metadata to manage.

Vnodes per serverStd dev of load (% of mean)Ring table entries (100 servers)
1~100%100
50~14%5,000
150~8%15,000
500~4.5%50,000

At 150 vnodes per server, the standard deviation drops to about 8% of the mean load. That is tight enough for most production systems. Going beyond 500 vnodes rarely justifies the extra memory. A ring with 50,000 entries is still tiny (a few megabytes at most), but the routing metadata and rebalancing bookkeeping grow with it.

Heterogeneous hardware

Virtual nodes also solve the problem of servers with different capacities. A server with 64 GB of RAM can be assigned 300 vnodes while a server with 32 GB gets 150. The load tracks the vnode count, so stronger machines absorb proportionally more keys without any changes to the hashing algorithm itself.

The lookup algorithm

In practice, the ring is implemented as a sorted array or balanced tree of vnode positions. To find the server for a key:

  1. Compute hash(key).
  2. Binary search the sorted ring for the smallest position greater than or equal to the hash.
  3. If no position is greater, wrap around to the first entry (the ring is circular).
  4. Return the physical server associated with that vnode.

This lookup is O(logV)O(\log V), where V is the total number of vnodes. With 15,000 entries, that is about 14 comparisons. Even under heavy load, this adds negligible latency.

When a node is added, you insert its vnodes into the sorted structure and migrate keys from the successor nodes. When a node is removed, you delete its vnodes and hand keys to their new successors. Both operations touch only the affected key ranges.

Replication and fault tolerance

Consistent hashing pairs naturally with replication. A common strategy is to walk clockwise from a key’s position and place replicas on the next N distinct physical servers. If the replication factor is 3, the key lives on the first three unique physical servers encountered clockwise. Virtual nodes complicate this slightly because consecutive vnodes might belong to the same physical server, so the walk skips duplicate physical servers.

This gives each key a preference list of servers. If the primary fails, reads and writes shift to the next server in the list. The client or coordinator keeps trying clockwise until it finds a healthy node.

Consistent hashing in the real world

Amazon DynamoDB

DynamoDB’s design, documented in the original Dynamo paper (2007), uses consistent hashing as its core partitioning strategy. Each table’s partition key is hashed, and the hash determines placement on a ring. DynamoDB assigns vnodes to partition ranges and automatically splits or merges partitions as throughput demands change. When a node fails, its key ranges move to the next nodes on the ring. Hinted handoff and anti-entropy (Merkle tree comparison) ensure replicas converge after failures.

The Dynamo paper reported that with 150 vnodes per node and a replication factor of 3, the system achieved load balance within 10% across nodes, even during node additions and removals. This tight balance held at scale with hundreds of nodes.

Apache Cassandra

Cassandra adopted the Dynamo-style ring from the start. Each node is assigned token ranges on a ring spanning 263-2^{63} to 26312^{63} - 1. The partitioner (by default Murmur3Partitioner) hashes the partition key to determine placement. Cassandra supports two token assignment strategies:

  1. Single-token assignment. Each node gets one token, and the operator manually distributes tokens for balance. This was the original approach and is error-prone.
  2. Virtual nodes (vnodes). Each node gets num_tokens positions (default 256 in modern Cassandra). The database automatically assigns random positions, and rebalancing on node addition or removal happens incrementally.

With vnodes enabled, adding a node to a 10-node Cassandra cluster moves roughly 10% of the data. Without vnodes, the operator would need to manually recalculate and reassign tokens, a process that is both slow and risky.

Memcached client libraries

Memcached itself is a simple key-value store with no built-in distribution. The consistent hashing happens in client libraries like libketama (originally built by Last.fm). The client maintains a ring of memcached server positions, routes each get and set to the correct server, and gracefully handles server additions and removals by only remapping the affected key range. When a memcached node goes down, roughly 1/N1/N of requests miss the cache and fall through to the database, not the catastrophic N1/NN-1/N miss storm of modular hashing.

Load balancers

Some load balancers use consistent hashing to ensure session affinity. The key is derived from the client IP or a session cookie, and the ring maps it to a backend server. If a backend goes down, only its sessions shift to other backends. Envoy Proxy and HAProxy both support consistent hashing as a load balancing strategy.

Comparing with other partitioning strategies

How does consistent hashing stack up against alternatives?

StrategyKeys moved on add/removeLoad balanceComplexity
Modular hashing (hash % N)~(N1)/N(N-1)/N of all keysGood if staticTrivial
Range-based partitioningVaries (split/merge)Can hotspotMedium
Consistent hashing (no vnodes)~1/N1/N of all keysPoorLow
Consistent hashing (with vnodes)~1/N1/N of all keysGoodLow
Rendezvous (HRW) hashing~1/N1/N of all keysGoodMedium

Rendezvous hashing (highest random weight) offers similar key-movement guarantees without a ring structure. Each key computes a weighted hash against every server and picks the highest score. The downside is O(N)O(N) lookup per key instead of O(logV)O(\log V). For systems with hundreds of nodes, that difference matters.

Common pitfalls

Ignoring hash function quality. A weak hash function creates clusters on the ring. Use a well-distributed function like MurmurHash3, xxHash, or SHA-256 (if you need cryptographic strength). MD5 works but is slower than necessary for non-cryptographic use.

Too few vnodes. Running with 10 vnodes per server on a 5-node cluster gives you 50 ring points. The load variance is still high. Start with at least 100 to 200 vnodes per server.

Not accounting for replication in vnode walks. If your replication code simply takes the next 3 vnodes clockwise, two of those vnodes might map to the same physical server. Always deduplicate by physical server when building preference lists.

Forgetting about data migration costs. Consistent hashing minimizes which keys move, but it does not eliminate movement. Moving 10% of 1 TB is still 100 GB. Plan for background transfer, throttling, and consistency during migration windows.

Hot keys. Consistent hashing distributes keys uniformly, not access patterns. If one key receives 50% of all reads (a celebrity’s profile, a viral post), it will overload whichever node owns it. Solutions include key-level caching tiers, read replicas, or splitting hot keys into sub-keys.

Putting it all together

The consistent hashing recipe for a production system:

  1. Choose a hash function with good distribution (MurmurHash3 is the most common choice).
  2. Assign 150 to 256 vnodes per physical server.
  3. Store the ring as a sorted array of {vnode_hash, physical_server} pairs.
  4. For writes, hash the key, find the responsible server via binary search, and replicate to the next N-1 distinct physical servers clockwise.
  5. For reads, hash the key and query the responsible server. On failure, try the next server in the preference list.
  6. When adding a server, insert its vnodes, stream the affected key ranges from successor nodes, and update the ring atomically.
  7. When removing a server, hand its key ranges to successor nodes, delete its vnodes, and update the ring.

This pattern powers systems handling millions of requests per second across thousands of nodes. The math is simple. The implementation is straightforward. The impact on system availability during scaling events is enormous compared to the naive alternative.

Consistent hashing does one thing and does it well: it turns the violent, system-wide reshuffling of modular hashing into a calm, localized adjustment. That single property makes it indispensable in distributed systems.

What comes next

With data partitioned reliably across nodes, the next challenge is decoupling producers and consumers of work. Message queues provide the asynchronous backbone that lets services communicate without tight coupling, absorb traffic spikes, and retry failed operations gracefully.

Start typing to search across all content
navigate Enter open Esc close