Dec 2, 2025 · 22 min read · System Design

Design a key-value store

In this series (18 parts)

A key-value store is a dictionary that lives on many machines. You give it a key, it gives you back a blob of data. The interface is simple: put(key, value) and get(key). The engineering challenge is making that interface work at scale, across data centers, with predictable latency even when nodes fail. This case study walks through the design of a distributed key-value store similar to Amazon’s DynamoDB or Apache Cassandra.

Prerequisites

You should be comfortable with consistent hashing and database replication. Both concepts appear throughout this design. Familiarity with the CAP theorem will help you understand the trade-offs we make around consistency and availability.

Requirements

Functional requirements

put(key, value): store a value under a given key.
get(key): retrieve the value associated with a key.
Keys are strings up to 256 bytes. Values are blobs up to 1 MB.
Automatic data partitioning across nodes.
Tunable consistency: the caller chooses how many replicas must acknowledge a read or write.

Non-functional requirements

High availability: the store must remain writable even during node failures.
Low latency: p99 read and write latency under 10 ms.
Scalability: support 10 million DAU, 100,000 reads/s and 50,000 writes/s at peak.
Durability: no data loss after a write is acknowledged.

Capacity estimation

Start with the numbers. 10 million DAU, each making roughly 10 reads and 5 writes per day. That gives us:

Metric	Estimate
Daily reads	100 million
Daily writes	50 million
Peak read QPS	~100,000 (assume 10x average)
Peak write QPS	~50,000
Average value size	10 KB
Daily new data	50M x 10 KB = 500 GB
Storage (1 year, 3 replicas)	500 GB x 365 x 3 ≈ 550 TB
Bandwidth (write)	50,000 x 10 KB = 500 MB/s

550 TB is large but manageable across a cluster of commodity machines with SSDs. Each node holding 4 TB of data means roughly 140 nodes before accounting for replication overhead.

High-level architecture

The system has four main layers: a client library that routes requests, a set of storage nodes organized in a hash ring, a replication protocol, and a background compaction engine.

graph TB
  Client["Client Library"]
  Coord["Coordinator Node"]
  subgraph Ring["Hash Ring"]
      N1["Node A"]
      N2["Node B"]
      N3["Node C"]
      N4["Node D"]
      N5["Node E"]
  end
  Client -->|"put(k,v) / get(k)"| Coord
  Coord --> N1
  Coord --> N2
  Coord --> N3
  N1 ---|"replication"| N2
  N2 ---|"replication"| N3
  N3 ---|"replication"| N4

High-level architecture: the client talks to a coordinator, which routes requests to the correct nodes on the hash ring.

The client hashes the key to find the coordinator node. The coordinator forwards the request to the N nodes responsible for that key and waits for W (write) or R (read) acknowledgments before responding. This is the quorum mechanism.

Deep dive 1: Partitioning with consistent hashing

Each node is assigned multiple positions (virtual nodes) on a hash ring. When a key arrives, we hash it to a point on the ring and walk clockwise to find the first N distinct physical nodes. Those N nodes are the preference list for that key.

Consistent hashing gives us two properties we need. First, adding or removing a node only moves a fraction of the keys, roughly 1/N of the total. Second, virtual nodes let us balance load even when physical machines have different capacities. A beefy machine gets more virtual nodes; a smaller one gets fewer.

graph LR
  subgraph HashRing["Hash Ring"]
      direction TB
      VA1["A (vnode 1)"]
      VB1["B (vnode 1)"]
      VA2["A (vnode 2)"]
      VC1["C (vnode 1)"]
      VB2["B (vnode 2)"]
      VC2["C (vnode 2)"]
  end
  K["Key k7"] -->|"hash = 0x3F2A"| VB1
  VB1 -->|"replica 1"| VC1
  VC1 -->|"replica 2"| VA2

Virtual nodes on the hash ring. Key k7 hashes to Node B’s first virtual node. Replicas land on the next distinct physical nodes clockwise.

A typical production setup uses 150 to 256 virtual nodes per physical node. This gives a standard deviation of under 5% in key distribution across nodes.

Deep dive 2: Replication and quorum consistency

Every key is stored on N replicas (typically N = 3). The coordinator writes the key to all N nodes in parallel, but only waits for W acknowledgments before telling the client the write succeeded. Similarly, a read goes to all N nodes but returns after R replies.

The quorum condition is: W + R > N. If N = 3, W = 2, R = 2, then any read is guaranteed to see at least one replica that has the latest write. This is the sweet spot for most workloads.

sequenceDiagram
  participant C as Client
  participant Co as Coordinator
  participant N1 as Node 1
  participant N2 as Node 2
  participant N3 as Node 3
  C->>Co: put(key, value)
  Co->>N1: write(key, value, ts)
  Co->>N2: write(key, value, ts)
  Co->>N3: write(key, value, ts)
  N1-->>Co: ACK
  N2-->>Co: ACK
  Co->>C: success (W=2 met)
  Note over Co,N3: N3 ACK arrives later
  C->>Co: get(key)
  Co->>N1: read(key)
  Co->>N2: read(key)
  Co->>N3: read(key)
  N1-->>Co: value (ts=100)
  N3-->>Co: value (ts=99)
  Co->>C: return value (ts=100, R=2 met)

Write and read quorum flow with N=3, W=2, R=2. The coordinator waits for two acknowledgments on writes and picks the freshest value on reads.

Different consistency levels let callers trade latency for safety:

Setting	W	R	Behavior
Strong	3	1	Every write confirmed by all replicas. Reads from any single node are consistent.
Quorum	2	2	Overlap guarantees latest value is seen. Good default.
Fast write	1	3	Writes are fast but reads must check all replicas.
Eventual	1	1	Fastest, but stale reads are possible.

When W + R <= N, you lose the overlap guarantee. Reads might miss the latest write. This is fine for use cases like session caches where staleness is acceptable, but dangerous for financial data.

Conflict resolution

With W < N, two coordinators can accept conflicting writes for the same key. The store handles this with vector clocks or last-writer-wins (LWW) timestamps. DynamoDB uses vector clocks and pushes conflict resolution to the application. Cassandra uses LWW, which is simpler but silently drops one of the conflicting values.

Hinted handoff

When a replica is down, the coordinator writes to a temporary stand-in node and attaches a “hint” indicating the intended destination. When the original node recovers, the stand-in forwards the data. This keeps the system writable even when nodes are temporarily unavailable.

Deep dive 3: Storage engine and compaction

Each node stores data using a Log-Structured Merge-tree (LSM-tree). Writes go to an in-memory table (memtable). When the memtable reaches a threshold (typically 64 MB), it flushes to disk as a sorted immutable file called an SSTable.

Over time, SSTables accumulate. Reads must check the memtable and potentially many SSTables. Compaction merges overlapping SSTables, discards deleted keys, and produces larger sorted files. This keeps read amplification under control.

The compaction strategy matters. Size-tiered compaction groups SSTables of similar size and merges them. It is write-friendly but can temporarily double disk usage during a merge. Leveled compaction organizes SSTables into levels where each level is 10x larger than the previous. It uses less disk space and provides more predictable read latency, but generates more write I/O.

Bloom filters

Before checking an SSTable for a key, the node consults a Bloom filter. A Bloom filter is a probabilistic data structure that can tell you “definitely not here” or “maybe here.” With a false positive rate of 1%, a Bloom filter for 1 million keys uses about 1.2 MB of memory. This eliminates most unnecessary disk reads and is critical for keeping read latency low.

A read path looks like this:

Check the memtable. If the key is there, return it.
For each SSTable (newest first), check the Bloom filter.
If the Bloom filter says “maybe,” do a binary search in the SSTable’s index.
Return the first match found (newest SSTable wins).

Trade-offs and alternative approaches

AP vs. CP

This design leans AP on the CAP theorem spectrum. We favor availability and partition tolerance over strong consistency. The quorum mechanism gives us tunable consistency, but the default setting (W=2, R=2, N=3) allows brief windows of inconsistency during network partitions.

A CP key-value store like etcd or ZooKeeper uses consensus protocols (Raft, ZAB) to ensure every read sees the latest write. The trade-off is higher write latency (every write needs a majority of nodes to agree) and unavailability when a majority of nodes are down.

LSM-tree vs. B-tree

LSM-trees favor write throughput. All writes are sequential appends, which is ideal for SSDs and spinning disks alike. The cost is read amplification: a lookup might touch multiple SSTables. B-trees (used by PostgreSQL, MySQL InnoDB) provide faster point reads but slower writes because every write modifies a page in place.

For a key-value store handling 50,000 writes/s, the LSM-tree is the right call. Compaction and Bloom filters keep read latency manageable.

Consistency models

Different applications need different consistency models. A shopping cart can tolerate eventual consistency because merging two versions of a cart is straightforward. A payment ledger cannot. The tunable quorum lets one cluster serve both workloads by adjusting W and R per request.

What real systems actually do

DynamoDB uses consistent hashing with virtual nodes, quorum-based replication (though the default is eventually consistent reads), and an LSM-tree storage engine. It offers strong consistency as an option at the cost of higher read latency. It uses vector clocks for conflict detection and pushes resolution to the client via conditional writes.

Cassandra uses a nearly identical architecture: consistent hashing, tunable quorum, LSM-trees with both size-tiered and leveled compaction. It differs by defaulting to last-writer-wins for conflict resolution and by supporting a richer data model (wide rows, CQL) on top of the key-value foundation.

Riak took the original Dynamo paper the furthest, supporting vector clocks, CRDTs for automatic conflict resolution, and a pluggable storage backend. It proved that you could build a fully decentralized (no leader) key-value store, but the operational complexity was high.

Redis Cluster takes a different approach. It uses hash slots (16,384 fixed slots) instead of a hash ring, assigns slots to nodes manually or via rebalancing, and provides strong consistency within a single slot but no cross-slot transactions. It is an in-memory store, so durability depends on RDB snapshots and AOF logs.

What comes next

This design gives you a working distributed key-value store, but several areas deserve deeper exploration:

Failure detection: how nodes detect that a peer is down. Gossip protocols and phi-accrual failure detectors are the standard approaches.
Anti-entropy: Merkle trees allow two replicas to efficiently find and repair differences in their data.
Multi-datacenter replication: extending the quorum model across regions introduces new latency and consistency trade-offs.
Hot key mitigation: when a single key receives disproportionate traffic, the coordinator becomes a bottleneck. Caching layers and request coalescing help.
Range queries: a pure key-value store only supports point lookups. Adding range queries requires an ordered partitioning scheme instead of hash-based partitioning.

Each of these is a design problem on its own. The foundation you have here, consistent hashing, quorum replication, LSM-tree storage, is the starting point for all of them.

← Back to all series