Database replication
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
Prerequisites: Databases overview.
A single database server can handle a surprising amount of load. A well-tuned PostgreSQL instance on modern hardware can push 50,000 reads per second and sustain 10,000 writes per second. But one server is one point of failure. When it dies, your data is unavailable, and if the disk is gone, the data is gone too. Replication solves both problems by keeping copies of the same data on multiple machines.
This article covers the three major replication topologies, the spectrum between synchronous and asynchronous replication, and the conflict resolution strategies you need when writes can land on more than one node. By the end you will know which model fits which workload and what breaks when the network partitions.
Why replicate?
Three goals drive every replication setup.
Availability. If the primary node crashes, a replica can take over. A service with a single database has an upper bound on uptime equal to that machine’s uptime. Add a synchronous replica and you can failover in seconds instead of restoring from a backup that might be hours old.
Read throughput. Most web applications are read-heavy. An e-commerce catalog might see a 100:1 read-to-write ratio. Sending reads to replicas lets you scale horizontally for the cheap operation while the leader handles the expensive one.
Durability. A write confirmed on two machines in two data centers survives the total loss of either data center. Without replication, you depend on backups, and backups always have a recovery point gap.
These goals conflict with each other. Synchronous replication maximizes durability but adds latency. Asynchronous replication maximizes throughput but risks data loss. The rest of this article explores the design space between those two extremes.
Leader-follower replication
This is the most common model. One node, the leader, accepts all writes. Every other node, the followers, receive a stream of changes from the leader and apply them in order. Reads can go to any node.
sequenceDiagram participant C as Client participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 C->>L: INSERT INTO orders (...) L->>L: Write to WAL L-->>F1: Stream WAL entry L-->>F2: Stream WAL entry F1->>F1: Apply entry F2->>F2: Apply entry L->>C: ACK (write confirmed)
Leader-follower replication: the client writes to the leader, which streams changes to both followers.
PostgreSQL uses this model with streaming replication. MySQL uses it via binlog shipping. The leader writes to its write-ahead log (WAL), and each follower maintains a persistent connection that tails that log in near real time.
The advantages are clear. There is exactly one source of truth for writes, so there are no write conflicts. Followers can serve reads, which scales read capacity linearly. The model is simple to reason about and simple to monitor.
The downsides are equally clear. The leader is a bottleneck for writes. If the leader fails, you need a failover process to promote a follower, and during that window writes are unavailable. If failover is manual, you might be looking at minutes of downtime. Automated failover using tools like Patroni for PostgreSQL can bring that window down to under 30 seconds, but split-brain scenarios remain a real risk if the old leader comes back before the promotion completes.
Read-after-write consistency
A user submits a form. The write lands on the leader. The user’s next page load hits a follower that has not yet received the change. The user sees stale data and thinks the write was lost. This is the read-after-write problem, and it appears in every leader-follower system with asynchronous replication.
Solutions include routing reads to the leader for a short window after a write (typically a few seconds), tagging reads with a minimum replication position and waiting for the follower to catch up, or simply reading from the leader for data the user just modified. Each solution trades latency for consistency. Pick the cheapest one your application can tolerate. For deeper analysis of these trade-offs, see consistency models.
Multi-leader replication
In multi-leader replication, two or more nodes accept writes independently and replicate changes to each other. The most common use case is a globally distributed system where you want low-latency writes in every region. If you have data centers in Virginia, Frankfurt, and Tokyo, each one can host a leader and accept local writes.
sequenceDiagram participant C1 as Client (US) participant L1 as Leader US participant L2 as Leader EU participant C2 as Client (EU) C1->>L1: UPDATE account SET ... C2->>L2: UPDATE account SET ... L1-->>L2: Replicate US write L2-->>L1: Replicate EU write L1->>L1: Detect conflict L2->>L2: Detect conflict
Multi-leader replication: both leaders accept writes and exchange changes asynchronously. Conflicts must be detected and resolved.
The benefit is straightforward. A user in Tokyo writes to a leader 5 ms away instead of 150 ms away. For write-heavy, latency-sensitive applications, that matters.
The cost is conflicts. Two leaders can concurrently modify the same row. When the changes replicate to each other, the system must decide which write wins. This is the central problem of multi-leader replication, and it makes the model significantly harder to operate than leader-follower.
CockroachDB avoids multi-leader in the traditional sense by using consensus per range, but tools like MySQL Group Replication and Galera Cluster operate closer to this model. In practice, most teams only reach for multi-leader when they have a hard requirement for multi-region writes with local latency.
Leaderless replication
In a leaderless system, any node can accept writes. The client sends a write to multiple nodes simultaneously, and the write is considered successful when a quorum of nodes acknowledges it. Reads also go to multiple nodes, and the client takes the most recent value.
The quorum rule is simple. If you have N replicas, require W acknowledgments for writes and R acknowledgments for reads. As long as W + R > N, the read set and the write set overlap, and you are guaranteed to read the latest write. A common configuration is N=3, W=2, R=2.
Amazon’s Dynamo paper popularized this approach. Cassandra and Riak implement it. The model is appealing because there is no single point of failure for writes. Any node going down does not block operations as long as enough nodes remain to form a quorum.
The trade-off is complexity. Concurrent writes to the same key on different nodes create conflicts, just like multi-leader. You need a resolution strategy. Reads may return stale data if you relax the quorum (for example, R=1 for faster reads), and anti-entropy processes must eventually reconcile divergent replicas. For more on how these trade-offs relate to partition tolerance, see the CAP theorem.
Synchronous vs. asynchronous replication
This is the single most important configuration decision in any replication setup. It cuts across all three topologies.
Synchronous replication means the leader waits for at least one follower to confirm it has persisted the write before acknowledging the client. The write is durable on two machines before the client gets a success response. The cost is latency. If the follower is in the same data center, you add roughly 1 to 2 ms. If it is cross-region, you add 50 to 150 ms per write. If the synchronous follower goes down, writes block entirely until it recovers or the system reconfigures.
Asynchronous replication means the leader acknowledges the client as soon as it writes locally. Followers receive the change eventually. Writes are fast, but if the leader crashes before the change ships, the data is lost. In a system processing 10,000 writes per second with a 500 ms replication lag, a leader crash can lose up to 5,000 transactions.
Semi-synchronous replication is the practical middle ground. The leader waits for one follower (out of potentially many) to confirm. If that follower is slow, the system promotes a different follower to be the synchronous target. PostgreSQL supports this with synchronous_standby_names, and MySQL has semi-sync replication built in. You get the durability guarantee of at least two copies without the fragility of requiring a specific node to be healthy.
| Mode | Write latency | Data loss on leader crash | Write availability |
|---|---|---|---|
| Synchronous | +1 to 150 ms | None | Blocked if sync replica down |
| Semi-synchronous | +1 to 150 ms | None (with healthy replica) | Briefly delayed on failover |
| Asynchronous | Baseline | Up to full replication lag | Unaffected |
Most production systems use semi-synchronous for the primary replica and asynchronous for the rest.
Replication lag
Replication lag is the delay between a write being committed on the leader and that write being visible on a follower. It is the central operational challenge of asynchronous replication. A healthy system might have lag under 10 ms. A system under heavy write load, or one where a follower has slow disks, can see lag climb to seconds or even minutes.
Lag creates three observable anomalies:
Stale reads. A client reads from a follower and gets data that is behind the leader. For many applications this is acceptable. For some, like showing a user their own recent actions, it is not.
Monotonic read violations. A client makes two reads that hit different followers. The second follower is further behind than the first, so the client sees data go backward in time. The fix is session stickiness: pin a client’s reads to a single follower for the duration of a session.
Causal ordering violations. A client writes A, then writes B that depends on A. A follower receives B before A. Any consumer of that follower sees an impossible state. This matters for things like inserting a row and then inserting a foreign key reference to that row. The follower sees the reference before the row exists. Leader-follower replication preserves ordering on a single replication stream, but multi-leader and leaderless systems need explicit causal ordering mechanisms like vector clocks or hybrid logical clocks to avoid this problem.
Monitoring replication lag is non-negotiable. In PostgreSQL, query pg_stat_replication on the leader to see the byte offset difference between the leader’s WAL position and each follower’s replay position. In MySQL, check Seconds_Behind_Master on each replica. Alert when lag exceeds your application’s tolerance threshold, typically a few seconds for most workloads.
Conflict resolution
Conflicts arise whenever two nodes accept concurrent writes to the same data. This happens in multi-leader and leaderless systems. Leader-follower avoids it by design, which is one of its strongest selling points.
Last-write-wins (LWW). Each write carries a timestamp. The write with the highest timestamp wins. This is simple and deterministic, but it silently discards data. If two users edit the same document concurrently, one edit vanishes. Cassandra uses LWW by default. It works well for data where overwrites are semantically correct, like updating a user’s last-seen timestamp. It works poorly for anything where merging matters.
Version vectors. Each node maintains a vector of counters, one per node. When a write arrives, the node increments its own counter. On replication, the system compares vectors to detect concurrent writes. If neither vector dominates the other, the writes are concurrent and the system can flag a conflict for application-level resolution. Riak uses this approach. It preserves all concurrent writes as siblings and lets the application decide how to merge them.
CRDTs (Conflict-free Replicated Data Types). These are data structures designed so that concurrent updates always converge to the same state without coordination. A G-Counter (grow-only counter) lets each node increment independently; the merged value is the sum of all node-local counters. A G-Set (grow-only set) merges via union. More complex CRDTs exist for registers, maps, and sequences. The trade-off is that not every data model fits a CRDT, and the ones that do can have high storage overhead. Redis uses CRDTs in its active-active geo-replication feature.
Application-level resolution. The database stores all conflicting versions and surfaces them to the application. The application contains domain-specific logic to merge. For example, a shopping cart might union the items from both versions. This is the most flexible approach but pushes complexity into application code.
In practice, most teams avoid conflicts entirely by sticking with leader-follower replication and only reaching for multi-leader or leaderless when the latency or availability requirements demand it. When they do, LWW is the default because it requires zero application changes, and teams upgrade to version vectors or CRDTs only when data loss from LWW becomes unacceptable.
Choosing a replication model
The choice depends on your workload, your latency requirements, and how much operational complexity you are willing to absorb.
Leader-follower is the right default. Use it when your write volume fits on a single node (most applications), you want simple operations, and you can tolerate a brief write-unavailability window during failover. This covers the vast majority of production systems. If you need to scale beyond what a single leader can handle, look at database sharding before you look at multi-leader.
Multi-leader is for multi-region deployments where you need low-latency writes in every region. Accept the conflict resolution burden only when the latency savings justify it. Expect to invest significantly more in testing and monitoring.
Leaderless is for workloads that need high write availability and can tolerate eventual consistency. Time-series data, sensor ingestion, and append-heavy analytics workloads are good fits. If your data model requires strong consistency, leaderless is the wrong choice unless you are prepared to layer consensus on top, at which point you are building something closer to leader-follower anyway.
| Requirement | Recommended model |
|---|---|
| Simple operations, single-region | Leader-follower |
| Multi-region, low-latency writes | Multi-leader |
| High write availability, eventual consistency OK | Leaderless |
| Strong consistency required | Leader-follower with synchronous replication |
What comes next
Replication gives you copies of the same data on multiple nodes. But it does not help when the data itself is too large for a single node, or when write throughput exceeds what one leader can handle. That is the problem database sharding solves: splitting data across nodes so each node owns a subset. Sharding and replication are complementary. Most large-scale systems use both.