Search…

CAP theorem and distributed system tradeoffs

In this series (20 parts)
  1. What is system design and why it matters
  2. Estimations and back-of-envelope calculations
  3. Scalability: vertical vs horizontal scaling
  4. CAP theorem and distributed system tradeoffs
  5. Consistency models
  6. Load balancing
  7. Caching: strategies and patterns
  8. Content Delivery Networks
  9. Databases: SQL vs NoSQL and when to use each
  10. Database replication
  11. Database sharding and partitioning
  12. Consistent hashing
  13. Message queues and event streaming
  14. API design: REST, GraphQL, gRPC
  15. Rate limiting and throttling
  16. Proxies: forward and reverse
  17. Networking concepts for system design
  18. Reliability patterns: timeouts, retries, circuit breakers
  19. Observability: logging, metrics, tracing
  20. Security in system design

Every distributed system makes tradeoffs. You can pretend otherwise for a while, usually until an undersea cable gets cut or a data center loses power, but eventually the physics of networks forces your hand. The CAP theorem gives you a framework for reasoning about those tradeoffs before they show up as a 3 AM page.

Eric Brewer first presented the conjecture in 2000. Seth Gilbert and Nancy Lynch proved it formally in 2002. The result is simple to state and surprisingly hard to internalize. This article breaks it down, walks through real systems, and explains what it means for the design decisions you will actually face.

Prerequisites

You should be comfortable with horizontal and vertical scaling, including why we replicate data across multiple nodes in the first place. Familiarity with basic database replication helps but is not strictly required.

The three guarantees

The CAP theorem concerns three properties of a distributed data store.

Consistency means every read receives the most recent write or an error. If client A writes balance = 500 to node 1, then client B immediately reads from node 2, it must see 500. Not 400. Not stale data. This is linearizable consistency, the strongest form. Weaker models exist and matter in practice, but CAP talks about this one.

Availability means every request to a non-failing node receives a response, without guarantee that it contains the most recent write. The system never refuses to answer. If a node is up, it responds. Timeouts and errors do not count as “available.”

Partition tolerance means the system continues to operate despite arbitrary message loss or delay between nodes. Network partitions are not hypothetical. Google reported in a 2011 study that their wide-area network experienced partitions roughly 1.5 times per day across their data centers. Partitions happen. The question is what your system does when they happen.

Why you can only pick two

Here is the core insight. Suppose you have two nodes, N1 and N2, holding the same piece of data. A network partition cuts communication between them. A write arrives at N1. Now a read arrives at N2. The system has exactly three options:

  1. Refuse the read at N2. This preserves consistency (N2 does not return stale data) but sacrifices availability (N2 is up but not responding to reads). This is CP.

  2. Let N2 return its local (stale) copy. This preserves availability (N2 responds) but sacrifices consistency (the response is outdated). This is AP.

  3. Prevent the partition from happening. This is only possible if all nodes are on the same network segment with guaranteed connectivity. This is CA, but it is not partition tolerant, which means it is not really distributed.

There is no fourth option. During a partition, you pick consistency or availability. You cannot have both. This is not a limitation of current technology. It is a mathematical impossibility proven by Gilbert and Lynch.

CAP theorem: pick any two

graph TD
  C["Consistency
Every read sees
the latest write"]
  A["Availability
Every request gets
a response"]
  P["Partition Tolerance
System works despite
network splits"]
  C --- CP["CP Systems
Zookeeper, HBase,
MongoDB"]
  C --- CA["CA Systems
Traditional RDBMS
(single node)"]
  A --- CA
  A --- AP["AP Systems
Cassandra, DynamoDB,
CouchDB"]
  P --- CP
  P --- AP

The CAP triangle. Real distributed systems must tolerate partitions, so the practical choice is CP or AP.

The partition tolerance reality

Here is the part many engineers miss. In any system that spans more than one machine connected by a network, partitions will happen. Switches fail. Cables get unplugged. Garbage collection pauses can make a node appear dead to its peers for hundreds of milliseconds. A 2014 study by Bailis and Kingsbury found that even within a single data center, network faults occur regularly enough to affect system behavior.

This means CA is not a real option for distributed systems. A single-node PostgreSQL instance is technically CA: it is consistent and available, but the moment you need to survive a machine failure, you replicate. And the moment you replicate, you must handle partitions.

The real question is never “should I tolerate partitions?” The answer is always yes if you are distributed. The real question is: “When a partition occurs, do I sacrifice consistency or availability?”

CP systems: consistency over availability

A CP system, during a partition, will refuse to serve requests rather than risk returning stale data. This sounds extreme, but it is exactly what you want for certain workloads.

ZooKeeper is the canonical example. It uses the ZAB (ZooKeeper Atomic Broadcast) protocol. Writes go through a leader node. If a majority of nodes cannot be reached (a partition isolates the leader from its quorum), the system stops accepting writes. Reads to minority-side nodes will also fail or return errors. ZooKeeper is used for coordination: leader election, distributed locks, configuration management. Returning stale configuration data could cause split-brain scenarios that are far worse than brief unavailability.

HBase builds on top of ZooKeeper for coordination and uses HDFS for storage. It provides strong consistency for reads and writes to a given row. During a region server failure (which looks like a partition to the system), the affected regions become temporarily unavailable while HBase reassigns them.

MongoDB with its default write concern and read preference behaves as a CP system. During a primary election (triggered by the primary becoming unreachable), the replica set cannot accept writes. Reads directed at the primary also fail. You can configure MongoDB to read from secondaries, which shifts it toward AP at the cost of reading potentially stale data.

The cost of CP is downtime during partitions. If your ZooKeeper cluster loses its quorum for 30 seconds, those 30 seconds are an outage for every service that depends on it. For a coordination service, this tradeoff is correct. For a user-facing shopping cart, it probably is not.

AP systems: availability over consistency

An AP system, during a partition, continues serving requests on both sides of the partition. Each side may diverge. When the partition heals, the system must reconcile conflicting writes.

Cassandra is designed from the ground up for availability. It uses a peer-to-peer architecture with no single leader. Writes go to any node and propagate asynchronously. With a replication factor of 3 and a consistency level of ONE, a write succeeds as soon as one replica acknowledges it. During a partition, both sides of the split continue accepting writes. When the partition heals, Cassandra uses last-write-wins (by timestamp) or custom conflict resolution to merge divergent data.

The numbers make the tradeoff concrete. A three-node Cassandra cluster with replication factor 3 and consistency level QUORUM requires 2 of 3 nodes to acknowledge a write. With consistency level ONE, it requires only 1. At ONE, a single-node partition still serves 100% of traffic on the majority side and the isolated node continues serving its local data. At QUORUM, the isolated node cannot satisfy writes, moving the system closer to CP behavior.

DynamoDB takes a similar approach. Amazon built it after the 2004 holiday season outage taught them that for a shopping cart, availability matters more than consistency. If a customer adds an item during a partition, both sides accept the write. When the partition heals, DynamoDB merges the carts. The worst case is a customer seeing a duplicate item, which is far better than the cart being unavailable during peak traffic.

CouchDB uses multi-version concurrency control and stores conflicting document revisions. Applications must resolve conflicts, but the database never refuses a write.

CA systems: the single-node special case

A traditional relational database on a single server, say PostgreSQL or MySQL without replication, is CA. Every read sees the latest write (consistency). The server always responds if it is up (availability). But there is no partition to tolerate because there is only one node.

The moment you add a replica for failover, you enter distributed territory and must choose between CP and AP behavior during partitions. PostgreSQL with synchronous replication behaves as CP: the primary waits for the replica to confirm writes, and if the replica is unreachable, writes block. PostgreSQL with asynchronous replication leans AP: the primary continues accepting writes even if the replica falls behind, but reads from the replica may return stale data.

Real systems mapped to CAP

SystemCAP CategoryPartition BehaviorTypical Use Case
ZooKeeperCPStops accepting writes without quorumCoordination, leader election
HBaseCPRegions become unavailable during failoverAnalytics on large datasets
MongoDB (default)CPNo writes during primary electionDocument storage, general purpose
CassandraAPBoth sides continue serving reads/writesHigh-write throughput, time-series
DynamoDBAPBoth sides accept operations, merge laterShopping carts, session stores
CouchDBAPAccepts writes, stores conflictsOffline-first applications
PostgreSQL (single)CANo partitions possibleTraditional OLTP
MySQL (single)CANo partitions possibleWeb application backends
SpannerCP (effectively)Uses TrueTime for global consistencyGlobal transactions

Beyond the binary: tunable consistency

Modern systems blur the CP/AP line. Cassandra lets you choose your consistency level per query. At consistency level ALL, every replica must respond: that is CP behavior. At ONE, any single replica suffices: that is AP behavior. At QUORUM, you need a majority, which gives you a middle ground.

The formula is straightforward. If R + W > N (where R is the read consistency level, W is the write consistency level, and N is the replication factor), reads will always see the latest write. With N=3, W=2, R=2: 2 + 2 = 4 > 3, so strong consistency holds. With W=1, R=1: 1 + 1 = 2, which is not greater than 3, so stale reads are possible.

This per-query tunability means a single Cassandra cluster can behave as CP for financial transactions (using QUORUM reads and writes) and AP for activity feeds (using ONE for both). The system does not fit neatly into one CAP corner. It slides between them based on your configuration.

DynamoDB offers similar flexibility with its strongly consistent read option. By default, reads are eventually consistent (AP). Flip a flag and a read will return the latest write (CP behavior for that read), at the cost of higher latency and lower throughput.

For a deeper dive into how these consistency levels relate to each other, see consistency models.

Practical decision framework

When designing a system, start with these questions:

What happens if a user sees stale data? If the answer is “they might overdraw their bank account,” you need strong consistency. If the answer is “they see a like count that is 3 seconds behind,” eventual consistency is fine.

What happens if the system is unavailable for 30 seconds? If the answer is “we lose revenue proportional to downtime” (think Amazon at $13.22 million per minute in 2022 revenue), availability wins. If the answer is “an internal batch job retries,” brief unavailability is acceptable.

How often do partitions actually occur in your deployment? Within a single data center on a well-managed network, partitions are rare (though not impossible). Across regions, they are routine. A system spanning US-East and EU-West will experience partitions measured in seconds to minutes multiple times per year.

Most real architectures are not one system. They are many systems composed together. Your user-facing API might be AP (always respond, even with slightly stale data) while the payment processing pipeline behind it is CP (never process a duplicate charge). A single application can, and usually should, make different CAP tradeoffs for different components.

Understanding the different types of databases and their replication strategies helps you match the right tool to each component.

Common misconceptions

“CAP means you always lose one of three.” Not quite. You lose one only during a partition. When the network is healthy, you can have all three. The theorem constrains behavior during failure, not during normal operation.

“CP means the system is never available.” A CP system is fully available when there is no partition. It only sacrifices availability during network splits, which in a well-run single-region deployment might happen a few times per year.

“Eventual consistency means data is always stale.” In practice, replication lag in systems like Cassandra is typically under 10 milliseconds within a data center. “Eventually” usually means “within a few milliseconds.” The risk is not perpetual staleness but rather the brief window where reads might not reflect the most recent write.

“CAP is outdated.” The theorem itself is as valid as ever. What has changed is our understanding of its nuance. Brewer himself wrote a 2012 retrospective clarifying that the “two of three” framing oversimplifies things. Modern systems treat consistency and availability as a spectrum, not a binary switch.

What comes next

The CAP theorem tells you that tradeoffs exist. Consistency models tell you exactly what those tradeoffs look like in practice: linearizability, sequential consistency, causal consistency, eventual consistency, and when each one is the right choice. That is where we go next.

Start typing to search across all content
navigate Enter open Esc close