Search…

Database sharding and partitioning

In this series (20 parts)
  1. What is system design and why it matters
  2. Estimations and back-of-envelope calculations
  3. Scalability: vertical vs horizontal scaling
  4. CAP theorem and distributed system tradeoffs
  5. Consistency models
  6. Load balancing
  7. Caching: strategies and patterns
  8. Content Delivery Networks
  9. Databases: SQL vs NoSQL and when to use each
  10. Database replication
  11. Database sharding and partitioning
  12. Consistent hashing
  13. Message queues and event streaming
  14. API design: REST, GraphQL, gRPC
  15. Rate limiting and throttling
  16. Proxies: forward and reverse
  17. Networking concepts for system design
  18. Reliability patterns: timeouts, retries, circuit breakers
  19. Observability: logging, metrics, tracing
  20. Security in system design

A single database handles your application just fine until it does not. At some point the write volume exceeds what one machine can absorb, or the dataset outgrows the memory and disk of a single server. Vertical scaling buys time. You upgrade the CPU, double the RAM, move to faster SSDs. But hardware has a ceiling, and the cost curve goes exponential long before you hit it. Sharding is the alternative: split the data across multiple machines so each one handles a fraction of the total load.

This article builds on database replication. Replication copies the same data to multiple nodes for read scaling and fault tolerance. Sharding divides the data itself. Most production systems use both. If you have not read the replication article yet, start there.

Partitioning vs sharding

The terms get used interchangeably, but there is a useful distinction. Partitioning means dividing a table into smaller pieces. When those pieces live on the same machine, it is local partitioning. When they live on different machines, it is sharding. Sharding is distributed partitioning.

PostgreSQL supports declarative partitioning within a single instance. You can split a orders table into monthly partitions so the planner only scans the relevant months during a query. That is useful for pruning, archival, and maintenance. But it does not help when the total write throughput exceeds what one node can handle. For that, you need the data on separate physical servers.

Throughout this article, “shard” means a partition that lives on its own database server.

Why shard

Three reasons dominate:

Write throughput. A single PostgreSQL instance on good hardware handles roughly 10,000 to 50,000 write transactions per second depending on row size, indexes, and durability settings. An e-commerce platform doing 200,000 order writes per second during a flash sale cannot fit that on one machine. Four shards, each handling 50,000 writes per second, can.

Dataset size. A single-node database with 20 TB of data means every full table scan, backup, and index rebuild operates on 20 TB. Split across 10 shards, each node holds 2 TB. Backups run faster. Index rebuilds finish in a fraction of the time. The working set fits in memory on each node.

Isolation. A runaway query on one shard does not degrade performance for users whose data lives on another shard. This matters for multi-tenant SaaS systems where one customer’s analytics query should not slow down another customer’s checkout.

Horizontal partitioning

Sharding is a form of horizontal partitioning. You split rows across shards based on the value of a partition key (also called the shard key). Every row has exactly one home shard determined by its key value. The choice of partition key is the single most important decision in your sharding strategy.

A good partition key has three properties. It distributes data evenly so no shard holds significantly more rows than others. It distributes load evenly so no shard receives significantly more queries. And it aligns with your most common access pattern so most queries hit a single shard rather than fanning out to all of them.

For a social media application, user_id is often a strong partition key. Users read and write their own data far more than they access other users’ data. Each user’s posts, likes, and messages land on the same shard. The vast majority of requests touch one shard.

graph TD
  APP["Application"] --> ROUTER["Shard Router"]
  ROUTER -->|"user_id % 4 = 0"| S0["Shard 0
Users 0-24M"]
  ROUTER -->|"user_id % 4 = 1"| S1["Shard 1
Users 25-49M"]
  ROUTER -->|"user_id % 4 = 2"| S2["Shard 2
Users 50-74M"]
  ROUTER -->|"user_id % 4 = 3"| S3["Shard 3
Users 75-100M"]
  S0 --> R0["Replica"]
  S1 --> R1["Replica"]
  S2 --> R2["Replica"]
  S3 --> R3["Replica"]

Data distribution across four shards using modular hash partitioning. Each shard holds roughly 25 million users and replicates for fault tolerance.

Sharding strategies

Range-based sharding

Assign rows to shards based on contiguous ranges of the partition key. Users with IDs 1 through 1,000,000 go to shard 0, 1,000,001 through 2,000,000 to shard 1, and so on. This is simple to understand and makes range queries efficient. If you need all users with IDs between 500,000 and 600,000, you know exactly which shard to ask.

The downside is predictable: new users get sequential IDs, so all new writes hit the last shard. That shard becomes a hotspot while the older shards sit mostly idle. You can mitigate this by choosing a partition key that does not grow monotonically, but range sharding with auto-incrementing keys is a common mistake in practice.

Range sharding works well for time-series data. Partition by month and each shard holds one month of records. Queries almost always filter by time range, so they hit one or two shards. Old shards become read-only and can be compressed or moved to cheaper storage.

Hash-based sharding

Apply a hash function to the partition key and use the result to determine the shard. shard = hash(user_id) % num_shards. The hash function spreads keys uniformly across shards regardless of the key distribution. You avoid the hotspot problem that plagues range sharding with sequential keys.

The tradeoff is that range queries become expensive. If you need all users with IDs between 500,000 and 600,000, those users are scattered across all shards. You must query every shard and merge the results.

Hash sharding has a second problem: adding or removing shards changes the modulus, which remaps nearly every key to a different shard. With 4 shards, hash(key) % 4 sends key 17 to shard 1. With 5 shards, hash(key) % 5 might send it to shard 2. Scaling from 4 to 5 shards means moving roughly 80% of the data. This is where consistent hashing becomes essential. It limits data movement to approximately 1/n of the keys when adding the nth shard.

Directory-based sharding

Maintain a lookup table that maps each partition key (or key range) to a shard. The application queries the directory to find the right shard, then queries that shard for the data. This approach gives you complete flexibility. You can move individual users between shards without changing any hash function or range boundary. You can place high-value customers on dedicated hardware.

The directory itself becomes a critical dependency. Every data access requires a directory lookup first. If the directory is unavailable, the entire system is unavailable. You must replicate the directory, cache it aggressively, and handle cache invalidation when mappings change. In practice, directory-based sharding is often used as a layer on top of hash or range sharding. The directory maps logical shard IDs to physical hosts, making it possible to rebalance without application changes.

The hotspot problem

Even with hash sharding, hotspots happen. A celebrity with 50 million followers posts an update. The shard holding that celebrity’s data suddenly receives 50 million read requests in seconds. The hash function distributed users evenly, but it cannot predict that one user generates orders of magnitude more load than average.

Some systems handle this by splitting hot keys across multiple shards. The celebrity’s posts are replicated to 10 shards, and reads are distributed among them. Instagram built a system that detects hot partitions in real-time and temporarily fans out reads across replicas. Others pre-identify VIP keys and assign them dedicated resources.

The chart below illustrates how a naive hash distribution compares to consistent hashing when load is skewed by a small number of hot keys. With 8 shards and 5% of keys generating 60% of the load, naive modular hashing concentrates that load on fewer shards after resharding, while consistent hashing redistributes it more gracefully.

Naive hash sharding creates severe imbalance when hot keys cluster after resharding. Consistent hashing spreads the load more evenly across all shards.

Resharding

Your system grows. Four shards are not enough. You need eight. With naive hash sharding, hash(key) % 4 becomes hash(key) % 8, and roughly 75% of keys move to a new shard. During the migration, the system must handle reads and writes to data that might be on the old shard or the new one.

There are two common approaches to resharding. The first is a stop-the-world migration: take the system offline, redistribute the data, bring it back up. This is only acceptable for systems that can tolerate scheduled downtime. The second is online resharding, which is significantly more complex.

Online resharding typically works in phases. First, create the new shards and start double-writing: all new writes go to both the old shard and the new shard. Second, backfill historical data from old shards to new shards. Third, verify consistency between old and new shard mappings. Fourth, cut over reads to the new mapping. Fifth, stop writes to old shards and decommission them.

Vitess, the sharding middleware used by YouTube and Slack, automates this process. It uses a combination of consistent hashing and directory-based routing to perform live resharding with no downtime. But even with tooling, resharding is operationally expensive. It is worth choosing your initial shard count and key carefully to delay the first reshard as long as possible.

A common rule of thumb: start with more shards than you think you need. Running 16 logical shards on 4 physical machines costs almost nothing extra, and when you need to scale to 8 physical machines, you just move logical shards between hosts instead of splitting data.

Cross-shard queries

The hardest part of sharding is not splitting the data. It is dealing with queries that span multiple shards.

A social media feed shows posts from all the users you follow. Those users live on different shards. Building the feed requires querying every shard that holds a followed user, sorting the results by timestamp, and merging them. If you follow 500 users spread across 16 shards, that is 16 parallel queries followed by a merge-sort. Latency is bounded by the slowest shard.

Cross-shard joins are worse. If orders is sharded by customer_id and products is sharded by product_id, joining them requires a full scatter-gather across all product shards for each order. This is why sharding strategy must align with access patterns. If 90% of your queries join orders with customers, shard both tables by customer_id and co-locate the data.

Cross-shard transactions require distributed coordination. Two-phase commit (2PC) ensures atomicity but adds latency and a blocking failure mode. If the coordinator crashes between the prepare and commit phases, all participants hold locks until the coordinator recovers. Most sharded systems avoid distributed transactions entirely and instead design around eventual consistency or use the saga pattern to break a logical transaction into compensatable local transactions.

For analytics and reporting that naturally span the full dataset, many teams maintain a separate denormalized read store. Stream changes from all shards into a data warehouse or search index. The sharded OLTP database handles transactional writes. The read store handles analytical queries. Each system is optimized for its workload.

Practical considerations

Connection management. With 16 shards, each with a primary and two replicas, you have 48 database instances. If your application has 100 backend servers, that is 4,800 potential connections. Connection pooling per shard (PgBouncer, ProxySQL) is not optional. It is a requirement.

Schema changes. Every ALTER TABLE must be applied to every shard. Rolling schema migrations across 16 shards take 16 times as long and have 16 times as many chances to fail partway through. Tools like gh-ost and pt-online-schema-change help, but operational complexity increases linearly with shard count.

Monitoring. You now need per-shard metrics for latency, throughput, disk usage, and replication lag. A shard running hot or falling behind on replication requires immediate attention. Aggregate metrics hide shard-level problems.

Backup and recovery. Restoring a single shard from backup means that shard’s data may be slightly behind the others. If cross-shard consistency matters, you need coordinated backup snapshots or a way to reconcile after partial restore.

When not to shard

Sharding adds significant complexity to every layer of your system. Before reaching for it, exhaust simpler alternatives. Vertical scaling, read replicas, caching, query optimization, and connection pooling solve most scaling problems without splitting the data.

The CAP theorem tells us that distributed systems face inherent tradeoffs between consistency and availability during network partitions. Sharding makes your database a distributed system with all the associated challenges. If your dataset fits on one beefy machine and you can handle the write throughput with proper indexing and hardware, a single-node database with replicas (covered in the databases overview) is simpler to operate, reason about, and debug.

Shard when you must, not when it sounds impressive. The best sharding decision is often to delay it until the data and access patterns are well understood, so you can choose the right partition key the first time.

What comes next

Sharding with a naive modulus hash breaks down when you add or remove nodes. Consistent hashing solves this by mapping both keys and nodes onto a ring, so adding a node only moves a fraction of the keys. That is the topic of the next article.

Start typing to search across all content
navigate Enter open Esc close