Search…

Scalability: vertical vs horizontal scaling

In this series (20 parts)
  1. What is system design and why it matters
  2. Estimations and back-of-envelope calculations
  3. Scalability: vertical vs horizontal scaling
  4. CAP theorem and distributed system tradeoffs
  5. Consistency models
  6. Load balancing
  7. Caching: strategies and patterns
  8. Content Delivery Networks
  9. Databases: SQL vs NoSQL and when to use each
  10. Database replication
  11. Database sharding and partitioning
  12. Consistent hashing
  13. Message queues and event streaming
  14. API design: REST, GraphQL, gRPC
  15. Rate limiting and throttling
  16. Proxies: forward and reverse
  17. Networking concepts for system design
  18. Reliability patterns: timeouts, retries, circuit breakers
  19. Observability: logging, metrics, tracing
  20. Security in system design

Your system handles 500 requests per second today. Marketing just signed a deal that will triple traffic in six weeks. You have two options: buy a bigger machine, or buy more machines. That decision shapes everything downstream, from your deployment pipeline to your database architecture to your on-call rotation at 3 a.m.

Scaling is the ability of a system to handle growing load without degrading the experience users care about: latency, throughput, and availability. If you have not already read through the basics of back-of-the-envelope estimations, do that first. You need to quantify load before you can reason about how to handle it.

What “load” actually means

Load is not one number. It is a collection of metrics that describe how hard your system is working. Common measures include requests per second (RPS), concurrent connections, database queries per second, cache hit ratio, and network bandwidth. A system that handles 10,000 RPS for read-heavy API calls is under very different stress than one handling 500 RPS of complex write transactions with cross-service coordination.

The key insight: you scale for the bottleneck. If your CPU saturates at 80% while memory sits at 20%, throwing RAM at the problem does nothing. Profiling comes before scaling. Always.

Vertical scaling: make the machine bigger

Vertical scaling (scaling up) means replacing your current server with a more powerful one. More CPU cores, more RAM, faster disks, better NICs. Your application code does not change. Your deployment topology does not change. You just move to bigger hardware.

This is the simplest path. A single PostgreSQL instance on a machine with 64 cores and 512 GB of RAM can handle a surprising amount of work. Many startups run their entire production workload on a single, well-tuned server for years. There is no distributed coordination, no split-brain risk, no cross-node latency. One machine, one process, one source of truth.

The limits are real

Hardware has a ceiling. As of 2024, the largest single cloud instances offer around 448 vCPUs and 24 TB of RAM (AWS u-24tb1.metal). That sounds enormous, but consider a social network with 100 million daily active users generating 50,000 write transactions per second at peak. A single machine, no matter how large, will not handle that. Disk I/O becomes the wall long before you run out of CPU.

Vertical scaling also introduces a single point of failure. If that one powerful machine goes down, your entire service goes down. You can mitigate this with standby replicas, but now you are already moving toward a multi-machine architecture.

Cost is the other problem. Hardware pricing is not linear. A machine with 2x the CPU often costs 2.5x to 3x as much. The price-performance curve bends sharply at the high end.

The red curve shows how vertical scaling costs grow super-linearly. Horizontal scaling (blue) tracks closer to ideal linear cost because commodity hardware is cheap.

At small scale, vertical wins on simplicity. At large scale, horizontal wins on economics. The crossover point depends on your workload, but for most web services it lands somewhere between 10,000 and 50,000 RPS.

Horizontal scaling: add more machines

Horizontal scaling (scaling out) means adding more servers and distributing the workload across them. Instead of one 64-core machine, you run sixteen 4-core machines behind a load balancer. Each machine handles a fraction of the traffic.

graph LR
  subgraph Vertical["Vertical Scaling"]
      C1["Client"] --> S1["Single Large Server<br/>64 CPU / 512 GB"]
      S1 --> DB1["Database"]
  end
  subgraph Horizontal["Horizontal Scaling"]
      C2["Client"] --> LB["Load Balancer"]
      LB --> W1["Server 1<br/>4 CPU / 16 GB"]
      LB --> W2["Server 2<br/>4 CPU / 16 GB"]
      LB --> W3["Server N<br/>4 CPU / 16 GB"]
      W1 --> DB2["Database"]
      W2 --> DB2
      W3 --> DB2
  end

Vertical scaling concentrates capacity in one node. Horizontal scaling spreads it across many, fronted by a load balancer.

The advantages are significant. You get fault tolerance: if one server dies, the others keep serving traffic. You get incremental growth: need 20% more capacity? Add two more servers instead of replacing one. You get cost efficiency at scale because commodity hardware is cheap. And you can scale nearly without limit. Netflix, Google, and Amazon all run on thousands of commodity machines, not a handful of supercomputers.

The challenges are also significant

Horizontal scaling is not free. It introduces complexity that vertical scaling avoids entirely. The main challenges fall into four categories.

State management. If a user’s session lives on Server 3 and the load balancer routes their next request to Server 7, the session is gone. You must externalize state. Move sessions to Redis or a shared database. This is solvable, but it is a design constraint you did not have with one server.

Data consistency. With one database server, every read sees the latest write. With replicas or sharded databases, you get replication lag, stale reads, and conflict resolution problems. These trade-offs are formalized in the CAP theorem, which you should understand before designing any distributed data layer.

Network overhead. Calls between services on the same machine take microseconds. Calls across a network take milliseconds. That 1000x difference compounds when a single user request fans out to five or six downstream services. Serialization, deserialization, TCP handshakes, retries on failure: all of this adds latency and complexity.

Operational burden. Deploying to one server means one SSH session or one container push. Deploying to 200 servers means orchestration, rolling updates, health checks, and canary deployments. You need Kubernetes or a similar platform, and that platform itself requires expertise to operate.

Stateless vs stateful services

The single most important design decision for horizontal scaling is whether your services are stateless or stateful.

A stateless service keeps no local data between requests. Every request contains everything the service needs to process it (or the service fetches what it needs from an external store). Any instance can handle any request. Scaling is trivial: add more instances, point the load balancer at them, done.

A stateful service holds data in memory or on local disk that other instances do not have. A database, a cache with local state, a WebSocket server maintaining long-lived connections. Scaling stateful services is fundamentally harder because you must decide which instance owns which data, handle rebalancing when instances join or leave, and deal with replication for durability.

PropertyStateless ServiceStateful Service
Scaling methodAdd instances behind LBPartition data, replicate, coordinate
Failure impactOther instances absorb trafficData may be lost or unavailable
ExampleREST API serverDatabase, in-memory cache
ComplexityLowHigh
DeploymentRolling update, no drain neededRequires graceful handoff

The architecture pattern that works at scale: make as many services stateless as possible. Push all state to purpose-built stateful systems (databases, caches, message queues) that are designed to handle replication and failover.

Scaling the database

Application servers are the easy part. The database is where scaling gets hard.

For reads, you can add read replicas. A primary database handles all writes and replicates data asynchronously to one or more secondaries. Read traffic goes to the secondaries. This works well for read-heavy workloads (most web apps have a 10:1 or higher read-to-write ratio), but it does not help with write throughput.

For writes, you need database sharding. Sharding splits your data across multiple database instances, each responsible for a subset of the data. User IDs 1 through 1 million go to shard 1, 1 million through 2 million go to shard 2, and so on. This distributes write load but introduces complexity: cross-shard queries, rebalancing when shards get hot, and the operational burden of managing many database instances.

The progression for most systems looks like this:

  1. Single database instance (vertical scaling).
  2. Primary with read replicas (horizontal reads, vertical writes).
  3. Sharded database (horizontal reads and writes).

Each step adds operational complexity. Do not jump to step 3 on day one. Start with the simplest architecture that meets your current load, and scale when measurements prove you need it.

Scaling in practice: a concrete example

Suppose you are building a URL shortener. At launch, you expect 1,000 RPS for reads (redirects) and 100 RPS for writes (new short URLs). A single server with 8 cores and 32 GB of RAM, running a PostgreSQL instance and a Go HTTP server, handles this easily. Vertical scaling, no fuss.

Six months in, you hit 20,000 read RPS and 2,000 write RPS. The single machine is at 90% CPU. You have two paths:

Path A (vertical): Move to a 32-core, 128 GB machine. Cost jumps from 400/monthto400/month to 1,800/month. You have headroom for maybe 4x more growth. Timeline: one afternoon.

Path B (horizontal): Deploy four 8-core application servers behind a load balancer. Add a Redis cache for hot URLs. Add a read replica for the database. Cost is roughly 2,000/monthtotal,butyoucannowgrowincrementallybyadding2,000/month total, but you can now grow incrementally by adding 400 servers. Timeline: one to two weeks of engineering.

At 20,000 RPS, Path A is the right call for most teams. The engineering cost of Path B is real, and your time is better spent shipping features. But if you are confident traffic will 10x again within a year, investing in Path B now saves you from a painful migration later.

When to scale vertically vs horizontally

There is no universal rule, but here are guidelines that hold across most systems.

Start vertical. A single well-tuned server handles more traffic than most engineers expect. Premature horizontal scaling adds complexity that slows down feature development, increases debugging difficulty, and burns engineering hours on infrastructure instead of product.

Go horizontal when vertical hits diminishing returns. If the next tier of hardware costs 3x as much for only 1.5x the capacity, that is your signal. If your workload naturally partitions (users are independent, requests are stateless), horizontal scaling is straightforward.

Go horizontal when you need fault tolerance. A single server is a single point of failure. If your SLA requires 99.99% uptime (roughly 52 minutes of downtime per year), you need redundancy. That means multiple servers, which means horizontal scaling.

Go horizontal for stateless services first. The cost-benefit ratio is best here. Stateless API servers behind a load balancer are well-understood, easy to deploy, and resilient. Horizontally scaling the database comes later, when read replicas are not enough.

The hybrid reality

Most production systems use both. The application tier scales horizontally with a pool of stateless servers. The database tier scales vertically first (bigger instance, more IOPS), then horizontally with read replicas, and eventually with sharding. Caches like Redis or Memcached scale horizontally with consistent hashing. Message queues like Kafka scale horizontally with partitions.

The skill is not choosing one approach over the other. It is knowing which approach to apply to which component at which stage of growth. That judgment comes from understanding the trade-offs deeply, measuring your system’s actual bottlenecks, and resisting the urge to over-engineer for hypothetical scale.

Key trade-offs at a glance

FactorVertical ScalingHorizontal Scaling
ComplexityLowHigh
Cost at small scaleLowerHigher (LB, orchestration)
Cost at large scaleSuper-linear growthNear-linear growth
Fault toleranceSingle point of failureBuilt-in redundancy
Max capacityHardware ceilingEffectively unlimited
Data consistencySimple (one node)Requires distributed protocols
DeploymentSimpleRequires orchestration

What comes next

Scaling horizontally introduces a fundamental tension between consistency, availability, and partition tolerance. You cannot have all three at the same time in a distributed system. The CAP theorem formalizes this constraint and gives you a framework for reasoning about which guarantees to sacrifice and when. That is where we go next.

Start typing to search across all content
navigate Enter open Esc close