Mar 7, 2026 · 14 min read · System Design

Scalability: vertical vs horizontal scaling

In this series (20 parts)

Your system handles 500 requests per second today. Marketing just signed a deal that will triple traffic in six weeks. You have two options: buy a bigger machine, or buy more machines. That decision shapes everything downstream, from your deployment pipeline to your database architecture to your on-call rotation at 3 a.m.

Scaling is the ability of a system to handle growing load without degrading the experience users care about: latency, throughput, and availability. If you have not already read through the basics of back-of-the-envelope estimations, do that first. You need to quantify load before you can reason about how to handle it.

What “load” actually means

Load is not one number. It is a collection of metrics that describe how hard your system is working. Common measures include requests per second (RPS), concurrent connections, database queries per second, cache hit ratio, and network bandwidth. A system that handles 10,000 RPS for read-heavy API calls is under very different stress than one handling 500 RPS of complex write transactions with cross-service coordination.

The key insight: you scale for the bottleneck. If your CPU saturates at 80% while memory sits at 20%, throwing RAM at the problem does nothing. Profiling comes before scaling. Always.

Vertical scaling: make the machine bigger

Vertical scaling (scaling up) means replacing your current server with a more powerful one. More CPU cores, more RAM, faster disks, better NICs. Your application code does not change. Your deployment topology does not change. You just move to bigger hardware.

This is the simplest path. A single PostgreSQL instance on a machine with 64 cores and 512 GB of RAM can handle a surprising amount of work. Many startups run their entire production workload on a single, well-tuned server for years. There is no distributed coordination, no split-brain risk, no cross-node latency. One machine, one process, one source of truth.

The limits are real

Hardware has a ceiling. As of 2024, the largest single cloud instances offer around 448 vCPUs and 24 TB of RAM (AWS u-24tb1.metal). That sounds enormous, but consider a social network with 100 million daily active users generating 50,000 write transactions per second at peak. A single machine, no matter how large, will not handle that. Disk I/O becomes the wall long before you run out of CPU.

Vertical scaling also introduces a single point of failure. If that one powerful machine goes down, your entire service goes down. You can mitigate this with standby replicas, but now you are already moving toward a multi-machine architecture.

Cost is the other problem. Hardware pricing is not linear. A machine with 2x the CPU often costs 2.5x to 3x as much. The price-performance curve bends sharply at the high end.

The red curve shows how vertical scaling costs grow super-linearly. Horizontal scaling (blue) tracks closer to ideal linear cost because commodity hardware is cheap.

At small scale, vertical wins on simplicity. At large scale, horizontal wins on economics. The crossover point depends on your workload, but for most web services it lands somewhere between 10,000 and 50,000 RPS.

Horizontal scaling: add more machines

Horizontal scaling (scaling out) means adding more servers and distributing the workload across them. Instead of one 64-core machine, you run sixteen 4-core machines behind a load balancer. Each machine handles a fraction of the traffic.

graph LR
  subgraph Vertical["Vertical Scaling"]
      C1["Client"] --> S1["Single Large Server<br/>64 CPU / 512 GB"]
      S1 --> DB1["Database"]
  end
  subgraph Horizontal["Horizontal Scaling"]
      C2["Client"] --> LB["Load Balancer"]
      LB --> W1["Server 1<br/>4 CPU / 16 GB"]
      LB --> W2["Server 2<br/>4 CPU / 16 GB"]
      LB --> W3["Server N<br/>4 CPU / 16 GB"]
      W1 --> DB2["Database"]
      W2 --> DB2
      W3 --> DB2
  end

Vertical scaling concentrates capacity in one node. Horizontal scaling spreads it across many, fronted by a load balancer.

The advantages are significant. You get fault tolerance: if one server dies, the others keep serving traffic. You get incremental growth: need 20% more capacity? Add two more servers instead of replacing one. You get cost efficiency at scale because commodity hardware is cheap. And you can scale nearly without limit. Netflix, Google, and Amazon all run on thousands of commodity machines, not a handful of supercomputers.

The challenges are also significant

Horizontal scaling is not free. It introduces complexity that vertical scaling avoids entirely. The main challenges fall into four categories.

State management. If a user’s session lives on Server 3 and the load balancer routes their next request to Server 7, the session is gone. You must externalize state. Move sessions to Redis or a shared database. This is solvable, but it is a design constraint you did not have with one server.

Data consistency. With one database server, every read sees the latest write. With replicas or sharded databases, you get replication lag, stale reads, and conflict resolution problems. These trade-offs are formalized in the CAP theorem, which you should understand before designing any distributed data layer.

Network overhead. Calls between services on the same machine take microseconds. Calls across a network take milliseconds. That 1000x difference compounds when a single user request fans out to five or six downstream services. Serialization, deserialization, TCP handshakes, retries on failure: all of this adds latency and complexity.

Operational burden. Deploying to one server means one SSH session or one container push. Deploying to 200 servers means orchestration, rolling updates, health checks, and canary deployments. You need Kubernetes or a similar platform, and that platform itself requires expertise to operate.

Stateless vs stateful services

The single most important design decision for horizontal scaling is whether your services are stateless or stateful.

A stateless service keeps no local data between requests. Every request contains everything the service needs to process it (or the service fetches what it needs from an external store). Any instance can handle any request. Scaling is trivial: add more instances, point the load balancer at them, done.

A stateful service holds data in memory or on local disk that other instances do not have. A database, a cache with local state, a WebSocket server maintaining long-lived connections. Scaling stateful services is fundamentally harder because you must decide which instance owns which data, handle rebalancing when instances join or leave, and deal with replication for durability.

Property	Stateless Service	Stateful Service
Scaling method	Add instances behind LB	Partition data, replicate, coordinate
Failure impact	Other instances absorb traffic	Data may be lost or unavailable
Example	REST API server	Database, in-memory cache
Complexity	Low	High
Deployment	Rolling update, no drain needed	Requires graceful handoff

The architecture pattern that works at scale: make as many services stateless as possible. Push all state to purpose-built stateful systems (databases, caches, message queues) that are designed to handle replication and failover.

Scaling the database

Application servers are the easy part. The database is where scaling gets hard.

For reads, you can add read replicas. A primary database handles all writes and replicates data asynchronously to one or more secondaries. Read traffic goes to the secondaries. This works well for read-heavy workloads (most web apps have a 10:1 or higher read-to-write ratio), but it does not help with write throughput.

For writes, you need database sharding. Sharding splits your data across multiple database instances, each responsible for a subset of the data. User IDs 1 through 1 million go to shard 1, 1 million through 2 million go to shard 2, and so on. This distributes write load but introduces complexity: cross-shard queries, rebalancing when shards get hot, and the operational burden of managing many database instances.

The progression for most systems looks like this:

Single database instance (vertical scaling).
Primary with read replicas (horizontal reads, vertical writes).
Sharded database (horizontal reads and writes).

Each step adds operational complexity. Do not jump to step 3 on day one. Start with the simplest architecture that meets your current load, and scale when measurements prove you need it.

Scaling in practice: a concrete example

Suppose you are building a URL shortener. At launch, you expect 1,000 RPS for reads (redirects) and 100 RPS for writes (new short URLs). A single server with 8 cores and 32 GB of RAM, running a PostgreSQL instance and a Go HTTP server, handles this easily. Vertical scaling, no fuss.

Six months in, you hit 20,000 read RPS and 2,000 write RPS. The single machine is at 90% CPU. You have two paths:

Path A (vertical): Move to a 32-core, 128 GB machine. Cost jumps from $400/month to$ 1,800/month. You have headroom for maybe 4x more growth. Timeline: one afternoon.

Path B (horizontal): Deploy four 8-core application servers behind a load balancer. Add a Redis cache for hot URLs. Add a read replica for the database. Cost is roughly $2,000/month total, but you can now grow incrementally by adding$ 400 servers. Timeline: one to two weeks of engineering.

At 20,000 RPS, Path A is the right call for most teams. The engineering cost of Path B is real, and your time is better spent shipping features. But if you are confident traffic will 10x again within a year, investing in Path B now saves you from a painful migration later.

When to scale vertically vs horizontally

There is no universal rule, but here are guidelines that hold across most systems.

Start vertical. A single well-tuned server handles more traffic than most engineers expect. Premature horizontal scaling adds complexity that slows down feature development, increases debugging difficulty, and burns engineering hours on infrastructure instead of product.

Go horizontal when vertical hits diminishing returns. If the next tier of hardware costs 3x as much for only 1.5x the capacity, that is your signal. If your workload naturally partitions (users are independent, requests are stateless), horizontal scaling is straightforward.

Go horizontal when you need fault tolerance. A single server is a single point of failure. If your SLA requires 99.99% uptime (roughly 52 minutes of downtime per year), you need redundancy. That means multiple servers, which means horizontal scaling.

Go horizontal for stateless services first. The cost-benefit ratio is best here. Stateless API servers behind a load balancer are well-understood, easy to deploy, and resilient. Horizontally scaling the database comes later, when read replicas are not enough.

The hybrid reality

Most production systems use both. The application tier scales horizontally with a pool of stateless servers. The database tier scales vertically first (bigger instance, more IOPS), then horizontally with read replicas, and eventually with sharding. Caches like Redis or Memcached scale horizontally with consistent hashing. Message queues like Kafka scale horizontally with partitions.

The skill is not choosing one approach over the other. It is knowing which approach to apply to which component at which stage of growth. That judgment comes from understanding the trade-offs deeply, measuring your system’s actual bottlenecks, and resisting the urge to over-engineer for hypothetical scale.

Key trade-offs at a glance

Factor	Vertical Scaling	Horizontal Scaling
Complexity	Low	High
Cost at small scale	Lower	Higher (LB, orchestration)
Cost at large scale	Super-linear growth	Near-linear growth
Fault tolerance	Single point of failure	Built-in redundancy
Max capacity	Hardware ceiling	Effectively unlimited
Data consistency	Simple (one node)	Requires distributed protocols
Deployment	Simple	Requires orchestration

What comes next

Scaling horizontally introduces a fundamental tension between consistency, availability, and partition tolerance. You cannot have all three at the same time in a distributed system. The CAP theorem formalizes this constraint and gives you a framework for reasoning about which guarantees to sacrifice and when. That is where we go next.

← Back to all series