Load balancing
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
A single server can handle maybe 10,000 concurrent connections before it starts dropping requests. Double the traffic and you have a problem. Triple it and the server falls over entirely. Load balancers exist to solve this. They sit between clients and your server pool, spreading requests so that no individual machine gets crushed.
This is not an optimization. It is a requirement for any system that expects growth. Every major web property runs behind multiple layers of load balancers. Google handles over 100,000 requests per second on Search alone. Netflix pushes 400+ Gbps of video traffic during peak hours. None of that works with a single backend.
Where load balancers sit in the stack
A load balancer is a reverse proxy that makes routing decisions. Clients send requests to the balancer’s IP. The balancer picks a backend server, forwards the request, and returns the response. From the client’s perspective, the balancer is the server. The actual backends are hidden.
graph TD C1["Client A"] --> LB["Load Balancer<br/>public IP: 203.0.113.1"] C2["Client B"] --> LB C3["Client C"] --> LB LB --> S1["Server 1<br/>10.0.0.1"] LB --> S2["Server 2<br/>10.0.0.2"] LB --> S3["Server 3<br/>10.0.0.3"] S1 --> DB["Database"] S2 --> DB S3 --> DB
Request flow from clients through a single load balancer to a pool of three backend servers. The database sits behind the application tier.
In production you rarely have just one load balancer. Most architectures place them at three points: between the internet and web servers, between web servers and application servers, and between application servers and data stores. Each layer can scale independently. If your caching layer absorbs 80% of reads, your database tier needs fewer machines than your application tier.
L4 vs L7 load balancing
Load balancers operate at two layers of the network stack and the distinction matters.
Layer 4 (transport) balancers work with TCP and UDP. They see source IPs, destination IPs, and port numbers. They do not inspect the payload. When a TCP connection arrives, the balancer picks a backend and forwards all packets for that connection to the same server. This is fast. An L4 balancer can handle millions of connections per second because it does almost no per-request work. Linux’s IPVS module can push 10 Gbps on commodity hardware.
Layer 7 (application) balancers understand HTTP, gRPC, WebSocket, and other application protocols. They can inspect headers, cookies, URL paths, and request bodies. This lets you do content-based routing: send /api/users to one service and /api/orders to another. You can terminate TLS at the balancer, rewrite headers, inject authentication tokens, and rate-limit by API key. The cost is higher CPU usage per request since parsing HTTP is more expensive than forwarding TCP packets.
Most systems use both. An L4 balancer at the edge handles raw connection distribution and DDoS mitigation. Behind it, L7 balancers do intelligent routing. AWS’s architecture makes this explicit: Network Load Balancer (NLB) operates at L4, Application Load Balancer (ALB) operates at L7. You often chain them.
When to pick one over the other: use L4 when you need raw throughput and protocol-agnostic forwarding. Use L7 when you need to make routing decisions based on request content. If you are building a microservices system with path-based routing, L7 is non-negotiable. If you are distributing TCP connections to a fleet of identical database replicas, L4 is simpler and faster.
Routing algorithms
The algorithm a load balancer uses to pick a backend determines how evenly traffic spreads. No algorithm is universally best. Each makes different tradeoffs between simplicity, fairness, and statefulness.
Round robin
The simplest approach. Requests go to servers in order: 1, 2, 3, 1, 2, 3. Weighted round robin assigns proportional shares, so a machine with weight 3 gets three requests for every one sent to a machine with weight 1. This works well when requests have roughly equal cost and servers have similar capacity. It falls apart when some requests take 10ms and others take 5 seconds, because the balancer has no idea which servers are actually busy.
Least connections
The balancer tracks how many active connections each backend holds and sends the next request to whichever server has the fewest. This adapts naturally to variable request costs. A server processing a slow database query accumulates connections and stops receiving new ones until it catches up. Weighted least connections combines this with server capacity weights. This is the default algorithm for most production HTTP load balancers because it handles real-world traffic patterns better than round robin.
IP hash
Hash the client’s IP address and use the result to select a server. The same client always reaches the same backend, which gives you crude session affinity without cookies or state. The downside is that the distribution depends on the hash function and the IP space. A few large corporate NATs can concentrate traffic on one server. Adding or removing servers rehashes everything and breaks all existing affinities.
Consistent hashing
A smarter version of IP hash that solves the rehashing problem. Servers and request keys sit on a hash ring. Each request maps to the nearest server clockwise on the ring. When you add a server, only the requests that fall between the new node and its neighbor get reassigned. Everything else stays put. With 150 virtual nodes per server, the load variance drops below 10%. This is the algorithm behind consistent hashing in distributed caches and is the right choice when session locality or cache hit rates matter.
Comparing distribution across algorithms
The following chart shows how 10,000 requests distribute across 5 servers under each algorithm, assuming a realistic traffic pattern where 30% of requests come from 5% of clients and request durations vary from 2ms to 800ms.
Round robin produces perfect uniformity in request count but ignores server load. Least connections adapts to actual utilization so the numbers shift slightly based on which servers finish faster. IP hash creates visible hotspots because client IP distributions are not uniform. Consistent hashing with enough virtual nodes approaches the balance of round robin while preserving request affinity.
Health checks
A load balancer that sends traffic to a dead server is worse than no load balancer at all. Health checks solve this by probing backends at regular intervals and removing failed ones from the pool.
Passive health checks monitor live traffic. If a server returns five consecutive 503 errors in 30 seconds, the balancer marks it unhealthy and stops routing to it. This costs nothing in extra traffic but only detects failures after real users hit them.
Active health checks send synthetic probes. The balancer pings GET /health every 10 seconds. If three consecutive checks fail, the server is removed. When it starts passing again, it re-enters the pool. Active checks catch failures before users do but add load to backends. A pool of 100 servers checked every 5 seconds generates 20 health requests per second, which is trivial for most services.
Good health endpoints go beyond “process is alive.” They verify the server can reach its database, that disk space is above a threshold, and that response latency is within bounds. A server that returns 200 OK but takes 30 seconds to serve any real request is effectively dead. Your health check should reflect that.
Sticky sessions
Stateless backends are the goal, but reality is messier. Shopping carts stored in server memory, WebSocket connections, multi-step form wizards that accumulate state: sometimes a client needs to return to the same server.
Sticky sessions (session affinity) pin a client to a backend for the duration of a session. The balancer typically inserts a cookie like SERVERID=srv2 on the first response. Subsequent requests include this cookie and get routed to srv2. If srv2 dies, the session is lost and the client gets a new backend.
The tradeoff is clear. Sticky sessions let you defer the work of externalizing state, but they create uneven load distribution. One server might accumulate long-lived sessions while others sit idle. They also complicate deployments: rolling out a new version means draining sessions from old servers, which can take minutes or hours.
The better path is to move session state into a shared store like Redis or a database. Then any server can handle any request and you do not need sticky sessions at all. Use stickiness as a transitional measure, not a permanent architecture.
Global vs regional load balancing
Everything discussed so far covers load balancing within a single data center or region. Global load balancing distributes traffic across regions.
DNS-based global load balancing returns different IP addresses based on the client’s geographic location. A user in Tokyo gets the IP of your Tokyo data center. A user in London gets Frankfurt. This is how most CDNs work. The limitation is DNS TTLs. If your Tokyo data center goes down, clients with cached DNS records keep hitting it until the TTL expires. Setting TTLs to 60 seconds helps but adds DNS lookup overhead.
Anycast advertises the same IP address from multiple locations via BGP. The network itself routes packets to the nearest healthy instance. Cloudflare uses anycast across 300+ data centers. Failover is nearly instant because BGP reconverges in seconds. The downside is that anycast works best for stateless or connection-light protocols. Long-lived TCP connections can break during BGP route changes.
Global server load balancing (GSLB) combines health awareness with geographic routing. The GSLB system monitors the health and capacity of each region and directs traffic accordingly. If the US-East region is at 90% capacity while US-West is at 40%, GSLB shifts traffic westward even for US-East users. This is what large-scale systems use in practice.
For a deeper discussion on scaling strategies across regions, see scalability patterns.
Failure modes and mitigations
Load balancers themselves are single points of failure if deployed carelessly. The standard mitigation is active-passive or active-active redundancy.
In active-passive, two balancers share a virtual IP via a protocol like VRRP. The active balancer handles all traffic. If it fails, the passive one takes over the virtual IP within 1 to 3 seconds. The downside is that you pay for a machine that sits idle most of the time.
In active-active, both balancers handle traffic simultaneously. DNS round robin or anycast distributes connections across them. If one fails, the other absorbs the full load. This requires each balancer to be provisioned for peak traffic, not half of it.
Connection draining matters during planned failover. When you take a balancer out of rotation for maintenance, it should stop accepting new connections but finish processing existing ones. Cutting connections mid-request causes errors that are entirely avoidable.
Choosing the right setup
For a small service handling under 10,000 requests per second, a single L7 load balancer with round robin and active health checks is enough. Nginx or HAProxy on a 4-core machine handles this easily.
At 50,000 to 200,000 requests per second, you need L4 balancing in front of L7 balancers. Use least connections at both layers. Deploy active-passive pairs for redundancy. Start thinking about consistent hashing if your backends maintain local caches.
Beyond 200,000 requests per second, you are looking at global load balancing with anycast, multiple regional L4/L7 tiers, automated GSLB with health-aware routing, and careful capacity planning. At this scale, the load balancing infrastructure itself is a distributed system that requires its own monitoring, alerting, and failover testing.
The principles do not change with scale. Distribute traffic. Detect failures. Remove bad backends. Route intelligently. The mechanisms get more sophisticated, but the goals remain the same.
What comes next
Load balancers keep backends from being overwhelmed by distributing traffic. But distributing requests is only half the problem. The other half is avoiding redundant work entirely. Caching sits between the client and the backend, serving repeated requests from memory instead of recomputing them. It reduces latency, cuts backend load, and changes the economics of serving traffic at scale.