Rate limiting and throttling
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
A single misbehaving client can bring down an entire service. It does not need to be malicious. A retry loop with no backoff, a misconfigured cron job, or a viral product launch can generate enough traffic to saturate your database connections and starve every other user. Rate limiting is the mechanism that prevents this. It puts a ceiling on how many requests a client can make within a defined time window, and it rejects or delays everything beyond that ceiling.
This concept is foundational to building reliable APIs. If you have worked through API design principles, you already know that a well-designed API communicates constraints clearly. Rate limiting is one of the most important constraints you will enforce.
Why rate limiting matters
Without rate limiting, your system is exactly as robust as your least disciplined client. Consider a public API serving 10,000 customers. One customer deploys a bug that sends 50,000 requests per second instead of 50. Your database connection pool maxes out at 500 connections. Every other customer starts seeing timeouts. Your monitoring fires alerts. Engineers scramble. The fix is simple: cut off the offending client. But without rate limiting, you are doing that manually, after the damage is done.
Rate limiting also serves as a cost control mechanism. Cloud infrastructure bills scale with usage. If your system processes every request without limits, a traffic spike from a single source can double your compute costs in hours. Stripe, for example, enforces a default limit of 100 requests per second per API key. Shopify caps at 40 requests per second for REST API calls. These are not arbitrary numbers. They reflect the capacity each company is willing to allocate per tenant.
There is a distinction between rate limiting and throttling that is worth clarifying early. Rate limiting is a policy: “this client gets 1,000 requests per minute.” Throttling is the enforcement mechanism: when a client exceeds its limit, the system slows down or rejects requests. In practice, engineers use these terms interchangeably, but the separation helps when discussing implementation.
The token bucket algorithm
The token bucket is the most widely used rate limiting algorithm. It is simple, memory-efficient, and naturally supports burst traffic.
The idea: imagine a bucket that holds tokens. The bucket has a maximum capacity (the burst size). Tokens are added at a fixed rate (the refill rate). Each request consumes one token. If the bucket is empty, the request is rejected. If the bucket is full, new tokens are discarded.
Suppose you configure a bucket with capacity 10 and a refill rate of 2 tokens per second. A client can burst up to 10 requests instantly. After that, it can sustain 2 requests per second. If the client goes idle for 5 seconds, the bucket refills to 10, and another burst is available.
Amazon API Gateway, NGINX, and Envoy all use token bucket implementations. The algorithm requires only two values in memory per client: the current token count and the timestamp of the last refill. For a million clients, that is roughly 16 MB of state.
stateDiagram-v2 [*] --> HasTokens HasTokens --> HasTokens: Request arrives, tokens > 0, consume token HasTokens --> Empty: Request arrives, tokens = 1, consume last token Empty --> Empty: Request arrives, reject (429) Empty --> HasTokens: Refill timer fires, add tokens HasTokens --> Full: Refill timer fires, tokens = capacity Full --> HasTokens: Request arrives, consume token Full --> Full: Refill timer fires, discard tokens
Token bucket state transitions. Tokens accumulate up to the bucket capacity. Requests drain tokens. An empty bucket triggers HTTP 429 responses until the next refill.
The token bucket gives you two tuning knobs. The capacity controls how large a burst you tolerate. The refill rate controls sustained throughput. For a payment API, you might set capacity to 5 and refill to 2 per second because you want tight control. For a search autocomplete endpoint, capacity 50 and refill 20 per second lets the client feel snappy while staying within bounds.
The leaky bucket algorithm
The leaky bucket is the token bucket’s more disciplined sibling. Instead of allowing bursts, it smooths traffic into a constant output rate. Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.
Think of it as a funnel. No matter how fast you pour water in, it drips out at the same speed. This is ideal for systems that need a perfectly uniform processing rate. Telecom networks historically used leaky buckets to shape packet flow because their downstream equipment could not tolerate bursts.
The downside is that legitimate burst traffic gets penalized. A user loading a dashboard that fires 8 parallel API calls will see some of those requests queued, even if the system has capacity. For most web APIs, the token bucket is preferred because it accommodates natural usage patterns.
Fixed window counting
Fixed window is the simplest algorithm to implement. Divide time into windows of fixed size (say, 60 seconds). Maintain a counter per client per window. Increment the counter on each request. If the counter exceeds the limit, reject the request. Reset the counter when the window rolls over.
Implementation requires a single counter per client. With Redis, it is a single INCR command with a TTL equal to the window size. This is about as cheap as rate limiting gets.
The problem is boundary spikes. A client can send 100 requests at second 59 of window 1, then 100 more at second 0 of window 2. That is 200 requests in 2 seconds, despite a limit of 100 per minute. The system sees two compliant windows. The infrastructure sees a burst that is double the intended rate.
Sliding window algorithms
Sliding window counters fix the boundary problem. There are two variants.
Sliding window log stores the timestamp of every request. To check the limit, count all timestamps within the last N seconds. This is perfectly accurate but expensive. For a client making 1,000 requests per minute, you store 1,000 timestamps per window. At scale, memory consumption becomes a real concern.
Sliding window counter is the practical compromise. It keeps the fixed window counters but weights them. If you are 30% into the current window, the effective count is (current_window_count * 0.30) + (previous_window_count * 0.70). Cloudflare uses this approach. It requires only two counters per client (current and previous window) and provides a good approximation with negligible memory overhead.
The sliding window counter is what most production systems use. It is nearly as accurate as the log approach, requires the same memory as fixed window, and eliminates the boundary spike problem. For a limit of 600 requests per minute, the worst-case error is roughly 5% under typical traffic patterns.
Request rates under limiting
The following chart shows how rate limiting affects actual request throughput over time. The client attempts to send bursts, but the limiter smooths the effective rate to the configured ceiling.
Two traffic spikes hit the system. The rate limiter caps effective throughput at 100 requests per second. Excess requests above the line are rejected with HTTP 429.
Distributed rate limiting
Everything above assumes a single process tracking request counts. In production, you have dozens or hundreds of application servers behind a load balancer. A client’s requests spread across all of them. Each server sees only a fraction of the client’s total traffic. Without coordination, a client limited to 100 requests per second could effectively get 100 multiplied by the number of servers.
There are three common solutions.
Centralized counter with Redis. Every application server checks and increments a counter in Redis before processing a request. Redis handles roughly 100,000 operations per second on a single instance. For most workloads, this is enough. The tradeoff is that every request now includes a network round trip to Redis, adding 0.5 to 2ms of latency. If Redis goes down, you need a fallback policy: fail open (allow all traffic) or fail closed (reject all traffic). Most systems fail open with local rate limiting as a backup.
Sticky sessions. Route all requests from a given client to the same server, using consistent hashing on the API key or client IP. Each server tracks only its assigned clients. No coordination needed. The downside is uneven load distribution if some clients are much heavier than others, and you lose the rate limiting state if the server restarts.
Approximate distributed counting. Each server tracks counts locally and periodically syncs with a central store. Between syncs, the limit enforcement is approximate. If you sync every second and have 10 servers, a client might briefly exceed its limit by up to 10x during the sync gap. This is acceptable for some use cases (content feeds, telemetry ingestion) but not for others (payment APIs, authentication endpoints).
For most teams, the Redis approach wins. It is simple, accurate, and the latency overhead is acceptable. Stripe, GitHub, and Discord all use centralized Redis counters for their public API rate limiting.
Where to enforce limits
Rate limiting can happen at multiple layers of your architecture. The right choice depends on what you are protecting.
At the API gateway or reverse proxy. This is the most common placement. NGINX, Envoy, Kong, and AWS API Gateway all have built-in rate limiting. Requests are rejected before they reach your application servers, which saves compute. The gateway has visibility into client identity (API key, IP address, JWT claims) and can apply per-client or per-route limits. This is where you stop volumetric abuse.
At the application layer. Some limits require business logic. “A free-tier user can create 5 projects per day” is not something a gateway can enforce without understanding your domain model. Application-level rate limiting checks counts against your database or cache within the request handler itself. It is slower but more expressive.
At the infrastructure layer. Cloud providers offer network-level rate limiting through services like AWS WAF, Cloudflare rate limiting rules, or iptables on individual hosts. These catch DDoS traffic and network-layer abuse before it reaches your application at all. Think of this as the outermost perimeter.
The best systems layer these defenses. Infrastructure-level rules block obvious attacks. The gateway enforces per-client API limits. The application layer handles domain-specific constraints. This defense-in-depth approach is a core piece of building reliable systems.
Communicating limits to clients
A well-implemented rate limiter communicates its state through HTTP headers. The standard headers (adopted by most major APIs) are:
X-RateLimit-Limit: the maximum number of requests allowed in the windowX-RateLimit-Remaining: how many requests the client has leftX-RateLimit-Reset: the UTC epoch timestamp when the window resets
When a client exceeds the limit, the server responds with HTTP 429 (Too Many Requests) and a Retry-After header indicating how many seconds to wait. Good clients respect this. Bad clients ignore it, which is why server-side enforcement is non-negotiable.
This ties back to API design. Your rate limiting policy is part of your API contract. Document it. Include it in your OpenAPI spec. Make the limits discoverable at runtime through the response headers.
Choosing your algorithm
The choice depends on your priorities.
Use the token bucket when you want to allow bursts but enforce an average rate. This covers 80% of use cases. It is the default in AWS API Gateway, NGINX, and most API frameworks.
Use the leaky bucket when you need perfectly smooth output. Streaming pipelines, message queue consumers, and network traffic shapers benefit from constant-rate processing.
Use the sliding window counter when simplicity and memory efficiency matter and you want boundary-spike protection. Cloudflare and many internal systems at large tech companies use this.
Use fixed window only for prototyping or extremely cost-sensitive deployments where the boundary spike is acceptable.
Building a rate limiter from scratch
If you want to go deeper into the implementation details, including class design, thread safety, and testing strategies, the low-level design of a rate limiter article walks through building one step by step. It covers the token bucket implementation in code, the interface design, and how to make it work in a multithreaded environment.
What comes next
Rate limiting protects your services from excessive load, but it is only one layer in the request path. Before requests even reach your rate limiter, they pass through proxies that route, cache, and transform traffic. Understanding forward and reverse proxies will give you a complete picture of how requests flow from client to server and where each protection layer fits.