Dec 9, 2025 · 20 min read · System Design

Design a rate limiter

In this series (18 parts)

Every large-scale API needs a rate limiter. Without one, a single misbehaving client can consume all your capacity and take down the service for everyone else. The core challenge is not the algorithm (token bucket, sliding window) but making it work across dozens of servers, thousands of rules, and millions of users without adding meaningful latency to every request.

This case study builds on the concepts from rate limiting and the LLD rate limiter. Here we focus on the distributed system: how to share counters across nodes, where to place the limiter in the architecture, and what happens when the limiter itself fails.

Requirements

Functional:

Throttle requests per client, per API endpoint, and globally.
Support multiple rate limiting rules (e.g., 100 requests/minute per user AND 10,000 requests/minute globally for a given endpoint).
Return standard HTTP headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
When a client exceeds the limit, respond with 429 Too Many Requests and a Retry-After header.
Allow rule configuration changes without redeploying.
Support tiered limits (free vs. paid users get different quotas).

Non-functional:

50 million DAU, peak 500,000 QPS across all endpoints.
P99 latency overhead of the rate limiter must be under 5ms.
The limiter must not become a single point of failure. If the rate limiting infrastructure goes down, traffic should flow through (fail-open).
Rules should propagate to all nodes within 10 seconds of an update.

Capacity estimation

QPS: 50M DAU, average 200 requests/day per user = 10 billion requests/day. Peak QPS ~ 500K. Each request requires one rate limit check, so the limiter handles 500K lookups/second.

Storage per counter: A single counter needs a key (user ID + rule ID, ~64 bytes), a count (8 bytes), and a TTL timestamp (8 bytes). That is roughly 80 bytes per counter. If we track 50M users across 5 rules each, that is 250M counters = 20 GB. This fits comfortably in a Redis cluster.

Bandwidth: Each rate limit check reads and increments a counter: ~80 bytes in, ~40 bytes out. At 500K QPS that is 60 MB/s, well within a Redis cluster’s capacity.

High level architecture

The rate limiter sits between the API gateway and the backend services. Clients never talk to the limiter directly. The gateway intercepts every request, calls the rate limiter, and either forwards the request or returns a 429.

graph LR
  C["Client"] --> GW["API Gateway"]
  GW --> RL["Rate Limiter Service"]
  RL --> RC["Redis Cluster"]
  RL --> RulesDB["Rules Config Store"]
  GW -->|"allowed"| BE["Backend Services"]
  GW -->|"429"| C
  RulesDB -->|"sync every 10s"| RL

High level architecture. The API gateway consults the rate limiter on every request. Redis stores the counters; the rules config store holds the throttling rules.

There are three major components:

Rate limiter service: A library or sidecar embedded in the gateway. It evaluates rules against counters stored in Redis.
Redis cluster: Stores all counters with TTLs. Sharded by client ID for horizontal scaling.
Rules config store: A database or config service (e.g., etcd, DynamoDB) that holds rate limiting rules. The limiter polls or subscribes to changes.

Deep dive: distributed token bucket

The token bucket algorithm is the best fit here. Each bucket has a maximum capacity and a refill rate. When a request arrives, we try to consume a token. If the bucket is empty, the request is rejected.

In a single-process system this is trivial. In a distributed system with 50+ gateway instances, all of them need to see the same bucket state. This is where Redis comes in.

Atomic check-and-decrement with Lua

Redis supports server-side Lua scripts that execute atomically. We use a single script to check the current token count, refill tokens based on elapsed time, and decrement if tokens are available:

local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = max_tokens
    last_refill = now
end

local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, max_tokens / refill_rate * 2)
    return {1, new_tokens, max_tokens}
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    return {0, new_tokens, max_tokens}
end

The script returns whether the request is allowed, the remaining tokens, and the max. The gateway translates this into the X-RateLimit-* headers.

Why not sliding window?

Sliding window log is more accurate but stores a timestamp per request. At 500K QPS, that is 500K entries per second per rule. Token bucket uses a fixed two fields per key regardless of request volume. For our scale, token bucket wins on both memory and latency.

sequenceDiagram
  participant C as Client
  participant GW as API Gateway
  participant RL as Rate Limiter
  participant R as Redis

  C->>GW: POST /api/orders
  GW->>RL: checkLimit(userId, endpoint)
  RL->>R: EVALSHA token_bucket_script
  R-->>RL: [allowed=1, remaining=42, limit=100]
  RL-->>GW: ALLOW, headers
  GW->>GW: Set X-RateLimit-Remaining: 42
  GW->>C: 200 OK (with rate limit headers)

  Note over C,R: When limit is exceeded

  C->>GW: POST /api/orders
  GW->>RL: checkLimit(userId, endpoint)
  RL->>R: EVALSHA token_bucket_script
  R-->>RL: [allowed=0, remaining=0, limit=100]
  RL-->>GW: DENY, headers
  GW->>C: 429 Too Many Requests + Retry-After

Sequence diagram for a rate limit check. The Lua script in Redis handles the token bucket logic atomically.

Deep dive: global vs. per-region limits

When your service runs in multiple regions (US-East, EU-West, AP-South), you face a choice: should the rate limit be global or per-region?

Per-region limits are simpler. Each region has its own Redis cluster and its own counters. A user hitting the US-East endpoint gets a separate budget from the same user hitting EU-West. This is fast (no cross-region calls) but inaccurate: a user can get 2x or 3x their intended quota by distributing requests across regions.

Global limits require cross-region coordination. Two approaches:

Single leader region: All rate limit checks go to one Redis cluster. Simple, accurate, but adds cross-region latency (50-200ms). Unacceptable for our 5ms target.
Eventual sync with local budgets: Each region gets an allocated portion of the global limit. Periodically (every 1-2 seconds), regions sync their usage to a central coordinator, which rebalances allocations. This trades some accuracy for speed.

The practical choice for most systems is approach 2. You accept that a user might briefly exceed their limit by 5-10% during rebalancing windows, but every individual request is checked in under 2ms against the local Redis.

Deep dive: graceful degradation and circuit breakers

What happens when Redis is down? If the rate limiter cannot check counters, you have two options:

Fail-closed: Reject all requests. Safe but causes a total outage whenever Redis hiccups.
Fail-open: Allow all requests through. Risky, but the backend services should have their own protection (connection limits, caching layers, queue-based load leveling).

For most services, fail-open is the right default. A rate limiter outage is temporary; the damage from letting unthrottled traffic through for 30 seconds is usually less than the damage from a complete service outage.

We wrap the Redis call in a circuit breaker. If Redis failures exceed a threshold (e.g., 5 failures in 10 seconds), the circuit opens and the limiter bypasses Redis entirely, following reliability patterns for fault tolerance.

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: failures > threshold
  Open --> HalfOpen: timeout expires
  HalfOpen --> Closed: probe succeeds
  HalfOpen --> Open: probe fails

  state Closed {
      [*] --> CheckRedis
      CheckRedis --> Allowed: tokens > 0
      CheckRedis --> Denied: tokens = 0
  }

  state Open {
      [*] --> AllowAll
  }

  state HalfOpen {
      [*] --> ProbeRedis
      ProbeRedis --> TestResult
  }

Circuit breaker states for the rate limiter. In the open state, all requests are allowed through (fail-open). The half-open state sends a single probe to Redis before closing the circuit.

During the open state, the gateway can optionally fall back to in-memory counters per instance. These are not globally consistent, but they provide rough protection: if a single client is flooding one gateway instance, the local counter will still catch it.

Trade-offs and alternative approaches

Approach	Pros	Cons
Token bucket in Redis	Simple, low memory, supports bursts	Requires Redis; slight inaccuracy with eventual sync
Sliding window counter	More uniform rate enforcement	Higher memory; more complex Lua script
Sliding window log	Exact per-request tracking	Stores every timestamp; O(n) cleanup
Local in-memory only	Zero network latency, no dependency	No global coordination; limits scale with instance count
API gateway built-in (e.g., Kong, Envoy)	Zero code; config-only	Less flexibility for complex multi-tier rules

Redis vs. Memcached: Redis wins here because of Lua scripting (atomic check-and-decrement), TTLs on hash fields, and clustering support. Memcached’s incr command works for simple counters but cannot express the token bucket refill logic atomically.

Sidecar vs. library: Running the rate limiter as a sidecar (separate process per pod) isolates failures but adds one network hop. Running it as an in-process library removes that hop but couples the limiter’s lifecycle to the application. At 500K QPS, the network hop matters; most high-performance implementations use a library.

What real systems actually do

Stripe uses a token bucket algorithm with Redis. They run a multi-tier system: per-API-key limits, per-endpoint limits, and global account limits. Their rate limiter is a library in the API gateway, not a separate service.

Cloudflare runs rate limiting at the edge in each PoP. They use local counters with periodic cross-PoP sync for global limits. Their architecture accepts some over-admission in exchange for zero-latency limit checks.

GitHub API uses a sliding window approach. Each response includes X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers. Authenticated users get 5,000 requests per hour; unauthenticated get 60.

AWS API Gateway offers built-in rate limiting with token bucket semantics. It supports per-API-key and per-stage throttling, configured entirely through the console or CloudFormation. Under the hood, it uses distributed counters synced across availability zones.

The common thread: everyone uses token bucket or sliding window, everyone stores counters in a fast in-memory store, and everyone accepts some imprecision in distributed setups rather than pay for strong consistency.

Rule configuration and propagation

Rate limiting rules live in a config store (DynamoDB, etcd, or a simple Postgres table). Each rule specifies a scope, a limit, and a time window:

{
  "rule_id": "user-orders-minute",
  "scope": "per_user",
  "endpoint": "/api/orders",
  "max_requests": 100,
  "window_seconds": 60,
  "tier": "free"
}

The rate limiter service polls the config store every 10 seconds and caches rules locally. When a rule changes, the worst case is a 10-second window where different gateway instances enforce different limits. This is acceptable for our use case. If you need faster propagation, use a pub/sub channel (Redis Pub/Sub or a message queue) to push updates.

Tiered rules are straightforward: look up the user’s tier from a cache or token claim, then select the matching rule set. Free users get 100 requests/minute; paid users get 1,000. The Lua script does not need to know about tiers. The gateway resolves the tier and passes the correct max_tokens and refill_rate to the script.

Response header design

Clients need to know their budget. Standard headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1713600000

X-RateLimit-Reset is a Unix epoch timestamp. Some APIs use seconds-until-reset instead; epoch is more common and avoids clock skew issues between client and server.

On a 429 response, include Retry-After with the number of seconds until the bucket refills enough for at least one request. Well-behaved clients use this to back off automatically.

Good rate limit headers are not just a courtesy. They allow clients to implement intelligent retry logic and smooth out their request patterns. Without them, clients resort to blind exponential backoff, which wastes both their time and your capacity during recovery windows.

What comes next

This case study covered the steady-state design: how requests flow, how counters work, and what happens when Redis fails. There are several extensions worth exploring:

Rate limiting WebSocket connections. Token bucket works for request/response, but long-lived connections need message-rate limiting within the connection.
Adaptive rate limiting. Instead of fixed rules, adjust limits based on real-time system load. When CPU utilization crosses 80%, tighten limits automatically.
Cost-based limiting. Not all requests are equal. A search query costs more than a profile read. Assign different token costs to different endpoints.
Multi-tenant fairness. In a SaaS platform, one tenant’s burst should not starve another. This leads to hierarchical token buckets or weighted fair queuing.

The rate limiter is a small component with outsized impact. Get it right and your users never notice it. Get it wrong and it becomes the single point of failure it was meant to prevent.

← Back to all series