Designing a rate limiter
In this series (15 parts)
- Introduction to low level design
- SOLID principles
- Design patterns: Creational
- Design patterns: Structural
- Design patterns: Behavioral
- Designing a parking lot
- Designing a library management system
- Designing an elevator system
- Designing a hotel booking system
- Designing a ride-sharing model
- Designing a rate limiter
- Designing a logging framework
- Designing a notification system
- API design and contract-first development
- Data modeling for system design
Every production API needs a rate limiter. Without one, a single client can exhaust your entire system’s capacity, whether through malice, a bug, or a sudden traffic spike. The goal is simple: enforce a ceiling on requests per client per time window, and reject everything beyond it. The challenge is doing it correctly in a distributed, multi-threaded environment.
If you have not already, review rate limiting for the high-level motivation. Here we focus on the low-level design: interfaces, data structures, concurrency, and distributed coordination.
Interface design
The rate limiter’s public contract should be minimal. A caller asks “is this request allowed?” and gets a yes or no answer. Everything else is an implementation detail.
public interface RateLimiter {
boolean isAllowed(String clientId);
RateLimitInfo getRateLimitInfo(String clientId);
}
public record RateLimitInfo(
int limit,
int remaining,
long resetTimestamp
) {}
The RateLimitInfo object is how you populate the X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers in your HTTP response. Clients need this feedback to self-regulate.
classDiagram
class RateLimiter {
<<interface>>
+isAllowed(clientId: String) boolean
+getRateLimitInfo(clientId: String) RateLimitInfo
}
class RateLimitInfo {
+int limit
+int remaining
+long resetTimestamp
}
class TokenBucketLimiter {
-Map~String, Bucket~ buckets
-RateLimitConfig defaultConfig
+isAllowed(clientId: String) boolean
+getRateLimitInfo(clientId: String) RateLimitInfo
}
class SlidingWindowLimiter {
-Map~String, Deque~ requestLogs
-RateLimitConfig defaultConfig
+isAllowed(clientId: String) boolean
+getRateLimitInfo(clientId: String) RateLimitInfo
}
class RateLimitConfig {
+int maxRequests
+long windowSizeMs
+int refillRate
}
RateLimiter <|.. TokenBucketLimiter
RateLimiter <|.. SlidingWindowLimiter
TokenBucketLimiter --> RateLimitConfig
SlidingWindowLimiter --> RateLimitConfig
RateLimiter --> RateLimitInfo
Class diagram showing the rate limiter interface with token bucket and sliding window implementations.
Token bucket: the core algorithm
The token bucket is the most widely used rate limiting algorithm. It is simple, memory-efficient, and naturally supports burst traffic. The idea: each client has a bucket that holds tokens. Every request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, the request is denied.
public class Bucket {
private double tokens;
private final int maxTokens;
private final double refillRatePerMs;
private long lastRefillTimestamp;
public Bucket(int maxTokens, double refillRatePerSecond) {
this.tokens = maxTokens;
this.maxTokens = maxTokens;
this.refillRatePerMs = refillRatePerSecond / 1000.0;
this.lastRefillTimestamp = System.currentTimeMillis();
}
public synchronized boolean tryConsume() {
refill();
if (tokens >= 1.0) {
tokens -= 1.0;
return true;
}
return false;
}
private void refill() {
long now = System.currentTimeMillis();
double elapsed = now - lastRefillTimestamp;
tokens = Math.min(maxTokens, tokens + elapsed * refillRatePerMs);
lastRefillTimestamp = now;
}
}
The key insight is lazy refill. You do not run a background timer. Instead, you calculate the tokens that should have been added since the last request. This eliminates the need for a scheduled thread and reduces memory overhead.
Request flow
When a request arrives, the API gateway intercepts it before reaching the application logic. The rate limiter checks the client’s bucket, and either allows the request through or returns HTTP 429.
sequenceDiagram
participant C as Client
participant G as API Gateway
participant R as RateLimiter
participant S as Storage
participant A as Application
C->>G: POST /api/resource
G->>R: isAllowed("client-42")
R->>S: fetch bucket for client-42
S-->>R: Bucket(tokens=5, lastRefill=t0)
R->>R: refill tokens based on elapsed time
R->>R: tryConsume one token
R->>S: update bucket state
R-->>G: allowed=true, remaining=4
G->>A: forward request
A-->>G: 200 OK with response body
G-->>C: 200 OK + rate limit headers
Sequence diagram for a rate-limited request passing through the API gateway.
When the limiter denies a request, the gateway returns a 429 status code with a Retry-After header. This tells the client exactly when to retry, preventing thundering herd problems.
Thread safety in the single-node case
The synchronized keyword on tryConsume works for a single bucket, but it creates contention when thousands of clients share one limiter instance. A better approach uses ConcurrentHashMap with per-bucket locking.
public class TokenBucketLimiter implements RateLimiter {
private final ConcurrentHashMap<String, Bucket> buckets;
private final Map<String, RateLimitConfig> clientConfigs;
private final RateLimitConfig defaultConfig;
public TokenBucketLimiter(RateLimitConfig defaultConfig) {
this.buckets = new ConcurrentHashMap<>();
this.clientConfigs = new ConcurrentHashMap<>();
this.defaultConfig = defaultConfig;
}
@Override
public boolean isAllowed(String clientId) {
RateLimitConfig config = clientConfigs
.getOrDefault(clientId, defaultConfig);
Bucket bucket = buckets.computeIfAbsent(
clientId,
k -> new Bucket(config.maxRequests(), config.refillRate())
);
return bucket.tryConsume();
}
public void setClientConfig(String clientId, RateLimitConfig config) {
clientConfigs.put(clientId, config);
buckets.remove(clientId); // force re-creation with new config
}
}
The computeIfAbsent call is atomic. Each bucket’s tryConsume uses its own synchronized block. This means two different clients never contend with each other, and contention only occurs when the same client sends concurrent requests.
Per-client configuration
Not all clients are equal. A free-tier user gets 100 requests per minute. A paid customer gets 10,000. An internal service gets unlimited. The configuration model needs to support this.
public record RateLimitConfig(
int maxRequests,
long windowSizeMs,
int refillRate
) {
public static final RateLimitConfig FREE_TIER =
new RateLimitConfig(100, 60_000, 2);
public static final RateLimitConfig PRO_TIER =
new RateLimitConfig(10_000, 60_000, 200);
public static final RateLimitConfig INTERNAL =
new RateLimitConfig(Integer.MAX_VALUE, 60_000, Integer.MAX_VALUE);
}
Store these configurations in a database or configuration service. When a client authenticates, look up their tier and attach the appropriate config. The rate limiter retrieves the config by client ID on every request.
Distributed rate limiting with Redis
On a single server, an in-memory map works perfectly. But most production systems run behind a load balancer with multiple application instances. A client’s requests hit different servers, so the in-memory bucket on server A knows nothing about requests that hit server B.
Redis solves this. It is single-threaded, so operations are naturally serialized. A Lua script makes the check-and-decrement atomic.
-- rate_limit.lua
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens == nil then
tokens = max_tokens
last_refill = now
end
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate / 1000)
if new_tokens >= 1 then
new_tokens = new_tokens - 1
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120)
return {1, math.floor(new_tokens)}
else
redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
redis.call('EXPIRE', key, 120)
return {0, 0}
end
The Lua script runs atomically on the Redis server. No race conditions, no distributed locks. The EXPIRE ensures that buckets for inactive clients get cleaned up automatically.
public class RedisRateLimiter implements RateLimiter {
private final JedisPool jedisPool;
private final String luaScript;
@Override
public boolean isAllowed(String clientId) {
try (Jedis jedis = jedisPool.getResource()) {
RateLimitConfig config = getConfig(clientId);
long now = System.currentTimeMillis();
Object result = jedis.eval(
luaScript,
List.of("ratelimit:" + clientId),
List.of(
String.valueOf(config.maxRequests()),
String.valueOf(config.refillRate()),
String.valueOf(now)
)
);
List<Long> values = (List<Long>) result;
return values.get(0) == 1L;
}
}
}
Handling edge cases
Several situations require careful handling:
Clock skew. In a distributed system, server clocks drift. If server A thinks it is 12:00:00.000 and server B thinks it is 12:00:00.150, the refill calculation diverges. Using Redis server time (via redis.call('TIME')) instead of client time eliminates this.
Bucket cleanup. Without expiration, the bucket map grows without bound. The Redis approach uses EXPIRE. For in-memory implementations, a scheduled task or a size-limited cache (like Caffeine with expireAfterAccess) handles cleanup.
Burst handling. The token bucket naturally allows bursts up to maxTokens. If a client has not made requests for a while, their bucket is full, and they can send maxTokens requests instantly. This is usually desirable, but if you need strict per-second limits, the sliding window algorithm is a better fit.
Graceful degradation. If Redis is down, you have two choices: fail open (allow all requests) or fail closed (deny all requests). Most systems fail open with local rate limiting as a fallback, because rejecting all traffic is usually worse than temporarily allowing extra traffic.
Testing the rate limiter
A rate limiter is tricky to test because it depends on time. Inject a clock abstraction to make tests deterministic.
public class Bucket {
private final Clock clock;
public Bucket(int maxTokens, double refillRate, Clock clock) {
this.clock = clock;
this.lastRefillTimestamp = clock.millis();
// ...
}
private void refill() {
long now = clock.millis();
// ...
}
}
// In tests:
var clock = Clock.fixed(Instant.now(), ZoneOffset.UTC);
var bucket = new Bucket(10, 1.0, clock);
assertTrue(bucket.tryConsume()); // token available
// exhaust all tokens...
assertFalse(bucket.tryConsume()); // denied
// advance clock
clock = Clock.offset(clock, Duration.ofSeconds(5));
assertTrue(bucket.tryConsume()); // tokens refilled
Test at least these scenarios: fresh bucket allows requests, exhausted bucket denies, tokens refill after waiting, burst does not exceed max, and concurrent access does not corrupt state.
What comes next
The rate limiter protects your system from overload, but it produces signals you need to observe: how often clients hit limits, which tiers are under-provisioned, and whether your limits match actual usage patterns. The next article on designing a logging framework builds the infrastructure for capturing and routing those signals through a structured logging pipeline.