Feb 7, 2026 · 16 min read · System Design

Designing a rate limiter

In this series (15 parts)

Every production API needs a rate limiter. Without one, a single client can exhaust your entire system’s capacity, whether through malice, a bug, or a sudden traffic spike. The goal is simple: enforce a ceiling on requests per client per time window, and reject everything beyond it. The challenge is doing it correctly in a distributed, multi-threaded environment.

If you have not already, review rate limiting for the high-level motivation. Here we focus on the low-level design: interfaces, data structures, concurrency, and distributed coordination.

Interface design

The rate limiter’s public contract should be minimal. A caller asks “is this request allowed?” and gets a yes or no answer. Everything else is an implementation detail.

public interface RateLimiter {
    boolean isAllowed(String clientId);
    RateLimitInfo getRateLimitInfo(String clientId);
}

public record RateLimitInfo(
    int limit,
    int remaining,
    long resetTimestamp
) {}

The RateLimitInfo object is how you populate the X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers in your HTTP response. Clients need this feedback to self-regulate.

classDiagram
  class RateLimiter {
      <<interface>>
      +isAllowed(clientId: String) boolean
      +getRateLimitInfo(clientId: String) RateLimitInfo
  }
  class RateLimitInfo {
      +int limit
      +int remaining
      +long resetTimestamp
  }
  class TokenBucketLimiter {
      -Map~String, Bucket~ buckets
      -RateLimitConfig defaultConfig
      +isAllowed(clientId: String) boolean
      +getRateLimitInfo(clientId: String) RateLimitInfo
  }
  class SlidingWindowLimiter {
      -Map~String, Deque~ requestLogs
      -RateLimitConfig defaultConfig
      +isAllowed(clientId: String) boolean
      +getRateLimitInfo(clientId: String) RateLimitInfo
  }
  class RateLimitConfig {
      +int maxRequests
      +long windowSizeMs
      +int refillRate
  }
  RateLimiter <|.. TokenBucketLimiter
  RateLimiter <|.. SlidingWindowLimiter
  TokenBucketLimiter --> RateLimitConfig
  SlidingWindowLimiter --> RateLimitConfig
  RateLimiter --> RateLimitInfo

Class diagram showing the rate limiter interface with token bucket and sliding window implementations.

Token bucket: the core algorithm

The token bucket is the most widely used rate limiting algorithm. It is simple, memory-efficient, and naturally supports burst traffic. The idea: each client has a bucket that holds tokens. Every request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, the request is denied.

public class Bucket {
    private double tokens;
    private final int maxTokens;
    private final double refillRatePerMs;
    private long lastRefillTimestamp;

    public Bucket(int maxTokens, double refillRatePerSecond) {
        this.tokens = maxTokens;
        this.maxTokens = maxTokens;
        this.refillRatePerMs = refillRatePerSecond / 1000.0;
        this.lastRefillTimestamp = System.currentTimeMillis();
    }

    public synchronized boolean tryConsume() {
        refill();
        if (tokens >= 1.0) {
            tokens -= 1.0;
            return true;
        }
        return false;
    }

    private void refill() {
        long now = System.currentTimeMillis();
        double elapsed = now - lastRefillTimestamp;
        tokens = Math.min(maxTokens, tokens + elapsed * refillRatePerMs);
        lastRefillTimestamp = now;
    }
}

The key insight is lazy refill. You do not run a background timer. Instead, you calculate the tokens that should have been added since the last request. This eliminates the need for a scheduled thread and reduces memory overhead.

Request flow

When a request arrives, the API gateway intercepts it before reaching the application logic. The rate limiter checks the client’s bucket, and either allows the request through or returns HTTP 429.

sequenceDiagram
  participant C as Client
  participant G as API Gateway
  participant R as RateLimiter
  participant S as Storage
  participant A as Application

  C->>G: POST /api/resource
  G->>R: isAllowed("client-42")
  R->>S: fetch bucket for client-42
  S-->>R: Bucket(tokens=5, lastRefill=t0)
  R->>R: refill tokens based on elapsed time
  R->>R: tryConsume one token
  R->>S: update bucket state
  R-->>G: allowed=true, remaining=4
  G->>A: forward request
  A-->>G: 200 OK with response body
  G-->>C: 200 OK + rate limit headers

Sequence diagram for a rate-limited request passing through the API gateway.

When the limiter denies a request, the gateway returns a 429 status code with a Retry-After header. This tells the client exactly when to retry, preventing thundering herd problems.

Thread safety in the single-node case

The synchronized keyword on tryConsume works for a single bucket, but it creates contention when thousands of clients share one limiter instance. A better approach uses ConcurrentHashMap with per-bucket locking.

public class TokenBucketLimiter implements RateLimiter {
    private final ConcurrentHashMap<String, Bucket> buckets;
    private final Map<String, RateLimitConfig> clientConfigs;
    private final RateLimitConfig defaultConfig;

    public TokenBucketLimiter(RateLimitConfig defaultConfig) {
        this.buckets = new ConcurrentHashMap<>();
        this.clientConfigs = new ConcurrentHashMap<>();
        this.defaultConfig = defaultConfig;
    }

    @Override
    public boolean isAllowed(String clientId) {
        RateLimitConfig config = clientConfigs
            .getOrDefault(clientId, defaultConfig);
        Bucket bucket = buckets.computeIfAbsent(
            clientId,
            k -> new Bucket(config.maxRequests(), config.refillRate())
        );
        return bucket.tryConsume();
    }

    public void setClientConfig(String clientId, RateLimitConfig config) {
        clientConfigs.put(clientId, config);
        buckets.remove(clientId); // force re-creation with new config
    }
}

The computeIfAbsent call is atomic. Each bucket’s tryConsume uses its own synchronized block. This means two different clients never contend with each other, and contention only occurs when the same client sends concurrent requests.

Per-client configuration

Not all clients are equal. A free-tier user gets 100 requests per minute. A paid customer gets 10,000. An internal service gets unlimited. The configuration model needs to support this.

public record RateLimitConfig(
    int maxRequests,
    long windowSizeMs,
    int refillRate
) {
    public static final RateLimitConfig FREE_TIER =
        new RateLimitConfig(100, 60_000, 2);
    public static final RateLimitConfig PRO_TIER =
        new RateLimitConfig(10_000, 60_000, 200);
    public static final RateLimitConfig INTERNAL =
        new RateLimitConfig(Integer.MAX_VALUE, 60_000, Integer.MAX_VALUE);
}

Store these configurations in a database or configuration service. When a client authenticates, look up their tier and attach the appropriate config. The rate limiter retrieves the config by client ID on every request.

Distributed rate limiting with Redis

On a single server, an in-memory map works perfectly. But most production systems run behind a load balancer with multiple application instances. A client’s requests hit different servers, so the in-memory bucket on server A knows nothing about requests that hit server B.

Redis solves this. It is single-threaded, so operations are naturally serialized. A Lua script makes the check-and-decrement atomic.

-- rate_limit.lua
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    tokens = max_tokens
    last_refill = now
end

local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + elapsed * refill_rate / 1000)

if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 120)
    return {1, math.floor(new_tokens)}
else
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 120)
    return {0, 0}
end

The Lua script runs atomically on the Redis server. No race conditions, no distributed locks. The EXPIRE ensures that buckets for inactive clients get cleaned up automatically.

public class RedisRateLimiter implements RateLimiter {
    private final JedisPool jedisPool;
    private final String luaScript;

    @Override
    public boolean isAllowed(String clientId) {
        try (Jedis jedis = jedisPool.getResource()) {
            RateLimitConfig config = getConfig(clientId);
            long now = System.currentTimeMillis();
            Object result = jedis.eval(
                luaScript,
                List.of("ratelimit:" + clientId),
                List.of(
                    String.valueOf(config.maxRequests()),
                    String.valueOf(config.refillRate()),
                    String.valueOf(now)
                )
            );
            List<Long> values = (List<Long>) result;
            return values.get(0) == 1L;
        }
    }
}

Handling edge cases

Several situations require careful handling:

Clock skew. In a distributed system, server clocks drift. If server A thinks it is 12:00:00.000 and server B thinks it is 12:00:00.150, the refill calculation diverges. Using Redis server time (via redis.call('TIME')) instead of client time eliminates this.

Bucket cleanup. Without expiration, the bucket map grows without bound. The Redis approach uses EXPIRE. For in-memory implementations, a scheduled task or a size-limited cache (like Caffeine with expireAfterAccess) handles cleanup.

Burst handling. The token bucket naturally allows bursts up to maxTokens. If a client has not made requests for a while, their bucket is full, and they can send maxTokens requests instantly. This is usually desirable, but if you need strict per-second limits, the sliding window algorithm is a better fit.

Graceful degradation. If Redis is down, you have two choices: fail open (allow all requests) or fail closed (deny all requests). Most systems fail open with local rate limiting as a fallback, because rejecting all traffic is usually worse than temporarily allowing extra traffic.

Testing the rate limiter

A rate limiter is tricky to test because it depends on time. Inject a clock abstraction to make tests deterministic.

public class Bucket {
    private final Clock clock;

    public Bucket(int maxTokens, double refillRate, Clock clock) {
        this.clock = clock;
        this.lastRefillTimestamp = clock.millis();
        // ...
    }

    private void refill() {
        long now = clock.millis();
        // ...
    }
}

// In tests:
var clock = Clock.fixed(Instant.now(), ZoneOffset.UTC);
var bucket = new Bucket(10, 1.0, clock);
assertTrue(bucket.tryConsume());  // token available
// exhaust all tokens...
assertFalse(bucket.tryConsume()); // denied
// advance clock
clock = Clock.offset(clock, Duration.ofSeconds(5));
assertTrue(bucket.tryConsume());  // tokens refilled

Test at least these scenarios: fresh bucket allows requests, exhausted bucket denies, tokens refill after waiting, burst does not exceed max, and concurrent access does not corrupt state.

What comes next

The rate limiter protects your system from overload, but it produces signals you need to observe: how often clients hit limits, which tiers are under-provisioned, and whether your limits match actual usage patterns. The next article on designing a logging framework builds the infrastructure for capturing and routing those signals through a structured logging pipeline.

← Back to all series