Reliability patterns: timeouts, retries, circuit breakers
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
A single database query takes 30 seconds instead of 30 milliseconds. The calling service holds a thread open, waiting. That thread cannot serve other requests. Fifty more requests pile up behind it, each spawning a thread that also blocks. Your connection pool drains. The load balancer keeps routing traffic to this instance because it has not failed a health check yet. Within 90 seconds, the entire service is unresponsive, and every upstream caller inherits the same fate. One slow query just took down your checkout flow.
This is not hypothetical. This is Tuesday. Distributed systems do not fail cleanly. They fail partially, silently, and at the worst possible moment. If you have worked through networking concepts, you know that networks are unreliable by nature. Reliability patterns exist to absorb that unreliability before it propagates.
Timeouts: the first line of defense
Every network call needs a timeout. Every single one. If you do not set a timeout, you are saying “I am willing to wait forever,” and the system will eventually call that bluff.
There are two kinds of timeouts you need to think about. A connection timeout limits how long you wait to establish a TCP connection. A request timeout (sometimes called a read timeout) limits how long you wait for a response after the connection is established. These are separate values and should be tuned independently.
Setting a connection timeout of 1 second and a request timeout of 5 seconds is a reasonable starting point for internal service calls. For calls to third-party APIs, you might extend the request timeout to 10 or 15 seconds, but never leave it unbounded. An unbounded timeout is a thread leak waiting to happen.
The tricky part is choosing values that are tight enough to detect failures quickly but loose enough to avoid false positives under normal load variance. Look at your p99 latency for the downstream call. Set your timeout at roughly 2 to 3 times that value. If p99 is 200ms, a 500ms timeout gives you headroom without masking real problems.
Timeouts protect individual calls. But what happens when a call fails and you try again?
Retries: helpful until they are not
Retries are the most intuitive reliability mechanism and also the most dangerous. A transient network blip caused a failure, so you try again. Simple. Except when 10,000 clients all retry at the same instant.
Consider a service handling 5,000 requests per second. The downstream database hiccups for 2 seconds. During that window, 10,000 requests fail. If every client retries immediately, the database now faces 10,000 queued requests plus 5,000 new ones per second. The retry storm doubles or triples the load on an already struggling system. The database does not recover. It gets worse. This is the retry amplification problem, and it has caused some of the most spectacular outages in production systems.
Naive retries turn partial failures into total failures. Three rules make retries safe.
Rule 1: Limit the count. Three retries is a common ceiling. Beyond that, you are not going to get a different result. You are just adding load.
Rule 2: Use exponential backoff. Instead of retrying immediately, wait 100ms, then 200ms, then 400ms, then 800ms. Each retry waits twice as long as the previous one. This gives the downstream system breathing room to recover.
Rule 3: Add jitter. Exponential backoff alone is not enough. If 1,000 clients all start their backoff at the same millisecond, they will all retry at 100ms, all retry again at 200ms, and so on. The thundering herd just becomes a slower thundering herd. Jitter randomizes the wait time within the backoff window. Instead of waiting exactly 400ms, a client waits somewhere between 0ms and 400ms, chosen uniformly at random. This spreads retries across the time window and eliminates the synchronized spikes.
The formula looks like this: wait = random(0, base * 2^attempt). For attempt 0, wait is random between 0 and 100ms. For attempt 1, random between 0 and 200ms. For attempt 2, random between 0 and 400ms. The distribution flattens the load curve dramatically.
Immediate retries create periodic load spikes that prevent recovery. Backoff with jitter produces a smooth decay back to normal throughput.
The red line shows what happens without backoff: synchronized retry waves hammer the downstream service in sharp bursts, each spike as tall as the original failure burst. The green line shows backoff with jitter: load decays smoothly, giving the downstream system a progressively easier time until it stabilizes.
Circuit breakers: stop calling what is already broken
Retries with backoff help when failures are transient. But what if the downstream service is not coming back in the next few seconds? What if it is down for minutes? Continuing to retry, even with backoff, wastes resources and adds latency to every request that could have failed fast.
A circuit breaker borrows its metaphor from electrical engineering. When too much current flows through a circuit, the breaker trips and cuts the connection, preventing damage. In software, a circuit breaker monitors the failure rate of calls to a downstream service and, when that rate crosses a threshold, stops making calls entirely for a cooling-off period.
The circuit breaker has three states.
Closed is the normal operating state. Requests pass through to the downstream service. The breaker tracks the number of recent failures. If the failure count exceeds a threshold (say, 5 failures in a 10-second window), the breaker transitions to Open.
Open means the breaker has tripped. All requests fail immediately with a predefined fallback response. No calls reach the downstream service. This protects the downstream from additional load and gives your callers a fast failure instead of a slow timeout. The breaker stays open for a configured duration, typically 30 to 60 seconds.
Half-Open is the probe state. After the open duration expires, the breaker allows a single request through to test whether the downstream has recovered. If that request succeeds, the breaker moves back to Closed. If it fails, the breaker returns to Open for another cooling-off period.
stateDiagram-v2 [*] --> Closed Closed --> Open : Failure threshold exceeded Open --> HalfOpen : Timeout expires HalfOpen --> Closed : Probe succeeds HalfOpen --> Open : Probe fails Closed --> Closed : Success / failure below threshold Open --> Open : Requests fail fast
Circuit breaker state transitions. The half-open state acts as a controlled probe to test recovery before restoring full traffic.
A good circuit breaker implementation tracks failures using a sliding window, not a simple counter. A rolling 10-second window with a 50% failure rate threshold is more robust than “5 failures total,” because the latter can trip on 5 failures spread across an hour of otherwise healthy traffic.
In the half-open state, some implementations allow a small percentage of traffic through (say 5%) rather than a single request. This gives a more statistically meaningful signal about whether the downstream has truly recovered.
The fallback behavior during the open state depends on your use case. For a product recommendation service, returning an empty list or a cached result is perfectly acceptable. For a payment processing call, you might queue the request for later processing via a message queue rather than dropping it. The right fallback is the one that preserves the best user experience given the constraint that the downstream is unavailable.
The bulkhead pattern: isolate the blast radius
Timeouts and circuit breakers protect individual call paths. The bulkhead pattern protects your service as a whole by isolating resource pools so that one failing dependency cannot consume all available resources.
The name comes from ship construction. A ship’s hull is divided into watertight compartments (bulkheads). If one compartment floods, the others remain dry, and the ship stays afloat. Without bulkheads, a single breach sinks the entire vessel.
In software, the most common bulkhead implementation is separate thread pools or connection pools per downstream dependency. Suppose your service calls three backends: user profiles, inventory, and pricing. Without bulkheads, all three share a single connection pool of 200 connections. If inventory becomes slow, all 200 connections can end up blocked on inventory calls, starving user profile and pricing calls even though those services are healthy.
With bulkheads, you allocate 80 connections to inventory, 60 to user profiles, and 60 to pricing. When inventory goes down, it can consume at most 80 connections. The remaining 120 connections continue serving profile and pricing requests normally. The failure is contained.
You can implement bulkheads at multiple levels. Thread pools, connection pools, separate process groups, or even separate Kubernetes pods per dependency. The granularity depends on how critical the isolation is. For most services, per-dependency connection pools are the right starting point.
The cost of bulkheads is reduced peak throughput for any single dependency. Those 80 connections to inventory mean you cannot burst to 200 during a flash sale. This is an intentional tradeoff. You are trading peak capacity for failure isolation. In almost every production system, that trade is worth making.
Idempotency: making retries safe at the application layer
Retries introduce a subtle problem that backoff and circuit breakers do not solve. If a client sends a payment request, the server processes it and sends a response, but the response is lost due to a network partition, the client will retry. Now the payment has been processed twice. The customer gets charged double.
Idempotency means that performing the same operation multiple times produces the same result as performing it once. GET requests are naturally idempotent. POST requests that create resources or trigger side effects are not, unless you design them to be.
The standard approach is an idempotency key. The client generates a unique identifier (a UUID works) and attaches it to the request. The server checks whether it has already processed a request with that key. If yes, it returns the stored result without re-executing the operation. If no, it processes the request and stores the result keyed by the idempotency key.
The implementation requires an atomic check-and-set operation. You cannot check for the key, then process the request, then store the result as three separate steps, because a second request could arrive between the check and the store. Most implementations use a database transaction or a distributed lock to ensure atomicity.
Idempotency keys have a TTL. Storing them forever is impractical. A 24-hour or 7-day window covers the vast majority of retry scenarios. After that window, the key expires, and a duplicate request would be treated as new. This is acceptable because retries that happen days later are not retries. They are new requests from confused clients, and you have bigger problems to debug.
Stripe, one of the most widely studied payment APIs, requires an Idempotency-Key header on all mutating requests. This is not a nice-to-have. It is a fundamental requirement for safe retries in a distributed system.
Combining the patterns
These patterns are not alternatives. They are layers. A well-designed service call looks like this:
- The client sends the request with an idempotency key and a timeout of 500ms.
- If the request times out, the client retries with exponential backoff and jitter, up to 3 attempts, using the same idempotency key.
- The circuit breaker monitors the failure rate across all calls to that downstream. If failures cross the threshold, it trips open and subsequent calls fail fast with a fallback.
- The bulkhead ensures that even if this particular downstream is failing, the thread pool and connection pool for other dependencies remain unaffected.
Each layer addresses a different failure mode. Timeouts handle slow calls. Retries handle transient errors. Idempotency keys make retries safe. Circuit breakers handle sustained outages. Bulkheads contain blast radius. Skip any one of these and you leave a gap that production traffic will find.
Tuning for your system
The specific numbers depend on your SLAs, traffic patterns, and downstream behavior. Here are starting points drawn from production systems:
Timeouts: 2 to 3 times your p99 latency for internal calls. 10 to 15 seconds for external APIs.
Retries: 3 attempts maximum. Base backoff of 100ms. Cap the maximum wait at 5 to 10 seconds.
Circuit breaker: 10-second sliding window. Trip at 50% failure rate with a minimum of 10 requests (to avoid tripping on low traffic). Open duration of 30 seconds. Allow 5% of traffic through in half-open state.
Bulkheads: Size each pool based on expected peak throughput for that dependency plus 20% headroom. Monitor pool utilization and adjust.
None of these numbers are universal. Instrument everything. Track timeout rates, retry rates, circuit breaker state transitions, and bulkhead pool utilization. Feed that data into your observability stack so you can tune based on real behavior, not guesses.
If your service is exposed to external clients, pair these reliability patterns with rate limiting at the edge. Rate limiting controls how much traffic enters your system. Reliability patterns control how your system behaves when things go wrong internally. They complement each other.
What comes next
Reliability patterns keep your system running when things break. But how do you know things are breaking in the first place? How do you detect a slow query before it cascades? How do you distinguish a transient blip from a real outage? That is the domain of observability: metrics, logs, traces, and the alerting systems built on top of them. That is what we cover in observability.