Apr 20, 2026 · 15 min read · DevOps

Reliability patterns for services

In this series (11 parts)

Failure is not a matter of “if.” Services go down. Networks partition. Databases slow to a crawl. The question is how your service behaves when that happens. Does it collapse entirely, or does it degrade gracefully and recover on its own?

Reliability patterns are engineering techniques that protect your service when things go wrong. Each pattern addresses a specific failure mode. Used together, they make the difference between a minor blip and a cascading outage.

For a broader look at these patterns in the context of system design, see the system design reliability patterns article.

Graceful degradation

Graceful degradation means serving reduced functionality instead of failing completely. When a dependency is unavailable, you do less rather than nothing.

Consider an e-commerce product page. It calls three services: product catalog, recommendations, and reviews. If the recommendations service goes down, you still show the product and reviews. The page is less useful but still functional.

The key decisions are:

Which features are essential vs optional?
What does the user see when a non-essential feature is unavailable?
How do you detect that a dependency is down?

Implement graceful degradation at the call site. Wrap each non-essential dependency call in a timeout with a fallback. The fallback might return cached data, a default value, or simply hide that section of the UI.

Static fallbacks vs cached fallbacks

Static fallbacks return a hardcoded default. The recommendations section might show “Popular items” from a static list. Simple to implement but the content goes stale.

Cached fallbacks return the last known good response. More relevant to the user, but you need a caching layer and a strategy for cache expiration. A recommendation from an hour ago is fine. A recommendation from a month ago might not be.

Choose based on how time-sensitive the data is.

Load shedding

Load shedding is the deliberate rejection of excess requests to protect the service. When traffic exceeds what your service can handle, you have two choices: slow down for everyone, or reject some requests so the rest complete successfully.

Load shedding chooses the second option. It is better to serve 80% of requests successfully than to serve 100% of requests slowly (or not at all).

graph TD
  A["Request arrives"] --> B["Check current load"]
  B --> C{"Load > threshold?"}
  C -->|No| D["Process request"]
  C -->|Yes| E{"Priority request?"}
  E -->|Yes| D
  E -->|No| F["Return 503 Service Unavailable"]
  D --> G["Return response"]

Load shedding decision flow: reject low-priority requests when load exceeds the threshold.

Implementing load shedding

Measure load using a metric your service actually cares about. Options include:

Active request count: How many requests are currently in-flight? This is the simplest and often the most effective.
CPU utilization: Useful but can lag behind actual overload.
Queue depth: For services that use internal queues.

Set a threshold based on load testing. If your service handles 500 concurrent requests cleanly and degrades at 600, set the shedding threshold at 500.

Priority-based shedding

Not all requests are equal. A health check from the load balancer is more important than a background analytics request. A request from a paying customer might take priority over an anonymous browsing session.

Assign priority levels and shed lowest-priority traffic first. This keeps the most important work flowing even under extreme load.

Request hedging

Request hedging sends the same request to multiple backends simultaneously and uses whichever response arrives first. This reduces tail latency at the cost of additional load.

Tail latency is the latency experienced by the slowest requests, typically the p99 or p99.9. A single slow backend instance can cause high tail latency even when the median is fast.

Hedging works well when:

You have multiple replicas that can serve the same request
The cost of an extra request is low compared to the cost of high latency
Tail latency meaningfully impacts user experience

Be careful with hedging. Sending every request twice doubles your backend load. A common approach is to hedge only after a delay. Send the first request, wait 10 milliseconds, and if no response has arrived, send a second request to a different backend. Cancel the slower request when the faster one completes.

When not to hedge

Do not hedge requests that have side effects. If the request creates an order or sends a notification, sending it twice creates duplicates. Hedging is for idempotent read operations only.

Health endpoints

Health endpoints tell infrastructure components (load balancers, orchestrators, service meshes) whether your service is operational.

Liveness: `/healthz`

The liveness endpoint answers one question: is this process alive and able to handle requests? A liveness check should be fast and lightweight. It verifies that the process is running and not deadlocked.

What to check:

The process can respond to HTTP requests
Critical internal threads are alive

What not to check:

Database connectivity (that is a readiness concern)
Downstream service availability

If liveness fails, the orchestrator restarts the container. You want this to trigger only when the process itself is broken, not when a dependency is temporarily down.

Readiness: `/readyz`

The readiness endpoint answers a different question: is this instance ready to receive traffic? A service might be alive but not ready, for example during startup while it warms a cache or establishes database connections.

What to check:

Database connection pool is healthy
Required caches are populated
Configuration has been loaded

When readiness fails, the load balancer stops sending traffic to this instance. The instance stays alive and keeps trying to become ready. Once readiness passes again, traffic resumes.

Keep health checks fast

Both endpoints should respond in under 100 milliseconds. If your readiness check queries the database, use a simple SELECT 1, not a complex query. A slow health check can cause cascading failures when the orchestrator marks healthy instances as unhealthy because the check timed out.

Exponential backoff with jitter

When a request fails, retrying immediately is often the worst thing you can do. If the failure was caused by overload, immediate retries add more load and make the problem worse. This is the thundering herd problem.

Exponential backoff increases the delay between retries. The first retry waits 1 second, the second waits 2 seconds, the fourth waits 4 seconds, and so on. This gives the failing system time to recover.

Jitter adds randomness to the backoff delay. Without jitter, all clients that failed at the same time will retry at the same time, recreating the thundering herd. With jitter, retries spread across a time window.

A common formula:

delay = min(base * 2^attempt + random(0, base), max_delay)

Where base is your initial delay (often 1 second), attempt is the retry count, and max_delay caps the maximum wait (often 30 to 60 seconds).

Always set a maximum retry count. Infinite retries against a permanently failing service waste resources and fill logs with errors.

Bulkhead isolation

A ship has bulkheads, walls that divide the hull into compartments. If one compartment floods, the others stay dry. The ship stays afloat.

Bulkhead isolation applies the same principle to software. Isolate failures so they do not cascade across the entire service.

Connection pool isolation

Your service calls three downstream APIs. If all three share a single HTTP connection pool and one API becomes slow, it can exhaust all connections. Now calls to the other two healthy APIs also fail because there are no available connections.

Give each downstream dependency its own connection pool with its own limits. If the slow API exhausts its pool, the other two pools are unaffected.

Thread pool isolation

Similar to connection pools, you can isolate compute resources. Assign separate thread pools (or goroutine pools, or async task pools) to different categories of work. A slow background job should not consume the threads needed to serve user requests.

Process isolation

The strongest form of bulkhead is process isolation. Run different workloads in separate containers or services. A memory leak in the analytics pipeline cannot crash the order processing service if they run in separate processes.

The trade-off is complexity. More isolated components mean more things to deploy, monitor, and coordinate. Choose your bulkhead boundaries based on the blast radius of failures.

Combining patterns

These patterns work best in combination. A well-designed service might use:

Load shedding at the edge to reject excess traffic
Graceful degradation to serve partial responses when dependencies fail
Exponential backoff with jitter when retrying failed dependency calls
Bulkhead isolation to prevent one slow dependency from affecting others
Health endpoints to let infrastructure route traffic away from unhealthy instances

No single pattern solves all reliability problems. Each one addresses a specific failure mode. The art is knowing which patterns to apply and where.

What comes next

This article wraps up the Site Reliability Engineering series. Over the course of eleven articles, we covered SLOs, error budgets, monitoring, alerting, incident response, postmortems, production readiness, and now reliability patterns.

If you are just getting started, revisit the earlier articles on SLOs and error budgets. Those concepts form the foundation for everything else. The patterns covered here also connect well with system design reliability patterns and other system design topics where you will see these ideas applied at a larger scale.

Reliability is not a destination. It is a practice. Keep measuring, keep learning from incidents, and keep raising the bar.

← Back to all series