Jan 5, 2026 · 17 min read · System Design

Caching architecture patterns

In this series (12 parts)

Caching at the single-service level is straightforward: store frequently accessed data closer to where it is needed to reduce latency and database load. At the architecture level, caching becomes a layered system spanning multiple tiers, geographic regions, and failure domains. The challenge shifts from “should I cache this?” to “where in the stack should this cache live, how do I keep it consistent, and what happens when it fails?”

In-process cache

The fastest cache is memory in the same process. No network hop, no serialization overhead, just a hash map lookup measured in nanoseconds. In-process caches like Guava Cache, Caffeine (Java), or a simple Map with TTL expiration work well for reference data that changes infrequently: configuration values, feature flags, country code lookups, permission matrices.

graph TD
  Req["Request"] --> App["Application Process"]
  subgraph App["Application Process"]
      Handler["Request Handler"]
      Cache["In-Process Cache<br/>(Caffeine / LRU Map)"]
      Handler -->|"1. check cache"| Cache
      Cache -->|"2a. cache hit"| Handler
  end
  Handler -->|"2b. cache miss"| DB[("Database")]
  DB -->|"3. populate cache"| Cache

In-process cache: data lives in the application’s heap memory. Cache hits avoid any external call.

The limitation is scope. Each application instance has its own cache, so there is no sharing between instances. If you have ten pods running the order service, you have ten independent caches that may hold different versions of the same data. After a cache miss, each instance fetches from the database independently, multiplying the load for cold keys by the number of instances.

In-process caches also compete with your application for heap memory. A large cache in a JVM application can trigger more frequent garbage collection pauses. Size these caches conservatively, typically 50-200 MB, and evict aggressively.

Distributed cache

A distributed cache like Redis or Memcached provides a shared cache layer that all application instances read from and write to. A cache hit from Redis takes 0.5-1ms (network round trip) compared to nanoseconds for in-process, but the cache is shared, consistent, and survives application restarts.

graph TD
  A1["App Instance 1"] --> Redis["Redis Cluster"]
  A2["App Instance 2"] --> Redis
  A3["App Instance 3"] --> Redis
  Redis --> DB[("Database")]
  A1 -->|"cache miss"| DB

Distributed cache: all instances share a single Redis cluster. One instance’s cache write benefits all others.

Redis dominates the distributed caching space for good reason. It supports rich data structures (strings, hashes, sorted sets, streams), built-in TTL expiration, Lua scripting for atomic operations, and pub/sub for cache invalidation broadcasts. Redis Cluster provides horizontal scaling by sharding keys across multiple nodes using consistent hashing.

A production Redis cluster for caching typically runs 3-6 primary nodes with one replica each. Keys are distributed across 16,384 hash slots, and each primary owns a range of slots. When a node fails, its replica promotes automatically. Client libraries like Jedis or Lettuce handle slot redirection transparently.

Cache hierarchy: L1 / L2 / L3

Production systems rarely use a single cache layer. A cache hierarchy combines the strengths of multiple tiers: the speed of in-process memory, the shared state of a distributed cache, and the durability of a CDN at the edge.

graph TD
  Client["Client"] --> CDN["L3: CDN Edge Cache<br/>(CloudFront / Fastly)"]
  CDN -->|"miss"| LB["Load Balancer"]
  LB --> App["Application Instance"]
  subgraph App["Application Instance"]
      L1["L1: In-Process Cache<br/>(Caffeine, 64MB)"]
      Handler["Request Handler"]
  end
  L1 -->|"miss"| Redis["L2: Redis Cluster"]
  Redis -->|"miss"| DB[("Database")]
  DB -->|"populate L2"| Redis
  Redis -->|"populate L1"| L1
  Handler --> L1

Three-tier cache hierarchy: L1 (in-process) catches hot keys, L2 (Redis) catches warm keys, L3 (CDN) caches static and semi-static responses at the edge.

A request first hits the CDN, which serves cached responses for static content and cacheable API responses. On CDN miss, the request reaches the application instance and checks the L1 in-process cache. On L1 miss, it checks the L2 Redis cluster. Only on L2 miss does the request hit the database. The response then populates all cache layers on the way back.

The effective cache hit rate compounds across layers. If L1 has a 60% hit rate, L2 has 85% of the remaining, and L3 handles 70% of total traffic, the database sees only a small fraction of original requests.

Cache stampede prevention

A cache stampede (also called thundering herd) happens when a popular cached value expires and hundreds of concurrent requests all miss the cache simultaneously, flooding the database with identical queries. The database struggles under the load, responses slow down, timeouts cascade, and the system degrades.

Three techniques prevent stampedes.

Lock-based recomputation. When a cache miss occurs, the first thread acquires a distributed lock (using Redis SET NX with a TTL) and recomputes the value. Other threads wait for the lock to release and then read the freshly cached value. This serializes recomputation but prevents duplicate work.

Probabilistic early expiration. Instead of a hard TTL, each cache read has a small probability of triggering a background refresh before the TTL expires. The probability increases as the TTL approaches, making it very likely that someone refreshes the cache before it actually expires. The XFetch algorithm formalizes this approach.

Background refresh. A dedicated process refreshes cache values before they expire, based on access patterns. Hot keys get refreshed frequently; cold keys are allowed to expire naturally. This completely eliminates stampedes for known hot keys but requires tracking access frequency.

sequenceDiagram
  participant R1 as Request 1
  participant R2 as Request 2
  participant R3 as Request 3
  participant Cache as Redis
  participant DB as Database
  R1->>Cache: GET product:42
  Cache-->>R1: MISS
  R1->>Cache: SET lock:product:42 NX EX 5
  Cache-->>R1: OK (lock acquired)
  R2->>Cache: GET product:42
  Cache-->>R2: MISS
  R2->>Cache: SET lock:product:42 NX EX 5
  Cache-->>R2: FAIL (lock held)
  R3->>Cache: GET product:42
  Cache-->>R3: MISS
  Note over R2,R3: Wait and retry
  R1->>DB: SELECT * FROM products WHERE id=42
  DB-->>R1: product data
  R1->>Cache: SET product:42 (data, TTL=300)
  R2->>Cache: GET product:42
  Cache-->>R2: HIT (freshly cached)
  R3->>Cache: GET product:42
  Cache-->>R3: HIT

Lock-based stampede prevention: only the first request queries the database. Others wait for the cache to be populated.

Cache invalidation strategies

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. In a distributed system, cache invalidation is especially tricky because data changes in one service and caches live in multiple layers across multiple regions.

TTL-based expiration is the simplest strategy. Every cached value gets a time-to-live, and the system tolerates stale data within that window. A 60-second TTL means data can be up to 60 seconds stale, which is acceptable for product listings or user profiles but not for account balances.

Event-driven invalidation uses events from the data owner to invalidate or update caches. When the product service updates a price, it publishes a PriceUpdated event. Consumers that cache product data listen for this event and invalidate their cached copies. This provides near-real-time consistency but requires event infrastructure.

Write-through and write-behind update the cache as part of the write path. Write-through updates the cache synchronously on every write, guaranteeing the cache always has fresh data but adding write latency. Write-behind queues the cache update and applies it asynchronously, reducing write latency at the cost of a brief inconsistency window.

Geographic cache distribution

For global applications, a single cache region means users far from the data center experience high latency. Geographic cache distribution places cache nodes in multiple regions, close to users.

graph TD
  subgraph US_East["US East"]
      UE_Users["Users"] --> UE_Cache["Redis (Primary)"]
      UE_Cache --> UE_DB[("Primary DB")]
  end
  subgraph EU_West["EU West"]
      EU_Users["Users"] --> EU_Cache["Redis (Replica)"]
      EU_Cache -->|"miss: cross-region"| UE_Cache
  end
  subgraph AP_South["AP South"]
      AP_Users["Users"] --> AP_Cache["Redis (Replica)"]
      AP_Cache -->|"miss: cross-region"| UE_Cache
  end
  UE_Cache -->|"replication"| EU_Cache
  UE_Cache -->|"replication"| AP_Cache

Geographic cache distribution: regional replicas serve local reads. Misses fall back to the primary region.

Redis supports active-passive replication across regions through Redis Enterprise’s Active-Active (CRDT-based) or simpler one-way replication. Writes go to the primary region, and replicas in other regions receive updates asynchronously. Reads from the local replica are fast (1-2ms) even when the primary is on another continent.

The consistency trade-off depends on the replication model. With one-way replication, reads from replicas may be stale by the replication lag (typically 50-200ms cross-region). With CRDT-based active-active, both regions can accept writes, and conflicts are resolved automatically using last-writer-wins or custom merge logic. CRDTs work well for counters, sets, and flags but not for arbitrary data structures.

For most read-heavy workloads, regional read replicas with a primary write region provide the best balance of performance and simplicity. Reserve active-active for use cases that truly need local writes in multiple regions, like session stores or shopping carts.

Cache sizing and capacity planning

Under-provisioned caches evict frequently, reducing hit rates. Over-provisioned caches waste money. Capacity planning starts with understanding your working set: the subset of data that is accessed frequently enough to benefit from caching.

A good rule of thumb: your cache should hold your working set plus 20-30% headroom for variance. If your product catalog has 10 million items but 90% of traffic hits the top 100,000, your cache needs room for roughly 130,000 items, not 10 million. Monitor your eviction rate; if it exceeds 1-2% of operations, the cache is too small.

For Redis specifically, use the INFO memory command to track used memory, peak memory, and fragmentation ratio. Set maxmemory-policy to allkeys-lru for general caching (evict least recently used keys when full) or volatile-lru to only evict keys with a TTL set.

What comes next

This article completes the High Level Design series on architecture patterns. From monolith vs microservices through communication patterns, service discovery, event-driven architecture, distributed data, and caching, you now have the vocabulary and mental models to design systems that scale with your team and your traffic. The next step is applying these patterns to concrete system design problems: designing a URL shortener, a chat system, a news feed, or a payment platform.

← Back to all series