Caching: strategies and patterns
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
Prerequisite: Load balancing.
Every millisecond of latency you shave off a request path compounds across millions of users. Caching is the single most effective lever you have. A well-placed cache turns a 50ms database read into a 1ms memory lookup. It absorbs traffic spikes that would otherwise flatten your backend. It is also one of the easiest things to get wrong.
This article walks through the core caching strategies, eviction policies, and the failure modes that catch teams off guard. By the end you will know when to reach for each pattern and what traps to avoid.
Why cache at all?
Consider a product catalog service handling 10,000 requests per second. Each request hits the database, which takes 20ms per query on average. That is 200 seconds of cumulative database time every second. The database buckles.
Now put a cache in front of it. If 95% of requests hit the cache at 0.5ms per lookup, the database only handles 500 requests per second. Total p50 latency drops from 20ms to roughly 1.5ms. The database breathes. You can defer that expensive vertical scaling for months.
Caching works because most real workloads follow a power-law distribution. A small fraction of keys accounts for the vast majority of reads. Your top 1,000 products might serve 80% of all traffic. This skew is what makes caching viable.
Cache-aside (lazy loading)
Cache-aside is the most common strategy. The application owns both the cache and the database. On a read, the app checks the cache first. On a miss, it reads from the database, writes the result into the cache, and returns the value.
sequenceDiagram participant App participant Cache participant DB App->>Cache: GET key Cache-->>App: MISS App->>DB: SELECT value DB-->>App: value App->>Cache: SET key = value App-->>App: return value
Cache-aside read flow: the application queries the cache, falls back to the database on a miss, and populates the cache for subsequent reads.
The application is responsible for keeping cache and database in sync. Writes go directly to the database. The corresponding cache entry is either invalidated (deleted) or left to expire via TTL.
Cache-aside is simple and resilient. If the cache goes down, the application still works by falling through to the database. The downside is that every cache miss results in three round trips: one to the cache, one to the database, and one to write the cache. For latency-sensitive paths, that initial miss penalty matters.
This pattern fits read-heavy workloads where stale data is acceptable for short periods. Most web applications start here.
Read-through
Read-through looks similar to cache-aside from the application’s perspective, but the cache itself handles the database lookup. The app always talks to the cache. On a miss, the cache fetches the data from the database, stores it, and returns it.
The difference is ownership. In cache-aside the app orchestrates the miss path. In read-through the cache library or proxy handles it. This simplifies application code but couples your caching layer to your data access layer.
Read-through works well with cache providers like Amazon DynamoDB Accelerator (DAX), which transparently fronts DynamoDB with an in-memory cache. The application code does not change at all.
Write-through
Write-through ensures that every write goes through the cache to the database. The cache writes the data to the database synchronously before confirming the write to the application.
sequenceDiagram participant App participant Cache participant DB App->>Cache: SET key = value Cache->>DB: INSERT/UPDATE value DB-->>Cache: ACK Cache-->>App: ACK
Write-through flow: the cache proxies all writes to the database, guaranteeing consistency between cache and store.
The advantage is strong consistency. The cache is never stale because every write updates both layers atomically from the application’s perspective. The disadvantage is write latency. Every write now pays the cost of both a cache write and a database write in series. If your workload is write-heavy, this pattern can become a bottleneck.
Write-through is often combined with read-through. Together they form a fully transparent caching layer. The application treats the cache as its only data store.
Write-behind (write-back)
Write-behind flips the latency trade-off. Writes go to the cache immediately, and the cache asynchronously flushes them to the database in the background. The application gets a fast acknowledgment.
This gives you very low write latency. The cache batches and coalesces writes, reducing database load. If the same key is updated 50 times in one second, the database only sees the final value.
The risk is data loss. If the cache node crashes before flushing, those writes are gone. You need replication or a write-ahead log in the cache layer to mitigate this. Redis with AOF (append-only file) persistence is a common choice, but even that has a small window of vulnerability depending on the fsync policy.
Write-behind suits workloads like analytics counters, session stores, and leaderboards where losing a few seconds of data is tolerable and write throughput matters more than durability.
Cache invalidation
Phil Karlton famously said there are two hard things in computer science: cache invalidation and naming things. He was right about the first one.
The core question: when data changes in the database, how does the cache find out?
TTL-based expiration is the simplest approach. Every cache entry gets a time-to-live. After that period, the entry is evicted and the next read triggers a fresh lookup. TTL is a blunt instrument. Set it too short and you lose cache efficiency. Set it too long and users see stale data. A TTL of 60 seconds works for many read-heavy APIs. For user profile data, 5 minutes is common. For stock prices, 0 seconds (do not cache).
Event-driven invalidation is more precise. When a write occurs, the service publishes an event (via Kafka, SNS, or a change-data-capture stream), and the caching layer deletes or updates the affected entries. This approach gives you near-real-time consistency but adds infrastructure complexity. You need reliable event delivery and idempotent invalidation handlers.
Active invalidation on write is the middle ground. When the application writes to the database, it also deletes the cache key in the same code path. This is what most teams do in practice. The pattern is: write to the database first, then delete from the cache. Not the other way around. If you delete the cache first and the database write fails, you have unnecessarily evicted a valid entry and the next read repopulates it correctly. But if you write to the database first and the cache delete fails, you have a stale entry. A short TTL acts as a safety net for this case.
Eviction policies
Caches have finite memory. When the cache is full and a new entry needs space, the eviction policy decides which existing entry to remove. The choice of policy directly affects your hit rate.
LRU (Least Recently Used)
LRU evicts the entry that has not been accessed for the longest time. It works well for workloads with temporal locality, where recently accessed items are likely to be accessed again soon. Most caches default to LRU. Redis uses an approximated LRU that samples five random keys and evicts the least recently used among them. This avoids the overhead of maintaining a perfect LRU list across millions of keys.
LFU (Least Frequently Used)
LFU evicts the entry with the fewest accesses. It handles frequency-skewed workloads better than LRU. If a key is accessed 1,000 times per hour, LFU will protect it even if it has not been touched in the last 30 seconds. Redis 4.0 added an approximated LFU policy. The downside: LFU is slow to adapt. A key that was popular yesterday but irrelevant today stays in the cache until its frequency count decays.
ARC (Adaptive Replacement Cache)
ARC maintains two LRU lists: one for entries accessed once (recency) and one for entries accessed more than once (frequency). It dynamically adjusts the balance between the two lists based on observed access patterns. If the workload shifts from frequency-skewed to recency-skewed, ARC adapts. IBM patented ARC, which limited its adoption, but the patent expired in 2014. ZFS uses ARC for its block cache. ARC consistently outperforms pure LRU and pure LFU across diverse workloads, typically achieving 2 to 5 percentage points higher hit rate.
Hit rate and cache sizing
The relationship between cache size and hit rate follows a diminishing-returns curve. Going from 0 to 1 GB of cache might take your hit rate from 0% to 85%. Going from 1 GB to 2 GB only gets you from 85% to 92%. And going from 2 GB to 4 GB gets you from 92% to 96%.
Hit rate follows a logarithmic curve. Most of the benefit comes from the first fraction of total dataset size. Monitor your actual hit rate and right-size accordingly.
The optimal cache size depends on your working set, not your total dataset. If your database holds 500 GB but only 2 GB of data is “hot,” a 2 GB cache captures nearly all the value. Profile your access patterns. Plot the curve for your workload. Overspending on cache memory is wasted money. Underspending leaves latency on the table.
In production, track these metrics: hit rate, miss rate, eviction rate, and memory utilization. A hit rate below 80% usually signals that either the cache is too small or the access pattern has poor locality. An eviction rate that spikes during peak hours means you need more memory or a better eviction policy.
The thundering herd problem
Imagine a popular cache key expires. One thousand concurrent requests all see a cache miss at the same moment. All one thousand hit the database simultaneously. The database chokes. This is the thundering herd.
Three defenses:
Lock-based recomputation. When a cache miss occurs, the first request acquires a distributed lock (using Redis SETNX or a similar mechanism) and recomputes the value. All other requests wait or get a slightly stale value. This serializes the expensive path and protects the database.
Stale-while-revalidate. Serve the expired value while a single background thread refreshes it. The cache stores both the value and its expiration time. On a read, if the entry is expired, the cache returns the stale value immediately and triggers an asynchronous refresh. This eliminates latency spikes entirely but requires your application to tolerate brief staleness.
Jittered TTLs. Instead of setting a flat 60-second TTL on all entries, add a random offset: 55 to 65 seconds. This spreads expirations across time, preventing mass simultaneous misses. Simple, effective, and too often overlooked.
Cold start
When you deploy a new cache node, or when a cache restarts after a crash, the cache is empty. Every request is a miss. The database gets hammered with the full request volume. This is the cold start problem.
Cache warming is the primary mitigation. Before routing traffic to a new cache node, preload it with the most frequently accessed keys. You can derive this set from access logs or from a key-frequency tracker in your application. Redis supports bulk loading via its protocol, and many orchestration tools can script a warm-up phase before health checks pass.
Gradual traffic shifting works in conjunction with warming. Instead of cutting over 100% of traffic to the new node instantly, ramp it up: 10%, 25%, 50%, 100%. This gives the cache time to populate organically while the old node (or the database) absorbs the remaining load. This approach mirrors what you would do with load balancer traffic shifting during deployments.
Choosing a strategy
There is no universal best strategy. The choice depends on your consistency requirements, read/write ratio, and tolerance for complexity.
For a typical read-heavy web application serving product pages, cache-aside with TTL-based invalidation is the right starting point. It is simple, well-understood, and resilient to cache failures. Pair it with LRU eviction and jittered TTLs, and you will handle most scenarios.
For systems that demand strong consistency between cache and database, write-through combined with read-through gives you a transparent layer. The trade-off is higher write latency.
For write-heavy workloads like real-time analytics, write-behind gives you throughput at the cost of durability risk. Protect yourself with replication and persistence.
For a deeper look at how caching fits into a full architecture with distributed nodes and tiered layers, see HLD: Caching architecture. For how caches interact with content delivery at the edge, see CDNs. And for the database layer that sits behind your cache, see Databases overview.
What comes next
Caching reduces latency between your servers and your data store. But what about the latency between your users and your servers? That is where content delivery networks come in. CDNs push static and dynamic content to edge locations worldwide, cutting round-trip times from hundreds of milliseconds to single digits.
Next up: CDNs.