Mar 2, 2026 · 20 min read · GPU Programming

CPU caches and memory hierarchy: why memory access speed matters

In this series (6 parts)

How a CPU actually works: cores, clocks, and execution
CPU caches and memory hierarchy: why memory access speed matters
CPU pipelines and instruction-level parallelism
Memory models and why concurrent CPU code is hard
SIMD and vectorization: parallelism on a single CPU core
Processes, threads, and context switching

Your CPU can execute billions of operations per second. But it spends most of its time waiting for data. A single fetch from main memory costs roughly 200 clock cycles. During that time, a modern processor could have retired hundreds of instructions. The memory wall is not a theoretical concern; it is the primary bottleneck in almost every real program.

This article builds on How a CPU actually works. If you are not comfortable with clock cycles, pipelines, and registers, read that first.

The problem: RAM is slow

A CPU core running at 4 GHz completes one cycle every 0.25 nanoseconds. A DDR5 RAM access takes around 50 to 80 nanoseconds. That is a 200x to 300x mismatch. If the processor had to go to main memory for every instruction and every piece of data, it would be idle more than 99% of the time.

The solution is the same one that works in logistics, libraries, and grocery stores: keep the things you use most often close to where you use them. In a CPU, that means a hierarchy of progressively larger, slower storage layers between the registers and main memory.

The memory hierarchy

graph TD
  R["Registers
approx 1 KB | &lt;1 ns | 0 cycles"]
  L1["L1 Cache
32-64 KB | approx 1 ns | approx 4 cycles"]
  L2["L2 Cache
256 KB - 1 MB | approx 3-5 ns | approx 12 cycles"]
  L3["L3 Cache
8-64 MB | approx 10-15 ns | approx 40 cycles"]
  RAM["Main Memory (DDR5)
16-128 GB | approx 50-80 ns | approx 200 cycles"]
  SSD["NVMe SSD
1-4 TB | approx 10,000 ns | N/A"]
  DISK["HDD
1-16 TB | approx 5,000,000 ns | N/A"]

  R --> L1
  L1 --> L2
  L2 --> L3
  L3 --> RAM
  RAM --> SSD
  SSD --> DISK

  style R fill:#2d6a4f,stroke:#1b4332,color:#fff
  style L1 fill:#40916c,stroke:#2d6a4f,color:#fff
  style L2 fill:#52b788,stroke:#40916c,color:#fff
  style L3 fill:#74c69d,stroke:#52b788,color:#000
  style RAM fill:#f4a261,stroke:#e76f51,color:#000
  style SSD fill:#e76f51,stroke:#c1121f,color:#fff
  style DISK fill:#c1121f,stroke:#780000,color:#fff

Think of it like a workshop. Registers are the tools in your hands. L1 cache is the workbench right in front of you. L2 is the tool chest two steps away. L3 is the shared cabinet across the room. RAM is the warehouse down the hall. And disk storage is the off-site depot you call when you need something unusual. Every step farther away costs more time.

Memory hierarchy table

Level	Typical Size	Latency (cycles)	Latency (ns)	Bandwidth (GB/s)	Managed By
Registers	~1 KB	0	< 0.25	N/A (internal)	Compiler
L1 Cache	32-64 KB per core	~4	~1	~1,000-2,000	Hardware
L2 Cache	256 KB - 1 MB per core	~12	~3-5	~400-800	Hardware
L3 Cache	8-64 MB shared	~40	~10-15	~200-400	Hardware
RAM (DDR5)	16-128 GB	~200	~50-80	~50-80	OS + Hardware
NVMe SSD	1-4 TB	N/A	~10,000	~5-7	OS
HDD	1-16 TB	N/A	~5,000,000	~0.1-0.3	OS

These numbers vary by generation and vendor, but the ratios are consistent. Each step down costs roughly 3x to 10x more latency and gains roughly 10x to 1000x more capacity.

Visualizing the latency gap

Look at the jump from L3 to RAM: 12 ns to 65 ns, roughly a 5x increase. Now look at the jump from RAM to SSD: 65 ns to 10,000 ns, over 150x. The entire cache hierarchy exists to keep your working data in that narrow, fast region above the RAM cliff. When your code causes frequent L3 misses, every one costs ~50 ns of dead time. In a tight loop processing millions of elements, those misses accumulate into seconds of wasted wall-clock time.

Cache levels: L1, L2, L3

L1: the fastest cache

L1 is split into two halves on most architectures: L1i for instructions and L1d for data. Each is typically 32 KB or 64 KB per core. L1 is built with the fastest SRAM cells, physically close to the execution units, and delivers data in about 4 cycles. It is small because fast SRAM is expensive and because larger caches have longer access times due to physical wire delays.

L2: the mid-tier buffer

L2 is private to each core (on modern designs), typically 256 KB to 1 MB, and takes about 12 cycles. It acts as a victim cache or an inclusive superset of L1, depending on the architecture. When data is evicted from L1, it often lands in L2 rather than being discarded entirely.

L3: the shared pool

L3 is shared across all cores on a chip. It ranges from 8 MB (laptop parts) to 64 MB or more (server CPUs, AMD’s 3D V-Cache). Access takes about 40 cycles. Its primary role is to reduce traffic to main memory and to provide a coherent view of data across cores. When one core writes to an address that another core has cached, the L3 helps mediate that through coherence protocols like MESI or MOESI.

Cache hits and cache misses

When the CPU needs data at a particular address, it checks L1 first. If the data is there, that is a cache hit. The data goes straight to the execution unit with minimal delay.

If L1 does not have it, that is an L1 miss. The request goes to L2. If L2 has it, that is an L2 hit. Otherwise, L2 miss, check L3. If L3 misses, the request goes all the way to main memory.

sequenceDiagram
  participant CPU as CPU Core
  participant L1 as L1 Cache
  participant L2 as L2 Cache
  participant L3 as L3 Cache
  participant RAM as Main Memory

  CPU->>L1: Request address 0x7FA0
  Note over L1: MISS, 4 cycles wasted
  L1->>L2: Forward request
  Note over L2: MISS, 12 cycles wasted
  L2->>L3: Forward request
  Note over L3: MISS, 40 cycles wasted
  L3->>RAM: Forward request
  Note over RAM: Data fetched, 200 cycles
  RAM-->>L3: Return 64-byte cache line
  L3-->>L2: Return cache line
  L2-->>L1: Return cache line
  L1-->>CPU: Deliver data
  Note over CPU: Total 200+ cycles

On a full miss to RAM, the CPU stalls for roughly 200 cycles. Hardware techniques like out-of-order execution and prefetching hide some of this latency by doing other useful work while waiting. But if every access misses, there is nothing to hide behind.

A well-optimized program achieves L1 hit rates above 95%. At 90% hit rate, the average access time is already noticeably worse. At 80%, your program is spending a significant fraction of its time waiting for memory.

Cache lines: the unit of transfer

Caches do not fetch individual bytes. They fetch fixed-size blocks called cache lines. On x86 and ARM, a cache line is 64 bytes. When you read a single integer at address 0x1000, the cache loads the entire 64-byte block from 0x1000 to 0x103F.

This design exploits a fundamental observation about programs: if you access one address, you are likely to access nearby addresses soon. Loading a full line amortizes the cost of the memory fetch across multiple accesses.

Example 1: cache line arithmetic

Consider an array of 32-bit integers (4 bytes each). A single 64-byte cache line holds:

64 bytes / 4 bytes per int = 16 integers per cache line

Scenario A: access every element sequentially. You load the first element, and the cache line comes in with 16 integers. The next 15 accesses are free (cache hits). For a 1,000-element array, you need:

1,000 / 16 = 63 cache line fetches (rounding up)

Scenario B: access every 16th element. Now every single access lands on a new cache line. For 1,000 / 16 = 63 accesses, you load:

63 cache line fetches for only 63 elements

Same number of cache line loads, but Scenario A processes 16x more data per fetch. Sequential access is not just “a bit faster.” It uses the hardware 16x more efficiently at the cache level.

Scenario C: access every 64th element (stride of 256 bytes). Every access skips an entire cache line boundary. You load 64 bytes, use 4, and throw away 60. This pattern wastes 94% of the fetched bandwidth.

Spatial and temporal locality

Cache effectiveness depends on two properties of your memory access patterns.

Spatial locality means accessing addresses near each other. Arrays, structs laid out in memory, sequential file reads: all of these exhibit spatial locality. The cache line design rewards spatial locality directly. If you touch one byte, the adjacent 63 bytes come for free.

Temporal locality means accessing the same address repeatedly in a short window. Loop counters, accumulator variables, frequently called function pointers: these stay hot in cache because they are reused before eviction. If you access a variable once and never again, caching it was wasted effort.

Good code has both. A tight loop over a contiguous array is the ideal case: each element is near the last (spatial), and the loop variable and accumulator are reused every iteration (temporal).

Bad code has neither. A linked list with nodes scattered across the heap, accessed once each, is the worst case. Every node is a cache miss, and nothing is reused.

Example 2: row-major vs column-major matrix access

Consider a 1,000 x 1,000 matrix of 32-bit integers stored in row-major order (C/C++ default). The matrix occupies 4,000,000 bytes.

Row-by-row access (good):

for (int i = 0; i < 1000; i++)
    for (int j = 0; j < 1000; j++)
        sum += matrix[i][j];

Elements matrix[i][0] through matrix[i][15] are in the same cache line. Each row of 1,000 elements needs:

1,000 / 16 = 63 cache lines per row
63 x 1,000 rows = 63,000 cache line fetches total

Column-by-column access (bad):

for (int j = 0; j < 1000; j++)
    for (int i = 0; i < 1000; i++)
        sum += matrix[i][j];

Each matrix[i][j] to matrix[i+1][j] jumps 4,000 bytes (one full row). That is 62 cache lines apart. Every single access is a cache miss unless the entire matrix fits in cache.

1,000,000 accesses, each missing = 1,000,000 cache line fetches

The column-major loop fetches roughly 16x more cache lines than the row-major loop for the same computation. In practice, this translates to 5x to 15x slower execution, depending on the CPU and how much of the matrix fits in L3. This is not a contrived example. It is one of the most common performance bugs in numerical code.

Example 3: structure of arrays vs array of structures

Game engines and physics simulations process millions of particles. Consider two layouts:

Array of Structures (AoS):

struct Particle {
    float x, y, z;       // position: 12 bytes
    float vx, vy, vz;    // velocity: 12 bytes
    float mass;           // 4 bytes
    int type;             // 4 bytes
};                        // total: 32 bytes per particle

Particle particles[1000000];

If you only need to update positions (x, y, z), each cache line gives you 2 particles (64 / 32 = 2). But you also load vx, vy, vz, mass, and type, which you do not need for this pass. Half the bandwidth is wasted.

Structure of Arrays (SoA):

struct Particles {
    float x[1000000];
    float y[1000000];
    float z[1000000];
    float vx[1000000];
    float vy[1000000];
    float vz[1000000];
    float mass[1000000];
    int type[1000000];
};

Now updating positions reads x[], y[], z[] contiguously. Each cache line gives you 16 floats. No wasted loads. The position update loop runs roughly 2x to 4x faster with SoA layout because every byte fetched from memory is actually used.

This is why game engines, SIMD libraries, and GPU compute kernels strongly prefer SoA layouts. The hardware rewards contiguous, homogeneous access.

Cache thrashing

Cache thrashing happens when your access pattern repeatedly evicts data that will be needed again soon. The cache has limited associativity: each address can only map to a small number of cache slots (typically 4-way, 8-way, or 16-way set-associative). If multiple hot addresses map to the same set, they compete for slots and keep evicting each other.

A classic trigger is striding through memory at a power-of-two interval that matches the cache’s set mapping. For example, if your L1 cache is 32 KB and 8-way set-associative, it has 64 sets. Accessing 9 arrays at offsets that are exact multiples of 32 KB will cause all of them to fight for the same cache set, and one will always be evicted.

Symptoms of cache thrashing:

Performance that is dramatically worse than expected for the data size
Performance that changes wildly with small changes to array sizes (e.g., 1024 vs 1025 elements)
High L1 or L2 miss rates in profiler output despite working set being smaller than the cache

Mitigation: add padding to arrays so addresses no longer collide on the same cache set. Some allocators align to cache line boundaries by default. Compiler pragmas and alignas in C++ can help.

Cache coherence: the multi-core complication

When multiple cores have their own L1 and L2 caches, they can hold stale copies of the same memory address. Cache coherence protocols (MESI, MOESI, MESIF) ensure that writes on one core are visible to reads on another.

The cost: when Core 0 writes to a cache line that Core 1 also has cached, the protocol invalidates Core 1’s copy. If both cores are writing to different variables that happen to share the same 64-byte cache line, they will constantly invalidate each other’s caches. This is called false sharing, and it can make “parallel” code slower than single-threaded code.

Fix: pad shared variables so each one occupies its own cache line. In C++, use alignas(64). In Java, the @Contended annotation does this. In Go, you will sometimes see padding fields in structs for exactly this reason.

NUMA: non-uniform memory access

On multi-socket servers (2 or 4 physical CPUs), each socket has its own memory controller and its own bank of RAM. Accessing “local” memory (attached to the same socket) takes ~50 ns. Accessing “remote” memory (attached to the other socket, reached via an interconnect like AMD Infinity Fabric or Intel UPI) takes ~100 ns or more.

This means memory access time depends on which CPU is asking and where the data lives. The OS tries to allocate memory on the same NUMA node as the thread that requested it (first-touch policy), but if threads migrate between cores on different sockets, they can end up accessing remote memory frequently.

For most application developers, NUMA is invisible. But if you are writing database engines, large-scale in-memory stores, or high-performance computing code, NUMA-aware allocation and thread pinning can make a 30-50% difference in throughput.

In practice: real bottlenecks

Profiling cache behavior. On Linux, perf stat reports L1, L2, and L3 miss rates directly:

perf stat -e L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./my_program

An L1 miss rate above 5% is worth investigating. An LLC (last-level cache, usually L3) miss rate above 1% for a compute-bound workload means you are likely memory-bound.

On macOS, Instruments provides similar counters through the “Counters” profiling template.

Common surprises:

Hash maps with random access patterns are almost always cache-hostile. If your keys are uniformly distributed, every lookup is effectively a random memory access. For small maps, a sorted array with binary search can outperform a hash map because the memory accesses are more predictable.
Linked lists are the canonical cache-unfriendly data structure. Each node is a pointer chase to an arbitrary heap location. Even if the list is short, traversal causes frequent cache misses. Prefer vectors or arrays when possible.
Object-oriented code with deep inheritance hierarchies and virtual dispatch scatters vtable lookups, method code, and object data across memory. Data-oriented design (grouping data by how it is processed rather than what it “is”) consistently outperforms OOP in performance-critical paths.

What to look for in your code:

✓ Sequential array access in inner loops
✓ Small, contiguous data structures
✓ Hot data separated from cold data
⚠ Random access patterns (hash maps, pointer chasing)
⚠ Large strides through arrays
✗ Linked lists in hot paths
✗ False sharing between threads

Key takeaways

The CPU-memory speed gap is the dominant performance bottleneck in modern computing. Caches exist to bridge it.
Cache lines (64 bytes) are the fundamental unit. Sequential access is fast because it uses entire cache lines. Random access wastes most of each fetched line.
Spatial locality (access nearby addresses) and temporal locality (reuse recently accessed addresses) are the two principles that make cache-friendly code.
Loop order matters enormously for multi-dimensional arrays. Row-major traversal in row-major storage can be 10x faster than column-major traversal.
False sharing between threads can silently destroy parallel performance. Pad shared variables to cache line boundaries.
Profile before optimizing. perf stat and hardware counters tell you exactly where cache misses are costing you time.

What comes next

With the memory hierarchy covered, the next piece is understanding how the CPU keeps its pipeline full despite all these potential stalls. CPU instruction pipeline and execution covers pipelining, hazards, branch prediction, and out-of-order execution: the mechanisms that let a processor do useful work while waiting for slow memory.

← Back to all series