CUDA memory hierarchy: where your data lives matters
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- CUDA thread hierarchy for grids, blocks, warps, and threads.
- CPU caches and memory hierarchy for the fundamental idea of trading capacity for speed.
You should be comfortable writing a simple CUDA kernel and launching it with <<<blocks, threads>>>. Everything in this article is about what happens after the kernel starts executing: where each variable lives, how fast it can be accessed, and how that choice determines whether your kernel runs at 10% or 90% of peak throughput.
The GPU memory hierarchy at a glance
A CPU has a straightforward hierarchy: registers, L1, L2, L3, DRAM. A GPU has a richer set of memory spaces because the programming model exposes them explicitly to the programmer. You, not the hardware, decide where most data lives.
graph TD REG["Registers Scope: per-thread Latency: 1 cycle Size: 255 max per thread Declaration: automatic variables"] LOCAL["Local Memory Scope: per-thread Latency: 200-800 cycles Size: up to 512 KB per thread Declaration: spilled registers"] SHARED["Shared Memory Scope: per-block Latency: 5-30 cycles Size: up to 164 KB per SM Declaration: shared keyword"] L1["L1 Cache Scope: per-SM Latency: 30 cycles Size: up to 256 KB per SM Hardware managed"] L2["L2 Cache Scope: device-wide Latency: 200 cycles Size: 6-96 MB Hardware managed"] CONST["Constant Memory Scope: device-wide, read-only Latency: 5 cycles cached Size: 64 KB Declaration: constant keyword"] TEX["Texture Memory Scope: device-wide, read-only Latency: 5 cycles cached Size: device memory Declaration: texture objects"] GLOBAL["Global Memory DRAM Scope: device-wide Latency: 200-800 cycles Size: 8-80 GB Declaration: cudaMalloc"] REG --> L1 LOCAL --> L1 SHARED --> L1 L1 --> L2 CONST --> L2 TEX --> L2 L2 --> GLOBAL style REG fill:#2d6a4f,stroke:#1b4332,color:#fff style SHARED fill:#40916c,stroke:#2d6a4f,color:#fff style CONST fill:#52b788,stroke:#40916c,color:#fff style TEX fill:#52b788,stroke:#40916c,color:#fff style L1 fill:#74c69d,stroke:#52b788,color:#000 style L2 fill:#f4a261,stroke:#e76f51,color:#000 style LOCAL fill:#e76f51,stroke:#c1121f,color:#fff style GLOBAL fill:#c1121f,stroke:#780000,color:#fff
The key insight: on a CPU, the cache hierarchy is almost entirely hardware-managed. On a GPU, shared memory and constant memory are programmer-managed. You decide what goes in them, when it arrives, and when it gets evicted. That control is what separates a 50 GB/s kernel from a 1500 GB/s kernel on the same hardware.
Memory types reference table
| Memory | Declaration | Scope | Lifetime | Latency (cycles) | Typical Size | Bandwidth (GB/s) | Programmer Controlled |
|---|---|---|---|---|---|---|---|
| Registers | int x; (automatic) | Thread | Thread | ~1 | 255 regs/thread, 65536 regs/SM | N/A (internal) | Partially (compiler) |
| Local | Spilled registers, large arrays | Thread | Thread | ~200-800 | Up to 512 KB/thread | ~900 (via L1/L2) | No (compiler decides) |
| Shared | __shared__ float s[N]; | Block | Block | ~5-30 | Up to 164 KB/SM | ~12,000-19,000 | ✓ Yes |
| L1 Cache | Automatic | SM | Automatic | ~30 | Up to 256 KB/SM | ~12,000-19,000 | Partially (hints) |
| L2 Cache | Automatic | Device | Automatic | ~200 | 6-96 MB | ~4,000-6,000 | Partially (persistence) |
| Constant | __constant__ float c[N]; | Device | Application | ~5 (cached), ~200 (miss) | 64 KB total | Broadcast to warp | ✓ Yes |
| Texture | cudaTextureObject_t | Device | Application | ~5 (cached), ~200 (miss) | Device memory | Cached + spatial | ✓ Yes |
| Global | cudaMalloc(&ptr, size) | Device | cudaMalloc to cudaFree | ~200-800 | 8-80 GB | 900-3,350 | ✓ Yes |
The latency numbers are approximate and vary by architecture (Volta, Ampere, Hopper). The ratios between tiers remain consistent across generations.
Registers: the fastest storage on the chip
Every automatic variable inside a kernel function lives in a register by default. Registers are private to each thread and have single-cycle access latency. There is no faster storage on the entire GPU.
__global__ void saxpy(float a, float* x, float* y, float* out, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
float xi = x[idx]; // xi lives in a register
float yi = y[idx]; // yi lives in a register
out[idx] = a * xi + yi;
}
}
Variables idx, xi, yi, and the parameter a all live in registers. The compiler allocates them automatically. You do not write register or any qualifier; you just declare a local variable.
The constraint: each SM has a fixed-size register file. On Ampere (SM 8.0), each SM has 65,536 32-bit registers. If your kernel uses 64 registers per thread and your block has 256 threads, that block alone consumes 64 * 256 = 16,384 registers. The SM can hold at most 65,536 / 16,384 = 4 such blocks simultaneously, giving you 1,024 threads per SM. More registers per thread means fewer concurrent threads, which means less latency hiding.
The compiler flag --maxrregcount=N lets you cap register usage per thread, at the cost of spilling excess variables to local memory.
Local memory: the register overflow area
Local memory is not a separate physical memory. It is global DRAM accessed with per-thread addressing. When the compiler runs out of registers for a thread (register spilling), or when you declare arrays too large for the register file, those values land in local memory.
__global__ void kernel_with_spill() {
float big_array[256]; // too large for registers, lives in local memory
for (int i = 0; i < 256; i++) {
big_array[i] = sinf((float)i);
}
}
Local memory accesses go through the L1 and L2 caches, so they are not always full-DRAM-latency. But they are still 100x to 800x slower than registers. The rule: if your profiler (Nsight Compute) shows high local memory traffic, you have a problem. Reduce register pressure or restructure the algorithm.
Shared memory: the programmer-managed scratchpad
Shared memory is on-chip SRAM that is visible to all threads within a block. It sits physically next to the L1 cache and has comparable latency (5 to 30 cycles). The critical difference from L1: you control every byte that goes in and comes out.
SM internals
graph TD
subgraph SM["Streaming Multiprocessor"]
direction TB
RF["Register File
65,536 x 32-bit registers"]
WS["Warp Schedulers
4 schedulers per SM"]
subgraph OnChip["On-Chip Memory, up to 256 KB"]
direction LR
SMEM["Shared Memory
configurable portion"]
L1C["L1 Data Cache
remainder"]
end
CORES["CUDA Cores
FP32 + INT32 + FP64 + Tensor"]
end
L2C["L2 Cache
Device-wide"]
DRAM["HBM / Global Memory"]
WS --> RF
WS --> CORES
CORES --> OnChip
OnChip --> L2C
L2C --> DRAM
style RF fill:#2d6a4f,stroke:#1b4332,color:#fff
style WS fill:#40916c,stroke:#2d6a4f,color:#fff
style SMEM fill:#52b788,stroke:#40916c,color:#fff
style L1C fill:#74c69d,stroke:#52b788,color:#000
style CORES fill:#b7e4c7,stroke:#74c69d,color:#000
style L2C fill:#f4a261,stroke:#e76f51,color:#000
style DRAM fill:#c1121f,stroke:#780000,color:#fff
On Ampere GPUs, the on-chip memory for each SM is a unified 192 KB pool that you can partition between shared memory and L1 cache. On Hopper, this increases to 256 KB. The cudaFuncSetAttribute API lets you choose the split:
cudaFuncSetAttribute(
myKernel,
cudaFuncAttributeMaxDynamicSharedMemorySize,
163840 // request 160 KB of shared memory
);
Declaring shared memory
Static allocation (size known at compile time):
__global__ void tile_multiply() {
__shared__ float tile_A[32][32];
__shared__ float tile_B[32][32];
// all 1024 threads in the block can read/write tile_A and tile_B
}
Dynamic allocation (size known at launch time):
__global__ void flexible_kernel() {
extern __shared__ float dynamic_smem[];
// size is set when launching: kernel<<<blocks, threads, sharedBytes>>>()
}
A tiled matrix multiplication kernel
The canonical use of shared memory is tiling. Instead of having every thread fetch from slow global memory on every multiply, each thread block cooperatively loads a tile into shared memory, synchronizes, then computes from the fast on-chip copy.
#define TILE_SIZE 32
__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
__shared__ float tileA[TILE_SIZE][TILE_SIZE];
__shared__ float tileB[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE_SIZE; t++) {
// cooperative load: each thread loads one element
tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
__syncthreads(); // wait for all threads to finish loading
for (int k = 0; k < TILE_SIZE; k++) {
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
}
__syncthreads(); // wait before loading next tile
}
C[row * N + col] = sum;
}
Without tiling, each thread reads 2N floats from global memory for N multiply-accumulate operations. With tiling, global reads drop by a factor of TILE_SIZE (32x). That is the difference between being memory-bound and being compute-bound.
Constant memory: broadcast for free
Constant memory is a 64 KB region of device memory that is cached in a dedicated per-SM constant cache. When all threads in a warp read the same address, the hardware broadcasts the value in a single transaction. This makes constant memory ideal for kernel parameters, lookup tables, and coefficients that do not change during execution.
__constant__ float filter_weights[256];
// Host code: copy data before kernel launch
cudaMemcpyToSymbol(filter_weights, host_weights, 256 * sizeof(float));
__global__ void convolution(float* input, float* output, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < n) {
float sum = 0.0f;
for (int k = 0; k < 256; k++) {
sum += input[idx + k] * filter_weights[k]; // broadcast read
}
output[idx] = sum;
}
}
⚠ If threads in a warp access different constant memory addresses, the accesses are serialized. Constant memory only helps when access is uniform across the warp.
Texture memory: spatial locality for free
Texture memory is backed by global memory but cached through a dedicated read-only texture cache. It is optimized for 2D spatial locality: if a thread reads pixel (x, y), nearby threads reading (x+1, y) or (x, y+1) will likely hit the cache.
cudaTextureObject_t tex;
// ... setup texture object with cudaCreateTextureObject() ...
__global__ void bilinear_sample(float* output, cudaTextureObject_t tex, int w, int h) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < w && y < h) {
// hardware-accelerated interpolation
output[y * w + x] = tex2D<float>(tex, x + 0.5f, y + 0.5f);
}
}
Texture memory also provides free boundary handling (clamp, wrap) and hardware-interpolated reads. For modern compute workloads, shared memory is usually preferred. Texture memory still wins for image processing and any kernel with 2D access patterns.
L1 and L2 caches: hardware-managed layers
The L1 cache sits on each SM and caches global and local memory accesses automatically. You do not explicitly load data into L1; the hardware manages eviction and replacement. On Ampere, L1 and shared memory share a unified 192 KB pool. Increasing shared memory allocation decreases L1 capacity and vice versa.
The L2 cache is shared across all SMs on the device. On A100, it is 40 MB. On H100, it is 50 MB. Starting with Ampere, you can set L2 persistence policies to pin frequently accessed data in L2:
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
size_t l2_size = prop.l2CacheSize;
// reserve 30% of L2 for persistent access
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_size * 0.3);
L2 persistence is useful when a kernel repeatedly accesses a working set that fits in L2 but not in L1. Without persistence, other traffic can evict your hot data between kernel launches.
Global memory: large, slow, essential
Global memory is the HBM (High Bandwidth Memory) DRAM attached to the GPU. It is the only memory visible to both the host (CPU) and the device (GPU), and it is where all large data structures live.
float* d_data;
cudaMalloc(&d_data, N * sizeof(float)); // allocate
cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice); // copy
myKernel<<<grid, block>>>(d_data); // use
cudaFree(d_data); // free
Global memory bandwidth on modern GPUs is enormous in absolute terms (A100: 2,039 GB/s, H100: 3,350 GB/s) but still 10x to 20x slower than shared memory bandwidth. The entire optimization strategy in CUDA boils down to: move data from global to on-chip memory, reuse it as many times as possible, then write results back.
Declaring each memory type: summary
// ---- Registers (automatic) ----
__global__ void kernel() {
int tid = threadIdx.x; // register
float val = 0.0f; // register
}
// ---- Local memory (compiler-managed spill) ----
__global__ void kernel() {
float big[512]; // too large for registers -> local memory
}
// ---- Shared memory (static) ----
__global__ void kernel() {
__shared__ float smem[256]; // visible to all threads in block
}
// ---- Shared memory (dynamic) ----
__global__ void kernel() {
extern __shared__ float smem[]; // size set at launch
}
// ---- Constant memory ----
__constant__ float coeff[64]; // file scope, 64 KB max total
// ---- Global memory ----
// Host side:
float* d_ptr;
cudaMalloc(&d_ptr, size);
// ---- Texture memory ----
// Host side: create cudaTextureObject_t via cudaCreateTextureObject()
Latency comparison
The log scale is necessary because the range spans four orders of magnitude. A register access at 1 cycle and a PCIe transfer at 10,000+ cycles cannot be compared on a linear axis. This single chart should guide every optimization decision: move data up the hierarchy, reduce traffic down it.
Worked examples
Example 1: Register pressure and occupancy
A kernel uses 64 registers per thread. The target GPU (Ampere, SM 8.0) has 65,536 registers per SM and supports a maximum of 2,048 threads per SM.
Step 1: Maximum threads from register budget.
65,536 registers / 64 registers per thread = 1,024 threads per SM
Step 2: Convert to warps.
1,024 threads / 32 threads per warp = 32 warps per SM
The hardware maximum is 64 warps (2,048 threads). Register pressure cuts occupancy to 50%.
Step 3: Total threads across the device (108 SMs on A100).
1,024 threads per SM * 108 SMs = 110,592 threads total
✓ This is still a large number of threads, but the reduced occupancy means less latency hiding. If the kernel is memory-bound, this matters. If the kernel is compute-bound with high ILP, 50% occupancy may be perfectly fine.
⚠ Use --maxrregcount=32 to halve register usage and double occupancy, but only if the spilled variables do not create excessive local memory traffic.
Example 2: Shared memory sizing
A kernel loads a 32 x 32 tile of float values into shared memory.
Step 1: Bytes per tile.
32 * 32 * 4 bytes = 4,096 bytes = 4 KB per block
Step 2: How many blocks fit in one SM’s shared memory? Assume the SM has 48 KB of shared memory configured (the default Ampere configuration with 144 KB L1).
48 KB / 4 KB = 12 blocks per SM
But the hardware maximum is typically 32 blocks per SM (Ampere). So shared memory is not the bottleneck here.
Step 3: If the kernel uses two tiles (A and B for matrix multiplication):
2 * 4 KB = 8 KB per block. 48 KB / 8 KB = 6 blocks per SM. Still well within limits.
✓ Shared memory becomes the bottleneck when tiles are large. A 64 x 64 float tile takes 16 KB. Two tiles take 32 KB. With 48 KB available, only 1 block fits. That is 1,024 threads at most on an SM that can handle 2,048. You trade occupancy for data reuse.
Example 3: Global memory bandwidth utilization
An A100 GPU has 2,039 GB/s of HBM bandwidth. A kernel reads a 1 GB input array and writes a 1 GB output array. The kernel does 1 FLOP per element (element-wise add).
Step 1: Total data movement.
1 GB (read) + 1 GB (write) = 2 GB
Step 2: Time at peak bandwidth.
2 GB / 2,039 GB/s = 0.98 ms
Step 3: Total FLOPs.
1 GB / 4 bytes per float = 256 million FLOPs. The A100 can do 19.5 TFLOPS (FP32).
256 * 10^6 FLOPs / 19.5 * 10^12 FLOPS/s = 0.013 ms
Step 4: Arithmetic intensity.
256 * 10^6 FLOPs / 2 * 10^9 bytes = 0.125 FLOPs/byte
✓ The kernel is overwhelmingly memory-bound. It takes 0.98 ms to move the data and 0.013 ms to compute. No amount of shared memory or register optimization will help here. The only wins come from reducing data movement: fusing this kernel with the next operation, using half-precision, or restructuring the algorithm to increase arithmetic intensity.
⚠ This is a roofline model calculation. If your measured time is significantly higher than 0.98 ms, you have a coalescing or occupancy problem, not an algorithmic one.
In practice
-
Profile first. Use Nsight Compute to measure actual memory throughput before optimizing. The
l1tex__data_pipe_lsu_wavefronts_mem_sharedmetric tells you shared memory utilization.lts__t_sectors_srcunit_texshows L2 traffic. Do not guess which memory level is the bottleneck. -
Shared memory is your primary optimization lever. If a kernel reads the same global memory data multiple times across threads in a block, shared memory tiling almost always helps. The tiled matrix multiply pattern generalizes to convolutions, reductions, scans, and stencil computations.
-
Watch register pressure. The occupancy calculator in Nsight Compute shows you the limiting factor (registers, shared memory, or block size). If registers are the bottleneck, consider reducing variables, using
__launch_bounds__, or trading precision (float instead of double halves register usage for FP values on most architectures). -
Constant memory is free performance for uniform reads. If every thread reads the same coefficient array, move it to
__constant__. The broadcast mechanism means one cache line serves the entire warp. -
L2 persistence is worth exploring for iterative algorithms. If your kernel launches repeatedly on the same working set (think iterative solvers, training loops), L2 persistence policies can prevent cold-cache misses on each launch.
-
Texture memory is niche but powerful. For 2D interpolation, boundary handling, and image processing, the texture cache provides features that would cost you hundreds of lines of manual code.
What comes next
Knowing where data lives is necessary but not sufficient. How data is accessed matters just as much. If adjacent threads read addresses that are scattered across memory, the hardware cannot combine those requests into efficient transactions.
The next article, CUDA memory coalescing, covers coalesced vs. uncoalesced access patterns, the 128-byte transaction model, stride-1 access, structure of arrays vs. array of structures, and how a single line change can move your kernel from 10% to 90% of peak bandwidth.