Search…

CUDA memory hierarchy: where your data lives matters

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable writing a simple CUDA kernel and launching it with <<<blocks, threads>>>. Everything in this article is about what happens after the kernel starts executing: where each variable lives, how fast it can be accessed, and how that choice determines whether your kernel runs at 10% or 90% of peak throughput.

The GPU memory hierarchy at a glance

A CPU has a straightforward hierarchy: registers, L1, L2, L3, DRAM. A GPU has a richer set of memory spaces because the programming model exposes them explicitly to the programmer. You, not the hardware, decide where most data lives.

graph TD
  REG["Registers
Scope: per-thread
Latency: 1 cycle
Size: 255 max per thread
Declaration: automatic variables"]
  LOCAL["Local Memory
Scope: per-thread
Latency: 200-800 cycles
Size: up to 512 KB per thread
Declaration: spilled registers"]
  SHARED["Shared Memory
Scope: per-block
Latency: 5-30 cycles
Size: up to 164 KB per SM
Declaration: shared keyword"]
  L1["L1 Cache
Scope: per-SM
Latency: 30 cycles
Size: up to 256 KB per SM
Hardware managed"]
  L2["L2 Cache
Scope: device-wide
Latency: 200 cycles
Size: 6-96 MB
Hardware managed"]
  CONST["Constant Memory
Scope: device-wide, read-only
Latency: 5 cycles cached
Size: 64 KB
Declaration: constant keyword"]
  TEX["Texture Memory
Scope: device-wide, read-only
Latency: 5 cycles cached
Size: device memory
Declaration: texture objects"]
  GLOBAL["Global Memory DRAM
Scope: device-wide
Latency: 200-800 cycles
Size: 8-80 GB
Declaration: cudaMalloc"]

  REG --> L1
  LOCAL --> L1
  SHARED --> L1
  L1 --> L2
  CONST --> L2
  TEX --> L2
  L2 --> GLOBAL

  style REG fill:#2d6a4f,stroke:#1b4332,color:#fff
  style SHARED fill:#40916c,stroke:#2d6a4f,color:#fff
  style CONST fill:#52b788,stroke:#40916c,color:#fff
  style TEX fill:#52b788,stroke:#40916c,color:#fff
  style L1 fill:#74c69d,stroke:#52b788,color:#000
  style L2 fill:#f4a261,stroke:#e76f51,color:#000
  style LOCAL fill:#e76f51,stroke:#c1121f,color:#fff
  style GLOBAL fill:#c1121f,stroke:#780000,color:#fff

The key insight: on a CPU, the cache hierarchy is almost entirely hardware-managed. On a GPU, shared memory and constant memory are programmer-managed. You decide what goes in them, when it arrives, and when it gets evicted. That control is what separates a 50 GB/s kernel from a 1500 GB/s kernel on the same hardware.

Memory types reference table

MemoryDeclarationScopeLifetimeLatency (cycles)Typical SizeBandwidth (GB/s)Programmer Controlled
Registersint x; (automatic)ThreadThread~1255 regs/thread, 65536 regs/SMN/A (internal)Partially (compiler)
LocalSpilled registers, large arraysThreadThread~200-800Up to 512 KB/thread~900 (via L1/L2)No (compiler decides)
Shared__shared__ float s[N];BlockBlock~5-30Up to 164 KB/SM~12,000-19,000✓ Yes
L1 CacheAutomaticSMAutomatic~30Up to 256 KB/SM~12,000-19,000Partially (hints)
L2 CacheAutomaticDeviceAutomatic~2006-96 MB~4,000-6,000Partially (persistence)
Constant__constant__ float c[N];DeviceApplication~5 (cached), ~200 (miss)64 KB totalBroadcast to warp✓ Yes
TexturecudaTextureObject_tDeviceApplication~5 (cached), ~200 (miss)Device memoryCached + spatial✓ Yes
GlobalcudaMalloc(&ptr, size)DevicecudaMalloc to cudaFree~200-8008-80 GB900-3,350✓ Yes

The latency numbers are approximate and vary by architecture (Volta, Ampere, Hopper). The ratios between tiers remain consistent across generations.

Registers: the fastest storage on the chip

Every automatic variable inside a kernel function lives in a register by default. Registers are private to each thread and have single-cycle access latency. There is no faster storage on the entire GPU.

__global__ void saxpy(float a, float* x, float* y, float* out, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        float xi = x[idx];   // xi lives in a register
        float yi = y[idx];   // yi lives in a register
        out[idx] = a * xi + yi;
    }
}

Variables idx, xi, yi, and the parameter a all live in registers. The compiler allocates them automatically. You do not write register or any qualifier; you just declare a local variable.

The constraint: each SM has a fixed-size register file. On Ampere (SM 8.0), each SM has 65,536 32-bit registers. If your kernel uses 64 registers per thread and your block has 256 threads, that block alone consumes 64 * 256 = 16,384 registers. The SM can hold at most 65,536 / 16,384 = 4 such blocks simultaneously, giving you 1,024 threads per SM. More registers per thread means fewer concurrent threads, which means less latency hiding.

The compiler flag --maxrregcount=N lets you cap register usage per thread, at the cost of spilling excess variables to local memory.

Local memory: the register overflow area

Local memory is not a separate physical memory. It is global DRAM accessed with per-thread addressing. When the compiler runs out of registers for a thread (register spilling), or when you declare arrays too large for the register file, those values land in local memory.

__global__ void kernel_with_spill() {
    float big_array[256];  // too large for registers, lives in local memory
    for (int i = 0; i < 256; i++) {
        big_array[i] = sinf((float)i);
    }
}

Local memory accesses go through the L1 and L2 caches, so they are not always full-DRAM-latency. But they are still 100x to 800x slower than registers. The rule: if your profiler (Nsight Compute) shows high local memory traffic, you have a problem. Reduce register pressure or restructure the algorithm.

Shared memory: the programmer-managed scratchpad

Shared memory is on-chip SRAM that is visible to all threads within a block. It sits physically next to the L1 cache and has comparable latency (5 to 30 cycles). The critical difference from L1: you control every byte that goes in and comes out.

SM internals

graph TD
  subgraph SM["Streaming Multiprocessor"]
      direction TB
      RF["Register File
65,536 x 32-bit registers"]
      WS["Warp Schedulers
4 schedulers per SM"]
      subgraph OnChip["On-Chip Memory, up to 256 KB"]
          direction LR
          SMEM["Shared Memory
configurable portion"]
          L1C["L1 Data Cache
remainder"]
      end
      CORES["CUDA Cores
FP32 + INT32 + FP64 + Tensor"]
  end
  L2C["L2 Cache
Device-wide"]
  DRAM["HBM / Global Memory"]

  WS --> RF
  WS --> CORES
  CORES --> OnChip
  OnChip --> L2C
  L2C --> DRAM

  style RF fill:#2d6a4f,stroke:#1b4332,color:#fff
  style WS fill:#40916c,stroke:#2d6a4f,color:#fff
  style SMEM fill:#52b788,stroke:#40916c,color:#fff
  style L1C fill:#74c69d,stroke:#52b788,color:#000
  style CORES fill:#b7e4c7,stroke:#74c69d,color:#000
  style L2C fill:#f4a261,stroke:#e76f51,color:#000
  style DRAM fill:#c1121f,stroke:#780000,color:#fff

On Ampere GPUs, the on-chip memory for each SM is a unified 192 KB pool that you can partition between shared memory and L1 cache. On Hopper, this increases to 256 KB. The cudaFuncSetAttribute API lets you choose the split:

cudaFuncSetAttribute(
    myKernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize,
    163840  // request 160 KB of shared memory
);

Declaring shared memory

Static allocation (size known at compile time):

__global__ void tile_multiply() {
    __shared__ float tile_A[32][32];
    __shared__ float tile_B[32][32];
    // all 1024 threads in the block can read/write tile_A and tile_B
}

Dynamic allocation (size known at launch time):

__global__ void flexible_kernel() {
    extern __shared__ float dynamic_smem[];
    // size is set when launching: kernel<<<blocks, threads, sharedBytes>>>()
}

A tiled matrix multiplication kernel

The canonical use of shared memory is tiling. Instead of having every thread fetch from slow global memory on every multiply, each thread block cooperatively loads a tile into shared memory, synchronizes, then computes from the fast on-chip copy.

#define TILE_SIZE 32

__global__ void matmul_tiled(float* A, float* B, float* C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // cooperative load: each thread loads one element
        tileA[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];

        __syncthreads();  // wait for all threads to finish loading

        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }

        __syncthreads();  // wait before loading next tile
    }

    C[row * N + col] = sum;
}

Without tiling, each thread reads 2N floats from global memory for N multiply-accumulate operations. With tiling, global reads drop by a factor of TILE_SIZE (32x). That is the difference between being memory-bound and being compute-bound.

Constant memory: broadcast for free

Constant memory is a 64 KB region of device memory that is cached in a dedicated per-SM constant cache. When all threads in a warp read the same address, the hardware broadcasts the value in a single transaction. This makes constant memory ideal for kernel parameters, lookup tables, and coefficients that do not change during execution.

__constant__ float filter_weights[256];

// Host code: copy data before kernel launch
cudaMemcpyToSymbol(filter_weights, host_weights, 256 * sizeof(float));

__global__ void convolution(float* input, float* output, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n) {
        float sum = 0.0f;
        for (int k = 0; k < 256; k++) {
            sum += input[idx + k] * filter_weights[k];  // broadcast read
        }
        output[idx] = sum;
    }
}

⚠ If threads in a warp access different constant memory addresses, the accesses are serialized. Constant memory only helps when access is uniform across the warp.

Texture memory: spatial locality for free

Texture memory is backed by global memory but cached through a dedicated read-only texture cache. It is optimized for 2D spatial locality: if a thread reads pixel (x, y), nearby threads reading (x+1, y) or (x, y+1) will likely hit the cache.

cudaTextureObject_t tex;
// ... setup texture object with cudaCreateTextureObject() ...

__global__ void bilinear_sample(float* output, cudaTextureObject_t tex, int w, int h) {
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    if (x < w && y < h) {
        // hardware-accelerated interpolation
        output[y * w + x] = tex2D<float>(tex, x + 0.5f, y + 0.5f);
    }
}

Texture memory also provides free boundary handling (clamp, wrap) and hardware-interpolated reads. For modern compute workloads, shared memory is usually preferred. Texture memory still wins for image processing and any kernel with 2D access patterns.

L1 and L2 caches: hardware-managed layers

The L1 cache sits on each SM and caches global and local memory accesses automatically. You do not explicitly load data into L1; the hardware manages eviction and replacement. On Ampere, L1 and shared memory share a unified 192 KB pool. Increasing shared memory allocation decreases L1 capacity and vice versa.

The L2 cache is shared across all SMs on the device. On A100, it is 40 MB. On H100, it is 50 MB. Starting with Ampere, you can set L2 persistence policies to pin frequently accessed data in L2:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
size_t l2_size = prop.l2CacheSize;

// reserve 30% of L2 for persistent access
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_size * 0.3);

L2 persistence is useful when a kernel repeatedly accesses a working set that fits in L2 but not in L1. Without persistence, other traffic can evict your hot data between kernel launches.

Global memory: large, slow, essential

Global memory is the HBM (High Bandwidth Memory) DRAM attached to the GPU. It is the only memory visible to both the host (CPU) and the device (GPU), and it is where all large data structures live.

float* d_data;
cudaMalloc(&d_data, N * sizeof(float));                       // allocate
cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice);  // copy
myKernel<<<grid, block>>>(d_data);                            // use
cudaFree(d_data);                                             // free

Global memory bandwidth on modern GPUs is enormous in absolute terms (A100: 2,039 GB/s, H100: 3,350 GB/s) but still 10x to 20x slower than shared memory bandwidth. The entire optimization strategy in CUDA boils down to: move data from global to on-chip memory, reuse it as many times as possible, then write results back.

Declaring each memory type: summary

// ---- Registers (automatic) ----
__global__ void kernel() {
    int tid = threadIdx.x;      // register
    float val = 0.0f;           // register
}

// ---- Local memory (compiler-managed spill) ----
__global__ void kernel() {
    float big[512];             // too large for registers -> local memory
}

// ---- Shared memory (static) ----
__global__ void kernel() {
    __shared__ float smem[256]; // visible to all threads in block
}

// ---- Shared memory (dynamic) ----
__global__ void kernel() {
    extern __shared__ float smem[];  // size set at launch
}

// ---- Constant memory ----
__constant__ float coeff[64];   // file scope, 64 KB max total

// ---- Global memory ----
// Host side:
float* d_ptr;
cudaMalloc(&d_ptr, size);

// ---- Texture memory ----
// Host side: create cudaTextureObject_t via cudaCreateTextureObject()

Latency comparison

The log scale is necessary because the range spans four orders of magnitude. A register access at 1 cycle and a PCIe transfer at 10,000+ cycles cannot be compared on a linear axis. This single chart should guide every optimization decision: move data up the hierarchy, reduce traffic down it.

Worked examples

Example 1: Register pressure and occupancy

A kernel uses 64 registers per thread. The target GPU (Ampere, SM 8.0) has 65,536 registers per SM and supports a maximum of 2,048 threads per SM.

Step 1: Maximum threads from register budget.

65,536 registers / 64 registers per thread = 1,024 threads per SM

Step 2: Convert to warps.

1,024 threads / 32 threads per warp = 32 warps per SM

The hardware maximum is 64 warps (2,048 threads). Register pressure cuts occupancy to 50%.

Step 3: Total threads across the device (108 SMs on A100).

1,024 threads per SM * 108 SMs = 110,592 threads total

✓ This is still a large number of threads, but the reduced occupancy means less latency hiding. If the kernel is memory-bound, this matters. If the kernel is compute-bound with high ILP, 50% occupancy may be perfectly fine.

⚠ Use --maxrregcount=32 to halve register usage and double occupancy, but only if the spilled variables do not create excessive local memory traffic.

Example 2: Shared memory sizing

A kernel loads a 32 x 32 tile of float values into shared memory.

Step 1: Bytes per tile.

32 * 32 * 4 bytes = 4,096 bytes = 4 KB per block

Step 2: How many blocks fit in one SM’s shared memory? Assume the SM has 48 KB of shared memory configured (the default Ampere configuration with 144 KB L1).

48 KB / 4 KB = 12 blocks per SM

But the hardware maximum is typically 32 blocks per SM (Ampere). So shared memory is not the bottleneck here.

Step 3: If the kernel uses two tiles (A and B for matrix multiplication):

2 * 4 KB = 8 KB per block. 48 KB / 8 KB = 6 blocks per SM. Still well within limits.

✓ Shared memory becomes the bottleneck when tiles are large. A 64 x 64 float tile takes 16 KB. Two tiles take 32 KB. With 48 KB available, only 1 block fits. That is 1,024 threads at most on an SM that can handle 2,048. You trade occupancy for data reuse.

Example 3: Global memory bandwidth utilization

An A100 GPU has 2,039 GB/s of HBM bandwidth. A kernel reads a 1 GB input array and writes a 1 GB output array. The kernel does 1 FLOP per element (element-wise add).

Step 1: Total data movement.

1 GB (read) + 1 GB (write) = 2 GB

Step 2: Time at peak bandwidth.

2 GB / 2,039 GB/s = 0.98 ms

Step 3: Total FLOPs.

1 GB / 4 bytes per float = 256 million FLOPs. The A100 can do 19.5 TFLOPS (FP32).

256 * 10^6 FLOPs / 19.5 * 10^12 FLOPS/s = 0.013 ms

Step 4: Arithmetic intensity.

256 * 10^6 FLOPs / 2 * 10^9 bytes = 0.125 FLOPs/byte

✓ The kernel is overwhelmingly memory-bound. It takes 0.98 ms to move the data and 0.013 ms to compute. No amount of shared memory or register optimization will help here. The only wins come from reducing data movement: fusing this kernel with the next operation, using half-precision, or restructuring the algorithm to increase arithmetic intensity.

⚠ This is a roofline model calculation. If your measured time is significantly higher than 0.98 ms, you have a coalescing or occupancy problem, not an algorithmic one.

In practice

  1. Profile first. Use Nsight Compute to measure actual memory throughput before optimizing. The l1tex__data_pipe_lsu_wavefronts_mem_shared metric tells you shared memory utilization. lts__t_sectors_srcunit_tex shows L2 traffic. Do not guess which memory level is the bottleneck.

  2. Shared memory is your primary optimization lever. If a kernel reads the same global memory data multiple times across threads in a block, shared memory tiling almost always helps. The tiled matrix multiply pattern generalizes to convolutions, reductions, scans, and stencil computations.

  3. Watch register pressure. The occupancy calculator in Nsight Compute shows you the limiting factor (registers, shared memory, or block size). If registers are the bottleneck, consider reducing variables, using __launch_bounds__, or trading precision (float instead of double halves register usage for FP values on most architectures).

  4. Constant memory is free performance for uniform reads. If every thread reads the same coefficient array, move it to __constant__. The broadcast mechanism means one cache line serves the entire warp.

  5. L2 persistence is worth exploring for iterative algorithms. If your kernel launches repeatedly on the same working set (think iterative solvers, training loops), L2 persistence policies can prevent cold-cache misses on each launch.

  6. Texture memory is niche but powerful. For 2D interpolation, boundary handling, and image processing, the texture cache provides features that would cost you hundreds of lines of manual code.

What comes next

Knowing where data lives is necessary but not sufficient. How data is accessed matters just as much. If adjacent threads read addresses that are scattered across memory, the hardware cannot combine those requests into efficient transactions.

The next article, CUDA memory coalescing, covers coalesced vs. uncoalesced access patterns, the 128-byte transaction model, stride-1 access, structure of arrays vs. array of structures, and how a single line change can move your kernel from 10% to 90% of peak bandwidth.

Start typing to search across all content
navigate Enter open Esc close