Search…

Memory coalescing: the most important optimization you will learn

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

  • CUDA memory hierarchy for an understanding of registers, shared memory, L1/L2 caches, and global memory.

You should be comfortable writing simple CUDA kernels and understand that global memory is the slowest tier in the GPU memory hierarchy. Everything in this article builds on that foundation.

Why memory coalescing matters more than instruction count

Most CUDA kernels are memory-bound, not compute-bound. An NVIDIA A100 delivers 19.5 TFLOPS of FP32 compute but only 2,039 GB/s of memory bandwidth. A single-precision multiply-add takes 1 FLOP per 4 bytes loaded, which means the hardware can compute roughly 40x faster than it can fetch data. The bottleneck is almost always the memory system.

Coalescing is the mechanism that determines whether your kernel uses 10% or 90% of that available bandwidth. Getting it wrong can degrade performance by 10x or more with zero change to your arithmetic. Getting it right is often the single largest optimization available.

What coalescing actually means

When a warp of 32 threads issues a load or store instruction, the hardware does not send 32 individual memory requests. Instead, it combines (coalesces) the requests into as few memory transactions as possible. Each transaction fetches a 32-byte, 64-byte, or 128-byte aligned segment from global memory, depending on the cache line size.

On modern NVIDIA GPUs (Volta and later), the L1 cache operates with 128-byte cache lines. A single transaction can fetch 128 bytes. If all 32 threads in a warp access 32 consecutive 4-byte floats starting at a 128-byte aligned address, the entire warp is served by exactly one 128-byte transaction.

That is the ideal case: one transaction, 128 bytes requested, 128 bytes transferred, 100% utilization.

graph LR
  subgraph Warp["Warp: 32 threads"]
      T0["T0: addr 0"]
      T1["T1: addr 4"]
      T2["T2: addr 8"]
      T31["T31: addr 124"]
  end
  subgraph Transaction1["Transaction 1 (128 bytes)"]
      Seg["Bytes 0 - 127"]
  end
  T0 --> Seg
  T1 --> Seg
  T2 --> Seg
  T31 --> Seg
  style Transaction1 fill:#2d6a2d,color:#fff

Coalesced access: 32 threads read 32 consecutive floats (128 bytes total). The hardware issues one 128-byte transaction. Every byte transferred is used. Bandwidth utilization is 100%.

Now consider what happens when threads access every other float (stride-2). Thread 0 reads address 0, thread 1 reads address 8, thread 2 reads address 16, and so on. The 32 threads now span 256 bytes of address space, requiring two 128-byte transactions. Half the bytes fetched are unused.

graph LR
  subgraph Warp2["Warp: 32 threads (stride-2)"]
      S0["T0: addr 0"]
      S1["T1: addr 8"]
      S2["T2: addr 16"]
      S31["T31: addr 248"]
  end
  subgraph Trans1["Transaction 1 (128 bytes)"]
      Seg1["Bytes 0 - 127"]
  end
  subgraph Trans2["Transaction 2 (128 bytes)"]
      Seg2["Bytes 128 - 255"]
  end
  S0 --> Seg1
  S1 --> Seg1
  S2 --> Seg1
  S31 --> Seg2
  style Trans1 fill:#8B6914,color:#fff
  style Trans2 fill:#8B6914,color:#fff

Strided access: 32 threads read every other float, spanning 256 bytes. Two 128-byte transactions are needed. Only 50% of transferred bytes are actually used.

The 128-byte transaction model

The memory controller works in units of aligned 128-byte segments. When a warp requests data, the hardware determines which 128-byte segments contain the requested addresses and issues one transaction per segment.

The key rules:

  1. Each transaction fetches exactly one aligned 128-byte segment.
  2. If a warp’s addresses fall within one segment, one transaction is issued.
  3. If addresses span N segments, N transactions are issued.
  4. Bytes within a fetched segment that no thread requested are wasted bandwidth.

The ratio of useful bytes to total bytes transferred is the bandwidth utilization. Coalesced access achieves close to 100%. Random access can drop below 5%.

Access patterns compared

PatternTransactions per warpBandwidth utilizationExampleFix
Stride-1 (coalesced)1100%a[threadIdx.x]Already optimal ✓
Stride-2250%a[2 * threadIdx.x]Restructure data layout
Stride-4425%Column access in row-major matrixTranspose or use shared memory
Stride-32323.1%a[32 * threadIdx.x]SoA transformation
RandomUp to 32~3-10%Hash table lookupsSorting, binning, or texture cache

Each additional transaction consumes memory bus bandwidth without delivering useful data to the warp. At stride-32, every single thread triggers its own 128-byte transaction, and 124 of those bytes are discarded. The kernel does 32x more memory work than the coalesced version for the same useful data.

CUDA C++: naive vs coalesced matrix transpose

Matrix transpose is the textbook example of a coalescing problem. Reading a row-major matrix by rows is coalesced. Writing the transposed result by rows means writing by columns of the source, which is strided. The naive kernel has one coalesced access (read) and one uncoalesced access (write).

The fix uses shared memory as a staging buffer: read a tile with coalesced global loads into shared memory, synchronize, then write from shared memory with coalesced global stores.

// Naive transpose: coalesced reads, uncoalesced writes
__global__ void transposeNaive(const float* input, float* output, int width, int height) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (col < width && row < height) {
        // Read is coalesced (consecutive threads read consecutive elements in a row)
        // Write is uncoalesced (consecutive threads write to consecutive rows = stride of width)
        output[col * height + row] = input[row * width + col];
    }
}

// Tiled transpose: coalesced reads AND coalesced writes via shared memory
#define TILE_DIM 32
#define BLOCK_ROWS 8

__global__ void transposeTiled(const float* input, float* output, int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1]; // +1 to avoid bank conflicts

    int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
    int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

    // Coalesced read from global memory into shared memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIndex < width && (yIndex + j) < height) {
            tile[threadIdx.y + j][threadIdx.x] = input[(yIndex + j) * width + xIndex];
        }
    }

    __syncthreads();

    // Transposed indices for the output tile
    xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
    yIndex = blockIdx.x * TILE_DIM + threadIdx.y;

    // Coalesced write from shared memory to global memory
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIndex < height && (yIndex + j) < width) {
            output[(yIndex + j) * height + xIndex] = tile[threadIdx.x][threadIdx.y + j];
        }
    }
}

The tiled version reads the same data and writes the same result. The only difference is the order of memory accesses. On an A100 with a 4096x4096 matrix:

// Bandwidth measurement
int main() {
    const int WIDTH = 4096, HEIGHT = 4096;
    size_t bytes = WIDTH * HEIGHT * sizeof(float);

    float *d_input, *d_output;
    cudaMalloc(&d_input, bytes);
    cudaMalloc(&d_output, bytes);

    dim3 block(TILE_DIM, BLOCK_ROWS);
    dim3 grid((WIDTH + TILE_DIM - 1) / TILE_DIM, (HEIGHT + TILE_DIM - 1) / TILE_DIM);

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Benchmark naive
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        transposeNaive<<<grid, block>>>(d_input, d_output, WIDTH, HEIGHT);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float naiveMs;
    cudaEventElapsedTime(&naiveMs, start, stop);
    float naiveBW = 2.0f * bytes * 100 / (naiveMs / 1000.0f) / 1e9;

    // Benchmark tiled
    cudaEventRecord(start);
    for (int i = 0; i < 100; i++)
        transposeTiled<<<grid, block>>>(d_input, d_output, WIDTH, HEIGHT);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float tiledMs;
    cudaEventElapsedTime(&tiledMs, start, stop);
    float tiledBW = 2.0f * bytes * 100 / (tiledMs / 1000.0f) / 1e9;

    printf("Naive:  %.1f GB/s\n", naiveBW);
    printf("Tiled:  %.1f GB/s\n", tiledBW);
    // Typical results on A100:
    // Naive:  ~380 GB/s
    // Tiled:  ~1450 GB/s  (3.8x improvement)

    cudaFree(d_input);
    cudaFree(d_output);
    return 0;
}

The tiled version achieves roughly 3.5-4x the effective bandwidth of the naive version. The arithmetic is identical. The only change is data layout in the access pattern.

Python (CuPy): matrix transpose comparison

The same comparison in Python using CuPy, which lets you write raw CUDA kernels from Python:

import cupy as cp
import time

width, height = 4096, 4096

naive_kernel = cp.RawKernel(r'''
extern "C" __global__
void transposeNaive(const float* input, float* output, int width, int height) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (col < width && row < height) {
        output[col * height + row] = input[row * width + col];
    }
}
''', 'transposeNaive')

tiled_kernel = cp.RawKernel(r'''
#define TILE_DIM 32
#define BLOCK_ROWS 8
extern "C" __global__
void transposeTiled(const float* input, float* output, int width, int height) {
    __shared__ float tile[TILE_DIM][TILE_DIM + 1];
    int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
    int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIndex < width && (yIndex + j) < height)
            tile[threadIdx.y + j][threadIdx.x] = input[(yIndex + j) * width + xIndex];
    }
    __syncthreads();
    xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
    yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
        if (xIndex < height && (yIndex + j) < width)
            output[(yIndex + j) * height + xIndex] = tile[threadIdx.x][threadIdx.y + j];
    }
}
''', 'transposeTiled')

d_input = cp.random.randn(height, width, dtype=cp.float32)
d_naive_out = cp.empty((width, height), dtype=cp.float32)
d_tiled_out = cp.empty((width, height), dtype=cp.float32)

block = (32, 8)
grid = ((width + 31) // 32, (height + 31) // 32)
bytes_transferred = 2 * width * height * 4  # read + write

# Warmup
for _ in range(10):
    naive_kernel(grid, block, (d_input, d_naive_out, width, height))
    tiled_kernel(grid, block, (d_input, d_tiled_out, width, height))
cp.cuda.Stream.null.synchronize()

# Benchmark naive
start = cp.cuda.Event()
end = cp.cuda.Event()
start.record()
for _ in range(100):
    naive_kernel(grid, block, (d_input, d_naive_out, width, height))
end.record()
end.synchronize()
naive_ms = cp.cuda.get_elapsed_time(start, end)
naive_bw = bytes_transferred * 100 / (naive_ms / 1000) / 1e9

# Benchmark tiled
start.record()
for _ in range(100):
    tiled_kernel(grid, block, (d_input, d_tiled_out, width, height))
end.record()
end.synchronize()
tiled_ms = cp.cuda.get_elapsed_time(start, end)
tiled_bw = bytes_transferred * 100 / (tiled_ms / 1000) / 1e9

print(f"Naive:  {naive_bw:.1f} GB/s")
print(f"Tiled:  {tiled_bw:.1f} GB/s")
# Typical results: Naive ~380 GB/s, Tiled ~1450 GB/s

Bandwidth vs stride: measured results

The following chart shows effective bandwidth for a simple kernel that reads one float per thread at various stride values, compared against the theoretical peak bandwidth of an A100 (2,039 GB/s).

Stride-1 achieves ~87% of theoretical peak. Each doubling of stride halves effective bandwidth because the number of transactions doubles. Random access drops below 60 GB/s, roughly 3% of peak. The memory hardware is identical in every case. Only the access pattern changes.

The numbers follow a predictable pattern. Stride-N requires N times as many transactions as stride-1 for the same useful data. The memory bus has a fixed transaction rate, so N times the transactions means roughly 1/N the effective bandwidth.

Example 1: counting cache line transactions

Problem: A warp of 32 threads each loads one float (4 bytes). Consider two cases:

Case A: addresses 0, 4, 8, 12, …, 124 (stride-1 floats)

The 32 floats span bytes 0 through 127. That is exactly one 128-byte aligned segment.

  • Transactions needed: 1
  • Bytes transferred: 128
  • Bytes used: 128 (32 x 4)
  • Utilization: 100%

Case B: addresses 0, 128, 256, 384, …, 3968 (stride-32 floats)

Each thread accesses a float in a different 128-byte segment. Thread 0 hits segment [0, 127]. Thread 1 hits segment [128, 255]. Thread 31 hits segment [3968, 4095].

  • Transactions needed: 32
  • Bytes transferred: 4096 (32 x 128)
  • Bytes used: 128 (32 x 4)
  • Utilization: 3.1%

Case B transfers 32x more data than Case A for the same 128 bytes of useful information. The kernel runs at 3% of peak bandwidth even though every thread is doing useful work. The problem is purely the access pattern.

Example 2: AoS to SoA transformation

Problem: You have 1024 particles, each with position components x, y, z.

Array of Structures (AoS) layout:

struct Particle {
    float x, y, z;
};
Particle particles[1024];  // Memory: x0,y0,z0, x1,y1,z1, x2,y2,z2, ...

A kernel where thread i reads particles[i].x:

__global__ void processAoS(Particle* particles, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        output[i] = particles[i].x * 2.0f;  // Stride-3 access!
    }
}

Thread 0 reads address 0 (x0). Thread 1 reads address 12 (x1). Thread 2 reads address 24 (x2). Consecutive threads are 12 bytes apart (3 floats x 4 bytes). This is stride-3 in float terms. A warp of 32 threads spans 32 x 12 = 384 bytes, requiring 3 transactions (segments [0,127], [128,255], [256,383]). Only 128 bytes of the 384 transferred are used. Utilization: 33%.

Structure of Arrays (SoA) layout:

struct ParticlesSoA {
    float x[1024];  // Memory: x0,x1,x2,...,x1023
    float y[1024];  // Memory: y0,y1,y2,...,y1023
    float z[1024];  // Memory: z0,z1,z2,...,z1023
};
ParticlesSoA particles;

Now thread i reads particles.x[i]:

__global__ void processSoA(ParticlesSoA particles, float* output, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        output[i] = particles.x[i] * 2.0f;  // Stride-1 access!
    }
}

Thread 0 reads address 0, thread 1 reads address 4, thread 2 reads address 8. Consecutive threads access consecutive memory locations. One transaction serves the entire warp. Utilization: 100% ✓.

The SoA version achieves roughly 3x the bandwidth of the AoS version for this kernel, with no change to the computation. In real particle simulations, this transformation alone often delivers 2-5x speedup.

Example 3: row vs column access in a 2D array

Problem: You have a 1024x1024 float matrix stored in row-major order. Two kernels access it differently.

Kernel A: row-wise access (coalesced)

__global__ void sumRows(const float* matrix, float* rowSums, int width) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0f;
    for (int col = 0; col < width; col++) {
        sum += matrix[row * width + col];
    }
    rowSums[row] = sum;
}

In this kernel, different threads access different rows. Within each warp, thread 0 accesses row r, thread 1 accesses row r+1, and so on. The inner loop iterates over columns, but at any given iteration all threads in a warp access the same column in consecutive rows. The address pattern is (r+0)*width + col, (r+1)*width + col, … which has stride width (1024). This is badly uncoalesced.

Kernel B: redesigned for coalescing

__global__ void sumRowsCoalesced(const float* matrix, float* rowSums, int width) {
    int row = blockIdx.x;
    float sum = 0.0f;
    for (int col = threadIdx.x; col < width; col += blockDim.x) {
        sum += matrix[row * width + col];  // Threads in a warp read consecutive columns
    }
    // Warp-level reduction (simplified)
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }
    if (threadIdx.x == 0) {
        atomicAdd(&rowSums[row], sum);
    }
}

Now threads within a warp access consecutive columns in the same row: row*width + col, row*width + col+1, row*width + col+2, … This is stride-1 and perfectly coalesced. The hardware issues one transaction per warp per iteration instead of 32.

The fix required rethinking which dimension maps to threads. Mapping threads to columns (stride-1 in memory) instead of rows (stride-width in memory) is the standard pattern for row-major 2D arrays.

In practice

Profile before optimizing. Use ncu (Nsight Compute) to measure actual memory throughput and transaction counts. The metric l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum tells you how many 32-byte sectors were loaded. Compare it to the theoretical minimum for your data size.

Default to SoA for GPU data. AoS is natural for CPU code where one thread processes one struct sequentially. On the GPU, thousands of threads process the same field across many structs. SoA is almost always the better layout.

Use shared memory to fix uncoalesceable patterns. When the algorithm inherently requires non-unit-stride access (like matrix transpose), load data with coalesced reads into shared memory, then redistribute within the block using shared memory’s low latency and high bandwidth.

Watch out for alignment. Even stride-1 access loses efficiency if the base address is not aligned to a 128-byte boundary. Use cudaMalloc (which returns 256-byte aligned pointers) and avoid pointer arithmetic that breaks alignment.

Struct padding can silently destroy coalescing. A struct with \{float x; double y;\} will be padded to 16 bytes. What looks like stride-1 struct access is actually stride-4 float access. Check struct sizes with sizeof() and consider explicit __align__ directives.

Common pitfall: Indirect or gather access patterns (reading from index arrays like a[index[i]]) produce effectively random access even if the index array itself is accessed with stride-1. Profile these carefully.

What comes next

Coalescing tells you how to lay out data for efficient global memory access. But when the algorithm cannot be restructured for stride-1 access, shared memory provides an escape hatch. In the next article, CUDA shared memory, we dive deep into shared memory: bank conflicts, tiling strategies, and how to use the 48-164 KB of on-chip SRAM to reshape access patterns that would otherwise cripple your bandwidth.

Start typing to search across all content
navigate Enter open Esc close