Search…

Performance case studies and optimization patterns

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read:

You should be comfortable reading Nsight Compute reports, understand the roofline model, and know how to distinguish memory-bound from compute-bound kernels. All examples target CUDA Toolkit 12+ and compute capability 8.0+ (A100/Ampere) unless stated otherwise.

The optimization methodology

Most performance work fails because engineers start optimizing before they know where the time goes. The correct methodology has exactly three steps: profile, identify the bottleneck category, then apply the right optimization for that category.

graph TD
  A["Run Nsight Systems
timeline view"] --> B{"Where does
time go?"}
  B -->|"Kernel execution"| C["Run Nsight Compute
on hot kernel"]
  B -->|"Transfers / sync"| D["Overlap with
streams + async"]
  B -->|"CPU code"| E["Fix CPU side
before touching GPU"]
  C --> F{"Roofline
position?"}
  F -->|"Below memory roof"| G["Memory-bound
optimize bandwidth"]
  F -->|"Below compute roof"| H["Compute-bound
optimize arithmetic"]
  F -->|"Neither roof"| I["Latency-bound
increase parallelism"]
  G --> J["Coalescing, caching
shared memory tiling"]
  H --> K["Tensor cores, FMA
reduce instructions"]
  I --> L["Kernel fusion
graphs, larger blocks"]

  style A fill:#2d6a4f,stroke:#1b4332,color:#fff
  style B fill:#40916c,stroke:#2d6a4f,color:#fff
  style C fill:#52b788,stroke:#40916c,color:#fff
  style D fill:#74c69d,stroke:#52b788,color:#000
  style E fill:#95d5b2,stroke:#74c69d,color:#000
  style F fill:#f4a261,stroke:#e76f51,color:#000
  style G fill:#e76f51,stroke:#d62828,color:#fff
  style H fill:#d62828,stroke:#9b2226,color:#fff
  style I fill:#9b2226,stroke:#6a040f,color:#fff
  style J fill:#264653,stroke:#2a9d8f,color:#fff
  style K fill:#264653,stroke:#2a9d8f,color:#fff
  style L fill:#264653,stroke:#2a9d8f,color:#fff

This decision tree is not optional. Skipping the profiling step means you are guessing. Engineers who guess spend weeks optimizing memory access patterns in compute-bound kernels, or unrolling loops in kernels that are bottlenecked on PCIe transfers. The profiler tells you which category you are in. Trust it.

The profile-first rule

Run Nsight Systems first. It gives you the timeline: kernel launches, memory transfers, API calls, CPU execution, and gaps. The timeline tells you where the wall-clock time actually goes. Only after you know which kernel dominates should you open Nsight Compute for that specific kernel.

# Step 1: system-level timeline
nsys profile --stats=true ./my_application

# Step 2: deep-dive on the hottest kernel
ncu --set full --kernel-name my_hot_kernel ./my_application

Nsight Compute gives you the roofline position. That single chart tells you whether the kernel is memory-bound, compute-bound, or latency-bound. Every optimization you apply should move the dot on the roofline toward one of the two ceilings.

Case study 1: Memory-bound kernel optimization

The problem

A 2D stencil kernel processes a 4096x4096 grid of float values. Each output element reads 5 input elements (center plus 4 neighbors). The naive implementation achieves 180 GB/s on an A100 with a theoretical peak of 2039 GB/s HBM bandwidth. Nsight Compute shows the kernel is firmly below the memory roofline.

Before: naive implementation

__global__ void stencil_naive(const float* in, float* out, int N) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < 1 || x >= N - 1 || y < 1 || y >= N - 1) return;

    out[y * N + x] = 0.2f * (in[y * N + x]
                            + in[(y - 1) * N + x]
                            + in[(y + 1) * N + x]
                            + in[y * N + (x - 1)]
                            + in[y * N + (x + 1)]);
}
// Launch: dim3 block(16, 16); dim3 grid((N+15)/16, (N+15)/16);

The profiler reveals three problems:

  1. Poor coalescing on vertical neighbors. Reads to in[(y-1)*N + x] and in[(y+1)*N + x] are coalesced, but the 16x16 block means only 16 threads per warp access contiguous memory. A 32x8 or 32x4 block would give full 32-wide coalescing.
  2. Redundant global memory reads. Each element is read by 5 neighboring threads. With no shared memory, the same data is loaded from HBM 5 times.
  3. Low L2 cache hit rate. The working set (4096x4096 floats = 64 MB) exceeds L2 capacity (40 MB on A100), so rows evict before reuse.

Optimization steps

Step 1: Fix block dimensions. Change to 32x8 so warps read 32 consecutive floats. This alone improves coalescing efficiency from 50% to 100%.

Step 2: Shared memory tiling. Load a tile with halo into shared memory. Each thread loads one element; halo threads load boundary elements. The 5-point stencil reuses data from shared memory instead of global memory.

__global__ void stencil_tiled(const float* in, float* out, int N) {
    const int TILE_X = 32, TILE_Y = 8;
    const int HALO = 1;
    __shared__ float tile[TILE_Y + 2 * HALO][TILE_X + 2 * HALO];

    int gx = blockIdx.x * TILE_X + threadIdx.x;
    int gy = blockIdx.y * TILE_Y + threadIdx.y;
    int lx = threadIdx.x + HALO;
    int ly = threadIdx.y + HALO;

    // Load center
    if (gx < N && gy < N)
        tile[ly][lx] = in[gy * N + gx];

    // Load halos (simplified; real code handles all 4 borders + corners)
    if (threadIdx.y == 0 && gy > 0)
        tile[0][lx] = in[(gy - 1) * N + gx];
    if (threadIdx.y == TILE_Y - 1 && gy + 1 < N)
        tile[ly + 1][lx] = in[(gy + 1) * N + gx];
    if (threadIdx.x == 0 && gx > 0)
        tile[ly][0] = in[gy * N + (gx - 1)];
    if (threadIdx.x == TILE_X - 1 && gx + 1 < N)
        tile[ly][lx + 1] = in[gy * N + (gx + 1)];

    __syncthreads();

    if (gx >= 1 && gx < N - 1 && gy >= 1 && gy < N - 1)
        out[gy * N + gx] = 0.2f * (tile[ly][lx]
                                   + tile[ly - 1][lx]
                                   + tile[ly + 1][lx]
                                   + tile[ly][lx - 1]
                                   + tile[ly][lx + 1]);
}

Step 3: Read-only cache hints. Use __ldg() or declare the input pointer as const float* __restrict__ to route loads through the read-only texture cache, reducing L2 pressure.

Results

graph LR
  A["Naive
180 GB/s"] -->|"+111%"| B["Coalesced blocks
380 GB/s"]
  B -->|"+147%"| C["Shared memory
940 GB/s"]
  C -->|"+33%"| D["ldg + restrict
1250 GB/s"]

  style A fill:#e76f51,stroke:#d62828,color:#fff
  style B fill:#f4a261,stroke:#e76f51,color:#000
  style C fill:#52b788,stroke:#40916c,color:#fff
  style D fill:#2d6a4f,stroke:#1b4332,color:#fff

The kernel went from 9% to 61% of peak HBM bandwidth. The remaining gap is expected: the stencil’s arithmetic intensity is low, so the kernel will never reach the compute roof. At 61% of memory bandwidth, the optimization is within the range where further gains require architectural changes (larger tiles, temporal blocking).

Case study 2: Compute-bound with tensor cores

The problem

A batched matrix multiply runs 256 independent 128x128 GEMM operations (FP16 accumulating to FP32). The existing implementation uses standard CUDA cores with shared memory tiling, achieving 12 TFLOPS on an A100 with a theoretical peak of 312 TFLOPS (tensor core) or 19.5 TFLOPS (FP32 CUDA core). The kernel is compute-bound per the roofline, but nowhere near the compute ceiling.

Before: CUDA core GEMM

The existing kernel uses a classic tiled approach with 16x16 tiles, achieving ~61% of the CUDA core FP32 peak. To go further, you must switch to tensor cores. Standard CUDA cores cannot close the 16x gap to tensor core throughput.

Optimization: WMMA (Warp Matrix Multiply-Accumulate)

Tensor cores operate on matrix fragments, not individual elements. The nvcuda::wmma API provides warp-level operations on 16x16x16 tiles:

#include <mma.h>
using namespace nvcuda;

__global__ void gemm_wmma(half* A, half* B, float* C,
                          int M, int N, int K) {
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

    wmma::fill_fragment(c_frag, 0.0f);

    int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
    int warpN = blockIdx.y;

    for (int k = 0; k < K; k += 16) {
        wmma::load_matrix_sync(a_frag, A + warpM * 16 * K + k, K);
        wmma::load_matrix_sync(b_frag, B + k * N + warpN * 16, N);
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    wmma::store_matrix_sync(C + warpM * 16 * N + warpN * 16, c_frag, N,
                            wmma::mem_row_major);
}

This is the minimal version. Production code adds shared memory staging, double buffering of fragments, and software pipelining of loads. But even this simple version changes the bottleneck category.

Results

graph LR
  A["CUDA cores
12 TFLOPS"] -->|"WMMA basic"| B["Tensor cores
85 TFLOPS"]
  B -->|"+ shared mem staging"| C["110 TFLOPS"]
  C -->|"+ double buffering"| D["165 TFLOPS"]

  style A fill:#e76f51,stroke:#d62828,color:#fff
  style B fill:#f4a261,stroke:#e76f51,color:#000
  style C fill:#52b788,stroke:#40916c,color:#fff
  style D fill:#2d6a4f,stroke:#1b4332,color:#fff

The WMMA kernel immediately delivers 7x the CUDA core throughput. With shared memory staging and double buffering, it reaches 53% of tensor core peak. The remaining gap comes from pipeline bubbles and L2 bank conflicts, which require the mma.sync PTX-level API to address (used by cuBLAS internally).

Key lesson: When Nsight Compute shows you are compute-bound and the workload maps to matrix operations, tensor cores are not optional. The 16x throughput advantage over standard CUDA cores means no amount of instruction-level optimization can compensate for not using them.

Case study 3: Latency-bound with many small launches

The problem

A signal processing pipeline runs 500 tiny kernels per frame, each processing 256 to 1024 elements. Total kernel execution time is 0.8 ms per frame, but wall-clock time is 15 ms. The Nsight Systems timeline shows that 94% of the time is gaps between kernel launches.

Each kernel launch has fixed overhead: the CPU queues work, the driver processes the command, the GPU scheduler picks it up. On Ampere, this overhead is roughly 5 to 15 microseconds per launch. With 500 launches, that adds up to 2.5 to 7.5 ms of pure launch overhead, plus synchronization gaps.

Optimization: CUDA Graphs

CUDA Graphs capture a sequence of operations (kernels, copies, synchronization) into a graph structure. The driver optimizes the entire sequence at capture time, then replays it with a single launch:

cudaGraph_t graph;
cudaGraphExec_t graphExec;

// Capture the 500-kernel sequence
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

for (int i = 0; i < 500; i++) {
    small_kernel<<<1, 256, 0, stream>>>(data[i], size[i]);
}

cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);

// Replay with single launch overhead
for (int frame = 0; frame < num_frames; frame++) {
    cudaGraphLaunch(graphExec, stream);
    cudaStreamSynchronize(stream);
}

cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);

Additional optimization: kernel fusion

Where data dependencies allow, fuse multiple tiny kernels into one. If kernels A, B, and C operate on independent data, a single kernel with thread blocks assigned to different workloads eliminates two launches:

__global__ void fused_abc(float* dataA, float* dataB, float* dataC,
                          int sizeA, int sizeB, int sizeC) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if (blockIdx.x < gridDimA) {
        // Work for kernel A
        if (tid < sizeA) processA(dataA, tid);
    } else if (blockIdx.x < gridDimA + gridDimB) {
        // Work for kernel B
        int local_tid = tid - gridDimA * blockDim.x;
        if (local_tid < sizeB) processB(dataB, local_tid);
    } else {
        // Work for kernel C
        int local_tid = tid - (gridDimA + gridDimB) * blockDim.x;
        if (local_tid < sizeC) processC(dataC, local_tid);
    }
}

Results

CUDA Graphs reduced frame time from 15 ms to 2.1 ms (7.1x improvement). Kernel fusion on the 50 most fusible groups further reduced it to 1.4 ms (total 10.7x). The GPU utilization went from 5% to 57%.

Performance improvement waterfall

The following chart shows the cumulative effect of optimizations from all three case studies, applied to a composite workload where the stencil, GEMM, and signal pipeline all contribute to total frame time.

The baseline frame time was 52.0 ms. After all optimizations, total frame time dropped to 6.5 ms, an 8x improvement. The largest single gain came from CUDA Graphs (12.9 ms saved), followed by tensor core adoption (12.0 ms). This is typical: architectural changes (switching execution models, using specialized hardware) deliver larger gains than incremental tuning.

Optimization checklist

OptimizationApplies whenExpected gainImplementation costRisk
Fix memory coalescingProfiler shows low global load efficiency1.5x to 4x bandwidthLow (block dim change)✓ Safe
Shared memory tilingRepeated access to same data region2x to 5x for stencils/reductionsMedium (halo logic, sync)⚠ Bank conflicts
Read-only cache hintsInput data is not modified by kernel10% to 30% bandwidth boostLow (__ldg or __restrict__)✓ Safe
Tensor core WMMAMatrix multiply or convolution workload5x to 16x over CUDA coresHigh (data layout, fragment API)⚠ FP16 precision
CUDA GraphsMany small kernel launches per frame3x to 10x for launch-bound workloadsMedium (capture/replay logic)⚠ Fixed topology
Kernel fusionIndependent small kernels on same stream1.5x to 3x launch overhead reductionMedium (combined kernel logic)⚠ Code complexity
__launch_bounds__Register spilling detected in profiler5% to 20% from reduced spillingLow (directive on kernel)⚠ May reduce occupancy
Async memcpy + streamsTransfers overlap with computationUp to 2x if transfer-boundMedium (stream management)✓ Safe
Warp shuffle reductionsSmall reductions within a warp2x to 4x over shared memory reductionLow (intrinsic calls)✓ Safe
Loop unrolling (#pragma unroll)Tight inner loops with known trip count5% to 15% instruction throughputLow (pragma)✗ Register pressure

Worked examples

Example A: Optimization ROI calculation

A kernel takes 10 ms per invocation and runs 1000 times per day.

  • Optimization A: 2 engineer-days, expected 3x speedup (10 ms to 3.33 ms)
  • Optimization B: 0.5 engineer-days, expected 1.5x speedup (10 ms to 6.67 ms)

Time saved per day:

Optimization A saves (10 - 3.33) * 1000 = 6,670 ms = 6.67 seconds per day. Optimization B saves (10 - 6.67) * 1000 = 3,330 ms = 3.33 seconds per day.

ROI (time saved per engineer-day invested):

Optimization A: 6.67s / 2 days = 3.33 seconds saved per engineer-day. Optimization B: 3.33s / 0.5 days = 6.67 seconds saved per engineer-day.

Analysis: Optimization B has 2x better ROI despite lower absolute speedup. In practice, you would implement B first because it delivers value faster with less risk. Then decide if A is still worth the investment. This calculation changes if the kernel runs 100,000 times per day instead of 1,000, or if the application is latency-sensitive (where absolute reduction matters more than ROI).

The general principle: always compute ROI before starting optimization work. A 10x speedup on a kernel that runs once per hour is worth less than a 1.2x speedup on a kernel that runs millions of times per second.

Example B: Benchmark variance and outlier handling

You run a kernel 20 times and record these execution times (ms):

[5.1, 5.2, 4.9, 5.0, 5.3, 5.0, 4.8, 5.1, 5.2, 5.0,
 12.3, 5.1, 5.0, 4.9, 5.2, 5.1, 5.0, 5.2, 4.9, 5.1]

Statistics with all 20 samples:

  • Mean: 5.42 ms
  • Median: 5.1 ms
  • Standard deviation: 1.63 ms

Statistics without the 12.3 ms outlier:

  • Mean: 5.06 ms
  • Median: 5.1 ms
  • Standard deviation: 0.13 ms

The 12.3 ms sample inflates the mean by 7% and the standard deviation by 12.5x. Should you include it?

The correct approach:

  1. Do not silently discard it. The outlier is real data. Something caused it: OS interrupt, thermal throttling, GPU clock scaling, a competing process, or PCIe contention.
  2. Report the median and interquartile range (IQR) as your primary metric. The median (5.1 ms) is resistant to outliers. The IQR (4.95 ms to 5.2 ms) shows the true spread of typical runs.
  3. Report the outlier separately. State: “Median 5.1 ms (IQR 4.95 to 5.2), with 1 outlier at 12.3 ms (likely OS interrupt).”
  4. Use a formal outlier test if needed. The modified Z-score method: any sample more than 3.5 median absolute deviations (MAD) from the median is flagged. MAD = median(|xi - median|) = 0.1. Modified Z-score for 12.3 = 0.6745 * (12.3 - 5.1) / 0.1 = 48.6. This far exceeds 3.5, confirming it is a statistical outlier.

Benchmarking best practices:

  • Warm up the GPU. Run the kernel 3 to 5 times before recording. The first launch incurs context setup, JIT compilation of PTX, and memory allocation overhead.
  • Use CUDA events, not CPU timers. cudaEventRecord and cudaEventElapsedTime measure GPU time directly, excluding CPU-side queueing delays.
  • Lock GPU clocks. Use nvidia-smi -lgc <freq> to fix the GPU clock frequency. Boost clocks introduce variance because the GPU changes frequency based on thermal state.
  • Run enough samples. 20 to 50 runs is a reasonable minimum. Compute the coefficient of variation (std / mean). If it exceeds 5%, investigate the source of variance before trusting the results.
  • Avoid CPU-side bottlenecks. If the CPU cannot enqueue kernels fast enough, you are benchmarking driver overhead, not kernel performance. Use CUDA events within a stream and let the GPU pipeline fill.
// Correct GPU benchmarking pattern
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Warmup
for (int i = 0; i < 5; i++)
    my_kernel<<<grid, block>>>(args);
cudaDeviceSynchronize();

// Timed runs
float times[50];
for (int i = 0; i < 50; i++) {
    cudaEventRecord(start);
    my_kernel<<<grid, block>>>(args);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&times[i], start, stop);
}

// Report median, not mean
std::sort(times, times + 50);
float median = times[25];
float p25 = times[12];
float p75 = times[37];
printf("Median: %.2f ms (IQR: %.2f - %.2f)\n", median, p25, p75);

Common optimization patterns

Pattern 1: Bandwidth amplification

If your kernel reads the same data multiple times from global memory, you are wasting bandwidth. The fix is always the same hierarchy: registers (fastest, per-thread) then shared memory (fast, per-block) then L1/L2 (automatic, but eviction-prone) then global (slowest).

Ask this question for every global memory read: “Does another thread in this block need the same value?” If yes, load it into shared memory once.

Pattern 2: Arithmetic intensity boosting

Memory-bound kernels stay memory-bound unless you increase the ratio of computation to memory access. Techniques:

  • Loop tiling with larger tiles. Process more output elements per loaded tile.
  • Temporal blocking. Apply multiple stencil iterations within a single kernel, reusing data in shared memory across time steps.
  • Fused operations. Instead of kernel A writing to global memory and kernel B reading it back, fuse A and B so intermediate results stay in registers.

Pattern 3: Launch overhead amortization

Every kernel launch costs 5 to 15 microseconds. When kernel execution time is comparable to launch overhead, you lose significant GPU utilization. Solutions, in order of complexity:

  1. Increase work per kernel. Process more elements per launch.
  2. CUDA Graphs. Capture and replay entire sequences.
  3. Persistent kernels. Launch once, loop inside the kernel, communicate via global memory flags.

When to stop optimizing

Stop when any of these conditions is true:

  • You hit a hardware ceiling. The kernel is at 80%+ of the relevant roofline bound (memory bandwidth or compute throughput). Further gains require a different algorithm or different hardware.
  • The kernel is no longer the bottleneck. After optimizing kernel X, kernel Y might now dominate the timeline. Optimize Y instead.
  • The remaining gap is smaller than run-to-run variance. If your benchmark shows 3% coefficient of variation and the theoretical remaining gain is 5%, you cannot reliably measure further improvements.
  • ROI turns negative. If the next optimization takes 5 engineer-days for a 2% gain on a kernel that contributes 1% of total runtime, that time is better spent elsewhere.

The profiler always has the final word. After every optimization, re-profile. The bottleneck category can change: a memory-bound kernel that gets tiling might become compute-bound. A compute-bound kernel that gets tensor cores might become memory-bound (because tensor cores consume data faster than HBM can deliver it). Each shift requires a different optimization strategy.

In practice

Profile first, always. The single biggest waste of optimization time is working on the wrong kernel or the wrong bottleneck. Five minutes with Nsight Systems saves days of misguided work.

Optimization order matters. Apply the cheapest, highest-impact changes first. Fix coalescing before adding shared memory tiling. Add tiling before switching to tensor cores. Use CUDA Graphs before writing persistent kernels. Each step changes the bottleneck, and later steps may become unnecessary.

Measure everything with locked clocks and CUDA events. CPU timers, std::chrono, and unlocked GPU clocks all introduce measurement error that can exceed the size of your optimization. If you cannot measure it correctly, you cannot optimize it correctly.

Do not optimize code that does not matter. A kernel consuming 0.5% of total runtime is not worth touching regardless of how suboptimal it looks. Focus on the top 3 kernels by wall-clock contribution.

Tensor cores are transformative for matrix workloads. If your application involves GEMM, convolution, or any operation expressible as matrix multiply, check whether you are using tensor cores. The gap between CUDA cores and tensor cores on modern GPUs is large enough that it often dominates all other optimization considerations.

What comes next

The final article in this series, What comes next, covers where to go after finishing these fundamentals: advanced topics like multi-GPU programming with NCCL, production deployment patterns, the CUDA ecosystem (Thrust, CUB, cuBLAS, cuDNN), and resources for continued learning.

Start typing to search across all content
navigate Enter open Esc close