Performance case studies and optimization patterns
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read:
- Occupancy, register pressure, and performance tuning for understanding resource limiters,
__launch_bounds__, and profiler-driven tuning. - Debugging and profiling CUDA programs for Nsight Systems/Compute workflows, the roofline model, and how to interpret profiler metrics.
You should be comfortable reading Nsight Compute reports, understand the roofline model, and know how to distinguish memory-bound from compute-bound kernels. All examples target CUDA Toolkit 12+ and compute capability 8.0+ (A100/Ampere) unless stated otherwise.
The optimization methodology
Most performance work fails because engineers start optimizing before they know where the time goes. The correct methodology has exactly three steps: profile, identify the bottleneck category, then apply the right optimization for that category.
graph TD
A["Run Nsight Systems
timeline view"] --> B{"Where does
time go?"}
B -->|"Kernel execution"| C["Run Nsight Compute
on hot kernel"]
B -->|"Transfers / sync"| D["Overlap with
streams + async"]
B -->|"CPU code"| E["Fix CPU side
before touching GPU"]
C --> F{"Roofline
position?"}
F -->|"Below memory roof"| G["Memory-bound
optimize bandwidth"]
F -->|"Below compute roof"| H["Compute-bound
optimize arithmetic"]
F -->|"Neither roof"| I["Latency-bound
increase parallelism"]
G --> J["Coalescing, caching
shared memory tiling"]
H --> K["Tensor cores, FMA
reduce instructions"]
I --> L["Kernel fusion
graphs, larger blocks"]
style A fill:#2d6a4f,stroke:#1b4332,color:#fff
style B fill:#40916c,stroke:#2d6a4f,color:#fff
style C fill:#52b788,stroke:#40916c,color:#fff
style D fill:#74c69d,stroke:#52b788,color:#000
style E fill:#95d5b2,stroke:#74c69d,color:#000
style F fill:#f4a261,stroke:#e76f51,color:#000
style G fill:#e76f51,stroke:#d62828,color:#fff
style H fill:#d62828,stroke:#9b2226,color:#fff
style I fill:#9b2226,stroke:#6a040f,color:#fff
style J fill:#264653,stroke:#2a9d8f,color:#fff
style K fill:#264653,stroke:#2a9d8f,color:#fff
style L fill:#264653,stroke:#2a9d8f,color:#fff
This decision tree is not optional. Skipping the profiling step means you are guessing. Engineers who guess spend weeks optimizing memory access patterns in compute-bound kernels, or unrolling loops in kernels that are bottlenecked on PCIe transfers. The profiler tells you which category you are in. Trust it.
The profile-first rule
Run Nsight Systems first. It gives you the timeline: kernel launches, memory transfers, API calls, CPU execution, and gaps. The timeline tells you where the wall-clock time actually goes. Only after you know which kernel dominates should you open Nsight Compute for that specific kernel.
# Step 1: system-level timeline
nsys profile --stats=true ./my_application
# Step 2: deep-dive on the hottest kernel
ncu --set full --kernel-name my_hot_kernel ./my_application
Nsight Compute gives you the roofline position. That single chart tells you whether the kernel is memory-bound, compute-bound, or latency-bound. Every optimization you apply should move the dot on the roofline toward one of the two ceilings.
Case study 1: Memory-bound kernel optimization
The problem
A 2D stencil kernel processes a 4096x4096 grid of float values. Each output element reads 5 input elements (center plus 4 neighbors). The naive implementation achieves 180 GB/s on an A100 with a theoretical peak of 2039 GB/s HBM bandwidth. Nsight Compute shows the kernel is firmly below the memory roofline.
Before: naive implementation
__global__ void stencil_naive(const float* in, float* out, int N) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < 1 || x >= N - 1 || y < 1 || y >= N - 1) return;
out[y * N + x] = 0.2f * (in[y * N + x]
+ in[(y - 1) * N + x]
+ in[(y + 1) * N + x]
+ in[y * N + (x - 1)]
+ in[y * N + (x + 1)]);
}
// Launch: dim3 block(16, 16); dim3 grid((N+15)/16, (N+15)/16);
The profiler reveals three problems:
- Poor coalescing on vertical neighbors. Reads to
in[(y-1)*N + x]andin[(y+1)*N + x]are coalesced, but the 16x16 block means only 16 threads per warp access contiguous memory. A 32x8 or 32x4 block would give full 32-wide coalescing. - Redundant global memory reads. Each element is read by 5 neighboring threads. With no shared memory, the same data is loaded from HBM 5 times.
- Low L2 cache hit rate. The working set (4096x4096 floats = 64 MB) exceeds L2 capacity (40 MB on A100), so rows evict before reuse.
Optimization steps
Step 1: Fix block dimensions. Change to 32x8 so warps read 32 consecutive floats. This alone improves coalescing efficiency from 50% to 100%.
Step 2: Shared memory tiling. Load a tile with halo into shared memory. Each thread loads one element; halo threads load boundary elements. The 5-point stencil reuses data from shared memory instead of global memory.
__global__ void stencil_tiled(const float* in, float* out, int N) {
const int TILE_X = 32, TILE_Y = 8;
const int HALO = 1;
__shared__ float tile[TILE_Y + 2 * HALO][TILE_X + 2 * HALO];
int gx = blockIdx.x * TILE_X + threadIdx.x;
int gy = blockIdx.y * TILE_Y + threadIdx.y;
int lx = threadIdx.x + HALO;
int ly = threadIdx.y + HALO;
// Load center
if (gx < N && gy < N)
tile[ly][lx] = in[gy * N + gx];
// Load halos (simplified; real code handles all 4 borders + corners)
if (threadIdx.y == 0 && gy > 0)
tile[0][lx] = in[(gy - 1) * N + gx];
if (threadIdx.y == TILE_Y - 1 && gy + 1 < N)
tile[ly + 1][lx] = in[(gy + 1) * N + gx];
if (threadIdx.x == 0 && gx > 0)
tile[ly][0] = in[gy * N + (gx - 1)];
if (threadIdx.x == TILE_X - 1 && gx + 1 < N)
tile[ly][lx + 1] = in[gy * N + (gx + 1)];
__syncthreads();
if (gx >= 1 && gx < N - 1 && gy >= 1 && gy < N - 1)
out[gy * N + gx] = 0.2f * (tile[ly][lx]
+ tile[ly - 1][lx]
+ tile[ly + 1][lx]
+ tile[ly][lx - 1]
+ tile[ly][lx + 1]);
}
Step 3: Read-only cache hints. Use __ldg() or declare the input pointer as const float* __restrict__ to route loads through the read-only texture cache, reducing L2 pressure.
Results
graph LR A["Naive 180 GB/s"] -->|"+111%"| B["Coalesced blocks 380 GB/s"] B -->|"+147%"| C["Shared memory 940 GB/s"] C -->|"+33%"| D["ldg + restrict 1250 GB/s"] style A fill:#e76f51,stroke:#d62828,color:#fff style B fill:#f4a261,stroke:#e76f51,color:#000 style C fill:#52b788,stroke:#40916c,color:#fff style D fill:#2d6a4f,stroke:#1b4332,color:#fff
The kernel went from 9% to 61% of peak HBM bandwidth. The remaining gap is expected: the stencil’s arithmetic intensity is low, so the kernel will never reach the compute roof. At 61% of memory bandwidth, the optimization is within the range where further gains require architectural changes (larger tiles, temporal blocking).
Case study 2: Compute-bound with tensor cores
The problem
A batched matrix multiply runs 256 independent 128x128 GEMM operations (FP16 accumulating to FP32). The existing implementation uses standard CUDA cores with shared memory tiling, achieving 12 TFLOPS on an A100 with a theoretical peak of 312 TFLOPS (tensor core) or 19.5 TFLOPS (FP32 CUDA core). The kernel is compute-bound per the roofline, but nowhere near the compute ceiling.
Before: CUDA core GEMM
The existing kernel uses a classic tiled approach with 16x16 tiles, achieving ~61% of the CUDA core FP32 peak. To go further, you must switch to tensor cores. Standard CUDA cores cannot close the 16x gap to tensor core throughput.
Optimization: WMMA (Warp Matrix Multiply-Accumulate)
Tensor cores operate on matrix fragments, not individual elements. The nvcuda::wmma API provides warp-level operations on 16x16x16 tiles:
#include <mma.h>
using namespace nvcuda;
__global__ void gemm_wmma(half* A, half* B, float* C,
int M, int N, int K) {
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::fill_fragment(c_frag, 0.0f);
int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
int warpN = blockIdx.y;
for (int k = 0; k < K; k += 16) {
wmma::load_matrix_sync(a_frag, A + warpM * 16 * K + k, K);
wmma::load_matrix_sync(b_frag, B + k * N + warpN * 16, N);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
wmma::store_matrix_sync(C + warpM * 16 * N + warpN * 16, c_frag, N,
wmma::mem_row_major);
}
This is the minimal version. Production code adds shared memory staging, double buffering of fragments, and software pipelining of loads. But even this simple version changes the bottleneck category.
Results
graph LR A["CUDA cores 12 TFLOPS"] -->|"WMMA basic"| B["Tensor cores 85 TFLOPS"] B -->|"+ shared mem staging"| C["110 TFLOPS"] C -->|"+ double buffering"| D["165 TFLOPS"] style A fill:#e76f51,stroke:#d62828,color:#fff style B fill:#f4a261,stroke:#e76f51,color:#000 style C fill:#52b788,stroke:#40916c,color:#fff style D fill:#2d6a4f,stroke:#1b4332,color:#fff
The WMMA kernel immediately delivers 7x the CUDA core throughput. With shared memory staging and double buffering, it reaches 53% of tensor core peak. The remaining gap comes from pipeline bubbles and L2 bank conflicts, which require the mma.sync PTX-level API to address (used by cuBLAS internally).
Key lesson: When Nsight Compute shows you are compute-bound and the workload maps to matrix operations, tensor cores are not optional. The 16x throughput advantage over standard CUDA cores means no amount of instruction-level optimization can compensate for not using them.
Case study 3: Latency-bound with many small launches
The problem
A signal processing pipeline runs 500 tiny kernels per frame, each processing 256 to 1024 elements. Total kernel execution time is 0.8 ms per frame, but wall-clock time is 15 ms. The Nsight Systems timeline shows that 94% of the time is gaps between kernel launches.
Each kernel launch has fixed overhead: the CPU queues work, the driver processes the command, the GPU scheduler picks it up. On Ampere, this overhead is roughly 5 to 15 microseconds per launch. With 500 launches, that adds up to 2.5 to 7.5 ms of pure launch overhead, plus synchronization gaps.
Optimization: CUDA Graphs
CUDA Graphs capture a sequence of operations (kernels, copies, synchronization) into a graph structure. The driver optimizes the entire sequence at capture time, then replays it with a single launch:
cudaGraph_t graph;
cudaGraphExec_t graphExec;
// Capture the 500-kernel sequence
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
for (int i = 0; i < 500; i++) {
small_kernel<<<1, 256, 0, stream>>>(data[i], size[i]);
}
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
// Replay with single launch overhead
for (int frame = 0; frame < num_frames; frame++) {
cudaGraphLaunch(graphExec, stream);
cudaStreamSynchronize(stream);
}
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
Additional optimization: kernel fusion
Where data dependencies allow, fuse multiple tiny kernels into one. If kernels A, B, and C operate on independent data, a single kernel with thread blocks assigned to different workloads eliminates two launches:
__global__ void fused_abc(float* dataA, float* dataB, float* dataC,
int sizeA, int sizeB, int sizeC) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (blockIdx.x < gridDimA) {
// Work for kernel A
if (tid < sizeA) processA(dataA, tid);
} else if (blockIdx.x < gridDimA + gridDimB) {
// Work for kernel B
int local_tid = tid - gridDimA * blockDim.x;
if (local_tid < sizeB) processB(dataB, local_tid);
} else {
// Work for kernel C
int local_tid = tid - (gridDimA + gridDimB) * blockDim.x;
if (local_tid < sizeC) processC(dataC, local_tid);
}
}
Results
CUDA Graphs reduced frame time from 15 ms to 2.1 ms (7.1x improvement). Kernel fusion on the 50 most fusible groups further reduced it to 1.4 ms (total 10.7x). The GPU utilization went from 5% to 57%.
Performance improvement waterfall
The following chart shows the cumulative effect of optimizations from all three case studies, applied to a composite workload where the stencil, GEMM, and signal pipeline all contribute to total frame time.
The baseline frame time was 52.0 ms. After all optimizations, total frame time dropped to 6.5 ms, an 8x improvement. The largest single gain came from CUDA Graphs (12.9 ms saved), followed by tensor core adoption (12.0 ms). This is typical: architectural changes (switching execution models, using specialized hardware) deliver larger gains than incremental tuning.
Optimization checklist
| Optimization | Applies when | Expected gain | Implementation cost | Risk |
|---|---|---|---|---|
| Fix memory coalescing | Profiler shows low global load efficiency | 1.5x to 4x bandwidth | Low (block dim change) | ✓ Safe |
| Shared memory tiling | Repeated access to same data region | 2x to 5x for stencils/reductions | Medium (halo logic, sync) | ⚠ Bank conflicts |
| Read-only cache hints | Input data is not modified by kernel | 10% to 30% bandwidth boost | Low (__ldg or __restrict__) | ✓ Safe |
| Tensor core WMMA | Matrix multiply or convolution workload | 5x to 16x over CUDA cores | High (data layout, fragment API) | ⚠ FP16 precision |
| CUDA Graphs | Many small kernel launches per frame | 3x to 10x for launch-bound workloads | Medium (capture/replay logic) | ⚠ Fixed topology |
| Kernel fusion | Independent small kernels on same stream | 1.5x to 3x launch overhead reduction | Medium (combined kernel logic) | ⚠ Code complexity |
__launch_bounds__ | Register spilling detected in profiler | 5% to 20% from reduced spilling | Low (directive on kernel) | ⚠ May reduce occupancy |
| Async memcpy + streams | Transfers overlap with computation | Up to 2x if transfer-bound | Medium (stream management) | ✓ Safe |
| Warp shuffle reductions | Small reductions within a warp | 2x to 4x over shared memory reduction | Low (intrinsic calls) | ✓ Safe |
Loop unrolling (#pragma unroll) | Tight inner loops with known trip count | 5% to 15% instruction throughput | Low (pragma) | ✗ Register pressure |
Worked examples
Example A: Optimization ROI calculation
A kernel takes 10 ms per invocation and runs 1000 times per day.
- Optimization A: 2 engineer-days, expected 3x speedup (10 ms to 3.33 ms)
- Optimization B: 0.5 engineer-days, expected 1.5x speedup (10 ms to 6.67 ms)
Time saved per day:
Optimization A saves (10 - 3.33) * 1000 = 6,670 ms = 6.67 seconds per day. Optimization B saves (10 - 6.67) * 1000 = 3,330 ms = 3.33 seconds per day.
ROI (time saved per engineer-day invested):
Optimization A: 6.67s / 2 days = 3.33 seconds saved per engineer-day. Optimization B: 3.33s / 0.5 days = 6.67 seconds saved per engineer-day.
Analysis: Optimization B has 2x better ROI despite lower absolute speedup. In practice, you would implement B first because it delivers value faster with less risk. Then decide if A is still worth the investment. This calculation changes if the kernel runs 100,000 times per day instead of 1,000, or if the application is latency-sensitive (where absolute reduction matters more than ROI).
The general principle: always compute ROI before starting optimization work. A 10x speedup on a kernel that runs once per hour is worth less than a 1.2x speedup on a kernel that runs millions of times per second.
Example B: Benchmark variance and outlier handling
You run a kernel 20 times and record these execution times (ms):
[5.1, 5.2, 4.9, 5.0, 5.3, 5.0, 4.8, 5.1, 5.2, 5.0,
12.3, 5.1, 5.0, 4.9, 5.2, 5.1, 5.0, 5.2, 4.9, 5.1]
Statistics with all 20 samples:
- Mean: 5.42 ms
- Median: 5.1 ms
- Standard deviation: 1.63 ms
Statistics without the 12.3 ms outlier:
- Mean: 5.06 ms
- Median: 5.1 ms
- Standard deviation: 0.13 ms
The 12.3 ms sample inflates the mean by 7% and the standard deviation by 12.5x. Should you include it?
The correct approach:
- Do not silently discard it. The outlier is real data. Something caused it: OS interrupt, thermal throttling, GPU clock scaling, a competing process, or PCIe contention.
- Report the median and interquartile range (IQR) as your primary metric. The median (5.1 ms) is resistant to outliers. The IQR (4.95 ms to 5.2 ms) shows the true spread of typical runs.
- Report the outlier separately. State: “Median 5.1 ms (IQR 4.95 to 5.2), with 1 outlier at 12.3 ms (likely OS interrupt).”
- Use a formal outlier test if needed. The modified Z-score method: any sample more than 3.5 median absolute deviations (MAD) from the median is flagged. MAD = median(|xi - median|) = 0.1. Modified Z-score for 12.3 = 0.6745 * (12.3 - 5.1) / 0.1 = 48.6. This far exceeds 3.5, confirming it is a statistical outlier.
Benchmarking best practices:
- Warm up the GPU. Run the kernel 3 to 5 times before recording. The first launch incurs context setup, JIT compilation of PTX, and memory allocation overhead.
- Use CUDA events, not CPU timers.
cudaEventRecordandcudaEventElapsedTimemeasure GPU time directly, excluding CPU-side queueing delays. - Lock GPU clocks. Use
nvidia-smi -lgc <freq>to fix the GPU clock frequency. Boost clocks introduce variance because the GPU changes frequency based on thermal state. - Run enough samples. 20 to 50 runs is a reasonable minimum. Compute the coefficient of variation (std / mean). If it exceeds 5%, investigate the source of variance before trusting the results.
- Avoid CPU-side bottlenecks. If the CPU cannot enqueue kernels fast enough, you are benchmarking driver overhead, not kernel performance. Use CUDA events within a stream and let the GPU pipeline fill.
// Correct GPU benchmarking pattern
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// Warmup
for (int i = 0; i < 5; i++)
my_kernel<<<grid, block>>>(args);
cudaDeviceSynchronize();
// Timed runs
float times[50];
for (int i = 0; i < 50; i++) {
cudaEventRecord(start);
my_kernel<<<grid, block>>>(args);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(×[i], start, stop);
}
// Report median, not mean
std::sort(times, times + 50);
float median = times[25];
float p25 = times[12];
float p75 = times[37];
printf("Median: %.2f ms (IQR: %.2f - %.2f)\n", median, p25, p75);
Common optimization patterns
Pattern 1: Bandwidth amplification
If your kernel reads the same data multiple times from global memory, you are wasting bandwidth. The fix is always the same hierarchy: registers (fastest, per-thread) then shared memory (fast, per-block) then L1/L2 (automatic, but eviction-prone) then global (slowest).
Ask this question for every global memory read: “Does another thread in this block need the same value?” If yes, load it into shared memory once.
Pattern 2: Arithmetic intensity boosting
Memory-bound kernels stay memory-bound unless you increase the ratio of computation to memory access. Techniques:
- Loop tiling with larger tiles. Process more output elements per loaded tile.
- Temporal blocking. Apply multiple stencil iterations within a single kernel, reusing data in shared memory across time steps.
- Fused operations. Instead of kernel A writing to global memory and kernel B reading it back, fuse A and B so intermediate results stay in registers.
Pattern 3: Launch overhead amortization
Every kernel launch costs 5 to 15 microseconds. When kernel execution time is comparable to launch overhead, you lose significant GPU utilization. Solutions, in order of complexity:
- Increase work per kernel. Process more elements per launch.
- CUDA Graphs. Capture and replay entire sequences.
- Persistent kernels. Launch once, loop inside the kernel, communicate via global memory flags.
When to stop optimizing
Stop when any of these conditions is true:
- You hit a hardware ceiling. The kernel is at 80%+ of the relevant roofline bound (memory bandwidth or compute throughput). Further gains require a different algorithm or different hardware.
- The kernel is no longer the bottleneck. After optimizing kernel X, kernel Y might now dominate the timeline. Optimize Y instead.
- The remaining gap is smaller than run-to-run variance. If your benchmark shows 3% coefficient of variation and the theoretical remaining gain is 5%, you cannot reliably measure further improvements.
- ROI turns negative. If the next optimization takes 5 engineer-days for a 2% gain on a kernel that contributes 1% of total runtime, that time is better spent elsewhere.
The profiler always has the final word. After every optimization, re-profile. The bottleneck category can change: a memory-bound kernel that gets tiling might become compute-bound. A compute-bound kernel that gets tensor cores might become memory-bound (because tensor cores consume data faster than HBM can deliver it). Each shift requires a different optimization strategy.
In practice
Profile first, always. The single biggest waste of optimization time is working on the wrong kernel or the wrong bottleneck. Five minutes with Nsight Systems saves days of misguided work.
Optimization order matters. Apply the cheapest, highest-impact changes first. Fix coalescing before adding shared memory tiling. Add tiling before switching to tensor cores. Use CUDA Graphs before writing persistent kernels. Each step changes the bottleneck, and later steps may become unnecessary.
Measure everything with locked clocks and CUDA events. CPU timers, std::chrono, and unlocked GPU clocks all introduce measurement error that can exceed the size of your optimization. If you cannot measure it correctly, you cannot optimize it correctly.
Do not optimize code that does not matter. A kernel consuming 0.5% of total runtime is not worth touching regardless of how suboptimal it looks. Focus on the top 3 kernels by wall-clock contribution.
Tensor cores are transformative for matrix workloads. If your application involves GEMM, convolution, or any operation expressible as matrix multiply, check whether you are using tensor cores. The gap between CUDA cores and tensor cores on modern GPUs is large enough that it often dominates all other optimization considerations.
What comes next
The final article in this series, What comes next, covers where to go after finishing these fundamentals: advanced topics like multi-GPU programming with NCCL, production deployment patterns, the CUDA ecosystem (Thrust, CUB, cuBLAS, cuDNN), and resources for continued learning.