Search…

Debugging and profiling CUDA programs

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article builds on shared memory concepts from the previous post. You should be comfortable writing kernels that use __shared__ memory, understand thread/block indexing, and know the basics of warp execution. Everything here targets CUDA Toolkit 12+ and Nsight tools.

The debugging and profiling workflow

Most CUDA bugs are silent. The kernel launches, returns no error, and writes garbage to your output buffer. The GPU does not segfault in the way a CPU program does. You need a systematic workflow: compile with debug flags, run with sanitizers, check every API return code, then profile to find the actual bottleneck.

graph LR
  A["Compile
nvcc -G -lineinfo"] --> B["Run with
compute-sanitizer"]
  B --> C["Check API
return codes"]
  C --> D["Profile with
Nsight Systems"]
  D --> E["Deep-dive with
Nsight Compute"]
  E --> F["Optimize kernel"]
  F --> B

  style A fill:#2d6a4f,stroke:#1b4332,color:#fff
  style B fill:#40916c,stroke:#2d6a4f,color:#fff
  style C fill:#52b788,stroke:#40916c,color:#fff
  style D fill:#f4a261,stroke:#e76f51,color:#000
  style E fill:#e76f51,stroke:#d62828,color:#fff
  style F fill:#d62828,stroke:#9b2226,color:#fff

This loop is the core of GPU development. You will cycle through it dozens of times per kernel.

Common CUDA bugs

Wrong thread indexing

The single most common bug. You compute a global thread index, but the formula does not match your launch configuration. The kernel silently reads and writes out-of-bounds memory.

// BUG: forgot blockDim.x in the index calculation
int idx = blockIdx.x + threadIdx.x;  // wrong
int idx = blockIdx.x * blockDim.x + threadIdx.x;  // correct

A one-character typo (+ vs *) produces results that look plausible for small inputs and fail catastrophically on large ones.

Off-by-one and bounds checking

Even with correct indexing, you must guard against threads that exceed the data size. A grid of 256-thread blocks processing 1000 elements launches 4 blocks (1024 threads). The last 24 threads will read past the end of your array.

__global__ void scale(float *data, int n, float factor) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;  // essential bounds check
    data[idx] *= factor;
}

Race conditions in shared memory

Two threads in the same block write to the same shared memory location without a __syncthreads() barrier. The result depends on warp scheduling and changes between runs, making this bug extremely hard to reproduce.

Using device pointers on the host

Dereferencing a device pointer on the CPU does not crash immediately on all systems. On some driver versions it returns zero; on others it triggers a segfault. Either way, the compiler cannot catch it.

float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
// BUG: cannot dereference d_data on the host
printf("first element: %f\n", d_data[0]);  // undefined behavior

The CUDA error checking macro

Every CUDA runtime call returns a cudaError_t. Ignoring it is the GPU equivalent of ignoring errno. The standard practice is a macro that checks the return value and prints the file, line, and human-readable error string.

#include <cstdio>
#include <cstdlib>

#define CUDA_CHECK(call)                                                 \
    do \{                                                                 \
        cudaError_t err = (call);                                        \
        if (err != cudaSuccess) \{                                        \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                          \
        \}                                                                \
    \} while (0)

// Usage: wrap every CUDA call
CUDA_CHECK(cudaMalloc(&d_data, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

// For kernel launches (async), check after synchronization
myKernel<<<grid, block>>>(d_data, N);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());

The do \{ ... \} while(0) wrapper ensures the macro behaves as a single statement in if/else blocks. The cudaGetLastError() call after a kernel launch catches configuration errors (e.g., requesting too much shared memory). The cudaDeviceSynchronize() call catches runtime errors inside the kernel.

Kernel printf for quick debugging

CUDA supports printf inside device code since compute capability 2.0. The output is buffered and flushed to the host on the next synchronization point.

__global__ void debugKernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // Print only from a few threads to avoid flooding the console
    if (idx < 4) {
        printf("Thread %d: block=%d, local=%d, value=%.4f\n",
               idx, blockIdx.x, threadIdx.x, data[idx]);
    }
}

printf from thousands of threads will produce megabytes of interleaved output. Always guard it with a condition. The default print buffer is 1 MB; increase it with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size) if you need more.

compute-sanitizer: catching memory errors

compute-sanitizer (formerly cuda-memcheck) is the GPU equivalent of Valgrind. It instruments your kernel at runtime and detects out-of-bounds accesses, race conditions, and uninitialized reads.

Here is a kernel with an intentional out-of-bounds write:

#include <cstdio>

#define CUDA_CHECK(call)                                                 \
    do \{                                                                 \
        cudaError_t err = (call);                                        \
        if (err != cudaSuccess) \{                                        \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                          \
        \}                                                                \
    \} while (0)

__global__ void buggyKernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // BUG: no bounds check, threads beyond n write out of bounds
    data[idx] = 1.0f;
}

int main() {
    int n = 1000;
    float *d_data;
    CUDA_CHECK(cudaMalloc(&d_data, n * sizeof(float)));

    // Launch 1024 threads for 1000 elements: 24 threads write OOB
    buggyKernel<<<4, 256>>>(d_data, n);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaFree(d_data));
    return 0;
}

Compile with debug info and run under the sanitizer:

nvcc -G -lineinfo -o buggy buggy.cu
compute-sanitizer --tool memcheck ./buggy

Output (abbreviated):

========= Invalid __global__ write of size 4 bytes
=========     at 0x00000148 in buggyKernel(float*, int)
=========     by thread (232,0,0) in block (3,0,0)
=========     Address 0x7f2c00001f40 is out of bounds
=========
========= ERROR SUMMARY: 24 errors

The sanitizer pinpoints the exact thread, block, and instruction. The 24 errors correspond to threads 1000 through 1023.

Other useful sanitizer modes:

  • --tool racecheck detects shared memory race conditions
  • --tool initcheck finds reads of uninitialized device memory
  • --tool synccheck validates __syncthreads() usage across divergent warps

Nsight Systems: the timeline view

Nsight Systems (nsys) gives you a system-wide timeline of CPU calls, CUDA API invocations, kernel launches, and memory transfers. It answers the question: “Where is time being spent?”

nsys profile --stats=true -o report ./my_program
nsys stats report.nsys-rep

What to look for in the timeline:

  • Gaps between kernels: indicates CPU-side overhead or unnecessary synchronization
  • Long cudaMemcpy calls: data transfer dominates kernel execution
  • Serialized kernel launches: kernels that could overlap on different streams are running sequentially
  • Small kernels with high launch overhead: the kernel runs for 2 microseconds but the launch takes 5

Nsight Systems is your first profiling step. It tells you which kernel to focus on before you dive deeper.

Nsight Compute: kernel-level metrics

Nsight Compute (ncu) profiles a single kernel invocation in extreme detail. It collects hardware counters, computes derived metrics, and compares your kernel against theoretical peak performance.

ncu --set full -o kernel_report ./my_program
ncu -i kernel_report.ncu-rep

Key profiler metrics

MetricWhat it measuresIdeal valueWhen it is lowWhat to do
Achieved occupancyFraction of max warps active per cycle>50% for most kernelsThreads are blocked on barriers or resourcesReduce register/shared memory per thread, adjust block size
SM throughputFraction of time SMs are doing useful workClose to 100%Too few blocks to fill the GPU, or high tail effectIncrease parallelism, check grid dimensions
Memory throughput (DRAM)Fraction of peak memory bandwidth usedDepends on kernel typePoor access patterns, low arithmetic intensityCoalesce accesses, use shared memory tiling
Warp stall: memory dependencyCycles warps wait for memoryLow (relative to compute)Too many global memory accesses per instructionIncrease data reuse via shared memory or registers
Warp stall: barrierCycles warps wait at __syncthreads()LowUnbalanced work across threads in a blockRedistribute work, reduce barrier frequency
L1/L2 cache hit rateFraction of memory requests served by cacheHigh for data-reuse kernelsStrided or random access patternsRestructure access to exploit spatial locality
FP32/FP64 utilizationFraction of peak FLOP/s achievedHigh for compute-bound kernelsMemory bound or low ILPIncrease arithmetic intensity, use ILP techniques

The roofline model

The roofline model is the most useful mental framework for understanding kernel performance. It plots achievable performance (FLOP/s) as a function of arithmetic intensity (FLOPs per byte of memory traffic).

graph LR
  subgraph Roofline["Roofline Model Regions"]
      direction TB
      MB["Memory-Bound Region
Performance limited by bandwidth
Arithmetic intensity &lt; ridge point"]
      CB["Compute-Bound Region
Performance limited by peak FLOP/s
Arithmetic intensity &gt; ridge point"]
      RP["Ridge Point
Peak FLOP/s / Peak Bandwidth"]
  end
  MB --> RP
  RP --> CB

  style MB fill:#f4a261,stroke:#e76f51,color:#000
  style CB fill:#2d6a4f,stroke:#1b4332,color:#fff
  style RP fill:#e76f51,stroke:#d62828,color:#fff

Two ceilings define the model:

  1. Memory bandwidth ceiling: performance = arithmetic_intensity x peak_bandwidth (GB/s). This is a sloped line.
  2. Compute ceiling: performance = peak_FLOP/s. This is a horizontal line.

The ridge point is where the two lines intersect: ridge_point = peak_FLOP/s / peak_bandwidth. Kernels to the left of the ridge point are memory-bound. Kernels to the right are compute-bound.

Roofline chart for an A100 GPU

Peak FP32: 19.5 TFLOP/s. Peak HBM2e bandwidth: 2,039 GB/s (2.039 TB/s). Ridge point: 19.5 / 2.039 = 9.56 FLOP/byte.

Reading this chart:

  • Kernel A (naive reduction): arithmetic intensity 2 FLOP/byte, achieves 3.5 TFLOP/s. It sits on the sloped line, meaning it is memory-bound. The roofline ceiling at 2 FLOP/byte is 2 x 2.039 = 4.08 TFLOP/s. The kernel reaches 86% of its memory-bound ceiling. To improve it, reduce memory traffic (better reuse, compression, fusion).
  • Kernel B (tiled matmul): arithmetic intensity 8 FLOP/byte, achieves 12 TFLOP/s. Still to the left of the ridge point, so still memory-bound. Ceiling is 8 x 2.039 = 16.3 TFLOP/s (73% efficiency). Larger tiles or better prefetching would help.
  • Kernel C (fused MLP): arithmetic intensity 40 FLOP/byte, achieves 17.5 TFLOP/s. Well past the ridge point, so compute-bound. Ceiling is 19.5 TFLOP/s (90% efficiency). To improve, increase instruction-level parallelism or use tensor cores.

Worked example 1: roofline analysis

Setup: A matrix multiplication kernel on a GPU with 20 TFLOP/s peak FP32 performance and 900 GB/s memory bandwidth. The kernel achieves 5 TFLOP/s. Profiling shows an arithmetic intensity of 15 FLOP/byte.

Step 1: find the ridge point.

Ridge point = peak FLOP/s / peak bandwidth = 20 x 10¹² / 900 x 10⁹ = 22.2 FLOP/byte.

Step 2: determine the bound.

The kernel’s arithmetic intensity (15 FLOP/byte) is less than the ridge point (22.2 FLOP/byte). The kernel is memory-bound.

Step 3: compute the theoretical maximum.

For a memory-bound kernel, the ceiling is: arithmetic_intensity x bandwidth = 15 x 900 x 10⁹ = 13.5 x 10¹² = 13.5 TFLOP/s.

Step 4: compute efficiency.

The kernel achieves 5 TFLOP/s out of a 13.5 TFLOP/s ceiling. That is 37% of the memory-bound roofline. This is poor. The kernel likely has non-coalesced memory accesses or excessive global memory traffic from redundant loads.

What to do:

  • ✓ Check for coalesced access patterns (consecutive threads read consecutive addresses)
  • ✓ Add shared memory tiling to increase data reuse per byte loaded
  • ✓ Fuse this kernel with adjacent operations to reduce round-trips to DRAM
  • ✗ Increasing FLOP/s will not help because the kernel is not compute-bound

Worked example 2: occupancy analysis

Setup: An NVIDIA GPU where each SM supports a maximum of 2048 concurrent threads, 65,536 registers, and 48 KB of shared memory. Your kernel uses blocks of 256 threads, 32 registers per thread, and 8 KB of shared memory per block.

Step 1: thread-limited occupancy.

Max blocks per SM (by threads) = floor(2048 / 256) = 8 blocks. Active threads = 8 x 256 = 2048. Thread occupancy = 2048 / 2048 = 100%.

Step 2: register-limited occupancy.

Registers per block = 256 threads x 32 registers = 8,192 registers. Max blocks per SM (by registers) = floor(65,536 / 8,192) = 8 blocks. Active threads = 8 x 256 = 2048. Register occupancy = 2048 / 2048 = 100%.

Step 3: shared-memory-limited occupancy.

Shared memory per block = 8 KB. Max blocks per SM (by shared memory) = floor(48 / 8) = 6 blocks. Active threads = 6 x 256 = 1,536. Shared memory occupancy = 1,536 / 2,048 = 75%.

Step 4: determine the binding constraint.

The minimum across all resources determines actual occupancy:

ResourceMax blocksActive threadsOccupancy
Threads per SM82,048100%
Registers82,048100%
Shared memory61,53675%

Shared memory is the binding constraint. Actual occupancy is 75%.

What to do:

  • ✓ Reduce shared memory per block from 8 KB to 6 KB (allows 8 blocks, reaching 100%)
  • ✓ Alternatively, increase block size to 384 threads with the same 8 KB shared memory (fewer blocks needed)
  • ⚠ 75% occupancy is often acceptable. Measure actual performance before optimizing; higher occupancy does not always mean faster execution if the kernel is compute-bound and already hides latency well

In practice

Use the error checking macro everywhere. Wrap every cudaMalloc, cudaMemcpy, and cudaFree call. Wrap cudaGetLastError() after every kernel launch. This costs zero runtime overhead (the error is already computed; you are just reading it) and saves hours of debugging silent failures.

Profile before optimizing. New CUDA programmers often guess at bottlenecks: “shared memory will make this faster.” Sometimes it does. Sometimes the kernel is already compute-bound and adding shared memory staging just increases register pressure. Nsight Compute will tell you in seconds what intuition takes hours to discover.

The roofline model is a ceiling, not a target. A kernel at 60% of its roofline ceiling is often good enough. The last 20% of performance typically requires intricate tuning (warp-level primitives, double buffering, manual register allocation) that makes the code fragile and hard to maintain. Know the ceiling so you can make an informed decision about when to stop optimizing.

compute-sanitizer is slow. Expect 10-50x slowdown. Use it with small inputs during development. Do not run it on production-scale data.

Nsight Systems before Nsight Compute. Always start with the timeline view. If your bottleneck is a 500 ms cudaMemcpy between two 2 ms kernels, no amount of kernel optimization will help. Fix the data transfer first.

Watch for occupancy traps. Higher occupancy means more warps available to hide latency. But beyond a threshold (often 50-60%), additional warps compete for cache and shared memory, sometimes making performance worse. Profile both directions.

What comes next

The next article covers device and host function qualifiers: __device__, __host__, and __host__ __device__ functions, how the compiler generates code for each, and patterns for writing portable code that runs on both CPU and GPU.

Start typing to search across all content
navigate Enter open Esc close