Debugging and profiling CUDA programs
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article builds on shared memory concepts from the previous post. You should be comfortable writing kernels that use __shared__ memory, understand thread/block indexing, and know the basics of warp execution. Everything here targets CUDA Toolkit 12+ and Nsight tools.
The debugging and profiling workflow
Most CUDA bugs are silent. The kernel launches, returns no error, and writes garbage to your output buffer. The GPU does not segfault in the way a CPU program does. You need a systematic workflow: compile with debug flags, run with sanitizers, check every API return code, then profile to find the actual bottleneck.
graph LR A["Compile nvcc -G -lineinfo"] --> B["Run with compute-sanitizer"] B --> C["Check API return codes"] C --> D["Profile with Nsight Systems"] D --> E["Deep-dive with Nsight Compute"] E --> F["Optimize kernel"] F --> B style A fill:#2d6a4f,stroke:#1b4332,color:#fff style B fill:#40916c,stroke:#2d6a4f,color:#fff style C fill:#52b788,stroke:#40916c,color:#fff style D fill:#f4a261,stroke:#e76f51,color:#000 style E fill:#e76f51,stroke:#d62828,color:#fff style F fill:#d62828,stroke:#9b2226,color:#fff
This loop is the core of GPU development. You will cycle through it dozens of times per kernel.
Common CUDA bugs
Wrong thread indexing
The single most common bug. You compute a global thread index, but the formula does not match your launch configuration. The kernel silently reads and writes out-of-bounds memory.
// BUG: forgot blockDim.x in the index calculation
int idx = blockIdx.x + threadIdx.x; // wrong
int idx = blockIdx.x * blockDim.x + threadIdx.x; // correct
A one-character typo (+ vs *) produces results that look plausible for small inputs and fail catastrophically on large ones.
Off-by-one and bounds checking
Even with correct indexing, you must guard against threads that exceed the data size. A grid of 256-thread blocks processing 1000 elements launches 4 blocks (1024 threads). The last 24 threads will read past the end of your array.
__global__ void scale(float *data, int n, float factor) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return; // essential bounds check
data[idx] *= factor;
}
Race conditions in shared memory
Two threads in the same block write to the same shared memory location without a __syncthreads() barrier. The result depends on warp scheduling and changes between runs, making this bug extremely hard to reproduce.
Using device pointers on the host
Dereferencing a device pointer on the CPU does not crash immediately on all systems. On some driver versions it returns zero; on others it triggers a segfault. Either way, the compiler cannot catch it.
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
// BUG: cannot dereference d_data on the host
printf("first element: %f\n", d_data[0]); // undefined behavior
The CUDA error checking macro
Every CUDA runtime call returns a cudaError_t. Ignoring it is the GPU equivalent of ignoring errno. The standard practice is a macro that checks the return value and prints the file, line, and human-readable error string.
#include <cstdio>
#include <cstdlib>
#define CUDA_CHECK(call) \
do \{ \
cudaError_t err = (call); \
if (err != cudaSuccess) \{ \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
\} \
\} while (0)
// Usage: wrap every CUDA call
CUDA_CHECK(cudaMalloc(&d_data, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));
// For kernel launches (async), check after synchronization
myKernel<<<grid, block>>>(d_data, N);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
The do \{ ... \} while(0) wrapper ensures the macro behaves as a single statement in if/else blocks. The cudaGetLastError() call after a kernel launch catches configuration errors (e.g., requesting too much shared memory). The cudaDeviceSynchronize() call catches runtime errors inside the kernel.
Kernel printf for quick debugging
CUDA supports printf inside device code since compute capability 2.0. The output is buffered and flushed to the host on the next synchronization point.
__global__ void debugKernel(float *data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
// Print only from a few threads to avoid flooding the console
if (idx < 4) {
printf("Thread %d: block=%d, local=%d, value=%.4f\n",
idx, blockIdx.x, threadIdx.x, data[idx]);
}
}
⚠ printf from thousands of threads will produce megabytes of interleaved output. Always guard it with a condition. The default print buffer is 1 MB; increase it with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size) if you need more.
compute-sanitizer: catching memory errors
compute-sanitizer (formerly cuda-memcheck) is the GPU equivalent of Valgrind. It instruments your kernel at runtime and detects out-of-bounds accesses, race conditions, and uninitialized reads.
Here is a kernel with an intentional out-of-bounds write:
#include <cstdio>
#define CUDA_CHECK(call) \
do \{ \
cudaError_t err = (call); \
if (err != cudaSuccess) \{ \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
\} \
\} while (0)
__global__ void buggyKernel(float *data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// BUG: no bounds check, threads beyond n write out of bounds
data[idx] = 1.0f;
}
int main() {
int n = 1000;
float *d_data;
CUDA_CHECK(cudaMalloc(&d_data, n * sizeof(float)));
// Launch 1024 threads for 1000 elements: 24 threads write OOB
buggyKernel<<<4, 256>>>(d_data, n);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
CUDA_CHECK(cudaFree(d_data));
return 0;
}
Compile with debug info and run under the sanitizer:
nvcc -G -lineinfo -o buggy buggy.cu
compute-sanitizer --tool memcheck ./buggy
Output (abbreviated):
========= Invalid __global__ write of size 4 bytes
========= at 0x00000148 in buggyKernel(float*, int)
========= by thread (232,0,0) in block (3,0,0)
========= Address 0x7f2c00001f40 is out of bounds
=========
========= ERROR SUMMARY: 24 errors
The sanitizer pinpoints the exact thread, block, and instruction. The 24 errors correspond to threads 1000 through 1023.
Other useful sanitizer modes:
--tool racecheckdetects shared memory race conditions--tool initcheckfinds reads of uninitialized device memory--tool synccheckvalidates__syncthreads()usage across divergent warps
Nsight Systems: the timeline view
Nsight Systems (nsys) gives you a system-wide timeline of CPU calls, CUDA API invocations, kernel launches, and memory transfers. It answers the question: “Where is time being spent?”
nsys profile --stats=true -o report ./my_program
nsys stats report.nsys-rep
What to look for in the timeline:
- Gaps between kernels: indicates CPU-side overhead or unnecessary synchronization
- Long cudaMemcpy calls: data transfer dominates kernel execution
- Serialized kernel launches: kernels that could overlap on different streams are running sequentially
- Small kernels with high launch overhead: the kernel runs for 2 microseconds but the launch takes 5
Nsight Systems is your first profiling step. It tells you which kernel to focus on before you dive deeper.
Nsight Compute: kernel-level metrics
Nsight Compute (ncu) profiles a single kernel invocation in extreme detail. It collects hardware counters, computes derived metrics, and compares your kernel against theoretical peak performance.
ncu --set full -o kernel_report ./my_program
ncu -i kernel_report.ncu-rep
Key profiler metrics
| Metric | What it measures | Ideal value | When it is low | What to do |
|---|---|---|---|---|
| Achieved occupancy | Fraction of max warps active per cycle | >50% for most kernels | Threads are blocked on barriers or resources | Reduce register/shared memory per thread, adjust block size |
| SM throughput | Fraction of time SMs are doing useful work | Close to 100% | Too few blocks to fill the GPU, or high tail effect | Increase parallelism, check grid dimensions |
| Memory throughput (DRAM) | Fraction of peak memory bandwidth used | Depends on kernel type | Poor access patterns, low arithmetic intensity | Coalesce accesses, use shared memory tiling |
| Warp stall: memory dependency | Cycles warps wait for memory | Low (relative to compute) | Too many global memory accesses per instruction | Increase data reuse via shared memory or registers |
| Warp stall: barrier | Cycles warps wait at __syncthreads() | Low | Unbalanced work across threads in a block | Redistribute work, reduce barrier frequency |
| L1/L2 cache hit rate | Fraction of memory requests served by cache | High for data-reuse kernels | Strided or random access patterns | Restructure access to exploit spatial locality |
| FP32/FP64 utilization | Fraction of peak FLOP/s achieved | High for compute-bound kernels | Memory bound or low ILP | Increase arithmetic intensity, use ILP techniques |
The roofline model
The roofline model is the most useful mental framework for understanding kernel performance. It plots achievable performance (FLOP/s) as a function of arithmetic intensity (FLOPs per byte of memory traffic).
graph LR
subgraph Roofline["Roofline Model Regions"]
direction TB
MB["Memory-Bound Region
Performance limited by bandwidth
Arithmetic intensity < ridge point"]
CB["Compute-Bound Region
Performance limited by peak FLOP/s
Arithmetic intensity > ridge point"]
RP["Ridge Point
Peak FLOP/s / Peak Bandwidth"]
end
MB --> RP
RP --> CB
style MB fill:#f4a261,stroke:#e76f51,color:#000
style CB fill:#2d6a4f,stroke:#1b4332,color:#fff
style RP fill:#e76f51,stroke:#d62828,color:#fff
Two ceilings define the model:
- Memory bandwidth ceiling: performance = arithmetic_intensity x peak_bandwidth (GB/s). This is a sloped line.
- Compute ceiling: performance = peak_FLOP/s. This is a horizontal line.
The ridge point is where the two lines intersect: ridge_point = peak_FLOP/s / peak_bandwidth. Kernels to the left of the ridge point are memory-bound. Kernels to the right are compute-bound.
Roofline chart for an A100 GPU
Peak FP32: 19.5 TFLOP/s. Peak HBM2e bandwidth: 2,039 GB/s (2.039 TB/s). Ridge point: 19.5 / 2.039 = 9.56 FLOP/byte.
Reading this chart:
- Kernel A (naive reduction): arithmetic intensity 2 FLOP/byte, achieves 3.5 TFLOP/s. It sits on the sloped line, meaning it is memory-bound. The roofline ceiling at 2 FLOP/byte is 2 x 2.039 = 4.08 TFLOP/s. The kernel reaches 86% of its memory-bound ceiling. To improve it, reduce memory traffic (better reuse, compression, fusion).
- Kernel B (tiled matmul): arithmetic intensity 8 FLOP/byte, achieves 12 TFLOP/s. Still to the left of the ridge point, so still memory-bound. Ceiling is 8 x 2.039 = 16.3 TFLOP/s (73% efficiency). Larger tiles or better prefetching would help.
- Kernel C (fused MLP): arithmetic intensity 40 FLOP/byte, achieves 17.5 TFLOP/s. Well past the ridge point, so compute-bound. Ceiling is 19.5 TFLOP/s (90% efficiency). To improve, increase instruction-level parallelism or use tensor cores.
Worked example 1: roofline analysis
Setup: A matrix multiplication kernel on a GPU with 20 TFLOP/s peak FP32 performance and 900 GB/s memory bandwidth. The kernel achieves 5 TFLOP/s. Profiling shows an arithmetic intensity of 15 FLOP/byte.
Step 1: find the ridge point.
Ridge point = peak FLOP/s / peak bandwidth = 20 x 10¹² / 900 x 10⁹ = 22.2 FLOP/byte.
Step 2: determine the bound.
The kernel’s arithmetic intensity (15 FLOP/byte) is less than the ridge point (22.2 FLOP/byte). The kernel is memory-bound.
Step 3: compute the theoretical maximum.
For a memory-bound kernel, the ceiling is: arithmetic_intensity x bandwidth = 15 x 900 x 10⁹ = 13.5 x 10¹² = 13.5 TFLOP/s.
Step 4: compute efficiency.
The kernel achieves 5 TFLOP/s out of a 13.5 TFLOP/s ceiling. That is 37% of the memory-bound roofline. This is poor. The kernel likely has non-coalesced memory accesses or excessive global memory traffic from redundant loads.
What to do:
- ✓ Check for coalesced access patterns (consecutive threads read consecutive addresses)
- ✓ Add shared memory tiling to increase data reuse per byte loaded
- ✓ Fuse this kernel with adjacent operations to reduce round-trips to DRAM
- ✗ Increasing FLOP/s will not help because the kernel is not compute-bound
Worked example 2: occupancy analysis
Setup: An NVIDIA GPU where each SM supports a maximum of 2048 concurrent threads, 65,536 registers, and 48 KB of shared memory. Your kernel uses blocks of 256 threads, 32 registers per thread, and 8 KB of shared memory per block.
Step 1: thread-limited occupancy.
Max blocks per SM (by threads) = floor(2048 / 256) = 8 blocks. Active threads = 8 x 256 = 2048. Thread occupancy = 2048 / 2048 = 100%.
Step 2: register-limited occupancy.
Registers per block = 256 threads x 32 registers = 8,192 registers. Max blocks per SM (by registers) = floor(65,536 / 8,192) = 8 blocks. Active threads = 8 x 256 = 2048. Register occupancy = 2048 / 2048 = 100%.
Step 3: shared-memory-limited occupancy.
Shared memory per block = 8 KB. Max blocks per SM (by shared memory) = floor(48 / 8) = 6 blocks. Active threads = 6 x 256 = 1,536. Shared memory occupancy = 1,536 / 2,048 = 75%.
Step 4: determine the binding constraint.
The minimum across all resources determines actual occupancy:
| Resource | Max blocks | Active threads | Occupancy |
|---|---|---|---|
| Threads per SM | 8 | 2,048 | 100% |
| Registers | 8 | 2,048 | 100% |
| Shared memory | 6 | 1,536 | 75% |
Shared memory is the binding constraint. Actual occupancy is 75%.
What to do:
- ✓ Reduce shared memory per block from 8 KB to 6 KB (allows 8 blocks, reaching 100%)
- ✓ Alternatively, increase block size to 384 threads with the same 8 KB shared memory (fewer blocks needed)
- ⚠ 75% occupancy is often acceptable. Measure actual performance before optimizing; higher occupancy does not always mean faster execution if the kernel is compute-bound and already hides latency well
In practice
Use the error checking macro everywhere. Wrap every cudaMalloc, cudaMemcpy, and cudaFree call. Wrap cudaGetLastError() after every kernel launch. This costs zero runtime overhead (the error is already computed; you are just reading it) and saves hours of debugging silent failures.
Profile before optimizing. New CUDA programmers often guess at bottlenecks: “shared memory will make this faster.” Sometimes it does. Sometimes the kernel is already compute-bound and adding shared memory staging just increases register pressure. Nsight Compute will tell you in seconds what intuition takes hours to discover.
The roofline model is a ceiling, not a target. A kernel at 60% of its roofline ceiling is often good enough. The last 20% of performance typically requires intricate tuning (warp-level primitives, double buffering, manual register allocation) that makes the code fragile and hard to maintain. Know the ceiling so you can make an informed decision about when to stop optimizing.
compute-sanitizer is slow. Expect 10-50x slowdown. Use it with small inputs during development. Do not run it on production-scale data.
Nsight Systems before Nsight Compute. Always start with the timeline view. If your bottleneck is a 500 ms cudaMemcpy between two 2 ms kernels, no amount of kernel optimization will help. Fix the data transfer first.
Watch for occupancy traps. Higher occupancy means more warps available to hide latency. But beyond a threshold (often 50-60%), additional warps compete for cache and shared memory, sometimes making performance worse. Profile both directions.
What comes next
The next article covers device and host function qualifiers: __device__, __host__, and __host__ __device__ functions, how the compiler generates code for each, and patterns for writing portable code that runs on both CPU and GPU.