Mar 23, 2026 · 22 min read · GPU Programming

Debugging and profiling CUDA programs

In this series (28 parts)

Prerequisites

This article builds on shared memory concepts from the previous post. You should be comfortable writing kernels that use __shared__ memory, understand thread/block indexing, and know the basics of warp execution. Everything here targets CUDA Toolkit 12+ and Nsight tools.

The debugging and profiling workflow

Most CUDA bugs are silent. The kernel launches, returns no error, and writes garbage to your output buffer. The GPU does not segfault in the way a CPU program does. You need a systematic workflow: compile with debug flags, run with sanitizers, check every API return code, then profile to find the actual bottleneck.

graph LR
  A["Compile
nvcc -G -lineinfo"] --> B["Run with
compute-sanitizer"]
  B --> C["Check API
return codes"]
  C --> D["Profile with
Nsight Systems"]
  D --> E["Deep-dive with
Nsight Compute"]
  E --> F["Optimize kernel"]
  F --> B

  style A fill:#2d6a4f,stroke:#1b4332,color:#fff
  style B fill:#40916c,stroke:#2d6a4f,color:#fff
  style C fill:#52b788,stroke:#40916c,color:#fff
  style D fill:#f4a261,stroke:#e76f51,color:#000
  style E fill:#e76f51,stroke:#d62828,color:#fff
  style F fill:#d62828,stroke:#9b2226,color:#fff

This loop is the core of GPU development. You will cycle through it dozens of times per kernel.

Common CUDA bugs

Wrong thread indexing

The single most common bug. You compute a global thread index, but the formula does not match your launch configuration. The kernel silently reads and writes out-of-bounds memory.

// BUG: forgot blockDim.x in the index calculation
int idx = blockIdx.x + threadIdx.x;  // wrong
int idx = blockIdx.x * blockDim.x + threadIdx.x;  // correct

A one-character typo (+ vs *) produces results that look plausible for small inputs and fail catastrophically on large ones.

Off-by-one and bounds checking

Even with correct indexing, you must guard against threads that exceed the data size. A grid of 256-thread blocks processing 1000 elements launches 4 blocks (1024 threads). The last 24 threads will read past the end of your array.

__global__ void scale(float *data, int n, float factor) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;  // essential bounds check
    data[idx] *= factor;
}

Race conditions in shared memory

Two threads in the same block write to the same shared memory location without a __syncthreads() barrier. The result depends on warp scheduling and changes between runs, making this bug extremely hard to reproduce.

Using device pointers on the host

Dereferencing a device pointer on the CPU does not crash immediately on all systems. On some driver versions it returns zero; on others it triggers a segfault. Either way, the compiler cannot catch it.

float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
// BUG: cannot dereference d_data on the host
printf("first element: %f\n", d_data[0]);  // undefined behavior

The CUDA error checking macro

Every CUDA runtime call returns a cudaError_t. Ignoring it is the GPU equivalent of ignoring errno. The standard practice is a macro that checks the return value and prints the file, line, and human-readable error string.

#include <cstdio>
#include <cstdlib>

#define CUDA_CHECK(call)                                                 \
    do \{                                                                 \
        cudaError_t err = (call);                                        \
        if (err != cudaSuccess) \{                                        \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                          \
        \}                                                                \
    \} while (0)

// Usage: wrap every CUDA call
CUDA_CHECK(cudaMalloc(&d_data, N * sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

// For kernel launches (async), check after synchronization
myKernel<<<grid, block>>>(d_data, N);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());

The do \{ ... \} while(0) wrapper ensures the macro behaves as a single statement in if/else blocks. The cudaGetLastError() call after a kernel launch catches configuration errors (e.g., requesting too much shared memory). The cudaDeviceSynchronize() call catches runtime errors inside the kernel.

Kernel printf for quick debugging

CUDA supports printf inside device code since compute capability 2.0. The output is buffered and flushed to the host on the next synchronization point.

__global__ void debugKernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;

    // Print only from a few threads to avoid flooding the console
    if (idx < 4) {
        printf("Thread %d: block=%d, local=%d, value=%.4f\n",
               idx, blockIdx.x, threadIdx.x, data[idx]);
    }
}

⚠ printf from thousands of threads will produce megabytes of interleaved output. Always guard it with a condition. The default print buffer is 1 MB; increase it with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size) if you need more.

compute-sanitizer: catching memory errors

compute-sanitizer (formerly cuda-memcheck) is the GPU equivalent of Valgrind. It instruments your kernel at runtime and detects out-of-bounds accesses, race conditions, and uninitialized reads.

Here is a kernel with an intentional out-of-bounds write:

#include <cstdio>

#define CUDA_CHECK(call)                                                 \
    do \{                                                                 \
        cudaError_t err = (call);                                        \
        if (err != cudaSuccess) \{                                        \
            fprintf(stderr, "CUDA error at %s:%d: %s\n",                \
                    __FILE__, __LINE__, cudaGetErrorString(err));         \
            exit(EXIT_FAILURE);                                          \
        \}                                                                \
    \} while (0)

__global__ void buggyKernel(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // BUG: no bounds check, threads beyond n write out of bounds
    data[idx] = 1.0f;
}

int main() {
    int n = 1000;
    float *d_data;
    CUDA_CHECK(cudaMalloc(&d_data, n * sizeof(float)));

    // Launch 1024 threads for 1000 elements: 24 threads write OOB
    buggyKernel<<<4, 256>>>(d_data, n);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    CUDA_CHECK(cudaFree(d_data));
    return 0;
}

Compile with debug info and run under the sanitizer:

nvcc -G -lineinfo -o buggy buggy.cu
compute-sanitizer --tool memcheck ./buggy

Output (abbreviated):

========= Invalid __global__ write of size 4 bytes
=========     at 0x00000148 in buggyKernel(float*, int)
=========     by thread (232,0,0) in block (3,0,0)
=========     Address 0x7f2c00001f40 is out of bounds
=========
========= ERROR SUMMARY: 24 errors

The sanitizer pinpoints the exact thread, block, and instruction. The 24 errors correspond to threads 1000 through 1023.

Other useful sanitizer modes:

--tool racecheck detects shared memory race conditions
--tool initcheck finds reads of uninitialized device memory
--tool synccheck validates __syncthreads() usage across divergent warps

Nsight Systems: the timeline view

Nsight Systems (nsys) gives you a system-wide timeline of CPU calls, CUDA API invocations, kernel launches, and memory transfers. It answers the question: “Where is time being spent?”

nsys profile --stats=true -o report ./my_program
nsys stats report.nsys-rep

What to look for in the timeline:

Gaps between kernels: indicates CPU-side overhead or unnecessary synchronization
Long cudaMemcpy calls: data transfer dominates kernel execution
Serialized kernel launches: kernels that could overlap on different streams are running sequentially
Small kernels with high launch overhead: the kernel runs for 2 microseconds but the launch takes 5

Nsight Systems is your first profiling step. It tells you which kernel to focus on before you dive deeper.

Nsight Compute: kernel-level metrics

Nsight Compute (ncu) profiles a single kernel invocation in extreme detail. It collects hardware counters, computes derived metrics, and compares your kernel against theoretical peak performance.

ncu --set full -o kernel_report ./my_program
ncu -i kernel_report.ncu-rep

Key profiler metrics

Metric	What it measures	Ideal value	When it is low	What to do
Achieved occupancy	Fraction of max warps active per cycle	>50% for most kernels	Threads are blocked on barriers or resources	Reduce register/shared memory per thread, adjust block size
SM throughput	Fraction of time SMs are doing useful work	Close to 100%	Too few blocks to fill the GPU, or high tail effect	Increase parallelism, check grid dimensions
Memory throughput (DRAM)	Fraction of peak memory bandwidth used	Depends on kernel type	Poor access patterns, low arithmetic intensity	Coalesce accesses, use shared memory tiling
Warp stall: memory dependency	Cycles warps wait for memory	Low (relative to compute)	Too many global memory accesses per instruction	Increase data reuse via shared memory or registers
Warp stall: barrier	Cycles warps wait at `__syncthreads()`	Low	Unbalanced work across threads in a block	Redistribute work, reduce barrier frequency
L1/L2 cache hit rate	Fraction of memory requests served by cache	High for data-reuse kernels	Strided or random access patterns	Restructure access to exploit spatial locality
FP32/FP64 utilization	Fraction of peak FLOP/s achieved	High for compute-bound kernels	Memory bound or low ILP	Increase arithmetic intensity, use ILP techniques

The roofline model

The roofline model is the most useful mental framework for understanding kernel performance. It plots achievable performance (FLOP/s) as a function of arithmetic intensity (FLOPs per byte of memory traffic).

graph LR
  subgraph Roofline["Roofline Model Regions"]
      direction TB
      MB["Memory-Bound Region
Performance limited by bandwidth
Arithmetic intensity &lt; ridge point"]
      CB["Compute-Bound Region
Performance limited by peak FLOP/s
Arithmetic intensity &gt; ridge point"]
      RP["Ridge Point
Peak FLOP/s / Peak Bandwidth"]
  end
  MB --> RP
  RP --> CB

  style MB fill:#f4a261,stroke:#e76f51,color:#000
  style CB fill:#2d6a4f,stroke:#1b4332,color:#fff
  style RP fill:#e76f51,stroke:#d62828,color:#fff

Two ceilings define the model:

Memory bandwidth ceiling: performance = arithmetic_intensity x peak_bandwidth (GB/s). This is a sloped line.
Compute ceiling: performance = peak_FLOP/s. This is a horizontal line.

The ridge point is where the two lines intersect: ridge_point = peak_FLOP/s / peak_bandwidth. Kernels to the left of the ridge point are memory-bound. Kernels to the right are compute-bound.

Roofline chart for an A100 GPU

Peak FP32: 19.5 TFLOP/s. Peak HBM2e bandwidth: 2,039 GB/s (2.039 TB/s). Ridge point: 19.5 / 2.039 = 9.56 FLOP/byte.

Reading this chart:

Kernel A (naive reduction): arithmetic intensity 2 FLOP/byte, achieves 3.5 TFLOP/s. It sits on the sloped line, meaning it is memory-bound. The roofline ceiling at 2 FLOP/byte is 2 x 2.039 = 4.08 TFLOP/s. The kernel reaches 86% of its memory-bound ceiling. To improve it, reduce memory traffic (better reuse, compression, fusion).
Kernel B (tiled matmul): arithmetic intensity 8 FLOP/byte, achieves 12 TFLOP/s. Still to the left of the ridge point, so still memory-bound. Ceiling is 8 x 2.039 = 16.3 TFLOP/s (73% efficiency). Larger tiles or better prefetching would help.
Kernel C (fused MLP): arithmetic intensity 40 FLOP/byte, achieves 17.5 TFLOP/s. Well past the ridge point, so compute-bound. Ceiling is 19.5 TFLOP/s (90% efficiency). To improve, increase instruction-level parallelism or use tensor cores.

Worked example 1: roofline analysis

Setup: A matrix multiplication kernel on a GPU with 20 TFLOP/s peak FP32 performance and 900 GB/s memory bandwidth. The kernel achieves 5 TFLOP/s. Profiling shows an arithmetic intensity of 15 FLOP/byte.

Step 1: find the ridge point.

Ridge point = peak FLOP/s / peak bandwidth = 20 x 10¹² / 900 x 10⁹ = 22.2 FLOP/byte.

Step 2: determine the bound.

The kernel’s arithmetic intensity (15 FLOP/byte) is less than the ridge point (22.2 FLOP/byte). The kernel is memory-bound.

Step 3: compute the theoretical maximum.

For a memory-bound kernel, the ceiling is: arithmetic_intensity x bandwidth = 15 x 900 x 10⁹ = 13.5 x 10¹² = 13.5 TFLOP/s.

Step 4: compute efficiency.

The kernel achieves 5 TFLOP/s out of a 13.5 TFLOP/s ceiling. That is 37% of the memory-bound roofline. This is poor. The kernel likely has non-coalesced memory accesses or excessive global memory traffic from redundant loads.

What to do:

✓ Check for coalesced access patterns (consecutive threads read consecutive addresses)
✓ Add shared memory tiling to increase data reuse per byte loaded
✓ Fuse this kernel with adjacent operations to reduce round-trips to DRAM
✗ Increasing FLOP/s will not help because the kernel is not compute-bound

Worked example 2: occupancy analysis

Setup: An NVIDIA GPU where each SM supports a maximum of 2048 concurrent threads, 65,536 registers, and 48 KB of shared memory. Your kernel uses blocks of 256 threads, 32 registers per thread, and 8 KB of shared memory per block.

Step 1: thread-limited occupancy.

Max blocks per SM (by threads) = floor(2048 / 256) = 8 blocks. Active threads = 8 x 256 = 2048. Thread occupancy = 2048 / 2048 = 100%.

Step 2: register-limited occupancy.

Registers per block = 256 threads x 32 registers = 8,192 registers. Max blocks per SM (by registers) = floor(65,536 / 8,192) = 8 blocks. Active threads = 8 x 256 = 2048. Register occupancy = 2048 / 2048 = 100%.

Step 3: shared-memory-limited occupancy.

Shared memory per block = 8 KB. Max blocks per SM (by shared memory) = floor(48 / 8) = 6 blocks. Active threads = 6 x 256 = 1,536. Shared memory occupancy = 1,536 / 2,048 = 75%.

Step 4: determine the binding constraint.

The minimum across all resources determines actual occupancy:

Resource	Max blocks	Active threads	Occupancy
Threads per SM	8	2,048	100%
Registers	8	2,048	100%
Shared memory	6	1,536	75%

Shared memory is the binding constraint. Actual occupancy is 75%.

What to do:

✓ Reduce shared memory per block from 8 KB to 6 KB (allows 8 blocks, reaching 100%)
✓ Alternatively, increase block size to 384 threads with the same 8 KB shared memory (fewer blocks needed)
⚠ 75% occupancy is often acceptable. Measure actual performance before optimizing; higher occupancy does not always mean faster execution if the kernel is compute-bound and already hides latency well

In practice

Use the error checking macro everywhere. Wrap every cudaMalloc, cudaMemcpy, and cudaFree call. Wrap cudaGetLastError() after every kernel launch. This costs zero runtime overhead (the error is already computed; you are just reading it) and saves hours of debugging silent failures.

Profile before optimizing. New CUDA programmers often guess at bottlenecks: “shared memory will make this faster.” Sometimes it does. Sometimes the kernel is already compute-bound and adding shared memory staging just increases register pressure. Nsight Compute will tell you in seconds what intuition takes hours to discover.

The roofline model is a ceiling, not a target. A kernel at 60% of its roofline ceiling is often good enough. The last 20% of performance typically requires intricate tuning (warp-level primitives, double buffering, manual register allocation) that makes the code fragile and hard to maintain. Know the ceiling so you can make an informed decision about when to stop optimizing.

compute-sanitizer is slow. Expect 10-50x slowdown. Use it with small inputs during development. Do not run it on production-scale data.

Nsight Systems before Nsight Compute. Always start with the timeline view. If your bottleneck is a 500 ms cudaMemcpy between two 2 ms kernels, no amount of kernel optimization will help. Fix the data transfer first.

Watch for occupancy traps. Higher occupancy means more warps available to hide latency. But beyond a threshold (often 50-60%), additional warps compete for cache and shared memory, sometimes making performance worse. Profile both directions.

What comes next

The next article covers device and host function qualifiers: __device__, __host__, and __host__ __device__ functions, how the compiler generates code for each, and patterns for writing portable code that runs on both CPU and GPU.

← Back to all series