Search…

Your first CUDA program: kernels, threads, and grids

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable reading C/C++ code. GPU programming experience is not required. This article starts from scratch.

The CUDA execution model

A CUDA program runs on two processors at the same time. The CPU is called the host. The GPU is called the device. They have separate memory spaces connected by the PCIe bus (or NVLink on high-end systems).

The programming model is straightforward:

  1. The host allocates memory on the device.
  2. The host copies input data from host memory to device memory.
  3. The host launches a kernel (a function that runs on the GPU).
  4. The device executes that kernel across thousands of threads in parallel.
  5. The host copies results back from device memory to host memory.

Every CUDA program follows this pattern. There are no exceptions. Even the most complex deep learning frameworks do exactly these five steps under the hood.

sequenceDiagram
  participant Host as Host (CPU)
  participant DevMem as Device Memory (GPU)
  participant Kernel as GPU Cores

  Host->>Host: Prepare input data
  Host->>DevMem: cudaMalloc - allocate device memory
  Host->>DevMem: cudaMemcpy - host to device
  Host->>Kernel: Launch kernel with triple angle brackets
  Kernel->>Kernel: Thousands of threads execute in parallel
  Kernel->>DevMem: Threads write results to device memory
  Host->>Host: cudaDeviceSynchronize - 
  DevMem->>Host: cudaMemcpy - device to host
  Host->>Host: Use results, free memory

The host orchestrates. The device computes. Data crosses the PCIe bus twice: once going in, once coming back. Minimizing these transfers is a recurring theme in GPU programming.

The separation of host and device memory is the single most important concept to internalize. A pointer allocated with malloc lives on the host. A pointer allocated with cudaMalloc lives on the device. Mixing them up causes segfaults or silent corruption. The compiler will not save you.

Kernel functions: __global__

A CUDA kernel is a function decorated with the __global__ specifier. It runs on the device but is called from the host. When the host invokes a kernel, the GPU launches thousands of copies of that function in parallel, each copy running on a different thread.

Here is the simplest possible kernel:

%%writefile hello.cu
#include <cstdio>

// __global__ means: runs on device, called from host
__global__ void helloKernel() {
    printf("Hello from block %d, thread %d\n",
           blockIdx.x, threadIdx.x);  // each thread prints its own ID
}

int main() {
    helloKernel<<<2, 4>>>();  // launch 2 blocks of 4 threads = 8 threads total
    cudaDeviceSynchronize();  // wait for GPU to finish before exiting
    return 0;
}

Compile and run:

nvcc hello.cu -o hello
./hello

Output (order varies because threads run in parallel):

Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3

Three things to notice. First, __global__ marks the function as a kernel. Second, the <<<2, 4>>> syntax (called triple angle brackets or chevron syntax) specifies the launch configuration: 2 blocks, 4 threads per block. Third, cudaDeviceSynchronize blocks the host until all GPU work finishes. Without it, the program might exit before the GPU prints anything.

CUDA has three function specifiers:

SpecifierRuns onCalled fromUse case
__global__DeviceHost (or device in dynamic parallelism)Kernel entry points
__device__DeviceDevice onlyHelper functions called by kernels
__host__HostHost onlyRegular CPU functions (default)

You can combine __host__ and __device__ on the same function to compile it for both. This is useful for utility functions like max or clamp that both sides need.

Thread hierarchy: blocks, threads, grids

When you write <<<numBlocks, threadsPerBlock>>>, you define a two-level hierarchy:

  • A grid is the entire collection of threads launched by one kernel call.
  • The grid is divided into blocks (also called thread blocks).
  • Each block contains a fixed number of threads.
flowchart TB
  G["Grid
numBlocks = 4"]
  G --> B0["Block 0
blockDim.x = 8"]
  G --> B1["Block 1
blockDim.x = 8"]
  G --> B2["Block 2
blockDim.x = 8"]
  G --> B3["Block 3
blockDim.x = 8"]

  B0 --> T0["threadIdx.x 0 ... 7"]
  B1 --> T1["threadIdx.x 0 ... 7"]
  B2 --> T2["threadIdx.x 0 ... 7"]
  B3 --> T3["threadIdx.x 0 ... 7"]

Example launch shown above: <<<4, 8>>> means 4 blocks × 8 threads per block = 32 total threads in one grid.

Every thread knows where it sits in this hierarchy through built-in variables:

VariableMeaning
threadIdx.xThread index within its block (0 to threadsPerBlock-1)
blockIdx.xBlock index within the grid (0 to numBlocks-1)
blockDim.xNumber of threads per block
gridDim.xNumber of blocks in the grid

These are read-only. The hardware sets them before your kernel code runs. Each thread gets unique values.

Example: computing a global thread ID

Launch configuration: <<<4, 8>>> (4 blocks, 8 threads per block, 32 threads total).

For the thread at blockIdx.x = 2, threadIdx.x = 5:

globalThreadId = blockIdx.x * blockDim.x + threadIdx.x
               = 2 * 8 + 5
               = 21

This formula maps every thread to a unique integer from 0 to 31. You use this global ID to index into arrays. Thread 21 processes element 21. Thread 0 processes element 0. No coordination needed.

The formula blockIdx.x * blockDim.x + threadIdx.x appears in virtually every CUDA kernel. Memorize it.

Example: mapping threads to a 1000-element array

You have an array of 1000 elements. You choose 256 threads per block (a common choice). How many blocks?

numBlocks = ceil(1000 / 256) = 4

Launch: <<<4, 256>>>. That gives you 4 * 256 = 1024 threads. But you only have 1000 elements. Threads 1000 through 1023 have nothing to do.

The standard pattern is a bounds check inside the kernel:

__global__ void process(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  // global thread ID
    if (idx < n) {         // boundary guard: skip threads past array end
        data[idx] *= 2.0f;
    }
}

Threads with idx >= 1000 hit the bounds check and exit immediately. They consume negligible resources. This is the standard way to handle arrays whose size is not a multiple of the block size.

Vector addition: the complete program

Vector addition is the “hello world” of GPU computing. Add two arrays element-wise: C[i] = A[i] + B[i]. It is trivially parallel because every element is independent.

Here is the full program with proper memory management and error checking:

%%writefile vec_add.cu
#include <cstdio>
#include <cstdlib>

// Macro to check CUDA API calls for errors
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(EXIT_FAILURE); \
    } \
} while(0)

// Kernel: each thread adds one pair of elements
__global__ void vecAdd(const float *a, const float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;  // global thread index
    if (i < n) {                                      // bounds check
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1000000;
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float *)malloc(bytes);
    float *h_b = (float *)malloc(bytes);
    float *h_c = (float *)malloc(bytes);

    // Initialize host arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(i * 2);
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    CUDA_CHECK(cudaMalloc(&d_a, bytes));               // allocate on GPU
    CUDA_CHECK(cudaMalloc(&d_b, bytes));
    CUDA_CHECK(cudaMalloc(&d_c, bytes));

    // Copy input data from host to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes,
                          cudaMemcpyHostToDevice));     // CPU -> GPU
    CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes,
                          cudaMemcpyHostToDevice));

    // Launch kernel
    int threadsPerBlock = 256;
    int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;  // ceiling division
    vecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Check for kernel launch errors
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Copy results back from device to host
    CUDA_CHECK(cudaMemcpy(h_c, d_c, bytes,
                          cudaMemcpyDeviceToHost));     // GPU -> CPU

    // Verify results
    int errors = 0;
    for (int i = 0; i < n; i++) {
        float expected = (float)i + (float)(i * 2);
        if (h_c[i] != expected) {
            errors++;
            if (errors <= 5) {
                printf("Mismatch at i=%d: got %f, expected %f\n",
                       i, h_c[i], expected);
            }
        }
    }

    if (errors == 0) {
        printf("All %d elements correct.\n", n);
    } else {
        printf("Found %d errors.\n", errors);
    }

    // Free device memory
    CUDA_CHECK(cudaFree(d_a));
    CUDA_CHECK(cudaFree(d_b));
    CUDA_CHECK(cudaFree(d_c));

    // Free host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Compile and run:

nvcc vec_add.cu -o vec_add
./vec_add

Expected output:

All 1000000 elements correct.

Walk through the code line by line. The host allocates arrays on both sides. It copies inputs to the GPU. The kernel launch <<<numBlocks, threadsPerBlock>>> creates roughly one million threads. Each thread computes one addition. Results are copied back and verified on the CPU.

The ceiling division (n + threadsPerBlock - 1) / threadsPerBlock ensures you launch enough blocks. For n=1,000,000 and threadsPerBlock=256: (1000000 + 255) / 256 = 1000255 / 256 = 3907 (integer division). That gives 3907 blocks × 256 threads = 1,000,192 total threads. We only need 1,000,000, so 192 threads have nothing to do. The bounds check if (i < n) in the kernel ensures those extra threads exit without touching memory. This pattern applies to every kernel you write: launch slightly more threads than you need and guard with a bounds check.

Error checking matters

Every cudaMalloc, cudaMemcpy, and cudaFree call returns a cudaError_t. Ignoring it is like ignoring the return value of malloc. The CUDA_CHECK macro above is the minimum viable approach: print the error string and exit.

Two calls deserve special attention:

  • cudaGetLastError() catches errors from the kernel launch itself (like invalid launch configuration).
  • cudaDeviceSynchronize() blocks until all GPU work finishes and surfaces any runtime errors that occurred during kernel execution.

Without cudaDeviceSynchronize, the host races ahead. That is useful for overlapping CPU work with GPU work, but during development, always synchronize and check errors after every kernel launch.

Running in Google Colab

If you do not have an NVIDIA GPU locally, Google Colab provides free access. Select a GPU runtime (Runtime > Change runtime type > T4 GPU), then use %%writefile to save .cu files and !nvcc to compile:

# In a Colab cell: write the file
# (paste the vec_add.cu code into a %%writefile cell)

# In the next cell: compile and run
!nvcc vec_add.cu -o vec_add && ./vec_add

The %%writefile magic command at the top of each code block saves the cell contents to a file. This is why the CUDA code samples above include it as the first line.

Vector addition in Python with Numba

If you prefer Python, the Numba library provides a @cuda.jit decorator that compiles Python functions into CUDA kernels. Here is the same vector addition:

import numpy as np
from numba import cuda

@cuda.jit                            # compile this function as a CUDA kernel
def vec_add_kernel(a, b, c):
    i = cuda.grid(1)                 # shorthand for blockIdx.x * blockDim.x + threadIdx.x
    if i < a.size:                   # bounds check, same pattern as C
        c[i] = a[i] + b[i]

n = 1_000_000
h_a = np.arange(n, dtype=np.float32)
h_b = np.arange(n, dtype=np.float32) * 2

# Transfer to device
d_a = cuda.to_device(h_a)           # host -> device
d_b = cuda.to_device(h_b)
d_c = cuda.device_array(n, dtype=np.float32)  # allocate output on device

# Launch kernel
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vec_add_kernel[blocks, threads_per_block](d_a, d_b, d_c)  # square-bracket launch syntax

# Copy back and verify
h_c = d_c.copy_to_host()            # device -> host
expected = h_a + h_b
assert np.allclose(h_c, expected), "Mismatch!"
print(f"All {n} elements correct. First 5: {h_c[:5]}")

The structure is identical to the C version. Allocate, transfer, launch, copy back, verify. Numba uses square brackets [blocks, threads_per_block] instead of triple angle brackets, but the mental model is the same.

cuda.grid(1) is shorthand for the global thread index formula. The 1 means one-dimensional. For 2D grids, use cuda.grid(2) which returns (x, y).

One important difference: Numba handles cudaMalloc and cudaMemcpy behind the scenes. If you pass a NumPy array directly, Numba will copy it to the device automatically. Explicit transfers with cuda.to_device give you more control and avoid redundant copies.

How threads actually execute

When you launch <<<4, 256>>>, the GPU does not run 1024 threads independently. The hardware groups threads into units of 32 called warps. All 32 threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads).

Each block of 256 threads contains 256 / 32 = 8 warps. The GPU scheduler assigns blocks to Streaming Multiprocessors (SMs). Each SM can run several blocks simultaneously, switching between warps whenever one stalls on memory.

You do not need to manage warps directly at this stage. But knowing they exist explains two things:

  1. Thread count should be a multiple of 32. Choosing 256, 128, or 512 threads per block keeps warps full. Choosing 100 wastes 28 slots in the last warp of every block.
  2. Branch divergence has a cost. If threads in the same warp take different paths in an if statement, the warp executes both paths serially and masks out the inactive threads. This is a performance topic for later articles.

Putting it together: the mental model

The entire CUDA execution model reduces to three ideas:

  1. Host and device are separate machines with separate memory. You must explicitly move data between them.
  2. A kernel is a function that runs on every thread. Each thread figures out which piece of data it owns using blockIdx.x * blockDim.x + threadIdx.x.
  3. You choose how many blocks and threads to launch. The hardware maps them to physical cores. More threads than cores is fine because the scheduler handles it.

That is the whole model. Everything else (shared memory, streams, cooperative groups, dynamic parallelism) builds on these three ideas.

Common mistakes

⚠ Forgetting cudaDeviceSynchronize. Kernels are asynchronous. Without a sync, your cudaMemcpy back to host might run before the kernel finishes. In debug builds this sometimes works by accident, making the bug intermittent.

⚠ Using host pointers on the device. Passing a malloc’d pointer to a kernel causes an illegal memory access. The kernel runs on the GPU, which cannot see host memory (unless you use unified memory, covered later).

⚠ Forgetting the bounds check. If your array has 1000 elements and you launch 1024 threads, threads 1000-1023 will read and write garbage memory. This corrupts other allocations and causes wrong results that are painful to debug.

⚠ Launching too few threads. A single block of 32 threads on an A100 with 108 SMs means 107 SMs sit completely idle. You want thousands of blocks to saturate the GPU.

✓ Always check error codes. Use the CUDA_CHECK macro or equivalent. Silent failures are the norm in CUDA, not the exception.

In practice

Production CUDA code follows the same host-device pattern shown here, but with additional layers:

  • Pinned (page-locked) memory with cudaMallocHost for faster PCIe transfers. Regular malloc memory must be copied to a pinned staging buffer by the driver before DMA transfer. Pinned memory skips that step.
  • Streams for overlapping data transfers with kernel execution. While one stream copies input for batch N+1, another stream computes batch N.
  • Unified memory with cudaMallocManaged which gives you a single pointer accessible from both host and device. The driver handles page migration automatically. Convenient for prototyping, but explicit memory management usually wins on performance.
  • Error checking wrappers that log to a proper logging framework instead of calling exit. In production, you want to recover from transient GPU errors, not crash the entire service.

The vector addition example is pedagogically useful but unrealistic as a standalone GPU workload. The arithmetic intensity (one addition per two memory reads and one write) is too low to benefit from GPU parallelism at small scales. The GPU shines when you have millions of elements and more compute-heavy operations like matrix multiplies, convolutions, or reductions.

When profiling, use nsys (NVIDIA Nsight Systems) to see the timeline of transfers and kernel execution. The first thing you will notice is that cudaMemcpy often dominates the total runtime. Reducing host-device transfers is usually the highest-impact optimization.

What comes next

You now know how to write, compile, and run a CUDA kernel. You understand the host-device split, the launch syntax, and how threads map to data.

The next article, CUDA thread hierarchy in depth, goes deeper into the thread model: multi-dimensional grids and blocks, the warp as the unit of execution, shared memory within a block, and synchronization primitives. That is where you start writing kernels that cooperate, not just kernels where every thread works alone.

Start typing to search across all content
navigate Enter open Esc close