Mar 16, 2026 · 22 min read · GPU Programming

Your first CUDA program: kernels, threads, and grids

In this series (28 parts)

Prerequisites

This article assumes you have read the following:

Introduction to CUDA and its history for why GPUs became programmable, the CUDA platform overview, and the software stack.

You should be comfortable reading C/C++ code. GPU programming experience is not required. This article starts from scratch.

The CUDA execution model

A CUDA program runs on two processors at the same time. The CPU is called the host. The GPU is called the device. They have separate memory spaces connected by the PCIe bus (or NVLink on high-end systems).

The programming model is straightforward:

The host allocates memory on the device.
The host copies input data from host memory to device memory.
The host launches a kernel (a function that runs on the GPU).
The device executes that kernel across thousands of threads in parallel.
The host copies results back from device memory to host memory.

Every CUDA program follows this pattern. There are no exceptions. Even the most complex deep learning frameworks do exactly these five steps under the hood.

sequenceDiagram
  participant Host as Host (CPU)
  participant DevMem as Device Memory (GPU)
  participant Kernel as GPU Cores

  Host->>Host: Prepare input data
  Host->>DevMem: cudaMalloc - allocate device memory
  Host->>DevMem: cudaMemcpy - host to device
  Host->>Kernel: Launch kernel with triple angle brackets
  Kernel->>Kernel: Thousands of threads execute in parallel
  Kernel->>DevMem: Threads write results to device memory
  Host->>Host: cudaDeviceSynchronize - 
  DevMem->>Host: cudaMemcpy - device to host
  Host->>Host: Use results, free memory

The host orchestrates. The device computes. Data crosses the PCIe bus twice: once going in, once coming back. Minimizing these transfers is a recurring theme in GPU programming.

The separation of host and device memory is the single most important concept to internalize. A pointer allocated with malloc lives on the host. A pointer allocated with cudaMalloc lives on the device. Mixing them up causes segfaults or silent corruption. The compiler will not save you.

Kernel functions: global

A CUDA kernel is a function decorated with the __global__ specifier. It runs on the device but is called from the host. When the host invokes a kernel, the GPU launches thousands of copies of that function in parallel, each copy running on a different thread.

Here is the simplest possible kernel:

%%writefile hello.cu
#include <cstdio>

// __global__ means: runs on device, called from host
__global__ void helloKernel() {
    printf("Hello from block %d, thread %d\n",
           blockIdx.x, threadIdx.x);  // each thread prints its own ID
}

int main() {
    helloKernel<<<2, 4>>>();  // launch 2 blocks of 4 threads = 8 threads total
    cudaDeviceSynchronize();  // wait for GPU to finish before exiting
    return 0;
}

Compile and run:

nvcc hello.cu -o hello
./hello

Output (order varies because threads run in parallel):

Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3

Three things to notice. First, __global__ marks the function as a kernel. Second, the <<<2, 4>>> syntax (called triple angle brackets or chevron syntax) specifies the launch configuration: 2 blocks, 4 threads per block. Third, cudaDeviceSynchronize blocks the host until all GPU work finishes. Without it, the program might exit before the GPU prints anything.

CUDA has three function specifiers:

Specifier	Runs on	Called from	Use case
`__global__`	Device	Host (or device in dynamic parallelism)	Kernel entry points
`__device__`	Device	Device only	Helper functions called by kernels
`__host__`	Host	Host only	Regular CPU functions (default)

You can combine __host__ and __device__ on the same function to compile it for both. This is useful for utility functions like max or clamp that both sides need.

Thread hierarchy: blocks, threads, grids

When you write <<<numBlocks, threadsPerBlock>>>, you define a two-level hierarchy:

A grid is the entire collection of threads launched by one kernel call.
The grid is divided into blocks (also called thread blocks).
Each block contains a fixed number of threads.

flowchart TB
  G["Grid
numBlocks = 4"]
  G --> B0["Block 0
blockDim.x = 8"]
  G --> B1["Block 1
blockDim.x = 8"]
  G --> B2["Block 2
blockDim.x = 8"]
  G --> B3["Block 3
blockDim.x = 8"]

  B0 --> T0["threadIdx.x 0 ... 7"]
  B1 --> T1["threadIdx.x 0 ... 7"]
  B2 --> T2["threadIdx.x 0 ... 7"]
  B3 --> T3["threadIdx.x 0 ... 7"]

Example launch shown above: <<<4, 8>>> means 4 blocks × 8 threads per block = 32 total threads in one grid.

Every thread knows where it sits in this hierarchy through built-in variables:

Variable	Meaning
`threadIdx.x`	Thread index within its block (0 to threadsPerBlock-1)
`blockIdx.x`	Block index within the grid (0 to numBlocks-1)
`blockDim.x`	Number of threads per block
`gridDim.x`	Number of blocks in the grid

These are read-only. The hardware sets them before your kernel code runs. Each thread gets unique values.

Example: computing a global thread ID

Launch configuration: <<<4, 8>>> (4 blocks, 8 threads per block, 32 threads total).

For the thread at blockIdx.x = 2, threadIdx.x = 5:

globalThreadId = blockIdx.x * blockDim.x + threadIdx.x
               = 2 * 8 + 5
               = 21

This formula maps every thread to a unique integer from 0 to 31. You use this global ID to index into arrays. Thread 21 processes element 21. Thread 0 processes element 0. No coordination needed.

The formula blockIdx.x * blockDim.x + threadIdx.x appears in virtually every CUDA kernel. Memorize it.

Example: mapping threads to a 1000-element array

You have an array of 1000 elements. You choose 256 threads per block (a common choice). How many blocks?

numBlocks = ceil(1000 / 256) = 4

Launch: <<<4, 256>>>. That gives you 4 * 256 = 1024 threads. But you only have 1000 elements. Threads 1000 through 1023 have nothing to do.

The standard pattern is a bounds check inside the kernel:

__global__ void process(float *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  // global thread ID
    if (idx < n) {         // boundary guard: skip threads past array end
        data[idx] *= 2.0f;
    }
}

Threads with idx >= 1000 hit the bounds check and exit immediately. They consume negligible resources. This is the standard way to handle arrays whose size is not a multiple of the block size.

Vector addition: the complete program

Vector addition is the “hello world” of GPU computing. Add two arrays element-wise: C[i] = A[i] + B[i]. It is trivially parallel because every element is independent.

Here is the full program with proper memory management and error checking:

%%writefile vec_add.cu
#include <cstdio>
#include <cstdlib>

// Macro to check CUDA API calls for errors
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(EXIT_FAILURE); \
    } \
} while(0)

// Kernel: each thread adds one pair of elements
__global__ void vecAdd(const float *a, const float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;  // global thread index
    if (i < n) {                                      // bounds check
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1000000;
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float *)malloc(bytes);
    float *h_b = (float *)malloc(bytes);
    float *h_c = (float *)malloc(bytes);

    // Initialize host arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(i * 2);
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    CUDA_CHECK(cudaMalloc(&d_a, bytes));               // allocate on GPU
    CUDA_CHECK(cudaMalloc(&d_b, bytes));
    CUDA_CHECK(cudaMalloc(&d_c, bytes));

    // Copy input data from host to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes,
                          cudaMemcpyHostToDevice));     // CPU -> GPU
    CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes,
                          cudaMemcpyHostToDevice));

    // Launch kernel
    int threadsPerBlock = 256;
    int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;  // ceiling division
    vecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Check for kernel launch errors
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Copy results back from device to host
    CUDA_CHECK(cudaMemcpy(h_c, d_c, bytes,
                          cudaMemcpyDeviceToHost));     // GPU -> CPU

    // Verify results
    int errors = 0;
    for (int i = 0; i < n; i++) {
        float expected = (float)i + (float)(i * 2);
        if (h_c[i] != expected) {
            errors++;
            if (errors <= 5) {
                printf("Mismatch at i=%d: got %f, expected %f\n",
                       i, h_c[i], expected);
            }
        }
    }

    if (errors == 0) {
        printf("All %d elements correct.\n", n);
    } else {
        printf("Found %d errors.\n", errors);
    }

    // Free device memory
    CUDA_CHECK(cudaFree(d_a));
    CUDA_CHECK(cudaFree(d_b));
    CUDA_CHECK(cudaFree(d_c));

    // Free host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Compile and run:

nvcc vec_add.cu -o vec_add
./vec_add

Expected output:

All 1000000 elements correct.

Walk through the code line by line. The host allocates arrays on both sides. It copies inputs to the GPU. The kernel launch <<<numBlocks, threadsPerBlock>>> creates roughly one million threads. Each thread computes one addition. Results are copied back and verified on the CPU.

The ceiling division (n + threadsPerBlock - 1) / threadsPerBlock ensures you launch enough blocks. For n=1,000,000 and threadsPerBlock=256: (1000000 + 255) / 256 = 1000255 / 256 = 3907 (integer division). That gives 3907 blocks × 256 threads = 1,000,192 total threads. We only need 1,000,000, so 192 threads have nothing to do. The bounds check if (i < n) in the kernel ensures those extra threads exit without touching memory. This pattern applies to every kernel you write: launch slightly more threads than you need and guard with a bounds check.

Error checking matters

Every cudaMalloc, cudaMemcpy, and cudaFree call returns a cudaError_t. Ignoring it is like ignoring the return value of malloc. The CUDA_CHECK macro above is the minimum viable approach: print the error string and exit.

Two calls deserve special attention:

cudaGetLastError() catches errors from the kernel launch itself (like invalid launch configuration).
cudaDeviceSynchronize() blocks until all GPU work finishes and surfaces any runtime errors that occurred during kernel execution.

Without cudaDeviceSynchronize, the host races ahead. That is useful for overlapping CPU work with GPU work, but during development, always synchronize and check errors after every kernel launch.

Running in Google Colab

If you do not have an NVIDIA GPU locally, Google Colab provides free access. Select a GPU runtime (Runtime > Change runtime type > T4 GPU), then use %%writefile to save .cu files and !nvcc to compile:

# In a Colab cell: write the file
# (paste the vec_add.cu code into a %%writefile cell)

# In the next cell: compile and run
!nvcc vec_add.cu -o vec_add && ./vec_add

The %%writefile magic command at the top of each code block saves the cell contents to a file. This is why the CUDA code samples above include it as the first line.

Vector addition in Python with Numba

If you prefer Python, the Numba library provides a @cuda.jit decorator that compiles Python functions into CUDA kernels. Here is the same vector addition:

import numpy as np
from numba import cuda

@cuda.jit                            # compile this function as a CUDA kernel
def vec_add_kernel(a, b, c):
    i = cuda.grid(1)                 # shorthand for blockIdx.x * blockDim.x + threadIdx.x
    if i < a.size:                   # bounds check, same pattern as C
        c[i] = a[i] + b[i]

n = 1_000_000
h_a = np.arange(n, dtype=np.float32)
h_b = np.arange(n, dtype=np.float32) * 2

# Transfer to device
d_a = cuda.to_device(h_a)           # host -> device
d_b = cuda.to_device(h_b)
d_c = cuda.device_array(n, dtype=np.float32)  # allocate output on device

# Launch kernel
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vec_add_kernel[blocks, threads_per_block](d_a, d_b, d_c)  # square-bracket launch syntax

# Copy back and verify
h_c = d_c.copy_to_host()            # device -> host
expected = h_a + h_b
assert np.allclose(h_c, expected), "Mismatch!"
print(f"All {n} elements correct. First 5: {h_c[:5]}")

The structure is identical to the C version. Allocate, transfer, launch, copy back, verify. Numba uses square brackets [blocks, threads_per_block] instead of triple angle brackets, but the mental model is the same.

cuda.grid(1) is shorthand for the global thread index formula. The 1 means one-dimensional. For 2D grids, use cuda.grid(2) which returns (x, y).

One important difference: Numba handles cudaMalloc and cudaMemcpy behind the scenes. If you pass a NumPy array directly, Numba will copy it to the device automatically. Explicit transfers with cuda.to_device give you more control and avoid redundant copies.

How threads actually execute

When you launch <<<4, 256>>>, the GPU does not run 1024 threads independently. The hardware groups threads into units of 32 called warps. All 32 threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads).

Each block of 256 threads contains 256 / 32 = 8 warps. The GPU scheduler assigns blocks to Streaming Multiprocessors (SMs). Each SM can run several blocks simultaneously, switching between warps whenever one stalls on memory.

You do not need to manage warps directly at this stage. But knowing they exist explains two things:

Thread count should be a multiple of 32. Choosing 256, 128, or 512 threads per block keeps warps full. Choosing 100 wastes 28 slots in the last warp of every block.
Branch divergence has a cost. If threads in the same warp take different paths in an if statement, the warp executes both paths serially and masks out the inactive threads. This is a performance topic for later articles.

Putting it together: the mental model

The entire CUDA execution model reduces to three ideas:

Host and device are separate machines with separate memory. You must explicitly move data between them.
A kernel is a function that runs on every thread. Each thread figures out which piece of data it owns using blockIdx.x * blockDim.x + threadIdx.x.
You choose how many blocks and threads to launch. The hardware maps them to physical cores. More threads than cores is fine because the scheduler handles it.

That is the whole model. Everything else (shared memory, streams, cooperative groups, dynamic parallelism) builds on these three ideas.

Common mistakes

⚠ Forgetting cudaDeviceSynchronize. Kernels are asynchronous. Without a sync, your cudaMemcpy back to host might run before the kernel finishes. In debug builds this sometimes works by accident, making the bug intermittent.

⚠ Using host pointers on the device. Passing a malloc’d pointer to a kernel causes an illegal memory access. The kernel runs on the GPU, which cannot see host memory (unless you use unified memory, covered later).

⚠ Forgetting the bounds check. If your array has 1000 elements and you launch 1024 threads, threads 1000-1023 will read and write garbage memory. This corrupts other allocations and causes wrong results that are painful to debug.

⚠ Launching too few threads. A single block of 32 threads on an A100 with 108 SMs means 107 SMs sit completely idle. You want thousands of blocks to saturate the GPU.

✓ Always check error codes. Use the CUDA_CHECK macro or equivalent. Silent failures are the norm in CUDA, not the exception.

In practice

Production CUDA code follows the same host-device pattern shown here, but with additional layers:

Pinned (page-locked) memory with cudaMallocHost for faster PCIe transfers. Regular malloc memory must be copied to a pinned staging buffer by the driver before DMA transfer. Pinned memory skips that step.
Streams for overlapping data transfers with kernel execution. While one stream copies input for batch N+1, another stream computes batch N.
Unified memory with cudaMallocManaged which gives you a single pointer accessible from both host and device. The driver handles page migration automatically. Convenient for prototyping, but explicit memory management usually wins on performance.
Error checking wrappers that log to a proper logging framework instead of calling exit. In production, you want to recover from transient GPU errors, not crash the entire service.

The vector addition example is pedagogically useful but unrealistic as a standalone GPU workload. The arithmetic intensity (one addition per two memory reads and one write) is too low to benefit from GPU parallelism at small scales. The GPU shines when you have millions of elements and more compute-heavy operations like matrix multiplies, convolutions, or reductions.

When profiling, use nsys (NVIDIA Nsight Systems) to see the timeline of transfers and kernel execution. The first thing you will notice is that cudaMemcpy often dominates the total runtime. Reducing host-device transfers is usually the highest-impact optimization.

What comes next

You now know how to write, compile, and run a CUDA kernel. You understand the host-device split, the launch syntax, and how threads map to data.

The next article, CUDA thread hierarchy in depth, goes deeper into the thread model: multi-dimensional grids and blocks, the warp as the unit of execution, shared memory within a block, and synchronization primitives. That is where you start writing kernels that cooperate, not just kernels where every thread works alone.

← Back to all series