Your first CUDA program: kernels, threads, and grids
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- Introduction to CUDA and its history for why GPUs became programmable, the CUDA platform overview, and the software stack.
You should be comfortable reading C/C++ code. GPU programming experience is not required. This article starts from scratch.
The CUDA execution model
A CUDA program runs on two processors at the same time. The CPU is called the host. The GPU is called the device. They have separate memory spaces connected by the PCIe bus (or NVLink on high-end systems).
The programming model is straightforward:
- The host allocates memory on the device.
- The host copies input data from host memory to device memory.
- The host launches a kernel (a function that runs on the GPU).
- The device executes that kernel across thousands of threads in parallel.
- The host copies results back from device memory to host memory.
Every CUDA program follows this pattern. There are no exceptions. Even the most complex deep learning frameworks do exactly these five steps under the hood.
sequenceDiagram participant Host as Host (CPU) participant DevMem as Device Memory (GPU) participant Kernel as GPU Cores Host->>Host: Prepare input data Host->>DevMem: cudaMalloc - allocate device memory Host->>DevMem: cudaMemcpy - host to device Host->>Kernel: Launch kernel with triple angle brackets Kernel->>Kernel: Thousands of threads execute in parallel Kernel->>DevMem: Threads write results to device memory Host->>Host: cudaDeviceSynchronize - DevMem->>Host: cudaMemcpy - device to host Host->>Host: Use results, free memory
The host orchestrates. The device computes. Data crosses the PCIe bus twice: once going in, once coming back. Minimizing these transfers is a recurring theme in GPU programming.
The separation of host and device memory is the single most important concept to internalize. A pointer allocated with malloc lives on the host. A pointer allocated with cudaMalloc lives on the device. Mixing them up causes segfaults or silent corruption. The compiler will not save you.
Kernel functions: __global__
A CUDA kernel is a function decorated with the __global__ specifier. It runs on the device but is called from the host. When the host invokes a kernel, the GPU launches thousands of copies of that function in parallel, each copy running on a different thread.
Here is the simplest possible kernel:
%%writefile hello.cu
#include <cstdio>
// __global__ means: runs on device, called from host
__global__ void helloKernel() {
printf("Hello from block %d, thread %d\n",
blockIdx.x, threadIdx.x); // each thread prints its own ID
}
int main() {
helloKernel<<<2, 4>>>(); // launch 2 blocks of 4 threads = 8 threads total
cudaDeviceSynchronize(); // wait for GPU to finish before exiting
return 0;
}
Compile and run:
nvcc hello.cu -o hello
./hello
Output (order varies because threads run in parallel):
Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Three things to notice. First, __global__ marks the function as a kernel. Second, the <<<2, 4>>> syntax (called triple angle brackets or chevron syntax) specifies the launch configuration: 2 blocks, 4 threads per block. Third, cudaDeviceSynchronize blocks the host until all GPU work finishes. Without it, the program might exit before the GPU prints anything.
CUDA has three function specifiers:
| Specifier | Runs on | Called from | Use case |
|---|---|---|---|
__global__ | Device | Host (or device in dynamic parallelism) | Kernel entry points |
__device__ | Device | Device only | Helper functions called by kernels |
__host__ | Host | Host only | Regular CPU functions (default) |
You can combine __host__ and __device__ on the same function to compile it for both. This is useful for utility functions like max or clamp that both sides need.
Thread hierarchy: blocks, threads, grids
When you write <<<numBlocks, threadsPerBlock>>>, you define a two-level hierarchy:
- A grid is the entire collection of threads launched by one kernel call.
- The grid is divided into blocks (also called thread blocks).
- Each block contains a fixed number of threads.
flowchart TB G["Grid numBlocks = 4"] G --> B0["Block 0 blockDim.x = 8"] G --> B1["Block 1 blockDim.x = 8"] G --> B2["Block 2 blockDim.x = 8"] G --> B3["Block 3 blockDim.x = 8"] B0 --> T0["threadIdx.x 0 ... 7"] B1 --> T1["threadIdx.x 0 ... 7"] B2 --> T2["threadIdx.x 0 ... 7"] B3 --> T3["threadIdx.x 0 ... 7"]
Example launch shown above: <<<4, 8>>> means 4 blocks × 8 threads per block = 32 total threads in one grid.
Every thread knows where it sits in this hierarchy through built-in variables:
| Variable | Meaning |
|---|---|
threadIdx.x | Thread index within its block (0 to threadsPerBlock-1) |
blockIdx.x | Block index within the grid (0 to numBlocks-1) |
blockDim.x | Number of threads per block |
gridDim.x | Number of blocks in the grid |
These are read-only. The hardware sets them before your kernel code runs. Each thread gets unique values.
Example: computing a global thread ID
Launch configuration: <<<4, 8>>> (4 blocks, 8 threads per block, 32 threads total).
For the thread at blockIdx.x = 2, threadIdx.x = 5:
globalThreadId = blockIdx.x * blockDim.x + threadIdx.x
= 2 * 8 + 5
= 21
This formula maps every thread to a unique integer from 0 to 31. You use this global ID to index into arrays. Thread 21 processes element 21. Thread 0 processes element 0. No coordination needed.
The formula blockIdx.x * blockDim.x + threadIdx.x appears in virtually every CUDA kernel. Memorize it.
Example: mapping threads to a 1000-element array
You have an array of 1000 elements. You choose 256 threads per block (a common choice). How many blocks?
numBlocks = ceil(1000 / 256) = 4
Launch: <<<4, 256>>>. That gives you 4 * 256 = 1024 threads. But you only have 1000 elements. Threads 1000 through 1023 have nothing to do.
The standard pattern is a bounds check inside the kernel:
__global__ void process(float *data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x; // global thread ID
if (idx < n) { // boundary guard: skip threads past array end
data[idx] *= 2.0f;
}
}
Threads with idx >= 1000 hit the bounds check and exit immediately. They consume negligible resources. This is the standard way to handle arrays whose size is not a multiple of the block size.
Vector addition: the complete program
Vector addition is the “hello world” of GPU computing. Add two arrays element-wise: C[i] = A[i] + B[i]. It is trivially parallel because every element is independent.
Here is the full program with proper memory management and error checking:
%%writefile vec_add.cu
#include <cstdio>
#include <cstdlib>
// Macro to check CUDA API calls for errors
#define CUDA_CHECK(call) do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
fprintf(stderr, "CUDA error at %s:%d: %s\n", \
__FILE__, __LINE__, cudaGetErrorString(err)); \
exit(EXIT_FAILURE); \
} \
} while(0)
// Kernel: each thread adds one pair of elements
__global__ void vecAdd(const float *a, const float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x; // global thread index
if (i < n) { // bounds check
c[i] = a[i] + b[i];
}
}
int main() {
int n = 1000000;
size_t bytes = n * sizeof(float);
// Allocate host memory
float *h_a = (float *)malloc(bytes);
float *h_b = (float *)malloc(bytes);
float *h_c = (float *)malloc(bytes);
// Initialize host arrays
for (int i = 0; i < n; i++) {
h_a[i] = (float)i;
h_b[i] = (float)(i * 2);
}
// Allocate device memory
float *d_a, *d_b, *d_c;
CUDA_CHECK(cudaMalloc(&d_a, bytes)); // allocate on GPU
CUDA_CHECK(cudaMalloc(&d_b, bytes));
CUDA_CHECK(cudaMalloc(&d_c, bytes));
// Copy input data from host to device
CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes,
cudaMemcpyHostToDevice)); // CPU -> GPU
CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes,
cudaMemcpyHostToDevice));
// Launch kernel
int threadsPerBlock = 256;
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock; // ceiling division
vecAdd<<<numBlocks, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Check for kernel launch errors
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
// Copy results back from device to host
CUDA_CHECK(cudaMemcpy(h_c, d_c, bytes,
cudaMemcpyDeviceToHost)); // GPU -> CPU
// Verify results
int errors = 0;
for (int i = 0; i < n; i++) {
float expected = (float)i + (float)(i * 2);
if (h_c[i] != expected) {
errors++;
if (errors <= 5) {
printf("Mismatch at i=%d: got %f, expected %f\n",
i, h_c[i], expected);
}
}
}
if (errors == 0) {
printf("All %d elements correct.\n", n);
} else {
printf("Found %d errors.\n", errors);
}
// Free device memory
CUDA_CHECK(cudaFree(d_a));
CUDA_CHECK(cudaFree(d_b));
CUDA_CHECK(cudaFree(d_c));
// Free host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}
Compile and run:
nvcc vec_add.cu -o vec_add
./vec_add
Expected output:
All 1000000 elements correct.
Walk through the code line by line. The host allocates arrays on both sides. It copies inputs to the GPU. The kernel launch <<<numBlocks, threadsPerBlock>>> creates roughly one million threads. Each thread computes one addition. Results are copied back and verified on the CPU.
The ceiling division (n + threadsPerBlock - 1) / threadsPerBlock ensures you launch enough blocks. For n=1,000,000 and threadsPerBlock=256: (1000000 + 255) / 256 = 1000255 / 256 = 3907 (integer division). That gives 3907 blocks × 256 threads = 1,000,192 total threads. We only need 1,000,000, so 192 threads have nothing to do. The bounds check if (i < n) in the kernel ensures those extra threads exit without touching memory. This pattern applies to every kernel you write: launch slightly more threads than you need and guard with a bounds check.
Error checking matters
Every cudaMalloc, cudaMemcpy, and cudaFree call returns a cudaError_t. Ignoring it is like ignoring the return value of malloc. The CUDA_CHECK macro above is the minimum viable approach: print the error string and exit.
Two calls deserve special attention:
cudaGetLastError()catches errors from the kernel launch itself (like invalid launch configuration).cudaDeviceSynchronize()blocks until all GPU work finishes and surfaces any runtime errors that occurred during kernel execution.
Without cudaDeviceSynchronize, the host races ahead. That is useful for overlapping CPU work with GPU work, but during development, always synchronize and check errors after every kernel launch.
Running in Google Colab
If you do not have an NVIDIA GPU locally, Google Colab provides free access. Select a GPU runtime (Runtime > Change runtime type > T4 GPU), then use %%writefile to save .cu files and !nvcc to compile:
# In a Colab cell: write the file
# (paste the vec_add.cu code into a %%writefile cell)
# In the next cell: compile and run
!nvcc vec_add.cu -o vec_add && ./vec_add
The %%writefile magic command at the top of each code block saves the cell contents to a file. This is why the CUDA code samples above include it as the first line.
Vector addition in Python with Numba
If you prefer Python, the Numba library provides a @cuda.jit decorator that compiles Python functions into CUDA kernels. Here is the same vector addition:
import numpy as np
from numba import cuda
@cuda.jit # compile this function as a CUDA kernel
def vec_add_kernel(a, b, c):
i = cuda.grid(1) # shorthand for blockIdx.x * blockDim.x + threadIdx.x
if i < a.size: # bounds check, same pattern as C
c[i] = a[i] + b[i]
n = 1_000_000
h_a = np.arange(n, dtype=np.float32)
h_b = np.arange(n, dtype=np.float32) * 2
# Transfer to device
d_a = cuda.to_device(h_a) # host -> device
d_b = cuda.to_device(h_b)
d_c = cuda.device_array(n, dtype=np.float32) # allocate output on device
# Launch kernel
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
vec_add_kernel[blocks, threads_per_block](d_a, d_b, d_c) # square-bracket launch syntax
# Copy back and verify
h_c = d_c.copy_to_host() # device -> host
expected = h_a + h_b
assert np.allclose(h_c, expected), "Mismatch!"
print(f"All {n} elements correct. First 5: {h_c[:5]}")
The structure is identical to the C version. Allocate, transfer, launch, copy back, verify. Numba uses square brackets [blocks, threads_per_block] instead of triple angle brackets, but the mental model is the same.
cuda.grid(1) is shorthand for the global thread index formula. The 1 means one-dimensional. For 2D grids, use cuda.grid(2) which returns (x, y).
One important difference: Numba handles cudaMalloc and cudaMemcpy behind the scenes. If you pass a NumPy array directly, Numba will copy it to the device automatically. Explicit transfers with cuda.to_device give you more control and avoid redundant copies.
How threads actually execute
When you launch <<<4, 256>>>, the GPU does not run 1024 threads independently. The hardware groups threads into units of 32 called warps. All 32 threads in a warp execute the same instruction at the same time (SIMT: Single Instruction, Multiple Threads).
Each block of 256 threads contains 256 / 32 = 8 warps. The GPU scheduler assigns blocks to Streaming Multiprocessors (SMs). Each SM can run several blocks simultaneously, switching between warps whenever one stalls on memory.
You do not need to manage warps directly at this stage. But knowing they exist explains two things:
- Thread count should be a multiple of 32. Choosing 256, 128, or 512 threads per block keeps warps full. Choosing 100 wastes 28 slots in the last warp of every block.
- Branch divergence has a cost. If threads in the same warp take different paths in an
ifstatement, the warp executes both paths serially and masks out the inactive threads. This is a performance topic for later articles.
Putting it together: the mental model
The entire CUDA execution model reduces to three ideas:
- Host and device are separate machines with separate memory. You must explicitly move data between them.
- A kernel is a function that runs on every thread. Each thread figures out which piece of data it owns using
blockIdx.x * blockDim.x + threadIdx.x. - You choose how many blocks and threads to launch. The hardware maps them to physical cores. More threads than cores is fine because the scheduler handles it.
That is the whole model. Everything else (shared memory, streams, cooperative groups, dynamic parallelism) builds on these three ideas.
Common mistakes
⚠ Forgetting cudaDeviceSynchronize. Kernels are asynchronous. Without a sync, your cudaMemcpy back to host might run before the kernel finishes. In debug builds this sometimes works by accident, making the bug intermittent.
⚠ Using host pointers on the device. Passing a malloc’d pointer to a kernel causes an illegal memory access. The kernel runs on the GPU, which cannot see host memory (unless you use unified memory, covered later).
⚠ Forgetting the bounds check. If your array has 1000 elements and you launch 1024 threads, threads 1000-1023 will read and write garbage memory. This corrupts other allocations and causes wrong results that are painful to debug.
⚠ Launching too few threads. A single block of 32 threads on an A100 with 108 SMs means 107 SMs sit completely idle. You want thousands of blocks to saturate the GPU.
✓ Always check error codes. Use the CUDA_CHECK macro or equivalent. Silent failures are the norm in CUDA, not the exception.
In practice
Production CUDA code follows the same host-device pattern shown here, but with additional layers:
- Pinned (page-locked) memory with
cudaMallocHostfor faster PCIe transfers. Regularmallocmemory must be copied to a pinned staging buffer by the driver before DMA transfer. Pinned memory skips that step. - Streams for overlapping data transfers with kernel execution. While one stream copies input for batch N+1, another stream computes batch N.
- Unified memory with
cudaMallocManagedwhich gives you a single pointer accessible from both host and device. The driver handles page migration automatically. Convenient for prototyping, but explicit memory management usually wins on performance. - Error checking wrappers that log to a proper logging framework instead of calling
exit. In production, you want to recover from transient GPU errors, not crash the entire service.
The vector addition example is pedagogically useful but unrealistic as a standalone GPU workload. The arithmetic intensity (one addition per two memory reads and one write) is too low to benefit from GPU parallelism at small scales. The GPU shines when you have millions of elements and more compute-heavy operations like matrix multiplies, convolutions, or reductions.
When profiling, use nsys (NVIDIA Nsight Systems) to see the timeline of transfers and kernel execution. The first thing you will notice is that cudaMemcpy often dominates the total runtime. Reducing host-device transfers is usually the highest-impact optimization.
What comes next
You now know how to write, compile, and run a CUDA kernel. You understand the host-device split, the launch syntax, and how threads map to data.
The next article, CUDA thread hierarchy in depth, goes deeper into the thread model: multi-dimensional grids and blocks, the warp as the unit of execution, shared memory within a block, and synchronization primitives. That is where you start writing kernels that cooperate, not just kernels where every thread works alone.