Search…

Thread hierarchy in CUDA: threads, blocks, warps, and grids

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

  • CUDA hello world for writing, compiling, and launching a basic kernel with <<<1, 1>>>.

You should already be comfortable with the idea that a kernel runs on the GPU and that the <<<...>>> syntax controls how many copies of that kernel run in parallel. This article explains the full structure behind those launch parameters.

The execution model at a glance

Every CUDA kernel launch creates a grid. A grid is a collection of blocks. Each block is a collection of threads. Threads within a block execute on the same hardware unit and can cooperate through shared memory and synchronization barriers. Threads in different blocks cannot cooperate directly.

Below the software abstraction, the hardware groups threads into warps of 32. A warp is the true unit of execution on an NVIDIA GPU. Every instruction is issued to all 32 threads in a warp simultaneously. Warps run on streaming multiprocessors (SMs), which are the physical compute units on the GPU die.

The relationship looks like this:

graph TB
  G["Grid (all blocks in a kernel launch)"]
  G --> B0["Block (0,0)"]
  G --> B1["Block (1,0)"]
  G --> B2["Block (2,0)"]
  G --> BN["Block (N,0) ..."]

  B0 --> W0["Warp 0: threads 0-31"]
  B0 --> W1["Warp 1: threads 32-63"]
  B0 --> W2["Warp 2: threads 64-95"]
  B0 --> WN["Warp K ..."]

  W0 --> T0["Thread 0"]
  W0 --> T1["Thread 1"]
  W0 --> T31["Thread 31"]

  style G fill:#1a1a2e,stroke:#e94560,color:#fff
  style B0 fill:#16213e,stroke:#0f3460,color:#fff
  style B1 fill:#16213e,stroke:#0f3460,color:#fff
  style B2 fill:#16213e,stroke:#0f3460,color:#fff
  style BN fill:#16213e,stroke:#0f3460,color:#fff
  style W0 fill:#0f3460,stroke:#533483,color:#fff
  style W1 fill:#0f3460,stroke:#533483,color:#fff
  style W2 fill:#0f3460,stroke:#533483,color:#fff
  style WN fill:#0f3460,stroke:#533483,color:#fff
  style T0 fill:#533483,stroke:#e94560,color:#fff
  style T1 fill:#533483,stroke:#e94560,color:#fff
  style T31 fill:#533483,stroke:#e94560,color:#fff

This hierarchy exists for a reason. It maps directly to how the hardware is organized. Understanding it is the single most important step toward writing efficient CUDA code.

Threads: the basic unit of work

A thread is the smallest unit of execution in CUDA. Each thread runs the same kernel function but operates on different data, identified by its unique index. Inside a kernel, you access this index through the built-in variables threadIdx.x, threadIdx.y, and threadIdx.z.

A thread has its own registers and local memory. It cannot see the registers of another thread. This isolation is what lets the GPU run thousands of threads without complex dependency tracking.

Think of a thread as a single worker on a factory line. It knows its position (the thread index), it has a small personal workspace (registers), and it follows the same set of instructions as every other worker.

Blocks: threads that cooperate

A block is a group of threads that can cooperate. Threads within the same block share two capabilities that threads in different blocks do not have:

  1. Shared memory. A fast, on-chip scratchpad (typically 48-164 KB depending on the GPU) visible to all threads in the block. Accessing shared memory is roughly 20-40x faster than accessing global memory.

  2. Synchronization. The __syncthreads() barrier ensures all threads in a block have reached the same point before any of them proceed. This is essential when one thread writes to shared memory and another thread reads that value.

Blocks are independent from each other. The hardware can execute them in any order, on any SM, and in any degree of parallelism. This independence is critical. It means a CUDA program scales automatically across GPUs with different SM counts. A GPU with 40 SMs and a GPU with 80 SMs will both run the same grid correctly. The one with 80 SMs will just process more blocks concurrently.

A block can be 1D, 2D, or 3D. The maximum number of threads per block on modern NVIDIA GPUs is 1024. You can arrange those 1024 threads in any combination of dimensions as long as the product does not exceed 1024. For example: (1024, 1, 1), (32, 32, 1), or (16, 16, 4).

Grids: all blocks in a launch

A grid is the complete collection of blocks created by a single kernel launch. When you write:

myKernel<<<gridDim, blockDim>>>(args);

gridDim specifies how many blocks the grid contains, and blockDim specifies how many threads each block contains. Both can be 1D, 2D, or 3D.

Inside a kernel, you identify which block you are in with blockIdx.x, blockIdx.y, and blockIdx.z. The total size of the grid is available through gridDim.x, gridDim.y, and gridDim.z.

The global index of a thread in a 1D grid is:

int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;

This is the most common line you will write in CUDA. It maps a thread to the data element it processes.

Warps: the hardware execution unit

A warp is a group of exactly 32 threads. It is not something you declare in your code. The hardware creates warps automatically by splitting each block into consecutive groups of 32 threads. Block thread 0-31 form warp 0, threads 32-63 form warp 1, and so on.

All 32 threads in a warp execute the same instruction at the same time in lockstep. This is SIMT (Single Instruction, Multiple Threads), NVIDIA’s variation on SIMD. When threads in a warp take different branches of an if statement, the warp must execute both branches serially, with some threads masked off during each branch. This is called warp divergence, and it is one of the most important performance considerations in CUDA programming.

A block with 256 threads produces 256 / 32 = 8 warps. A block with 100 threads produces ceil(100 / 32) = 4 warps, but the last warp has only 4 active threads. The other 28 thread slots are wasted. The hardware still allocates a full warp’s worth of resources for those 4 threads.

Streaming multiprocessors: the physical engine

A streaming multiprocessor (SM) is the physical unit that executes warps. Each SM contains:

  • Warp schedulers that pick a ready warp each cycle and issue its next instruction.
  • CUDA cores (INT32 and FP32 execution units) that perform arithmetic.
  • A register file (typically 64K 32-bit registers) shared across all threads resident on the SM.
  • Shared memory that all threads in a block can access.
  • Special function units (SFUs) for transcendentals like sin, cos, exp.
  • Load/store units for memory access.
  • Tensor cores (on Volta and later) for matrix multiply-accumulate operations.
graph TB
  subgraph SM["Streaming Multiprocessor (SM)"]
      direction TB
      WS["Warp Schedulers (4)"]
      subgraph Compute["Compute Units"]
          direction LR
          CC["CUDA Cores (FP32/INT32)"]
          SFU["Special Function Units"]
          TC["Tensor Cores"]
      end
      subgraph Memory["On-chip Memory"]
          direction LR
          RF["Register File (65536 x 32-bit)"]
          SM_MEM["Shared Memory (up to 164 KB)"]
          L1["L1 Cache"]
      end
      WS --> Compute
      WS --> Memory
  end

  style SM fill:#1a1a2e,stroke:#e94560,color:#fff
  style WS fill:#16213e,stroke:#0f3460,color:#fff
  style Compute fill:#0f3460,stroke:#533483,color:#fff
  style Memory fill:#0f3460,stroke:#533483,color:#fff
  style CC fill:#533483,stroke:#e94560,color:#fff
  style SFU fill:#533483,stroke:#e94560,color:#fff
  style TC fill:#533483,stroke:#e94560,color:#fff
  style RF fill:#533483,stroke:#e94560,color:#fff
  style SM_MEM fill:#533483,stroke:#e94560,color:#fff
  style L1 fill:#533483,stroke:#e94560,color:#fff

When a grid is launched, the CUDA runtime distributes blocks to SMs. Each SM can run multiple blocks concurrently, limited by the SM’s register count, shared memory capacity, and maximum thread count. When one warp stalls on a memory access, the warp scheduler switches to another ready warp with zero overhead. This is how GPUs hide memory latency: not with caches (like CPUs), but with massive parallelism.

An NVIDIA A100 has 108 SMs. An RTX 4090 has 128 SMs. A laptop RTX 4060 has 24 SMs. The same kernel binary runs on all of them. More SMs means more blocks running concurrently means faster completion.

1D, 2D, and 3D grids and blocks

CUDA supports up to three dimensions for both grids and blocks. This is purely a convenience for indexing. The hardware does not care whether your grid is 1D, 2D, or 3D. It only cares about the total number of threads and blocks.

Use 1D when your data is a flat array. Use 2D when your data is a matrix or image. Use 3D for volumes (medical imaging, fluid simulation, voxel grids).

For a 2D block and 2D grid, the global thread coordinates are:

int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

This maps naturally to matrix indexing, where row selects the row and col selects the column. The linear index into a row-major array is then row * width + col.

Device property limits

Every GPU has hard limits on thread hierarchy dimensions. Exceeding any limit causes the kernel launch to silently fail. The table below shows common limits for modern architectures.

LevelDimensionMax sizeTotal limitNotes
Blockx10241024 threads per blockProduct of x * y * z must not exceed 1024
Blocky1024
Blockz64
Gridx2,147,483,647 (2^31 - 1)No fixed total limitGrids can be enormous
Gridy65535
Gridz65535
Warp-32 threadsFixedHardware constant since CUDA’s inception
SMThreads1536-2048Per SMDepends on compute capability
SMBlocks16-32Per SMDepends on compute capability
SMWarps48-64Per SMDepends on compute capability
SMShared memory48-164 KBPer SMConfigurable L1/shared split on some architectures

These limits are not arbitrary. They reflect the physical resources on the SM. The register file has a fixed number of registers. If your kernel uses 64 registers per thread and the SM has 65536 registers, you can fit at most 1024 threads on that SM, regardless of the 2048-thread-per-SM hardware limit.

Worked example: 2D thread indexing for an image

You need to process a 1920x1080 image. Each thread processes one pixel. You choose a 2D block size of 16x16 = 256 threads.

Grid dimensions (ceiling division):

gridDim.x = ceil(1920 / 16) = 120 blocks
gridDim.y = ceil(1080 / 16) = 68 blocks (because ceil(1080/16) = 67.5, rounded up to 68)

Total blocks: 120 * 68 = 8,160. Total threads launched: 8,160 * 256 = 2,088,960. The image has 1920 * 1080 = 2,073,600 pixels. The extra 15,360 threads will compute indices outside the image and must be guarded:

__global__ void processImage(unsigned char* img, int width, int height) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (col >= width || row >= height) return;  // boundary guard

    int idx = row * width + col;
    // process pixel at img[idx]
}

// Launch configuration
dim3 blockDim(16, 16);
dim3 gridDim((1920 + 15) / 16, (1080 + 15) / 16);  // (120, 68)
processImage<<<gridDim, blockDim>>>(d_img, 1920, 1080);

Concrete index calculation. Consider block (5, 3), thread (7, 4):

col = 5 * 16 + 7 = 87
row = 3 * 16 + 4 = 52
linear index = 52 * 1920 + 87 = 99,927

Thread (7, 4) in block (5, 3) processes the pixel at column 87, row 52. Both are within bounds, so the boundary guard passes.

Worked example: warp allocation and occupancy cost

A block has 256 threads. How many warps? 256 / 32 = 8 warps.

Now suppose 3 threads in the last warp hit the boundary guard and return early. The warp still has 32 thread slots allocated. Those 3 idle slots consume registers and occupy a scheduler slot. The GPU does not reclaim their resources.

At the block level, this is negligible: 3 wasted slots out of 256 is about 1.2%. But consider a pathological case. Suppose you launch blocks of 33 threads. Each block creates 2 warps (ceil(33/32) = 2), but the second warp has only 1 active thread. That is 31 wasted slots per block, or 31/64 = 48% waste. The cost is real: those 31 phantom threads consume registers, reducing the number of warps the SM can host, which reduces the SM’s ability to hide memory latency.

The takeaway: always choose block sizes that are multiples of 32. Common choices are 128, 256, and 512. This ensures every warp is fully occupied.

Querying device properties in CUDA C++

You do not need to memorize hardware limits. CUDA provides a runtime API to query them:

#include <cstdio>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("CUDA devices found: %d\n\n", deviceCount);

    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);

        printf("Device %d: %s\n", i, prop.name);
        printf("  Compute capability:     %d.%d\n", prop.major, prop.minor);
        printf("  SMs:                    %d\n", prop.multiProcessorCount);
        printf("  Max threads per SM:     %d\n", prop.maxThreadsPerMultiProcessor);
        printf("  Max threads per block:  %d\n", prop.maxThreadsPerBlock);
        printf("  Max block dimensions:   (%d, %d, %d)\n",
               prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
        printf("  Max grid dimensions:    (%d, %d, %d)\n",
               prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
        printf("  Warp size:              %d\n", prop.warpSize);
        printf("  Shared memory per SM:   %lu bytes\n", prop.sharedMemPerMultiprocessor);
        printf("  Shared memory per block:%lu bytes\n", prop.sharedMemPerBlock);
        printf("  Registers per SM:       %d\n", prop.regsPerMultiprocessor);
        printf("  Registers per block:    %d\n", prop.regsPerBlock);
        printf("\n");
    }
    return 0;
}

Compile with nvcc device_query.cu -o device_query and run it on any machine with an NVIDIA GPU. The output tells you exactly what your hardware supports. Use these numbers to tune your launch configurations.

On an A100, you would see output like:

Device 0: NVIDIA A100-SXM4-80GB
  Compute capability:     8.0
  SMs:                    108
  Max threads per SM:     2048
  Max threads per block:  1024
  Max block dimensions:   (1024, 1024, 64)
  Max grid dimensions:    (2147483647, 65535, 65535)
  Warp size:              32
  Shared memory per SM:   167936 bytes
  Shared memory per block:49152 bytes
  Registers per SM:       65536
  Registers per block:    65536

CuPy: mapping Python to CUDA hierarchy

If you work in Python, CuPy exposes the same thread hierarchy. A CuPy RawKernel lets you write CUDA C inside Python and launch it with explicit grid and block dimensions:

import cupy as cp

kernel_code = r"""
extern "C" __global__
void add_matrices(const float* A, const float* B, float* C, int width, int height) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (col >= width || row >= height) return;
    int idx = row * width + col;
    C[idx] = A[idx] + B[idx];
}
"""

add_kernel = cp.RawKernel(kernel_code, "add_matrices")

width, height = 1920, 1080
A = cp.random.rand(height, width, dtype=cp.float32)
B = cp.random.rand(height, width, dtype=cp.float32)
C = cp.zeros((height, width), dtype=cp.float32)

block = (16, 16, 1)
grid = ((width + 15) // 16, (height + 15) // 16, 1)
# grid = (120, 68, 1) -- same calculation as the C++ example

add_kernel(grid, block, (A, B, C, width, height))

assert cp.allclose(C, A + B)
print("✓ Matrix addition correct")

The tuple (16, 16, 1) maps directly to dim3(16, 16, 1) in CUDA C++. The grid tuple (120, 68, 1) maps to dim3(120, 68, 1). CuPy handles memory allocation and transfer behind the scenes, but the thread hierarchy is identical. The same mental model applies: each thread computes its row and col from blockIdx and threadIdx, guards against out-of-bounds access, and processes one element.

For higher-level operations, CuPy uses its own internal heuristics to choose block and grid sizes. When you write C = A + B with CuPy arrays, it launches a kernel with a configuration optimized for your GPU. The RawKernel interface gives you explicit control when the defaults are not sufficient.

In practice

Start with 256 threads per block. It is a safe default that works well on every architecture since Kepler. Exactly 8 warps, good occupancy, and enough threads for the SM to hide latency. Only deviate after profiling.

Use 2D blocks for 2D data. A 16x16 block for image or matrix operations gives natural indexing and good memory coalescing along rows. A 32x8 block is another option when you want full warp utilization along the x dimension.

Always guard boundaries. When the data size is not a perfect multiple of the block size (it almost never is), threads at the edge of the grid will compute out-of-bounds indices. The if (col >= width || row >= height) return; pattern costs almost nothing. Skipping it causes memory corruption.

Query device properties at startup. Do not hardcode SM counts or memory sizes. Use cudaGetDeviceProperties to adapt your launch configuration to the hardware you are running on. This matters when your code runs on different GPU models in a cluster.

Watch for warp-level effects. A block of 256 threads with 3 inactive threads wastes 1.2% of resources. A block of 33 threads wastes 48%. Occupancy is not just about thread counts; it is about how efficiently you use warps.

Remember the independence guarantee. Blocks can run in any order. Do not write code that assumes block 0 finishes before block 1. If you need inter-block communication, you need either multiple kernel launches or cooperative groups (an advanced feature).

What comes next

This article covered the CUDA thread hierarchy from top to bottom: grids contain blocks, blocks contain threads, and the hardware groups threads into warps of 32 that execute in lockstep on streaming multiprocessors. You learned how to compute global thread indices in 1D and 2D, how to query device limits, and why block sizes should be multiples of 32.

The lockstep execution model has a cost. When threads in a warp disagree on a branch, performance drops. The next article, Warps and divergence, explains exactly how warp divergence happens, how to measure it, and how to restructure your code to minimize it.

Start typing to search across all content
navigate Enter open Esc close