Warps and warp divergence: the hidden performance trap
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read CUDA thread hierarchy, where you learned how threads are organized into blocks and grids. You need that mental model before we can zoom into the hardware unit that actually executes those threads: the warp.
What a warp is
A GPU does not schedule threads individually. It schedules them in groups of 32 called warps. Every block you launch gets partitioned into warps. A block of 256 threads becomes 8 warps. A block of 128 threads becomes 4 warps. A block of 100 threads becomes 4 warps, with the last warp containing 4 active threads and 28 idle lanes.
A warp is the fundamental unit of execution on an NVIDIA GPU. All 32 threads in a warp share a single instruction pointer. At every clock cycle, the warp scheduler fetches one instruction and issues it to all 32 threads simultaneously. This is not an implementation detail you can ignore. It is the single most important hardware constraint that determines whether your kernel runs at full throughput or wastes half its cycles doing nothing.
The number 32 is baked into the hardware. It has been 32 since the G80 architecture in 2006 and remains 32 through Hopper. You will see it referenced as warpSize in CUDA code.
SIMT execution model
NVIDIA calls this execution model SIMT: Single Instruction, Multiple Threads. It is similar to SIMD on CPUs, but with a critical difference. In CPU SIMD, you explicitly pack data into wide registers and use special instructions. In SIMT, you write scalar code for a single thread, and the hardware implicitly executes that same instruction across 32 threads in lockstep.
Each thread in a warp has its own registers, its own program counter (logically), and its own stack. But at the hardware level, the warp scheduler issues one instruction at a time. All 32 threads execute that instruction, each operating on their own data.
When all 32 threads follow the same control flow path, the warp is converged. Every thread does useful work on every cycle. This is the ideal case, and it gives you the full throughput the hardware was designed for.
The problem starts when threads disagree about which path to take.
Warp divergence
Consider this kernel:
__global__ void divergent_kernel(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
if (x[i] < 0.0f) {
y[i] = 0.0f; // Branch A
} else {
y[i] = x[i] * 2.0f; // Branch B
}
}
}
Suppose a warp contains 32 threads where 16 have x[i] < 0 and 16 do not. The hardware cannot execute Branch A and Branch B simultaneously because there is only one instruction pointer per warp. Instead, it serializes the two paths:
- First pass: execute Branch A. The 16 threads that took Branch A are active. The other 16 are masked off (they do nothing but still occupy their lane).
- Second pass: execute Branch B. The 16 threads that took Branch B are active. The other 16 are masked off.
- The warp reconverges at the instruction after the
if/else.
This is warp divergence. The warp must execute both paths sequentially, and on each pass, some threads sit idle. The total time is the sum of both paths, not the maximum.
graph TB
subgraph Divergent["Divergent warp: 32 threads hit if/else"]
direction TB
F["Fetch: if (x[i] < 0)"]
F --> PA["Pass 1: Branch A (y[i] = 0)
Active mask: threads 0-15 ✓
Threads 16-31 masked off ✗"]
PA --> PB["Pass 2: Branch B (y[i] = x[i]*2)
Active mask: threads 16-31 ✓
Threads 0-15 masked off ✗"]
PB --> RC["Reconverge: all 32 threads active ✓"]
end
subgraph Converged["Converged warp: all threads take same path"]
direction TB
F2["Fetch: if (x[i] < 0)"]
F2 --> P1["Single pass: Branch B (y[i] = x[i]*2)
Active mask: all 32 threads ✓"]
P1 --> RC2["Continue: all 32 threads active ✓"]
end
Left: a divergent warp executes both branches sequentially, masking inactive threads on each pass. Right: a converged warp executes only one branch, keeping all threads active.
The cost of serialization
The penalty from divergence is straightforward to reason about. If a warp hits an if/else and half the threads take each side, the warp takes twice as long as it would if all threads agreed. If the branch has three paths (using if/else if/else), the warp serializes three times.
graph LR
subgraph No_Div["No divergence"]
direction TB
ND1["Cycle 1-10: All 32 threads execute Branch B"]
ND2["Total: 10 cycles"]
ND1 --> ND2
end
subgraph Half_Div["50% divergence (16/16 split)"]
direction TB
HD1["Cycle 1-10: 16 threads execute Branch A"]
HD2["Cycle 11-25: 16 threads execute Branch B"]
HD3["Total: 25 cycles"]
HD1 --> HD2 --> HD3
end
subgraph Full_Div["Worst case (32 unique paths)"]
direction TB
FD1["Cycle 1-10: Thread 0 active"]
FD2["Cycle 11-20: Thread 1 active"]
FD3["..."]
FD4["Total: 320 cycles"]
FD1 --> FD2 --> FD3 --> FD4
end
Cycle counts scale with the number of distinct paths taken within a warp. A 50% split roughly doubles execution time. The worst case (every thread diverges) reduces the warp to scalar execution.
The key insight: divergence is a per-warp phenomenon. If threads 0-31 all take Branch A, and threads 32-63 all take Branch B, there is zero divergence. Each warp is internally converged, even though different warps take different paths. Divergence only matters when threads within the same warp disagree.
Divergence scenarios
| Pattern | Divergence level | Performance impact | Mitigation |
|---|---|---|---|
if (threadIdx.x < 16) | 50% within every warp | 2x slowdown on branched section | Reorganize work so warps align with branches |
if (threadIdx.x % 2 == 0) | 50% within every warp | 2x slowdown, worst pattern | Sort or partition data by condition |
if (blockIdx.x < gridDim.x / 2) | 0%, divergence is between blocks | No penalty | Already optimal, warps are converged |
if (threadIdx.x < warpSize) | 0%, condition aligns with warp boundary | No penalty | Warp-aligned branches are free |
switch with N cases per warp | Up to N-way serialization | Up to Nx slowdown | Reduce case count, group similar work |
if (data[i] > threshold) | Data-dependent, unpredictable | Varies, often 1.2-1.8x | Sort data, use branchless arithmetic |
CUDA C++: divergent vs optimized kernel
The following kernel intentionally creates divergence by branching on even/odd thread index. This is the worst pattern because every warp diverges identically.
#include <stdio.h>
#include <cuda_runtime.h>
// Divergent version: even/odd threads take different paths
__global__ void relu_divergent(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
if (x[i] < 0.0f) {
y[i] = 0.0f;
} else {
y[i] = x[i];
}
}
}
// Branchless version: no divergence possible
__global__ void relu_branchless(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
y[i] = x[i] * (float)(x[i] > 0.0f);
}
}
int main() {
int n = 1 << 24; // 16M elements
size_t bytes = n * sizeof(float);
float *h_x = (float*)malloc(bytes);
// Fill with alternating positive/negative to maximize divergence
for (int i = 0; i < n; i++) {
h_x[i] = (i % 2 == 0) ? 1.0f : -1.0f;
}
float *d_x, *d_y;
cudaMalloc(&d_x, bytes);
cudaMalloc(&d_y, bytes);
cudaMemcpy(d_x, h_x, bytes, cudaMemcpyHostToDevice);
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
// Benchmark divergent kernel
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (int iter = 0; iter < 100; iter++) {
relu_divergent<<<gridSize, blockSize>>>(d_x, d_y, n);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms_divergent = 0.0f;
cudaEventElapsedTime(&ms_divergent, start, stop);
// Benchmark branchless kernel
cudaEventRecord(start);
for (int iter = 0; iter < 100; iter++) {
relu_branchless<<<gridSize, blockSize>>>(d_x, d_y, n);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms_branchless = 0.0f;
cudaEventElapsedTime(&ms_branchless, start, stop);
printf("Divergent: %.3f ms (avg per launch)\n", ms_divergent / 100.0f);
printf("Branchless: %.3f ms (avg per launch)\n", ms_branchless / 100.0f);
printf("Speedup: %.2fx\n", ms_divergent / ms_branchless);
cudaFree(d_x);
cudaFree(d_y);
free(h_x);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
The branchless version replaces the if/else with a multiply by a boolean cast to float. The expression (float)(x[i] > 0.0f) evaluates to 1.0f or 0.0f, and the multiply produces the correct ReLU result. No branch, no divergence, no serialization.
On a typical GPU (RTX 3090, A100), the branchless kernel runs 1.3-1.6x faster for data with 50% negative values. The gap widens as the branch body grows in complexity.
Python (Numba): divergent vs branchless comparison
import numpy as np
from numba import cuda
import time
@cuda.jit
def relu_divergent(x, y):
i = cuda.grid(1)
if i < x.shape[0]:
if x[i] < 0.0:
y[i] = 0.0
else:
y[i] = x[i]
@cuda.jit
def relu_branchless(x, y):
i = cuda.grid(1)
if i < x.shape[0]:
y[i] = x[i] * float(x[i] > 0.0)
n = 1 << 24
h_x = np.empty(n, dtype=np.float32)
h_x[0::2] = 1.0 # even indices positive
h_x[1::2] = -1.0 # odd indices negative
d_x = cuda.to_device(h_x)
d_y = cuda.device_array(n, dtype=np.float32)
threads = 256
blocks = (n + threads - 1) // threads
# Warmup
relu_divergent[blocks, threads](d_x, d_y)
relu_branchless[blocks, threads](d_x, d_y)
cuda.synchronize()
# Benchmark divergent
start = time.perf_counter()
for _ in range(100):
relu_divergent[blocks, threads](d_x, d_y)
cuda.synchronize()
t_divergent = (time.perf_counter() - start) / 100
# Benchmark branchless
start = time.perf_counter()
for _ in range(100):
relu_branchless[blocks, threads](d_x, d_y)
cuda.synchronize()
t_branchless = (time.perf_counter() - start) / 100
print(f"Divergent: {t_divergent*1000:.3f} ms")
print(f"Branchless: {t_branchless*1000:.3f} ms")
print(f"Speedup: {t_divergent/t_branchless:.2f}x")
The Numba kernels follow the same pattern. The branchless version avoids the if/else entirely. Numba compiles to PTX through LLVM, so the resulting GPU code is structurally identical to the CUDA C++ version.
Measuring divergence: execution time by divergence percentage
The following chart shows measured execution time for a simple kernel where a controlled fraction of threads within each warp take a different branch. At 0% divergence, all threads agree. At 100%, every thread takes the opposite path from its neighbor.
Measured on 16M elements, 256 threads per block, RTX 3090. Divergence was controlled by setting a fraction of input values negative. The relationship is roughly linear because serialization cost scales with the fraction of threads taking each path.
The chart confirms two things. First, even 25% divergence adds measurable overhead. Second, the penalty is not a cliff. It grows proportionally. This means that partial divergence mitigation (reducing 50% divergence to 25%) gives a real, measurable speedup.
Worked example 1: divergence penalty calculation
Problem: A warp of 32 threads encounters an if/else. 16 threads take Branch A (10 cycles), 16 take Branch B (15 cycles). Calculate the total execution time with and without divergence. Show the active mask for each pass.
Without divergence (all threads take one branch):
If all 32 threads take Branch A: 10 cycles. If all take Branch B: 15 cycles. Either way, a single pass handles all threads.
With divergence (16/16 split):
| Pass | Active threads | Masked threads | Instruction | Cycles |
|---|---|---|---|---|
| 1 | Threads 0-15 (mask: 0x0000FFFF) | Threads 16-31 idle | Branch A body | 10 |
| 2 | Threads 16-31 (mask: 0xFFFF0000) | Threads 0-15 idle | Branch B body | 15 |
| Total | 25 |
Without divergence: 10 or 15 cycles (best case). With divergence: 10 + 15 = 25 cycles.
The overhead is not just “double.” It is the sum of all paths. If Branch A took 10 cycles and Branch B took 100 cycles, a converged warp taking only Branch B finishes in 100 cycles, but a divergent warp takes 110 cycles. The short branch adds its full cost on top.
Utilization: During Pass 1, only 16 of 32 threads are active: 50% utilization. During Pass 2, again 50%. Averaged across the branch section, the warp achieves 50% efficiency. The hardware is doing half the useful work it could.
Worked example 2: removing divergence with arithmetic
Problem: Rewrite the following kernel without any branch, using arithmetic only. Verify correctness for x = -3, 0, 5.
if (x[i] < 0)
y[i] = 0;
else
y[i] = x[i];
Branchless rewrite:
y[i] = x[i] * (float)(x[i] >= 0);
The comparison (x[i] >= 0) produces 1 (true) or 0 (false). Casting to float gives 1.0f or 0.0f. Multiplying by x[i] gives the correct result.
Verification:
| x[i] | x[i] >= 0 | (float)(x[i] >= 0) | x[i] * result | Expected | Match |
|---|---|---|---|---|---|
| -3 | false | 0.0 | -3 * 0.0 = 0.0 | 0 | ✓ |
| 0 | true | 1.0 | 0 * 1.0 = 0.0 | 0 | ✓ |
| 5 | true | 1.0 | 5 * 1.0 = 5.0 | 5 | ✓ |
This technique is called predication. Instead of branching, you compute both paths and select the result with arithmetic. The GPU executes one instruction stream with no masking. Every thread stays active on every cycle.
Predication works well when both branches are short (a few instructions). If one branch is 200 instructions and the other is 2 instructions, predication forces every thread to execute 200 instructions. In that case, divergence with masking is cheaper because the short-branch threads simply idle rather than executing useless work.
Detecting divergence with profilers
You should not guess at divergence. NVIDIA provides tools that measure it directly.
Nsight Compute (ncu) reports the metric smsp__branch_efficiency. A value of 100% means every branch was taken uniformly within every warp. Values below 100% indicate divergence. The smsp__warps_divergent_branch metric counts the total number of divergent branches.
ncu --metrics smsp__branch_efficiency,smsp__warps_divergent_branch ./my_kernel
Nsight Systems (nsys) gives a timeline view. Kernels with high divergence will show longer execution times relative to their arithmetic intensity. Combined with the source-level correlation in Nsight Compute, you can pinpoint exactly which if statement is causing the divergence.
What to look for:
- ✓ Branch efficiency above 95%: divergence is negligible.
- ⚠ Branch efficiency between 80-95%: investigate the hottest branches.
- ✗ Branch efficiency below 80%: significant serialization. Restructure your data or algorithm.
Patterns to minimize divergence
1. Warp-aligned branching. If your condition depends on threadIdx.x, ensure it aligns with warp boundaries. if (threadIdx.x < 32) causes zero divergence because the first warp takes the branch entirely and subsequent warps skip it entirely. if (threadIdx.x < 20) causes divergence in the first warp because threads 0-19 take the branch and threads 20-31 do not.
2. Data sorting. If your branch depends on input data (e.g., classifying positive vs negative values), sort or partition the data before launching the kernel. Sorting groups similar values into contiguous memory regions, which means adjacent threads (same warp) are more likely to take the same path.
3. Predication (branchless arithmetic). Replace short branches with arithmetic as shown in Worked Example 2. The compiler often does this automatically for trivial branches, but complex conditions benefit from manual rewriting. Use the ternary operator or multiply-by-boolean patterns.
4. Separate kernels. If two code paths are fundamentally different (hundreds of instructions each), split them into two kernels. Use a pre-classification pass to separate elements into two arrays, then launch each kernel on its respective array. Both kernels run with zero divergence.
5. Warp-level voting. Use __all_sync(), __any_sync(), and __ballot_sync() to test whether the entire warp agrees on a condition before branching. If all threads agree, take the fast path. If they disagree, fall back to the slow path. This reduces divergence in the common case.
unsigned mask = __ballot_sync(0xFFFFFFFF, x[i] < 0.0f);
if (mask == 0xFFFFFFFF) {
// All threads in warp have x[i] < 0: converged fast path
y[i] = 0.0f;
} else if (mask == 0x00000000) {
// No threads have x[i] < 0: converged fast path
y[i] = x[i] * 2.0f;
} else {
// Mixed: unavoidable divergence, but rare if data is sorted
y[i] = (x[i] < 0.0f) ? 0.0f : x[i] * 2.0f;
}
When divergence is acceptable
Not all divergence is a bug. Some divergence is inherent to the algorithm and the cost of eliminating it exceeds the benefit.
- ✓ Boundary checks like
if (i < n)at the edge of the grid cause divergence in at most one warp per block. The cost is negligible. - ✓ Rare branches that trigger for less than 1% of threads (error handling, edge cases) are not worth optimizing.
- ⚠ Tree traversals and graph algorithms have data-dependent, unpredictable branching. Divergence is inherent. The optimization strategy is to maximize warp-level convergence through work grouping, not to eliminate branches.
- ✗ Hot loops with per-element branching on unsorted data are the pattern that destroys performance. This is where you invest optimization effort.
The rule of thumb: measure first with Nsight Compute. If branch efficiency is above 90%, move on to other bottlenecks. Memory access patterns almost always matter more than branch divergence.
In practice
Profile, then optimize. Run ncu --set full on your kernel before rewriting any branches. Branch divergence is often blamed for performance issues that are actually caused by uncoalesced memory access or low occupancy. The profiler tells you which problem to solve first.
The compiler is smarter than you think. NVCC and LLVM (for Numba) both apply predication automatically for short branches. Check the generated PTX (nvcc --ptx) to see whether your if/else compiled to a branch instruction (@p bra) or a predicated select (selp). If the compiler already predicated it, your manual rewrite gains nothing.
Sort your data when possible. This is the highest-leverage optimization for data-dependent divergence. A single thrust::sort_by_key call before your kernel can eliminate divergence entirely. The sort itself is highly optimized and runs with full warp convergence.
Divergence compounds with nested branches. An if inside an if can produce 4-way serialization. Flatten nested branches by computing a single case index and using a lookup table or switch. The switch still diverges, but it diverges once instead of twice.
Warp size may change in the future. Code that assumes warpSize == 32 will break if NVIDIA ever changes it. Use the built-in warpSize variable or the __CUDA_ARCH__ macro for portability. In Numba, use cuda.warpsize.
What comes next
This article covered the warp as the fundamental execution unit, the SIMT model, how divergence serializes execution, and concrete patterns to avoid the penalty. You now know how to identify divergence with profilers and when to optimize it versus when to accept it.
The next bottleneck to tackle is memory. A kernel with perfect convergence can still run 10x slower than expected if it accesses memory in the wrong pattern. CUDA memory hierarchy covers global memory coalescing, shared memory, L1/L2 caches, and the register file, giving you the tools to eliminate the most common performance bottleneck in GPU programming.