Occupancy, register pressure, and performance tuning
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- CUDA memory hierarchy for registers, shared memory, and how data placement drives kernel throughput.
- Debugging and profiling CUDA programs for Nsight Systems/Compute workflows and the roofline model.
You should be comfortable interpreting profiler output and understand that an SM has a fixed pool of registers and shared memory divided among resident blocks. Everything here targets compute capability 8.0+ (A100/Ampere) unless stated otherwise.
What occupancy actually means
Occupancy is the ratio of active warps on a streaming multiprocessor (SM) to the maximum number of warps that SM can support:
Occupancy = Active Warps per SM / Maximum Warps per SM
On an A100 (compute capability 8.0), each SM supports up to 64 warps (2048 threads). If your kernel configuration results in 32 active warps per SM, occupancy is 50%.
Occupancy is not a percentage of “how hard the GPU is working.” It is a measure of how many warps the scheduler has available to hide latency. When one warp stalls on a memory access, the scheduler switches to another active warp at zero cost. More active warps means more opportunities to overlap computation with memory latency.
Why high occupancy is not always the goal
A common misconception: maximize occupancy, maximize performance. The relationship is more nuanced than that.
Consider a kernel that is entirely compute-bound. Every warp performs arithmetic on register-resident data with no global memory stalls. The scheduler never needs to switch warps because warps never stall. In this scenario, 25% occupancy can deliver the same throughput as 100% occupancy. The extra warps just sit in the ready queue, never needed.
Conversely, a memory-bound kernel stalls frequently. Each warp waits hundreds of cycles for global memory. Here, occupancy directly controls how well the hardware hides that latency. Going from 25% to 50% occupancy can double throughput.
The rule: occupancy matters most for memory-bound kernels and matters least for compute-bound kernels. Profiling tells you which regime you are in.
SM resource allocation
Three resources determine how many blocks (and therefore warps) can be resident on an SM simultaneously: registers per thread, shared memory per block, and threads per block. The SM allocates all three to each block at launch time, and the scarcest resource becomes the bottleneck.
graph TD SM["SM Resource Pool 65,536 registers 164 KB shared memory 2,048 max threads 32 max blocks"] SM --> B1["Block 0 128 threads × 40 regs = 5,120 regs 8 KB shared"] SM --> B2["Block 1 128 threads × 40 regs = 5,120 regs 8 KB shared"] SM --> B3["Block 2 128 threads × 40 regs = 5,120 regs 8 KB shared"] SM --> B4["Block 3 ...up to 12 blocks (register-limited)"] REMAINING["Remaining resources unused 65,536 - 12×5,120 = 4,096 regs 164 KB - 12×8 KB = 68 KB shared"] B4 --> REMAINING style SM fill:#2d6a4f,stroke:#1b4332,color:#fff style B1 fill:#40916c,stroke:#2d6a4f,color:#fff style B2 fill:#40916c,stroke:#2d6a4f,color:#fff style B3 fill:#40916c,stroke:#2d6a4f,color:#fff style B4 fill:#52b788,stroke:#40916c,color:#fff style REMAINING fill:#e76f51,stroke:#d62828,color:#fff
Each block claims its full register and shared memory allocation at launch. If a block needs 5,120 registers and 8 KB of shared memory, the SM fits as many blocks as the tightest constraint allows. Leftover resources below one block’s requirement are wasted.
Occupancy limiters
| Resource | How it limits occupancy | How to reduce pressure | Tradeoff |
|---|---|---|---|
| Registers per thread | More regs per thread means fewer threads fit in the SM’s register file. 65,536 regs / (128 threads × 40 regs) = 12 blocks max. | Use __launch_bounds__, reduce local variables, use shared memory for intermediate values. | Fewer registers can cause spilling to slow local memory. |
| Shared memory per block | Each block’s shared memory allocation reduces what remains for other blocks. 164 KB / 8 KB = 20 blocks (but 32 block limit may apply first). | Reduce tile sizes, reuse shared memory buffers across phases, use dynamic shared memory sized at launch. | Less shared memory means smaller tiles and potentially more global memory traffic. |
| Threads per block | SM has a max thread count (2,048 on A100). Fewer threads per block means more blocks needed to fill the SM, but block count is also capped at 32. | Choose block sizes that divide evenly into the SM thread limit. 128 or 256 are typically good choices. | Too few threads per block wastes warp scheduler slots. Too many reduces flexibility. |
| Blocks per SM | Hardware caps the number of concurrent blocks per SM (32 on A100). Small blocks can hit this before exhausting threads or registers. | Increase threads per block so fewer blocks are needed to saturate the SM. | Larger blocks reduce scheduling flexibility and increase synchronization scope. |
Worked example: computing occupancy from first principles
Consider an A100 SM with these limits:
- Max threads per SM: 2,048
- Max blocks per SM: 32
- Register file: 65,536 registers
- Max shared memory: 164 KB (configurable, but assume 48 KB allocated for shared memory and the rest for L1)
Your kernel uses: 128 threads per block, 40 registers per thread, 8 KB shared memory per block.
Step 1: Block limit from threads. 2,048 / 128 = 16 blocks.
Step 2: Block limit from registers. Registers per block = 128 × 40 = 5,120. Total register file = 65,536. Blocks = floor(65,536 / 5,120) = 12 blocks.
Step 3: Block limit from shared memory. 48 KB available / 8 KB per block = 6 blocks.
Step 4: Block limit from hardware cap. 32 blocks (not the bottleneck here).
Step 5: Take the minimum. min(16, 12, 6, 32) = 6 blocks.
Active threads = 6 × 128 = 768. Active warps = 768 / 32 = 24. Occupancy = 24 / 64 = 37.5%.
The bottleneck is shared memory. If you can reduce shared memory per block to 4 KB, you get floor(48 / 4) = 12 blocks, limited by registers. That bumps active threads to 12 × 128 = 1,536 and occupancy to 1,536 / 2,048 = 75%.
Occupancy vs threads per block
The relationship between block size and occupancy depends heavily on register usage. Higher register counts per thread reduce the number of blocks that fit on the SM.
At 32 registers per thread, occupancy hits 100% easily. At 64 registers, the register file becomes the bottleneck for most block sizes, and occupancy drops to 50% or below. The non-monotonic shape occurs because block counts must be integers: some block sizes align better with the register file than others.
Occupancy does not predict performance
The scatter below shows measured GFLOPS for several kernel variants of the same algorithm at different occupancy levels. There is no clean linear relationship.
The peak sits around 62.5% occupancy, not at 100%. Kernels at high occupancy compete for L1 cache and shared memory bandwidth, thrashing data out before it can be reused. Kernels at low occupancy leave memory latency unhidden. The sweet spot depends on the kernel’s memory access pattern and arithmetic intensity.
This is why you profile, not guess. The CUDA Occupancy Calculator gives you a starting point. Nsight Compute gives you the answer.
The CUDA Occupancy Calculator
NVIDIA ships a spreadsheet-based Occupancy Calculator (available as an Excel file in the CUDA Toolkit). You enter compute capability, registers per thread, shared memory per block, and threads per block. It outputs theoretical occupancy and shows which resource is the limiter.
The programmatic equivalent is the occupancy API:
#include <cuda_runtime.h>
#include <cstdio>
__global__ void __launch_bounds__(256, 4)
myKernel(float* __restrict__ out, const float* __restrict__ in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
// compute-heavy work with controlled register usage
float val = in[idx];
float result = val * val + 2.0f * val + 1.0f;
result = rsqrtf(result + 1e-6f);
result = fmaf(result, val, -result);
out[idx] = result;
}
int main() {
int blockSize = 256;
int minGridSize = 0;
int maxActiveBlocks = 0;
// Query max active blocks for this kernel and block size
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxActiveBlocks,
myKernel,
blockSize,
0 // dynamic shared memory bytes
);
// Query optimal block size for max occupancy
cudaOccupancyMaxPotentialBlockSize(
&minGridSize,
&blockSize,
myKernel,
0, // dynamic shared memory bytes
0 // block size limit (0 = no limit)
);
int device;
cudaGetDevice(&device);
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
int maxWarpsPerSM = props.maxThreadsPerMultiProcessor / 32;
int activeWarps = maxActiveBlocks * (256 / 32);
float occupancy = (float)activeWarps / (float)maxWarpsPerSM * 100.0f;
printf("Max active blocks per SM: %d\n", maxActiveBlocks);
printf("Active warps per SM: %d / %d\n", activeWarps, maxWarpsPerSM);
printf("Theoretical occupancy: %.1f%%\n", occupancy);
printf("Suggested block size: %d\n", blockSize);
printf("Suggested min grid size: %d\n", minGridSize);
return 0;
}
The __launch_bounds__(256, 4) directive tells the compiler two things: the kernel will never be launched with more than 256 threads per block, and you want at least 4 blocks resident per SM. The compiler uses this information to limit register allocation.
Understanding __launch_bounds__
The __launch_bounds__ qualifier has the following signature:
__global__ void __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
kernelFunction(...);
maxThreadsPerBlock (required): The maximum number of threads per block the kernel will ever be launched with. The compiler uses this to set an upper bound on register usage. If you launch with more threads than this value, behavior is undefined.
minBlocksPerMultiprocessor (optional): The desired minimum number of blocks per SM. The compiler will try to limit register usage so that at least this many blocks can be resident. Setting this too high forces aggressive register limiting, which may cause spills.
Example: reducing register pressure
Suppose a kernel compiles to 64 registers per thread without __launch_bounds__. With 256 threads per block, each block needs 256 × 64 = 16,384 registers. The A100 register file holds 65,536 registers, so floor(65,536 / 16,384) = 4 blocks can be resident. That gives 4 × 256 = 1,024 active threads, or 50% occupancy.
Adding __launch_bounds__(256, 8) tells the compiler: “I want 8 blocks per SM.” For 8 blocks of 256 threads, the register budget per thread is floor(65,536 / (8 × 256)) = 32 registers. The compiler will try to fit the kernel into 32 registers. If it cannot, it spills excess values to local memory (which is actually global memory, accessed through the L1 cache).
The result: active threads jump from 1,024 to 2,048 (100% occupancy), but each spilled register access adds hundreds of cycles of latency.
⚠ This is the fundamental tradeoff. You are trading register access speed (1 cycle) for occupancy. If the kernel is compute-bound and those spilled variables are in hot loops, performance will degrade despite higher occupancy. If the kernel is memory-bound, the additional warps available to hide latency may more than compensate for spill overhead.
Always verify with Nsight Compute. Check the “Local Memory” section. If l1tex__t_bytes_pipe_lsu_mem_local_op_st.sum is nonzero, you have spills.
Instruction throughput
Understanding peak instruction throughput helps you determine whether a kernel is compute-bound or memory-bound. On an A100 SM (compute capability 8.0), peak throughput per SM per clock:
| Instruction type | Throughput (ops/SM/clock) | Notes |
|---|---|---|
| FP32 (float) | 64 | One per thread per clock at full occupancy |
| FP64 (double) | 32 | Half the FP32 rate on A100 |
| INT32 | 64 | Same as FP32, can execute concurrently |
| FP16 / BF16 (Tensor Core) | 256 | Via Tensor Core MMA instructions |
| Special (sin, cos, exp, rsqrt) | 16 | Executed on SFU, 4x slower than FP32 |
A100 has 108 SMs running at ~1.41 GHz. Peak FP32 throughput = 108 × 64 × 2 (FMA counts as 2 ops) × 1.41 GHz = 19.5 TFLOPS. This matches the published spec.
If your kernel performs 100 FP32 FMA operations per element and loads 8 bytes per element from global memory, the arithmetic intensity is (100 × 2) / 8 = 25 FLOP/byte. The A100’s bandwidth is ~2 TB/s, so the bandwidth ceiling is 2000 × 25 = 50 TFLOPS, which exceeds the compute ceiling of 19.5 TFLOPS. This kernel is compute-bound, and optimizing memory access patterns will not help. You need to reduce instruction count or use Tensor Cores.
For a kernel loading 128 bytes per element with 10 FP32 FMA operations, arithmetic intensity is 20 / 128 = 0.156 FLOP/byte. The bandwidth ceiling is 2000 × 0.156 = 312 GFLOPS, well below the compute ceiling. This kernel is memory-bound. Higher occupancy, better coalescing, and caching matter here.
Memory-bound vs compute-bound: the profiling-driven loop
The optimization loop is straightforward, but it requires discipline to follow it instead of guessing.
graph TD
A["Profile kernel
with Nsight Compute"] --> B{"Memory-bound or
compute-bound?"}
B -->|Memory-bound| C["Improve memory access
Coalescing, caching,
shared memory tiling"]
B -->|Compute-bound| D["Reduce instructions
Strength reduction,
Tensor Cores, ILP"]
B -->|Latency-bound| E["Increase occupancy
Adjust block size,
reduce regs/shared"]
C --> F["Re-profile"]
D --> F
E --> F
F --> G{"Hit roofline
ceiling?"}
G -->|No| A
G -->|Yes| H["Done or change algorithm"]
style A fill:#2d6a4f,stroke:#1b4332,color:#fff
style B fill:#f4a261,stroke:#e76f51,color:#000
style C fill:#40916c,stroke:#2d6a4f,color:#fff
style D fill:#40916c,stroke:#2d6a4f,color:#fff
style E fill:#40916c,stroke:#2d6a4f,color:#fff
style F fill:#52b788,stroke:#40916c,color:#fff
style G fill:#f4a261,stroke:#e76f51,color:#000
style H fill:#d62828,stroke:#9b2226,color:#fff
Step 1: Profile. Run Nsight Compute on your kernel. Look at the Speed of Light (SOL) section. It reports achieved percentage of peak compute and peak memory bandwidth.
Step 2: Classify. If achieved memory bandwidth is close to peak but compute is low, you are memory-bound. If compute is close to peak but memory utilization is low, you are compute-bound. If neither is close to peak, you are latency-bound (not enough warps to saturate either pipe).
Step 3: Optimize the bottleneck. Only the bottleneck. Optimizing the non-bottleneck resource changes nothing.
Step 4: Re-profile. After optimization, the bottleneck may shift. A memory-bound kernel that you improved with shared memory tiling may now become compute-bound. Repeat the loop.
This is not a one-pass process. Production kernels go through this loop 5 to 15 times. Each iteration moves you closer to the roofline ceiling. The gains diminish with each pass. Know when to stop.
Key Nsight Compute metrics for occupancy tuning
These are the metrics to watch when tuning occupancy:
- sm__warps_active.avg.pct_of_peak_sustained_active: actual occupancy during kernel execution. This may differ from theoretical occupancy if blocks finish at different rates.
- launch__registers_per_thread: registers allocated per thread. Compare with your
__launch_bounds__target. - launch__shared_mem_per_block_allocated: shared memory per block including alignment overhead.
- l1tex__t_bytes_pipe_lsu_mem_local_op_st.sum: bytes written to local memory. Nonzero means register spills.
- l1tex__t_bytes_pipe_lsu_mem_local_op_ld.sum: bytes read from local memory. Nonzero means register spills are being read back.
- sm__sass_thread_inst_executed_op_fp32_pred_on.sum: FP32 instructions actually executed (useful for arithmetic intensity calculations).
Worked example: register spilling with __launch_bounds__
Your kernel compiles to 64 registers per thread at 256 threads per block.
Without __launch_bounds__:
- Regs per block: 256 × 64 = 16,384
- Blocks per SM (regs): floor(65,536 / 16,384) = 4
- Active threads: 4 × 256 = 1,024
- Occupancy: 1,024 / 2,048 = 50%
With __launch_bounds__(256, 8):
- Target: 8 blocks per SM
- Register budget: floor(65,536 / (8 × 256)) = 32 regs/thread
- The compiler must reduce from 64 to 32 registers
- If successful: 8 × 256 = 2,048 active threads, 100% occupancy
- ⚠ Risk: 32 excess register values per thread may spill to local memory
With a more conservative __launch_bounds__(256, 6):
- Register budget: floor(65,536 / (6 × 256)) = 42 regs/thread
- The compiler reduces from 64 to 42 registers
- 6 × 256 = 1,536 active threads, 75% occupancy
- Lower spill risk: only 22 values need to be relocated, and many can be recomputed instead of spilled
The conservative approach often wins. Going from 50% to 75% occupancy provides most of the latency-hiding benefit. Going from 75% to 100% adds marginal benefit while the spill cost is real.
✓ Check l1tex__t_bytes_pipe_lsu_mem_local_op_st.sum after each change. If local memory traffic increases significantly, back off.
✗ Do not blindly set minBlocksPerMultiprocessor to its maximum. Let profiling data guide the value.
Practical guidelines for block size selection
Block size interacts with all three resource constraints. These guidelines work well as starting points:
-
Start with 128 or 256 threads per block. Both divide evenly into the warp size and the SM thread limit. 256 is the most common choice in production code.
-
Avoid block sizes below 64. Small blocks hit the per-SM block limit before using all available threads. A block of 32 threads uses one warp but counts as one of the 32 allowed blocks.
-
Avoid block sizes above 512 unless necessary. Large blocks reduce the number of blocks per SM, which reduces flexibility. If one large block finishes early, the SM is partially idle until the next block is assigned.
-
Use
cudaOccupancyMaxPotentialBlockSizeas a starting point. It queries the compiler’s register count and returns the block size that maximizes theoretical occupancy. Then profile to see if that block size is actually optimal. -
For 2D/3D kernels, keep block dimensions as multiples of 32 in the x-dimension. This ensures coalesced memory access for row-major data.
In practice
Profile first, tune second. Before touching __launch_bounds__ or adjusting block sizes, run Nsight Compute and identify whether you are memory-bound, compute-bound, or latency-bound. Tuning the wrong resource wastes time and can make performance worse.
Occupancy is a tool, not a target. A kernel at 50% occupancy that fully utilizes its registers is often faster than the same kernel at 100% occupancy with heavy spilling. The scatter plot above is not hypothetical; this pattern shows up consistently in real workloads.
Register pressure is the most common limiter. Complex kernels with many local variables, deep call chains, or unrolled loops consume registers quickly. Use --ptxas-options=-v during compilation to see register counts without running the kernel.
Shared memory configuration matters. On Ampere and later, shared memory and L1 cache share a pool. Use cudaFuncSetAttribute to configure the split per kernel:
cudaFuncSetAttribute(
myKernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared
);
This gives the kernel maximum shared memory at the expense of L1 cache. Use it when your kernel’s working set fits in shared memory and does not benefit from L1 caching.
Beware of the occupancy cliff. Adding one more register per thread can drop occupancy from 75% to 50% if it pushes the block count from 6 to 4. Monitor register counts across compiler versions and code changes. A seemingly unrelated change to a helper function can add registers and silently degrade performance.
What comes next
The next article applies everything from this series to a real problem: implementing and optimizing matrix multiplication from scratch. We start with a naive kernel, measure its performance against the roofline, and iteratively apply tiling, shared memory staging, register blocking, and vectorized loads to approach cuBLAS-level throughput.