Case study: reduction and histogram at scale
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- Parallel prefix sum and reduction for tree-based reduction, warp shuffle basics, and work-efficient parallel algorithms.
- Synchronization and atomic operations for
__syncthreads(),atomicAdd, contention costs, and memory fences.
You should be comfortable writing reduction kernels within a single block. This article extends those techniques to arrays that span thousands of blocks and applies similar thinking to histogram computation.
Reduction beyond a single block
A single thread block can reduce at most 1024 elements (one per thread, assuming block size 1024). Real workloads operate on millions or billions of elements. The core question: how do you coordinate partial results across blocks when CUDA provides no global barrier within a kernel?
Three strategies exist:
- Two-pass reduction. Launch one kernel where each block reduces its chunk and writes a partial sum. Launch a second kernel to reduce those partial sums.
- Atomic global reduction. Each block reduces locally, then uses
atomicAddto accumulate into a single global result. - Cooperative groups grid sync. Use
cooperative_groups::grid_group::sync()to synchronize all blocks, enabling a single-kernel multi-level reduction. This requires opt-in launch semantics and has occupancy constraints.
The two-pass approach is the most widely used. It is simple, has no contention, and works on all hardware.
Two-pass reduction architecture
graph TD
subgraph "Pass 1: Block-Level Reduction"
I["Input Array (1M elements)"] --> B0["Block 0
256 threads
reduces elements 0..255"]
I --> B1["Block 1
256 threads
reduces elements 256..511"]
I --> B2["Block 2
256 threads
reduces elements 512..767"]
I --> BD["..."]
I --> BN["Block 3905
256 threads
reduces final chunk"]
B0 --> P["Partial sums array
(3906 elements)"]
B1 --> P
B2 --> P
BD --> P
BN --> P
end
subgraph "Pass 2: Final Reduction"
P --> F0["Block 0
256 threads
reduces partials 0..255"]
P --> F1["Block 1..15
256 threads each"]
F0 --> R["Final result
(single value)"]
F1 --> R
end
style I fill:#4dabf7,color:#fff
style P fill:#ffa94d,color:#fff
style R fill:#69db7c,color:#000
Worked example. Take N = 1,000,000 elements with block size 256.
- Pass 1: Grid size = ceil(1,000,000 / 256) = 3,907 blocks. Each block loads 256 elements, reduces them to one partial sum, and writes it to
partials[blockIdx.x]. Output: 3,907 partial sums. - Pass 2: Grid size = ceil(3,907 / 256) = 16 blocks. These 16 blocks reduce 3,907 values down to 16 partial sums. A third pass with 1 block finishes the job, or you can handle 16 values on the CPU.
In practice, two passes suffice for any realistic input size. Pass 1 produces at most ceil(N / 256) partials. For N = 4 billion (the largest 32-bit indexable range), that is roughly 16 million partials. Pass 2 reduces those to about 63,000. Pass 3 reduces to 246. Pass 4 finishes. But since pass 2 output is small enough for a single block, you rarely need more than two kernel launches.
Can a single block handle 3,907 partials? No. Block size is at most 1024. With 3,907 partials you need at least ceil(3,907 / 1024) = 4 blocks. Using 256-thread blocks, you need 16 blocks. The second pass is itself a mini-reduction.
Warp reduction with __shfl_down_sync
The fastest way to reduce 32 values is through warp shuffle instructions. __shfl_down_sync(mask, val, delta) reads the val register from the thread that is delta lanes ahead. No shared memory, no barriers, just register-to-register transfers within the warp.
__device__ float warp_reduce_sum(float val) {
for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
val += __shfl_down_sync(0xFFFFFFFF, val, offset);
}
return val;
}
This performs 5 iterations (log2(32) = 5). After completion, lane 0 holds the sum of all 32 input values. The mask 0xFFFFFFFF means all 32 threads participate. Using a narrower mask is legal but requires careful reasoning about which lanes hold valid data.
Why shuffle is faster than shared memory. Shared memory reduction requires a __syncthreads() at each level plus two memory operations (write + read) per level. Shuffle replaces all of that with a single instruction per level that moves data through the warp’s register file. On an A100, a shared memory round-trip costs about 20 cycles. A shuffle costs about 1 cycle.
Block reduction: shuffle plus shared memory
Combining warp shuffle with shared memory gives the optimal single-block reduction. Each warp reduces its 32 elements via shuffle. The per-warp results go into shared memory. The first warp then reduces those per-warp values via shuffle again.
__global__ void reduce_sum_optimized(const float* __restrict__ input,
float* __restrict__ output,
int n) {
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global memory
float val = (i < n) ? input[i] : 0.0f;
// Warp-level reduction
val = warp_reduce_sum(val);
// Per-warp results to shared memory
__shared__ float warp_sums[32]; // max 32 warps per block (1024/32)
unsigned int lane = tid % warpSize;
unsigned int warpId = tid / warpSize;
if (lane == 0) {
warp_sums[warpId] = val;
}
__syncthreads();
// First warp reduces the per-warp sums
unsigned int numWarps = blockDim.x / warpSize;
val = (tid < numWarps) ? warp_sums[tid] : 0.0f;
if (warpId == 0) {
val = warp_reduce_sum(val);
}
// Thread 0 writes block result
if (tid == 0) {
output[blockIdx.x] = val;
}
}
This kernel processes blockDim.x elements per block. For 256 threads, that is 8 warps. The first level does 8 parallel warp reductions (each 5 shuffle steps). The second level does one warp reduction on 8 values (5 more shuffle steps, with 24 lanes contributing zero). Total: 10 shuffle instructions plus one __syncthreads(). Compare to the naive shared memory approach: 8 tree levels, each requiring a __syncthreads() and shared memory loads and stores.
Complete two-pass reduction
The full solution launches the block reduction kernel twice:
void reduce_large_array(float* d_input, float* d_output, int n) {
const int BLOCK_SIZE = 256;
int gridSize = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
float* d_partials;
cudaMalloc(&d_partials, gridSize * sizeof(float));
// Pass 1: each block reduces its chunk
reduce_sum_optimized<<<gridSize, BLOCK_SIZE>>>(d_input, d_partials, n);
// Pass 2: reduce partial sums
// For very large gridSize, this may need another pass
if (gridSize > BLOCK_SIZE) {
int gridSize2 = (gridSize + BLOCK_SIZE - 1) / BLOCK_SIZE;
float* d_partials2;
cudaMalloc(&d_partials2, gridSize2 * sizeof(float));
reduce_sum_optimized<<<gridSize2, BLOCK_SIZE>>>(d_partials, d_partials2, gridSize);
// Pass 3 (if needed, but gridSize2 is small enough for one block)
reduce_sum_optimized<<<1, BLOCK_SIZE>>>(d_partials2, d_output, gridSize2);
cudaFree(d_partials2);
} else {
reduce_sum_optimized<<<1, BLOCK_SIZE>>>(d_partials, d_output, gridSize);
}
cudaFree(d_partials);
}
For 1M elements: pass 1 produces 3,907 partials. Since 3,907 > 256, pass 2 produces ceil(3,907 / 256) = 16 partials. Pass 3 reduces 16 values to 1. Three kernel launches, each trivially fast.
Grid-level atomic reduction
The alternative to multiple passes is a single kernel where each block atomically accumulates its result:
__global__ void reduce_sum_atomic(const float* __restrict__ input,
float* __restrict__ output,
int n) {
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
float val = (i < n) ? input[i] : 0.0f;
val = warp_reduce_sum(val);
__shared__ float warp_sums[32];
unsigned int lane = tid % warpSize;
unsigned int warpId = tid / warpSize;
if (lane == 0) warp_sums[warpId] = val;
__syncthreads();
unsigned int numWarps = blockDim.x / warpSize;
val = (tid < numWarps) ? warp_sums[tid] : 0.0f;
if (warpId == 0) val = warp_reduce_sum(val);
if (tid == 0) {
atomicAdd(output, val);
}
}
This eliminates the second kernel launch. The cost is atomic contention: all blocks compete to update the same memory location. For 3,907 blocks, that is 3,907 serialized atomicAdd operations. On an A100, each atomicAdd to global memory takes roughly 100 ns. Total serialized time: ~0.4 ms. For a reduction of 1M floats that otherwise takes ~0.05 ms, the atomic overhead dominates.
Atomic reduction works well when the grid is small (a few hundred blocks or fewer). For large grids, two-pass wins.
Thrust: reduction in one line
CUDA’s Thrust library provides thrust::reduce which automatically selects the best strategy:
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
thrust::device_vector<float> d_vec(h_data, h_data + n);
float result = thrust::reduce(d_vec.begin(), d_vec.end(), 0.0f, thrust::plus<float>());
Thrust uses a multi-pass reduction internally with architecture-tuned block sizes and vectorized loads. It consistently matches or exceeds hand-written kernels for standard reductions.
Reduction throughput comparison
The naive atomic approach is severely bottlenecked by contention on a single global memory address. The shared memory tree is limited by __syncthreads() overhead and idle warps in later reduction levels. The shuffle-based two-pass approach gets close to memory bandwidth limits because the kernel is memory-bound (each element is read once, with minimal compute). Thrust matches or slightly exceeds the hand-written shuffle version through internal tuning and vectorized loads.
Histogram: counting occurrences
Histogram computation bins input values and counts how many fall into each bin. Unlike reduction, which converges to a single value, histogram produces an array of counts. The core challenge is the same: many threads updating shared counters simultaneously.
// Naive: every thread atomically updates global histogram
__global__ void histogram_naive(const int* __restrict__ input,
int* __restrict__ hist,
int n) {
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
atomicAdd(&hist[input[i]], 1);
}
}
This works but creates massive contention when many elements map to the same bin. For 256 bins and 1M uniform random values, each bin receives roughly 3,906 increments. With 3,907 blocks of 256 threads all racing on the same 256 global counters, contention is severe.
Privatized histogram
The solution is histogram privatization: each block maintains its own private copy of the histogram in shared memory. Threads within a block atomically update shared memory (fast, low contention). After all elements are processed, the block merges its private histogram into the global one.
graph TD
subgraph "Phase 1: Per-Block Private Histograms"
I["Input Array (1M elements)"] --> B0["Block 0
Private hist[256]
in shared memory"]
I --> B1["Block 1
Private hist[256]
in shared memory"]
I --> B2["Block 2
Private hist[256]
in shared memory"]
I --> BD["..."]
I --> BN["Block 3905
Private hist[256]
in shared memory"]
end
subgraph "Phase 2: Merge into Global Histogram"
B0 --> G["Global hist[256]
atomicAdd per bin"]
B1 --> G
B2 --> G
BD --> G
BN --> G
end
style I fill:#4dabf7,color:#fff
style G fill:#69db7c,color:#000
The key insight: shared memory atomics are 10-50x faster than global memory atomics. Within a block of 256 threads, at most 256 threads compete on shared memory. Across the full grid, thousands of blocks would compete on global memory. Privatization reduces the global atomic count from N (one per input element) to numBins * numBlocks (one per bin per block during the merge phase).
Worked example. 256 bins, 1M elements, block size 256.
- Naive approach: 1,000,000
atomicAddoperations to global memory. With uniform distribution, each of the 256 bins sees ~3,906 atomic updates from different blocks, creating severe contention. - Privatized approach: 3,907 blocks, each doing up to 256 shared memory atomics internally. Merge phase: 3,907 blocks * 256 bins = ~1,000,000 global atomics, but these are spread across 256 different addresses (bins), not concentrated on hot bins. Per-bin contention during merge: ~3,907 atomics per bin, which is manageable.
The real win comes from the shared memory phase. Shared memory atomicAdd runs at ~5 ns versus ~100 ns for global memory. The 1M updates in shared memory cost roughly 5 ms total across all blocks (running in parallel). The merge phase adds ~1M global atomics spread across 256 addresses, completing in roughly 0.4 ms given parallel block execution.
__global__ void histogram_privatized(const int* __restrict__ input,
int* __restrict__ hist,
int numBins,
int n) {
extern __shared__ int s_hist[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// Initialize private histogram
for (int j = tid; j < numBins; j += blockDim.x) {
s_hist[j] = 0;
}
__syncthreads();
// Accumulate into private histogram
if (i < n) {
atomicAdd(&s_hist[input[i]], 1);
}
__syncthreads();
// Merge private histogram into global
for (int j = tid; j < numBins; j += blockDim.x) {
if (s_hist[j] > 0) {
atomicAdd(&hist[j], s_hist[j]);
}
}
}
// Launch
int numBins = 256;
int smemSize = numBins * sizeof(int);
histogram_privatized<<<gridSize, BLOCK_SIZE, smemSize>>>(d_input, d_hist, numBins, n);
The initialization loop uses a strided pattern so all threads cooperate in zeroing the shared histogram. The merge loop similarly distributes bins across threads. The if (s_hist[j] > 0) guard skips bins with zero count, avoiding unnecessary global atomics.
Scaling: multiple elements per thread
Both the reduction and histogram kernels above process one element per thread. For large arrays, this means launching enormous grids with minimal work per thread. A better approach assigns multiple elements per thread, amortizing launch overhead and increasing arithmetic intensity.
__global__ void histogram_privatized_coarsened(const int* __restrict__ input,
int* __restrict__ hist,
int numBins,
int n) {
extern __shared__ int s_hist[];
unsigned int tid = threadIdx.x;
// Initialize private histogram
for (int j = tid; j < numBins; j += blockDim.x) {
s_hist[j] = 0;
}
__syncthreads();
// Each thread processes multiple elements (grid-stride loop)
unsigned int stride = blockDim.x * gridDim.x;
for (unsigned int i = blockIdx.x * blockDim.x + tid; i < n; i += stride) {
atomicAdd(&s_hist[input[i]], 1);
}
__syncthreads();
// Merge
for (int j = tid; j < numBins; j += blockDim.x) {
if (s_hist[j] > 0) {
atomicAdd(&hist[j], s_hist[j]);
}
}
}
The grid-stride loop lets you launch a fixed-size grid (say, 256 blocks) regardless of input size. Each thread walks through the array with a stride equal to the total thread count. This keeps the grid small, reducing merge overhead (256 blocks * 256 bins = 65,536 global atomics instead of ~1M), while still processing all elements.
Thrust: histogram with sort and reduce
Thrust does not provide a direct histogram primitive, but you can build one from sort and reduce-by-key:
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/iterator/constant_iterator.h>
thrust::device_vector<int> d_input(h_input, h_input + n);
thrust::device_vector<int> d_keys(numBins);
thrust::device_vector<int> d_counts(numBins);
thrust::sort(d_input.begin(), d_input.end());
auto end = thrust::reduce_by_key(
d_input.begin(), d_input.end(),
thrust::constant_iterator<int>(1),
d_keys.begin(),
d_counts.begin()
);
This approach is algorithmically clean but slower than a privatized histogram for small bin counts. The sort alone is O(N log N) work, while the privatized histogram is O(N). For large bin counts (thousands or more), where shared memory cannot hold the full histogram, sort-and-reduce becomes competitive.
Histogram throughput vs bin count
Key observations:
- Naive atomic throughput is nearly flat and low. Regardless of bin count, every element requires a global atomic. More bins spread contention across more addresses, slightly improving throughput, but the fundamental bottleneck remains.
- Privatized throughput peaks at small-to-medium bin counts. With 256 bins, the shared histogram is only 1 KB. This fits easily in shared memory with room to spare for high occupancy. At 4K bins (16 KB), shared memory pressure reduces occupancy. At 16K bins (64 KB), the shared histogram consumes the entire shared memory allocation on most GPUs, forcing single-block occupancy and killing performance.
- Sort + reduce-by-key is insensitive to bin count. The sort cost dominates and depends on N, not on the number of bins. This makes it the best choice when bin counts exceed what shared memory can hold.
Choosing the right approach
| Scenario | Best approach | Why |
|---|---|---|
| Reduction, any size | Shuffle two-pass | Near bandwidth limit, no contention |
| Reduction, small array (<100K) | Single kernel + atomic | Avoids second launch overhead |
| Histogram, few bins (<1K) | Privatized shared memory | Shared atomics are fast, merge is cheap |
| Histogram, many bins (>4K) | Sort + reduce-by-key | Shared memory cannot hold the histogram |
| Histogram, skewed distribution | Privatized + coarsening | Reduces hot-bin contention through fewer blocks |
| Either, rapid prototyping | Thrust | Correct, fast, one line |
Data distribution matters for histograms. Uniform distributions spread contention evenly. Skewed distributions (e.g., Zipf) concentrate updates on a few hot bins. In the naive approach, hot bins become serialization points. Privatized histograms help because contention is limited to threads within a single block. With 256 threads and a hot bin, you have at most 256-way contention in shared memory versus grid-wide contention in global memory.
For extremely skewed data (e.g., 90% of values in one bin), even shared memory atomics can bottleneck. In those cases, consider warp-level vote and ballot intrinsics: threads in the same warp voting for the same bin can use __ballot_sync to count matches and issue a single atomicAdd with the popcount, reducing contention by up to 32x.
In practice
✓ Use Thrust or CUB for production reductions. thrust::reduce and cub::DeviceReduce::Sum are highly optimized, handle edge cases, and support custom operators. Writing a custom reduction kernel is valuable for learning, but production code should use library primitives.
✓ Profile atomic contention with Nsight Compute. The “Memory Workload Analysis” section shows L2 atomic throughput and contention metrics. If your kernel shows high “L2 Atomic Transactions” relative to useful work, privatization or algorithmic restructuring will help.
✓ Tune grid size for privatized histograms. Fewer blocks means fewer merge operations but less parallelism. More blocks means more parallelism but a more expensive merge. A grid-stride loop with a fixed grid of 256 to 512 blocks is a good starting point.
⚠ Watch shared memory limits with large histograms. Each SM has 48 to 164 KB of shared memory depending on architecture. A privatized histogram with 16K 4-byte bins uses 64 KB, which may exceed the available shared memory or force low occupancy. Check with cudaFuncSetAttribute for extended shared memory on Ampere and later.
⚠ Floating-point reduction is not associative. Reordering additions changes the result due to rounding. Different block sizes or grid sizes produce slightly different sums. If bitwise reproducibility matters, use Kahan summation or fixed-point accumulation.
✗ Do not use naive global atomics for large-scale histograms. The throughput difference between naive and privatized can be 5 to 8x. There is no scenario where the naive approach is preferable when bin counts fit in shared memory.
What comes next
The next article moves beyond single-GPU kernel optimization to heterogeneous processing, where you will learn to partition work across CPU and GPU, overlap host and device computation, and handle workloads that do not fit neatly into a single kernel launch pattern.