Search…

Case study: reduction and histogram at scale

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable writing reduction kernels within a single block. This article extends those techniques to arrays that span thousands of blocks and applies similar thinking to histogram computation.

Reduction beyond a single block

A single thread block can reduce at most 1024 elements (one per thread, assuming block size 1024). Real workloads operate on millions or billions of elements. The core question: how do you coordinate partial results across blocks when CUDA provides no global barrier within a kernel?

Three strategies exist:

  1. Two-pass reduction. Launch one kernel where each block reduces its chunk and writes a partial sum. Launch a second kernel to reduce those partial sums.
  2. Atomic global reduction. Each block reduces locally, then uses atomicAdd to accumulate into a single global result.
  3. Cooperative groups grid sync. Use cooperative_groups::grid_group::sync() to synchronize all blocks, enabling a single-kernel multi-level reduction. This requires opt-in launch semantics and has occupancy constraints.

The two-pass approach is the most widely used. It is simple, has no contention, and works on all hardware.

Two-pass reduction architecture

graph TD
  subgraph "Pass 1: Block-Level Reduction"
      I["Input Array (1M elements)"] --> B0["Block 0
256 threads
reduces elements 0..255"]
      I --> B1["Block 1
256 threads
reduces elements 256..511"]
      I --> B2["Block 2
256 threads
reduces elements 512..767"]
      I --> BD["..."]
      I --> BN["Block 3905
256 threads
reduces final chunk"]
      B0 --> P["Partial sums array
(3906 elements)"]
      B1 --> P
      B2 --> P
      BD --> P
      BN --> P
  end

  subgraph "Pass 2: Final Reduction"
      P --> F0["Block 0
256 threads
reduces partials 0..255"]
      P --> F1["Block 1..15
256 threads each"]
      F0 --> R["Final result
(single value)"]
      F1 --> R
  end

  style I fill:#4dabf7,color:#fff
  style P fill:#ffa94d,color:#fff
  style R fill:#69db7c,color:#000

Worked example. Take N = 1,000,000 elements with block size 256.

  • Pass 1: Grid size = ceil(1,000,000 / 256) = 3,907 blocks. Each block loads 256 elements, reduces them to one partial sum, and writes it to partials[blockIdx.x]. Output: 3,907 partial sums.
  • Pass 2: Grid size = ceil(3,907 / 256) = 16 blocks. These 16 blocks reduce 3,907 values down to 16 partial sums. A third pass with 1 block finishes the job, or you can handle 16 values on the CPU.

In practice, two passes suffice for any realistic input size. Pass 1 produces at most ceil(N / 256) partials. For N = 4 billion (the largest 32-bit indexable range), that is roughly 16 million partials. Pass 2 reduces those to about 63,000. Pass 3 reduces to 246. Pass 4 finishes. But since pass 2 output is small enough for a single block, you rarely need more than two kernel launches.

Can a single block handle 3,907 partials? No. Block size is at most 1024. With 3,907 partials you need at least ceil(3,907 / 1024) = 4 blocks. Using 256-thread blocks, you need 16 blocks. The second pass is itself a mini-reduction.

Warp reduction with __shfl_down_sync

The fastest way to reduce 32 values is through warp shuffle instructions. __shfl_down_sync(mask, val, delta) reads the val register from the thread that is delta lanes ahead. No shared memory, no barriers, just register-to-register transfers within the warp.

__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

This performs 5 iterations (log2(32) = 5). After completion, lane 0 holds the sum of all 32 input values. The mask 0xFFFFFFFF means all 32 threads participate. Using a narrower mask is legal but requires careful reasoning about which lanes hold valid data.

Why shuffle is faster than shared memory. Shared memory reduction requires a __syncthreads() at each level plus two memory operations (write + read) per level. Shuffle replaces all of that with a single instruction per level that moves data through the warp’s register file. On an A100, a shared memory round-trip costs about 20 cycles. A shuffle costs about 1 cycle.

Block reduction: shuffle plus shared memory

Combining warp shuffle with shared memory gives the optimal single-block reduction. Each warp reduces its 32 elements via shuffle. The per-warp results go into shared memory. The first warp then reduces those per-warp values via shuffle again.

__global__ void reduce_sum_optimized(const float* __restrict__ input,
                                     float* __restrict__ output,
                                     int n) {
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from global memory
    float val = (i < n) ? input[i] : 0.0f;

    // Warp-level reduction
    val = warp_reduce_sum(val);

    // Per-warp results to shared memory
    __shared__ float warp_sums[32]; // max 32 warps per block (1024/32)
    unsigned int lane = tid % warpSize;
    unsigned int warpId = tid / warpSize;

    if (lane == 0) {
        warp_sums[warpId] = val;
    }
    __syncthreads();

    // First warp reduces the per-warp sums
    unsigned int numWarps = blockDim.x / warpSize;
    val = (tid < numWarps) ? warp_sums[tid] : 0.0f;
    if (warpId == 0) {
        val = warp_reduce_sum(val);
    }

    // Thread 0 writes block result
    if (tid == 0) {
        output[blockIdx.x] = val;
    }
}

This kernel processes blockDim.x elements per block. For 256 threads, that is 8 warps. The first level does 8 parallel warp reductions (each 5 shuffle steps). The second level does one warp reduction on 8 values (5 more shuffle steps, with 24 lanes contributing zero). Total: 10 shuffle instructions plus one __syncthreads(). Compare to the naive shared memory approach: 8 tree levels, each requiring a __syncthreads() and shared memory loads and stores.

Complete two-pass reduction

The full solution launches the block reduction kernel twice:

void reduce_large_array(float* d_input, float* d_output, int n) {
    const int BLOCK_SIZE = 256;
    int gridSize = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;

    float* d_partials;
    cudaMalloc(&d_partials, gridSize * sizeof(float));

    // Pass 1: each block reduces its chunk
    reduce_sum_optimized<<<gridSize, BLOCK_SIZE>>>(d_input, d_partials, n);

    // Pass 2: reduce partial sums
    // For very large gridSize, this may need another pass
    if (gridSize > BLOCK_SIZE) {
        int gridSize2 = (gridSize + BLOCK_SIZE - 1) / BLOCK_SIZE;
        float* d_partials2;
        cudaMalloc(&d_partials2, gridSize2 * sizeof(float));
        reduce_sum_optimized<<<gridSize2, BLOCK_SIZE>>>(d_partials, d_partials2, gridSize);

        // Pass 3 (if needed, but gridSize2 is small enough for one block)
        reduce_sum_optimized<<<1, BLOCK_SIZE>>>(d_partials2, d_output, gridSize2);
        cudaFree(d_partials2);
    } else {
        reduce_sum_optimized<<<1, BLOCK_SIZE>>>(d_partials, d_output, gridSize);
    }

    cudaFree(d_partials);
}

For 1M elements: pass 1 produces 3,907 partials. Since 3,907 > 256, pass 2 produces ceil(3,907 / 256) = 16 partials. Pass 3 reduces 16 values to 1. Three kernel launches, each trivially fast.

Grid-level atomic reduction

The alternative to multiple passes is a single kernel where each block atomically accumulates its result:

__global__ void reduce_sum_atomic(const float* __restrict__ input,
                                  float* __restrict__ output,
                                  int n) {
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    float val = (i < n) ? input[i] : 0.0f;
    val = warp_reduce_sum(val);

    __shared__ float warp_sums[32];
    unsigned int lane = tid % warpSize;
    unsigned int warpId = tid / warpSize;

    if (lane == 0) warp_sums[warpId] = val;
    __syncthreads();

    unsigned int numWarps = blockDim.x / warpSize;
    val = (tid < numWarps) ? warp_sums[tid] : 0.0f;
    if (warpId == 0) val = warp_reduce_sum(val);

    if (tid == 0) {
        atomicAdd(output, val);
    }
}

This eliminates the second kernel launch. The cost is atomic contention: all blocks compete to update the same memory location. For 3,907 blocks, that is 3,907 serialized atomicAdd operations. On an A100, each atomicAdd to global memory takes roughly 100 ns. Total serialized time: ~0.4 ms. For a reduction of 1M floats that otherwise takes ~0.05 ms, the atomic overhead dominates.

Atomic reduction works well when the grid is small (a few hundred blocks or fewer). For large grids, two-pass wins.

Thrust: reduction in one line

CUDA’s Thrust library provides thrust::reduce which automatically selects the best strategy:

#include <thrust/device_vector.h>
#include <thrust/reduce.h>

thrust::device_vector<float> d_vec(h_data, h_data + n);
float result = thrust::reduce(d_vec.begin(), d_vec.end(), 0.0f, thrust::plus<float>());

Thrust uses a multi-pass reduction internally with architecture-tuned block sizes and vectorized loads. It consistently matches or exceeds hand-written kernels for standard reductions.

Reduction throughput comparison

The naive atomic approach is severely bottlenecked by contention on a single global memory address. The shared memory tree is limited by __syncthreads() overhead and idle warps in later reduction levels. The shuffle-based two-pass approach gets close to memory bandwidth limits because the kernel is memory-bound (each element is read once, with minimal compute). Thrust matches or slightly exceeds the hand-written shuffle version through internal tuning and vectorized loads.

Histogram: counting occurrences

Histogram computation bins input values and counts how many fall into each bin. Unlike reduction, which converges to a single value, histogram produces an array of counts. The core challenge is the same: many threads updating shared counters simultaneously.

// Naive: every thread atomically updates global histogram
__global__ void histogram_naive(const int* __restrict__ input,
                                int* __restrict__ hist,
                                int n) {
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        atomicAdd(&hist[input[i]], 1);
    }
}

This works but creates massive contention when many elements map to the same bin. For 256 bins and 1M uniform random values, each bin receives roughly 3,906 increments. With 3,907 blocks of 256 threads all racing on the same 256 global counters, contention is severe.

Privatized histogram

The solution is histogram privatization: each block maintains its own private copy of the histogram in shared memory. Threads within a block atomically update shared memory (fast, low contention). After all elements are processed, the block merges its private histogram into the global one.

graph TD
  subgraph "Phase 1: Per-Block Private Histograms"
      I["Input Array (1M elements)"] --> B0["Block 0
Private hist[256]
in shared memory"]
      I --> B1["Block 1
Private hist[256]
in shared memory"]
      I --> B2["Block 2
Private hist[256]
in shared memory"]
      I --> BD["..."]
      I --> BN["Block 3905
Private hist[256]
in shared memory"]
  end

  subgraph "Phase 2: Merge into Global Histogram"
      B0 --> G["Global hist[256]
atomicAdd per bin"]
      B1 --> G
      B2 --> G
      BD --> G
      BN --> G
  end

  style I fill:#4dabf7,color:#fff
  style G fill:#69db7c,color:#000

The key insight: shared memory atomics are 10-50x faster than global memory atomics. Within a block of 256 threads, at most 256 threads compete on shared memory. Across the full grid, thousands of blocks would compete on global memory. Privatization reduces the global atomic count from N (one per input element) to numBins * numBlocks (one per bin per block during the merge phase).

Worked example. 256 bins, 1M elements, block size 256.

  • Naive approach: 1,000,000 atomicAdd operations to global memory. With uniform distribution, each of the 256 bins sees ~3,906 atomic updates from different blocks, creating severe contention.
  • Privatized approach: 3,907 blocks, each doing up to 256 shared memory atomics internally. Merge phase: 3,907 blocks * 256 bins = ~1,000,000 global atomics, but these are spread across 256 different addresses (bins), not concentrated on hot bins. Per-bin contention during merge: ~3,907 atomics per bin, which is manageable.

The real win comes from the shared memory phase. Shared memory atomicAdd runs at ~5 ns versus ~100 ns for global memory. The 1M updates in shared memory cost roughly 5 ms total across all blocks (running in parallel). The merge phase adds ~1M global atomics spread across 256 addresses, completing in roughly 0.4 ms given parallel block execution.

__global__ void histogram_privatized(const int* __restrict__ input,
                                     int* __restrict__ hist,
                                     int numBins,
                                     int n) {
    extern __shared__ int s_hist[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Initialize private histogram
    for (int j = tid; j < numBins; j += blockDim.x) {
        s_hist[j] = 0;
    }
    __syncthreads();

    // Accumulate into private histogram
    if (i < n) {
        atomicAdd(&s_hist[input[i]], 1);
    }
    __syncthreads();

    // Merge private histogram into global
    for (int j = tid; j < numBins; j += blockDim.x) {
        if (s_hist[j] > 0) {
            atomicAdd(&hist[j], s_hist[j]);
        }
    }
}

// Launch
int numBins = 256;
int smemSize = numBins * sizeof(int);
histogram_privatized<<<gridSize, BLOCK_SIZE, smemSize>>>(d_input, d_hist, numBins, n);

The initialization loop uses a strided pattern so all threads cooperate in zeroing the shared histogram. The merge loop similarly distributes bins across threads. The if (s_hist[j] > 0) guard skips bins with zero count, avoiding unnecessary global atomics.

Scaling: multiple elements per thread

Both the reduction and histogram kernels above process one element per thread. For large arrays, this means launching enormous grids with minimal work per thread. A better approach assigns multiple elements per thread, amortizing launch overhead and increasing arithmetic intensity.

__global__ void histogram_privatized_coarsened(const int* __restrict__ input,
                                               int* __restrict__ hist,
                                               int numBins,
                                               int n) {
    extern __shared__ int s_hist[];
    unsigned int tid = threadIdx.x;

    // Initialize private histogram
    for (int j = tid; j < numBins; j += blockDim.x) {
        s_hist[j] = 0;
    }
    __syncthreads();

    // Each thread processes multiple elements (grid-stride loop)
    unsigned int stride = blockDim.x * gridDim.x;
    for (unsigned int i = blockIdx.x * blockDim.x + tid; i < n; i += stride) {
        atomicAdd(&s_hist[input[i]], 1);
    }
    __syncthreads();

    // Merge
    for (int j = tid; j < numBins; j += blockDim.x) {
        if (s_hist[j] > 0) {
            atomicAdd(&hist[j], s_hist[j]);
        }
    }
}

The grid-stride loop lets you launch a fixed-size grid (say, 256 blocks) regardless of input size. Each thread walks through the array with a stride equal to the total thread count. This keeps the grid small, reducing merge overhead (256 blocks * 256 bins = 65,536 global atomics instead of ~1M), while still processing all elements.

Thrust: histogram with sort and reduce

Thrust does not provide a direct histogram primitive, but you can build one from sort and reduce-by-key:

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/iterator/constant_iterator.h>

thrust::device_vector<int> d_input(h_input, h_input + n);
thrust::device_vector<int> d_keys(numBins);
thrust::device_vector<int> d_counts(numBins);

thrust::sort(d_input.begin(), d_input.end());

auto end = thrust::reduce_by_key(
    d_input.begin(), d_input.end(),
    thrust::constant_iterator<int>(1),
    d_keys.begin(),
    d_counts.begin()
);

This approach is algorithmically clean but slower than a privatized histogram for small bin counts. The sort alone is O(N log N) work, while the privatized histogram is O(N). For large bin counts (thousands or more), where shared memory cannot hold the full histogram, sort-and-reduce becomes competitive.

Histogram throughput vs bin count

Key observations:

  • Naive atomic throughput is nearly flat and low. Regardless of bin count, every element requires a global atomic. More bins spread contention across more addresses, slightly improving throughput, but the fundamental bottleneck remains.
  • Privatized throughput peaks at small-to-medium bin counts. With 256 bins, the shared histogram is only 1 KB. This fits easily in shared memory with room to spare for high occupancy. At 4K bins (16 KB), shared memory pressure reduces occupancy. At 16K bins (64 KB), the shared histogram consumes the entire shared memory allocation on most GPUs, forcing single-block occupancy and killing performance.
  • Sort + reduce-by-key is insensitive to bin count. The sort cost dominates and depends on N, not on the number of bins. This makes it the best choice when bin counts exceed what shared memory can hold.

Choosing the right approach

ScenarioBest approachWhy
Reduction, any sizeShuffle two-passNear bandwidth limit, no contention
Reduction, small array (<100K)Single kernel + atomicAvoids second launch overhead
Histogram, few bins (<1K)Privatized shared memoryShared atomics are fast, merge is cheap
Histogram, many bins (>4K)Sort + reduce-by-keyShared memory cannot hold the histogram
Histogram, skewed distributionPrivatized + coarseningReduces hot-bin contention through fewer blocks
Either, rapid prototypingThrustCorrect, fast, one line

Data distribution matters for histograms. Uniform distributions spread contention evenly. Skewed distributions (e.g., Zipf) concentrate updates on a few hot bins. In the naive approach, hot bins become serialization points. Privatized histograms help because contention is limited to threads within a single block. With 256 threads and a hot bin, you have at most 256-way contention in shared memory versus grid-wide contention in global memory.

For extremely skewed data (e.g., 90% of values in one bin), even shared memory atomics can bottleneck. In those cases, consider warp-level vote and ballot intrinsics: threads in the same warp voting for the same bin can use __ballot_sync to count matches and issue a single atomicAdd with the popcount, reducing contention by up to 32x.

In practice

Use Thrust or CUB for production reductions. thrust::reduce and cub::DeviceReduce::Sum are highly optimized, handle edge cases, and support custom operators. Writing a custom reduction kernel is valuable for learning, but production code should use library primitives.

Profile atomic contention with Nsight Compute. The “Memory Workload Analysis” section shows L2 atomic throughput and contention metrics. If your kernel shows high “L2 Atomic Transactions” relative to useful work, privatization or algorithmic restructuring will help.

Tune grid size for privatized histograms. Fewer blocks means fewer merge operations but less parallelism. More blocks means more parallelism but a more expensive merge. A grid-stride loop with a fixed grid of 256 to 512 blocks is a good starting point.

Watch shared memory limits with large histograms. Each SM has 48 to 164 KB of shared memory depending on architecture. A privatized histogram with 16K 4-byte bins uses 64 KB, which may exceed the available shared memory or force low occupancy. Check with cudaFuncSetAttribute for extended shared memory on Ampere and later.

Floating-point reduction is not associative. Reordering additions changes the result due to rounding. Different block sizes or grid sizes produce slightly different sums. If bitwise reproducibility matters, use Kahan summation or fixed-point accumulation.

Do not use naive global atomics for large-scale histograms. The throughput difference between naive and privatized can be 5 to 8x. There is no scenario where the naive approach is preferable when bin counts fit in shared memory.

What comes next

The next article moves beyond single-GPU kernel optimization to heterogeneous processing, where you will learn to partition work across CPU and GPU, overlap host and device computation, and handle workloads that do not fit neatly into a single kernel launch pattern.

Start typing to search across all content
navigate Enter open Esc close