Search…

Parallel prefix sum and reduction: the core parallel primitives

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read CUDA synchronization and atomics, where you learned how threads coordinate through barriers and atomic operations. You need that foundation because every algorithm here depends on correct synchronization between threads within and across warps.

You should also be comfortable with shared memory and the concept of work complexity versus depth (span) complexity in parallel algorithms.

Reduction: from N values to one

Reduction collapses an array of N values into a single result using an associative binary operator. Sum, min, max, and bitwise OR are all reductions. The sequential version is trivial:

float sequential_sum(float* data, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += data[i];
    }
    return sum;
}

This runs in O(N) time with O(N) work. On a GPU with thousands of cores sitting idle, we can do much better on the time axis.

Parallel reduction with a tree

A parallel reduction organizes the computation as a binary tree. At each level, half the active values are combined in pairs. For N values, the tree has log2(N) levels. At each level, N/2^k independent additions execute in parallel.

graph TD
  subgraph "Level 0: Input (8 elements)"
      A0["2"] --- A1["4"] --- A2["6"] --- A3["8"] --- A4["1"] --- A5["3"] --- A6["5"] --- A7["7"]
  end

  subgraph "Level 1: 4 additions"
      B0["2+4=6"] --- B1["6+8=14"] --- B2["1+3=4"] --- B3["5+7=12"]
  end

  subgraph "Level 2: 2 additions"
      C0["6+14=20"] --- C1["4+12=16"]
  end

  subgraph "Level 3: 1 addition"
      D0["20+16=36"]
  end

  A0 --> B0
  A1 --> B0
  A2 --> B1
  A3 --> B1
  A4 --> B2
  A5 --> B2
  A6 --> B3
  A7 --> B3

  B0 --> C0
  B1 --> C0
  B2 --> C1
  B3 --> C1

  C0 --> D0
  C1 --> D0

The tree has depth O(log N) and performs O(N) total work, which matches the sequential algorithm. This is work-efficient: we do not perform more total operations than the sequential version, we just distribute them across processors.

Concrete trace: Start with [2, 4, 6, 8, 1, 3, 5, 7] and 8 threads (T0 through T7).

  • Step 1 (stride 1): T0 adds index 0+1 (2+4=6), T2 adds index 2+3 (6+8=14), T4 adds index 4+5 (1+3=4), T6 adds index 6+7 (5+7=12). Four additions, four threads active.
  • Step 2 (stride 2): T0 adds index 0+2 (6+14=20), T4 adds index 4+6 (4+12=16). Two additions, two threads active.
  • Step 3 (stride 4): T0 adds index 0+4 (20+16=36). One addition, one thread active.

Final result: 36. Total operations: 4 + 2 + 1 = 7 additions for 8 elements. The sequential version also performs 7 additions. The parallel version completes in 3 steps instead of 7.

Shared memory reduction kernel

The standard single-block reduction kernel loads data into shared memory, then iteratively halves the active set:

__global__ void reduce_sum(float* input, float* output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (i < n) ? input[i] : 0.0f;
    __syncthreads();

    // Tree reduction in shared memory
    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

Each block produces a partial sum. A second kernel (or a single-block kernel) reduces those partial sums to the final result. The __syncthreads() inside the loop ensures all writes from the current level are visible before the next level reads them.

This kernel has a problem. In the first iteration, threads 0 through N/2-1 are active while the rest sit idle. All those idle threads still occupy warp slots. Worse, in the final iterations only one warp (or a fraction of one) is doing work while dozens of warps burn cycles on the barrier.

Warp-level reduction with shuffle

For the last 32 elements (one warp), we can replace shared memory with warp shuffle instructions. __shfl_down_sync lets a thread read a register value from another thread in the same warp without going through shared memory. No barrier is needed because warps execute in lockstep.

__device__ float warp_reduce_sum(float val) {
    for (int offset = warpSize / 2; offset > 0; offset >>= 1) {
        val += __shfl_down_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

__global__ void reduce_sum_warp_shuffle(float* input, float* output, int n) {
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int lane = tid % warpSize;
    unsigned int warpId = tid / warpSize;

    float val = (i < n) ? input[i] : 0.0f;

    // First: full warp-level reduction
    val = warp_reduce_sum(val);

    // Write each warp's result to shared memory
    if (lane == 0) {
        sdata[warpId] = val;
    }
    __syncthreads();

    // First warp reduces the per-warp results
    val = (tid < blockDim.x / warpSize) ? sdata[lane] : 0.0f;
    if (warpId == 0) {
        val = warp_reduce_sum(val);
    }

    if (tid == 0) {
        output[blockIdx.x] = val;
    }
}

The shuffle version avoids shared memory bank conflicts entirely for the warp-level steps, saves shared memory capacity, and eliminates explicit barrier overhead. On modern GPUs (Volta and later), the 0xFFFFFFFF mask specifies that all 32 lanes participate in the shuffle.

The __shfl_down_sync(mask, val, offset) call returns the value of val from the thread that is offset lanes higher. Thread 0 gets thread 16’s value, thread 1 gets thread 17’s value, and so on. After 5 iterations (offsets 16, 8, 4, 2, 1), lane 0 holds the sum of all 32 values.

Performance: sequential vs. parallel vs. warp shuffle

At small sizes the GPU kernel launch overhead dominates and the CPU wins. As the array grows past roughly 64K elements, the parallel versions pull ahead. The warp shuffle variant consistently outperforms the naive shared memory version by 1.5 to 1.7x because it eliminates shared memory round-trips and barriers during the final warp-level steps.

Prefix sum (scan)

Reduction collapses N values to 1. Prefix sum (scan) keeps all intermediate results. Given an array [a0, a1, a2, …, aN-1] and an associative operator +, the inclusive scan produces:

[a0, a0+a1, a0+a1+a2, ..., a0+a1+...+aN-1]

The exclusive scan shifts the result right and fills position 0 with the identity element (0 for addition):

[0, a0, a0+a1, ..., a0+a1+...+aN-2]

Scan is surprisingly useful. It appears as a building block in stream compaction, radix sort, sparse matrix operations, histogram equalization, and memory allocation on the GPU. Any time you need to compute “where does my output go?” in a data-dependent parallel algorithm, scan is likely the answer.

Naive parallel scan: O(N log N) work

The simplest parallel scan (Hillis-Steele) works like this: at step k, each element i adds the element at position i - 2^k (if it exists). After log2(N) steps, every element holds its prefix sum.

For the input [1, 2, 3, 4]:

  • Step 0 (offset 1): each element adds the element 1 position back. Result: [1, 1+2, 2+3, 3+4] = [1, 3, 5, 7].
  • Step 1 (offset 2): each element adds the element 2 positions back. Result: [1, 3, 1+5, 3+7] = [1, 3, 6, 10].

The result [1, 3, 6, 10] is the inclusive prefix sum. This algorithm has O(N log N) work but only O(log N) depth. It does more total work than the sequential algorithm, but on a GPU with enough parallelism the low depth can still win for moderate N.

The problem: as N grows, the extra work (log N factor) becomes a real cost. For N = 1M, you are doing 20x the work of the sequential version. We need a work-efficient algorithm.

Blelloch work-efficient scan

Guy Blelloch’s algorithm achieves O(N) work and O(log N) depth by splitting the scan into two phases: an up-sweep (reduce) and a down-sweep (distribute).

Up-sweep phase

The up-sweep is identical to a parallel reduction. It builds a binary tree of partial sums from the leaves to the root.

graph TD
  subgraph "Input"
      I0["1"] --- I1["2"] --- I2["3"] --- I3["4"]
  end

  subgraph "Up-sweep Level 1"
      U0["1"] --- U1["1+2=3"] --- U2["3"] --- U3["3+4=7"]
  end

  subgraph "Up-sweep Level 2"
      V0["1"] --- V1["3"] --- V2["3"] --- V3["3+7=10"]
  end

  I0 --> U0
  I1 --> U1
  I2 --> U2
  I3 --> U3

  U0 --> V0
  U1 --> V1
  U2 --> V2
  U3 --> V3

After the up-sweep, the last element holds the total sum (10 for input [1, 2, 3, 4]).

Down-sweep phase

The down-sweep traverses the tree from root to leaves, distributing partial sums. It starts by replacing the root with the identity element (0 for exclusive scan), then at each level it swaps and adds.

graph TD
  subgraph "Start: replace root with 0"
      D0["1"] --- D1["3"] --- D2["3"] --- D3["0"]
  end

  subgraph "Down-sweep Level 1"
      E0["1"] --- E1["0"] --- E2["3"] --- E3["3"]
  end

  subgraph "Down-sweep Level 2 (Final exclusive scan)"
      F0["0"] --- F1["1"] --- F2["3"] --- F3["6"]
  end

  D0 --> E0
  D1 --> E1
  D2 --> E2
  D3 --> E3

  E0 --> F0
  E1 --> F1
  E2 --> F2
  E3 --> F3

Detailed trace for [1, 2, 3, 4]:

Up-sweep:

  1. Level 1 (stride 1): index 1 = arr[0] + arr[1] = 1+2 = 3. Index 3 = arr[2] + arr[3] = 3+4 = 7. Array: [1, 3, 3, 7].
  2. Level 2 (stride 2): index 3 = arr[1] + arr[3] = 3+7 = 10. Array: [1, 3, 3, 10].

Down-sweep:

  1. Set arr[3] = 0 (identity). Array: [1, 3, 3, 0].
  2. Level 1 (stride 2): temp = arr[3] = 0. arr[3] = arr[1] + arr[3] = 3+0 = 3. arr[1] = temp = 0. Array: [1, 0, 3, 3].
  3. Level 2 (stride 1): For indices 0,1: temp = arr[1] = 0. arr[1] = arr[0] + arr[1] = 1+0 = 1. arr[0] = temp = 0. For indices 2,3: temp = arr[3] = 3. arr[3] = arr[2] + arr[3] = 3+3 = 6. arr[2] = temp = 3. Array: [0, 1, 3, 6].

Exclusive scan result: [0, 1, 3, 6]

To get the inclusive scan, shift left by one and append the total sum: [1, 3, 6, 10] ✓

Scan algorithm comparison

AlgorithmWork ComplexityDepth ComplexityIn-placeSuitable for GPU
Sequential scanO(N)O(N)
Hillis-Steele (naive)O(N log N)O(log N)✗ (needs double buffer)✓ (high parallelism, extra work)
Blelloch (work-efficient)O(N)O(log N)✓ (optimal work, good parallelism)
Brent-KungO(N)O(log N)✓ (fewer operations than Blelloch)
Merge-based (CUB/Thrust)O(N)O(log N)✓ (best practical performance)

Hillis-Steele wins when you have more processors than data (its depth is the same as Blelloch, but every step has full parallelism). Blelloch wins when the data is large relative to processor count because it does less total work. Production libraries like CUB and Thrust use hybrid approaches: Blelloch within a block, with inter-block coordination through global memory.

Using Thrust for scan in CUDA C++

In production code you almost never write your own scan. Thrust provides optimized implementations:

#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <iostream>

int main() {
    int h_data[] = {1, 2, 3, 4, 5, 6, 7, 8};
    int n = 8;

    thrust::device_vector<int> d_input(h_data, h_data + n);
    thrust::device_vector<int> d_output(n);

    // Inclusive scan (prefix sum)
    thrust::inclusive_scan(d_input.begin(), d_input.end(), d_output.begin());

    std::cout << "Inclusive scan: ";
    for (int i = 0; i < n; i++) {
        std::cout << d_output[i] << " ";
    }
    // Output: 1 3 6 10 15 21 28 36
    std::cout << std::endl;

    // Exclusive scan
    thrust::exclusive_scan(d_input.begin(), d_input.end(), d_output.begin());

    std::cout << "Exclusive scan: ";
    for (int i = 0; i < n; i++) {
        std::cout << d_output[i] << " ";
    }
    // Output: 0 1 3 6 10 15 21 28
    std::cout << std::endl;

    return 0;
}

Thrust automatically selects the best algorithm for your GPU and data size. Under the hood, it uses CUB’s device-wide scan which handles multi-block coordination, look-back techniques for decoupled computation, and warp-level primitives.

Python: reduction and scan with CuPy

CuPy provides NumPy-compatible APIs that run on the GPU. Reduction is straightforward:

import cupy as cp
import time

# Reduction with cupy.sum
data = cp.random.rand(16_000_000, dtype=cp.float32)

# cupy.sum calls an optimized GPU reduction kernel internally.
# It uses a multi-pass approach: per-block reductions followed
# by a final reduction of block results.
start = time.perf_counter()
result = cp.sum(data)
cp.cuda.Device().synchronize()
elapsed = time.perf_counter() - start

print(f"Sum: {float(result):.2f}, Time: {elapsed*1000:.3f} ms")

For prefix sum, CuPy exposes cupy.cumsum which internally dispatches to Thrust’s scan:

import cupy as cp

data = cp.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=cp.int32)

# Inclusive prefix sum (uses Thrust scan under the hood)
inclusive = cp.cumsum(data)
print(f"Inclusive scan: {inclusive}")
# Output: [ 1  3  6 10 15 21 28 36]

# Exclusive scan: shift right and prepend 0
exclusive = cp.zeros_like(data)
exclusive[1:] = inclusive[:-1]
print(f"Exclusive scan: {exclusive}")
# Output: [ 0  1  3  6 10 15 21 28]

CuPy does not expose a direct exclusive scan API, so the shift approach works. For performance-critical code, consider using the RawKernel interface to call CUB’s exclusive scan directly.

Application: stream compaction

Stream compaction removes unwanted elements from an array while preserving order. Given an array and a predicate (keep or discard), produce a dense output of only the kept elements. This is used in ray tracing (removing terminated rays), particle simulations (removing dead particles), and sparse data processing.

The algorithm has three steps:

  1. Evaluate predicate: create a flag array where 1 = keep, 0 = discard.
  2. Exclusive scan on the flag array: produces the output index for each kept element.
  3. Scatter: each kept element writes to its computed output position.

Example: input [A, B, C, D, E], flags [1, 0, 1, 1, 0].

  • Exclusive scan of flags: [0, 1, 1, 2, 3].
  • Element A (flag=1) writes to index 0. Element C (flag=1) writes to index 1. Element D (flag=1) writes to index 2.
  • Output: [A, C, D]. Output count = scan[4] + flags[4] = 3+0 = 3.

Without scan, you would need atomic increments to a global counter, which serialize under contention. Scan gives every thread its output address with zero contention.

Application: radix sort building block

Radix sort processes integers one bit at a time, from least significant to most significant. At each bit position, it partitions elements into two groups: those with bit 0 and those with bit 1. This partition step is a scan.

For bit k of each element, compute: flag = (element >> k) & 1. Then:

  • Exclusive scan of NOT(flag) gives the destination index for 0-bit elements.
  • The 1-bit elements go after all the 0-bit elements.

This gives a stable sort for one bit position. Repeat for all 32 bits (or fewer for limited ranges) and you have a complete sort. GPU radix sort (as implemented in CUB) achieves throughput exceeding 10 billion keys per second on modern hardware, making it one of the fastest sorting algorithms on GPUs.

Application: histogram equalization

Image histogram equalization enhances contrast by redistributing pixel intensities. The core step: compute a histogram of pixel values (256 bins for 8-bit images), then compute the prefix sum of the histogram. This cumulative distribution function maps old pixel values to new ones that span the full intensity range.

new_value[i] = round((cdf[i] - cdf_min) / (total_pixels - cdf_min) * 255)

The prefix sum of the histogram is exactly an inclusive scan. On a GPU, you compute the histogram with atomics, scan it with Thrust, then remap all pixels in parallel. The scan step itself is negligible compared to the histogram and remap steps, but it is the linchpin: without it, there is no mapping function.

In practice

Use libraries. CUB’s DeviceReduce and DeviceScan are heavily optimized with look-back techniques, warp-level primitives, and multi-SM scheduling. They outperform hand-written kernels in virtually all cases. Thrust wraps CUB and provides a cleaner API.

Numerical precision matters. Floating-point addition is not truly associative. The parallel tree reduction computes a different value than the sequential sum because the grouping of additions differs. For large arrays of floats, the parallel result can diverge noticeably. Use double precision or compensated summation (Kahan) when accuracy matters.

Reduction across blocks requires coordination. A single block can reduce at most 1024 elements (with one element per thread). For larger arrays, each block produces a partial result, then a second kernel reduces those partials. CUB’s DeviceReduce handles this automatically with a single API call.

Warp shuffle has constraints. The offset must be a power of two. The participating mask must be consistent across the warp. On pre-Volta architectures, warp shuffle without the _sync suffix works but is deprecated. Always use __shfl_down_sync with an explicit mask.

Scan is the parallel programmer’s secret weapon. Any time you find yourself reaching for an atomic counter to assign output positions, stop and ask whether a scan would work instead. The scan version is almost always faster because it eliminates contention.

Common mistake: forgetting __syncthreads() between reduction levels in shared memory. This creates a race condition where a thread reads a value that another thread has not yet written. The bug is non-deterministic and may not manifest on small inputs.

What comes next

With reduction and scan in your toolkit, you have the two most important parallel primitives. The next article covers concurrent data structures, where these primitives combine with atomics to build lock-free queues, work-stealing pools, and dynamic memory allocators that run entirely on the GPU.

Start typing to search across all content
navigate Enter open Esc close