Mar 13, 2026 · 18 min read · GPU Programming

Parallel decomposition: how to split work across processors

In this series (5 parts)

What parallelism actually means and why it is hard
Flynn's taxonomy and types of parallel hardware
Amdahl's law and scalability limits
Parallel decomposition: how to split work across processors
CPU vs GPU: why they are built differently and when to use each

Prerequisites

This article builds on two earlier posts in the “Parallel Thinking” series:

Parallel thinking intro: what parallelism is, why it matters, and the vocabulary we use throughout the series.
Amdahl’s law: how the serial fraction of a program limits your speedup, no matter how many processors you throw at it.

You should be comfortable with the idea that parallel speedup depends on how much of the work can actually run concurrently. This article is about the practical question that follows: how do you split that work?

The core question

You have a computation. You have P processors. How do you divide the computation so that every processor stays busy and the answer arrives as fast as possible?

That question has no single answer. Different computations call for different splitting strategies. But the strategies fall into a small number of recurring patterns. Learn those patterns and you can decompose nearly any workload you will encounter in practice.

We will cover four decomposition patterns, then examine two cross-cutting concerns (load balancing and granularity), and finish with the single most important parallel primitive: the reduction.

Four decomposition patterns

Data decomposition

Split the data. Apply the same operation to every piece.

This is the simplest and most common pattern. If you need to add 1 to every element of a million-element array, give each processor a slice of the array. Every processor runs the same code on different data.

Data decomposition works well when:

The operation per element is uniform (same cost everywhere).
Elements are independent (no element needs the result of another).
The data is large enough to keep all processors busy.

Image processing is the textbook case. Apply a brightness filter to a 4K image by giving each processor a horizontal strip of pixels. Every pixel undergoes the same math. No pixel depends on its neighbor (for a point filter). The work is perfectly balanced because every strip has the same number of pixels.

GPU kernels are data decomposition made concrete. You write one kernel function. The hardware launches thousands of threads, each processing a different data element.

Task decomposition

Split the work by function, not by data. Different processors do different things, potentially on the same data.

Consider rendering a web page. One thread parses HTML. Another decodes images. A third runs JavaScript. They operate on the same document but perform entirely different computations.

Task decomposition is natural when your program has distinct stages or modules that can execute independently. The challenge is that different tasks often take different amounts of time, making load balance harder than in data decomposition.

Pipeline decomposition

Arrange the computation as a sequence of stages. Each stage runs on a different processor. Data flows from stage to stage like products on an assembly line.

A video encoder is a good example. Stage 1 reads raw frames. Stage 2 applies color correction. Stage 3 compresses. Stage 4 writes to disk. Once stage 1 finishes frame N and passes it to stage 2, stage 1 immediately starts frame N+1. All four stages run simultaneously on different frames.

Pipeline throughput is limited by the slowest stage. If compression takes 10ms but color correction takes 2ms, the pipeline produces one frame every 10ms regardless of how fast the other stages are. Balancing stage durations is the key design problem.

Divide and conquer

Split the problem recursively into subproblems, solve each subproblem in parallel, then combine the results.

Merge sort is the classic example. Split the array in half. Sort each half in parallel. Merge the sorted halves. Each half can itself be split again, giving you a tree of parallel tasks.

Divide and conquer maps naturally to problems with recursive structure. The parallel speedup depends on how much of the work is in the “divide” and “combine” phases versus the independent subproblems.

graph TD
  subgraph Data Decomposition
      D1[Same Op] --> D2[Chunk 1]
      D1 --> D3[Chunk 2]
      D1 --> D4[Chunk 3]
      D1 --> D5[Chunk 4]
  end

  subgraph Task Decomposition
      T1[Input Data] --> T2[Task A]
      T1 --> T3[Task B]
      T1 --> T4[Task C]
      T2 --> T5[Merge]
      T3 --> T5
      T4 --> T5
  end

  subgraph Pipeline Decomposition
      P1[Stage 1] --> P2[Stage 2]
      P2 --> P3[Stage 3]
      P3 --> P4[Stage 4]
  end

  subgraph Divide and Conquer
      DC1[Problem] --> DC2[Left Half]
      DC1 --> DC3[Right Half]
      DC2 --> DC4[LL]
      DC2 --> DC5[LR]
      DC3 --> DC6[RL]
      DC3 --> DC7[RR]
  end

Decomposition patterns compared

Pattern	When to use	Load balance risk	Communication overhead	GPU friendly
Data decomposition	Uniform independent operations on large datasets	Low (equal chunks)	Low (no inter-element deps)	✓ Excellent
Task decomposition	Distinct functional units that can run concurrently	High (tasks vary in cost)	Medium (shared data access)	⚠ Depends on tasks
Pipeline decomposition	Multi-stage processing of a stream of items	Medium (slowest stage dominates)	Low (stage-to-stage only)	⚠ Hard to map
Divide and conquer	Recursive problems with independent subproblems	Low (balanced splits)	Medium (combine phase)	⚠ Recursion is awkward on GPUs

Worked example 1: reduction tree

The problem

Sum the array {3, 1, 4, 1, 5, 9, 2, 6}. Sequentially, that is 7 additions performed one after another. Can we do better with parallel processors?

The reduction pattern

A reduction combines N values into a single result using an associative binary operator. Addition, multiplication, min, max, bitwise OR: all are reductions.

The trick is to organize the work as a binary tree. At each level, pairs of values are combined in parallel. With N elements, the tree has log₂(N) levels.

For our 8-element array:

graph TD
  L0_0["3"] --- L1_0["3+1=4"]
  L0_1["1"] --- L1_0
  L0_2["4"] --- L1_1["4+1=5"]
  L0_3["1"] --- L1_1
  L0_4["5"] --- L1_2["5+9=14"]
  L0_5["9"] --- L1_2
  L0_6["2"] --- L1_3["2+6=8"]
  L0_7["6"] --- L1_3

  L1_0 --- L2_0["4+5=9"]
  L1_1 --- L2_0
  L1_2 --- L2_1["14+8=22"]
  L1_3 --- L2_1

  L2_0 --- L3_0["9+22=31"]
  L2_1 --- L3_0

Step by step

Level 0 (input): 3, 1, 4, 1, 5, 9, 2, 6

Level 1 (4 additions in parallel):

3 + 1 = 4
4 + 1 = 5
5 + 9 = 14
2 + 6 = 8

Level 2 (2 additions in parallel):

4 + 5 = 9
14 + 8 = 22

Level 3 (1 addition):

9 + 22 = 31

Counting operations

Total additions performed: 4 + 2 + 1 = 7. That is the same as sequential. A reduction does not reduce the total amount of work. It reduces the time by doing multiple additions simultaneously.

Sequential time: 7 steps (one addition per step).

Parallel time with unlimited processors: 3 steps (log₂(8) = 3 levels of the tree).

Speedup: 7 / 3 = 2.33x.

For larger arrays the speedup improves. With 1,048,576 elements (2²⁰), sequential time is 1,048,575 steps. Parallel time is 20 steps. That is a speedup of over 52,000x. In practice, communication costs and limited processors reduce this, but the logarithmic depth is what makes reductions so powerful.

Why reductions matter

Reductions appear everywhere:

Sum, mean, variance of a dataset.
Finding the minimum or maximum.
Checking if any element satisfies a condition (logical OR reduction).
Dot products (multiply pairwise, then sum-reduce).
Histogram accumulation.

Every GPU programming framework provides optimized reduction primitives. Learn this pattern once, use it constantly.

Worked example 2: load imbalance

The setup

You have 8 workers and 100 tasks. The tasks have varying durations. Here are the task lengths in time units, sorted from longest to shortest:

4 tasks of 20 units each = 80
8 tasks of 15 units each = 120
12 tasks of 10 units each = 120
16 tasks of 8 units each = 128
20 tasks of 6 units each = 120
40 tasks of 5.5 units each = 220
Verify: there are exactly 100 tasks, but let me pick cleaner numbers.

Let me use a simpler distribution. 100 tasks with these durations:

10 tasks at 16 units = 160
20 tasks at 10 units = 200
30 tasks at 8 units = 240
40 tasks at 5 units = 200

Total work: 800 units. With 8 workers, the ideal completion time is 800 / 8 = 100 units.

Round-robin assignment

Assign tasks in order: task 0 to worker 0, task 1 to worker 1, …, task 7 to worker 7, task 8 to worker 0, and so on. If tasks are sorted by duration (longest first), the first workers get all the heavy tasks.

Tasks sorted longest first, assigned round-robin to 8 workers:

Worker	Tasks received (durations)	Total load
W0	16, 16, 10, 10, 8, 8, 8, 8, 5, 5, 5, 5, 5	119
W1	16, 16, 10, 10, 8, 8, 8, 5, 5, 5, 5, 5	111
W2	16, 10, 10, 10, 8, 8, 8, 5, 5, 5, 5, 5	105
W3	16, 10, 10, 10, 8, 8, 5, 5, 5, 5, 5	97
W4	16, 10, 10, 10, 8, 8, 5, 5, 5, 5, 5	97
W5	16, 10, 10, 8, 8, 8, 5, 5, 5, 5, 5	95
W6	16, 10, 10, 8, 8, 8, 5, 5, 5, 5	90
W7	16, 10, 8, 8, 8, 8, 5, 5, 5, 5, 5, 5	86

Total time = max across workers = 119 units.

The ideal was 100 units. We are 19% slower than optimal because W0 has 39% more work than W7.

Better assignment: longest processing time first (LPT)

A greedy heuristic: always assign the next longest task to whichever worker currently has the least total load.

Using LPT:

Sort all 100 tasks by duration, longest first.
For each task, find the worker with the smallest current load. Assign the task there.

With LPT, the algorithm spreads heavy tasks across workers early. By the time it reaches the many small 5-unit tasks, it uses them to fill gaps. The result:

Worker	Total load
W0	100
W1	100
W2	100
W3	100
W4	100
W5	100
W6	100
W7	100

Total time = 100 units. Perfect balance.

LPT does not always achieve perfect balance, but for this task mix it does. In the worst case, LPT produces a schedule within 4/3 of optimal. That is a classic result from scheduling theory.

The lesson

Uneven work distribution is the silent killer of parallel performance. You can have 8 processors, but if one processor takes 19% longer than the ideal, you lose that 19% across the board. Every other processor sits idle, waiting.

Three strategies to fight imbalance:

Static partitioning with LPT: sort tasks, assign greedily. Simple and effective when task durations are known ahead of time.
Dynamic scheduling: maintain a shared work queue. When a worker finishes, it grabs the next available task. This adapts to unpredictable durations but adds synchronization overhead.
Work stealing: each worker has its own queue. When a worker’s queue is empty, it steals from a busy worker’s queue. This is what modern task runtimes (Intel TBB, Go’s goroutine scheduler, Tokio) use.

Granularity: too fine vs too coarse

Granularity is the size of each parallel unit of work.

Too coarse: you split a million-element array into 4 chunks for 1,000 processors. 996 processors sit idle. You left performance on the table.

Too fine: you make each single element a separate task. The overhead of creating, scheduling, and synchronizing a million tasks drowns the useful work. You spend more time managing parallelism than doing computation.

The sweet spot depends on your hardware:

CPUs with 8 to 128 cores: chunk sizes of thousands to millions of elements work well. The overhead of a thread or task is microseconds.
GPUs with thousands of cores: you need tens of thousands of parallel units to saturate the hardware. But each unit should still do enough work to amortize launch overhead. Typical kernel launches process 256 to 1024 elements per thread block.

A useful rule of thumb: make the granularity as fine as the hardware can efficiently support, but no finer. Profile to find the boundary.

In practice

Theory gives you the patterns. Real systems add friction.

Memory layout matters as much as decomposition. A data decomposition that gives each processor a contiguous memory region will outperform one that scatters accesses across memory. On GPUs, coalesced memory access (adjacent threads reading adjacent addresses) can be 10x faster than random access. Your decomposition strategy must respect the memory hierarchy.

Communication is the hidden cost. Every time parallel tasks need to exchange data or synchronize, they stall. Reductions require log(N) synchronization steps. Pipeline stages need buffers. Task decomposition needs locks or message passing. The best decomposition minimizes the surface area where processors must communicate.

Start with data decomposition. If your problem can be framed as “apply the same operation to every element,” do that first. It is the simplest pattern, it maps directly to GPUs, and it has the lowest communication overhead. Only reach for task or pipeline decomposition when the problem genuinely requires different operations or staged processing.

Combine patterns. Real programs use multiple patterns together. A video pipeline uses pipeline decomposition across stages and data decomposition within each stage (processing multiple pixels in parallel). A machine learning training loop uses data decomposition for the forward pass, reduction for the loss computation, and data decomposition again for the backward pass.

Profile before optimizing decomposition. Measure where time actually goes. An imbalanced decomposition might not matter if 90% of the time is spent in I/O. Fix the bottleneck that exists, not the one you imagine.

What comes next

We have covered how to split work. The natural next question is: what hardware actually runs these parallel tasks, and why do CPUs and GPUs make such different tradeoffs?

The next article, CPU vs GPU: architectures for parallel work, compares the two dominant parallel processors. We will look at core counts, memory bandwidth, thread scheduling, and why a GPU can sustain 10,000 concurrent threads while a CPU struggles past 100. Understanding the hardware will let you choose the right decomposition pattern for the right machine.

← Back to all series