Mar 9, 2026 · 18 min read · GPU Programming

What parallelism actually means and why it is hard

In this series (5 parts)

What parallelism actually means and why it is hard
Flynn's taxonomy and types of parallel hardware
Amdahl's law and scalability limits
Parallel decomposition: how to split work across processors
CPU vs GPU: why they are built differently and when to use each

Prerequisites

This article assumes you understand processes, threads, and how an operating system schedules work on a CPU. If those concepts are fuzzy, read Processes and threads first and come back.

The core question

You have work to do. Can you split it across multiple workers and finish faster?

Sometimes yes. Sometimes no. The difference between those two answers is the entire subject of parallel computing. This article gives you the vocabulary to reason about it: what parallelism actually is, what prevents it, and what breaks when you get it wrong.

Task parallelism vs data parallelism

There are two fundamentally different ways to do multiple things at once.

Task parallelism means different workers perform different operations. One thread parses a config file while another opens a database connection. The operations are unrelated, so they overlap in time.

Data parallelism means every worker performs the same operation on different chunks of data. You have 10 million pixels. You apply the same brightness filter to each one. 1000 GPU threads each handle 10,000 pixels. Same instruction, different data.

The distinction matters because data parallelism scales. If you double the data, you double the workers and wall-clock time stays flat. Task parallelism does not scale the same way. You only have so many distinct tasks, and they tend to depend on each other.

GPUs are built for data parallelism. CPUs with a few cores are better suited for task parallelism. Most real systems use both.

Independent work: the prerequisite

Parallelism requires independence. Two units of work can run in parallel only if neither needs the result of the other. This sounds obvious, but identifying true independence in real code is where the difficulty lives.

Consider adding 1000 numbers. Each pairwise addition is independent of the others, right? Not exactly. You need the partial sums before you can produce the final sum. There is hidden sequential structure even in something this simple.

Or consider rendering frames in a video. Frame 47 does not depend on frame 48. Those can render in parallel. But frame 47 might depend on the motion vectors computed from frame 46 (for predictive encoding). Suddenly the parallelism is constrained.

The rule: look at the data each task reads and writes. If task A writes something that task B reads, B depends on A. No parallelism between them.

Dependencies: what prevents parallelization

Let’s make this concrete with a worked example.

Example 1: Task dependency analysis

You have six tasks with these dependencies and durations:

Task	Duration (ms)	Depends on
A	3	nothing
B	2	nothing
C	4	A
D	2	A, B
E	3	C
F	1	D, E

Here is the dependency graph:

graph TD
  A["A (3ms)"] --> C["C (4ms)"]
  A --> D["D (2ms)"]
  B["B (2ms)"] --> D
  C --> E["E (3ms)"]
  D --> F["F (1ms)"]
  E --> F
  style A fill:#4a9eff,color:#fff
  style B fill:#4a9eff,color:#fff
  style C fill:#ff9f43,color:#fff
  style D fill:#ff9f43,color:#fff
  style E fill:#ee5a24,color:#fff
  style F fill:#ee5a24,color:#fff

Task dependency graph. Blue tasks have no dependencies and can start immediately. Orange tasks depend on blue. Red tasks depend on orange.

Which tasks can run in parallel?

At time 0, A and B are both ready. They share no dependencies. Run them together.

Once A finishes (t=3), C and D both become eligible. But D also needs B, which finished at t=2. So at t=3, both C and D are ready. Run them together.

E starts after C finishes (t=7). F starts after both D and E finish.

The critical path is the longest chain through the graph: A (3) -> C (4) -> E (3) -> F (1) = 11ms. No amount of parallelism can beat 11ms. That is the theoretical minimum.

With 3 workers:

Time	Worker 1	Worker 2	Worker 3
0-2	A	B	idle
2-3	A	idle	idle
3-5	D	C	idle
5-7	idle	C	idle
7-10	E	idle	idle
10-11	F	idle	idle

Total wall-clock: 11ms. We hit the critical path exactly. The third worker was never needed.

With 2 workers:

Time	Worker 1	Worker 2
0-2	A	B
2-3	A	idle
3-5	D	C
5-7	idle	C
7-10	E	idle
10-11	F	idle

Total wall-clock: 11ms. Same result. Two workers are enough to saturate this graph. The critical path, not the worker count, is the bottleneck.

With 1 worker:

All tasks run sequentially: 3 + 2 + 4 + 2 + 3 + 1 = 15ms. Parallelism saved 4ms here, a 27% improvement. That improvement is bounded by how much work lives off the critical path.

The takeaway: adding more workers only helps until you saturate the critical path. After that, extra workers sit idle.

A race condition occurs when two or more threads access shared data, at least one of them writes, and the final result depends on the order of execution. The program is correct under some orderings and incorrect under others.

The read-modify-write problem

Most variables cannot be updated atomically. Incrementing a counter, for instance, takes three steps:

Read the current value from memory into a register
Modify the value (add 1)
Write the new value back to memory

If two threads perform these steps on the same variable without coordination, their operations can interleave.

sequenceDiagram
  participant T1 as Thread 1
  participant M as Memory (counter)
  participant T2 as Thread 2
  Note over M: counter = 0
  T1->>M: READ counter - gets 0
  T2->>M: READ counter - gets 0
  T1->>T1: Compute 0 + 1 = 1
  T2->>T2: Compute 0 + 1 = 1
  T1->>M: WRITE counter = 1
  T2->>M: WRITE counter = 1
  Note over M: counter = 1 - should be 2

Two threads both read counter = 0, both compute 1, both write 1. One increment is lost.

Example 2: The lost update

Two threads each increment a shared counter 100 times. The counter starts at 0. You expect the final value to be 200.

Here is what can actually happen. Each counter++ is three operations: read, add 1, write. The threads run on separate cores, and the OS can preempt or interleave at any point between those three operations.

Possible final values:

200 (lucky case): Every read-modify-write from Thread 1 completes before Thread 2’s next read-modify-write begins, or vice versa. All 200 increments are visible. This can happen, but you cannot rely on it.

2 (worst case): Imagine both threads read counter = 0, both write 1. Then both read 1, both write 2. This pattern repeats for all 100 iterations. Thread 1 does: read 0, write 1, read 1, write 2, …, read 99, write 100. Thread 2 reads the same values at the same time and writes the same results. Final value: 100? Actually the theoretical worst case is even lower. If Thread 1 reads 0, then Thread 2 runs all 100 of its increments (reaching 100), then Thread 1 writes 1, the counter drops from 100 to 1. Repeat with Thread 2, and the absolute minimum is 2: each thread completes one “visible” increment.

A typical run might yield 137 or 162 or 189. The exact number depends on how the scheduler interleaves the threads. Run the program ten times and you will get ten different answers.

This is the defining property of a race condition: the output is nondeterministic. It depends on timing, not logic.

Why this matters in practice: Race conditions are among the hardest bugs to find. They often pass tests (the lucky ordering happens to occur) and fail in production under load (when contention increases, unlucky orderings become common).

Deadlock: when everyone waits forever

A deadlock occurs when two or more threads are each waiting for a resource held by another, forming a cycle. No thread can proceed. The program hangs.

The classic scenario

Thread 1 holds Lock A and wants Lock B. Thread 2 holds Lock B and wants Lock A. Neither can proceed.

graph LR
  T1["Thread 1
holds Lock A"] -->|"wants"| LB["Lock B"]
  T2["Thread 2
holds Lock B"] -->|"wants"| LA["Lock A"]
  LA -->|"held by"| T1
  LB -->|"held by"| T2
  style T1 fill:#ee5a24,color:#fff
  style T2 fill:#ee5a24,color:#fff
  style LA fill:#4a9eff,color:#fff
  style LB fill:#4a9eff,color:#fff

Circular wait. Thread 1 holds A, wants B. Thread 2 holds B, wants A. Neither can make progress.

A concrete example with numbers:

Consider a banking system transferring money between accounts. To transfer $50 from Account X to Account Y, the code locks both accounts to ensure consistency:

transfer(from, to, amount):
    lock(from)
    lock(to)
    from.balance -= amount
    to.balance += amount
    unlock(to)
    unlock(from)

Thread 1 calls transfer(X, Y, 50). Thread 2 simultaneously calls transfer(Y, X, 30).

Step	Thread 1	Thread 2
1	lock(X) ✓	lock(Y) ✓
2	lock(Y) … blocked	lock(X) … blocked

Both threads are now waiting forever. The transfers never complete. The system is frozen.

Four conditions for deadlock (all must hold simultaneously):

Mutual exclusion: Resources cannot be shared (locks are exclusive)
Hold and wait: A thread holds one resource while waiting for another
No preemption: Resources cannot be forcibly taken from a thread
Circular wait: A cycle exists in the wait-for graph

Break any one of these four conditions and deadlock becomes impossible. The most common fix is to impose a global ordering on lock acquisition. If every thread always acquires Lock A before Lock B, the cycle cannot form.

The fundamental cost of coordination

Parallelism is not free. Every time threads need to communicate or synchronize, there is overhead.

Locks force threads to take turns, converting parallel execution back into sequential execution at the lock boundary. A lock protecting a shared counter means only one thread can increment at a time. If 8 threads spend 30% of their time contending on the same lock, you will not get anywhere near 8x speedup.

Cache coherence adds invisible cost. When one core writes to a memory location, every other core that has cached that location must invalidate its copy. This invalidation traffic travels over the interconnect between cores. It takes 40 to 100 nanoseconds per invalidation on modern hardware. If threads frequently write to nearby memory addresses (false sharing), this overhead dominates.

Communication latency is the final tax. Sending data between threads on the same machine costs microseconds. Sending data between machines costs milliseconds. Sending data between data centers costs tens of milliseconds. Every synchronization point pays this cost.

Amdahl’s Law: the hard ceiling

If a fraction s of your program is inherently sequential (cannot be parallelized), then the maximum speedup with p processors is:

Speedup = 1 / (s + (1 - s) / p)

If 5% of your code is sequential and you throw 1000 processors at it:

Speedup = 1 / (0.05 + 0.95 / 1000) = 1 / 0.05095 = 19.6x

1000 processors. 19.6x speedup. That 5% sequential fraction is a hard ceiling. You will never exceed 20x no matter how many processors you add.

Amdahl’s Law for different sequential fractions. Even a small sequential portion puts a hard cap on speedup.

This is why optimization of the sequential portion matters more than adding processors. Going from 10% sequential to 5% sequential doubles your maximum speedup (from 10x to 20x). No amount of hardware solves a fundamentally sequential algorithm.

In practice: real coordination bottlenecks

Database locks. A web application with 200 concurrent requests all updating the same row in a database. The database serializes those writes. Your 64-core server is effectively single-threaded for that operation.

Shared queues. Producer-consumer patterns where a single queue is the handoff point between stages. The queue’s lock becomes the bottleneck. This is why systems like Kafka partition data across multiple queues, converting one contention point into many independent ones.

Global state. A counter tracking “total requests served” that every thread increments. Atomic operations help (they eliminate the lock) but still force cache-line bouncing between cores. At extreme throughput, even atomics become a bottleneck. The solution is per-thread counters that are summed periodically.

Barrier synchronization. MapReduce-style systems where all mappers must finish before any reducer starts. The slowest mapper determines the start time of the reduce phase. One straggler (a slow disk, a noisy neighbor on a shared machine) delays the entire computation.

These bottlenecks share a pattern: a point where parallel execution must become sequential, even briefly. The goal of parallel system design is to minimize the time spent at these points. Eliminate shared state where possible. Partition data so threads work on independent subsets. Batch coordination to amortize the cost.

Summary

Concept	What it means	Why it matters
Task parallelism	Different operations run simultaneously	Limited by number of distinct independent tasks
Data parallelism	Same operation on different data	Scales with data size; the GPU model
Dependency	Task B needs the result of Task A	Creates sequential chains that limit speedup
Critical path	Longest dependency chain	The hard floor on execution time
Race condition	Output depends on thread scheduling	Nondeterministic bugs that hide in testing
Deadlock	Circular wait on resources	System hangs; requires breaking one of four conditions
Amdahl’s Law	Sequential fraction caps speedup	5% sequential means max 20x speedup, forever

The key insight: parallelism is not about having more workers. It is about having enough independent work to keep those workers busy, and minimizing the time they spend waiting for each other.

What comes next

This article gave you the vocabulary. The next article in this series, Flynn’s taxonomy, introduces the four architectures for parallel computation: SISD, SIMD, MISD, and MIMD. That framework explains why GPUs (SIMD/SIMT) and CPUs (MIMD) solve different problems, and it sets up everything that follows about GPU programming.

← Back to all series