Mar 4, 2026 · 18 min read · GPU Programming

CPU pipelines and instruction-level parallelism

In this series (6 parts)

How a CPU actually works: cores, clocks, and execution
CPU caches and memory hierarchy: why memory access speed matters
CPU pipelines and instruction-level parallelism
Memory models and why concurrent CPU code is hard
SIMD and vectorization: parallelism on a single CPU core
Processes, threads, and context switching

Prerequisites

This article assumes you understand basic CPU structure and how caches reduce memory latency. If not, read these first:

CPU architecture overview: registers, ALU, control unit, the fetch-decode-execute cycle
CPU caches and memory hierarchy: why memory is slow, how caches help, and what happens on a miss

Both concepts show up constantly here. Pipelining is how the CPU keeps all its hardware busy; caches are how it keeps the pipeline fed with data.

The problem: one instruction at a time is slow

A simple CPU processes one instruction through all its stages before starting the next. If each stage takes one clock cycle and there are five stages, every instruction takes five cycles. For 100 instructions, that is 500 cycles. The ALU sits idle 80% of the time, waiting for fetch or memory access to finish.

This is like a car wash with five stations (soap, scrub, rinse, dry, wax) where only one car enters the entire facility at a time. Four stations are always empty. Wasteful.

Pipelining fixes this by overlapping execution. While one instruction is being decoded, the next is already being fetched. Each stage works on a different instruction every cycle.

The five classic pipeline stages

Most introductory models use a five-stage pipeline. Real processors have more stages (Intel’s Skylake has roughly 14 to 19, depending on how you count), but the principles are identical.

Stage	Name	What happens
IF	Instruction Fetch	Read the next instruction from the instruction cache using the program counter
ID	Instruction Decode	Determine what the instruction does, read source registers
EX	Execute	ALU operation, address calculation, or branch resolution
MEM	Memory Access	Load from or store to the data cache
WB	Write Back	Write the result into the destination register

Each stage takes one cycle. The pipeline register between stages holds the intermediate result, passing it forward like a baton.

Pipeline in action: five instructions overlapping

gantt
  title 5-Stage Pipeline: 5 Instructions
  dateFormat X
  axisFormat %s
  section Instr 1
  IF  :i1if, 0, 1
  ID  :i1id, 1, 2
  EX  :i1ex, 2, 3
  MEM :i1mem, 3, 4
  WB  :i1wb, 4, 5
  section Instr 2
  IF  :i2if, 1, 2
  ID  :i2id, 2, 3
  EX  :i2ex, 3, 4
  MEM :i2mem, 4, 5
  WB  :i2wb, 5, 6
  section Instr 3
  IF  :i3if, 2, 3
  ID  :i3id, 3, 4
  EX  :i3ex, 4, 5
  MEM :i3mem, 5, 6
  WB  :i3wb, 6, 7
  section Instr 4
  IF  :i4if, 3, 4
  ID  :i4id, 4, 5
  EX  :i4ex, 5, 6
  MEM :i4mem, 6, 7
  WB  :i4wb, 7, 8
  section Instr 5
  IF  :i5if, 4, 5
  ID  :i5id, 5, 6
  EX  :i5ex, 6, 7
  MEM :i5mem, 7, 8
  WB  :i5wb, 8, 9

Without pipelining, five instructions take 25 cycles (5 × 5). With pipelining, they complete in 9 cycles: 5 cycles to fill the pipeline, then one instruction finishes every cycle after that. The general formula for N instructions in a K-stage pipeline is:

Cycles = K + (N - 1)

Worked example 1: pipeline throughput

Setup: 5-stage pipeline, 10 instructions.

Without pipelining: Each instruction takes 5 cycles. Total = 5 × 10 = 50 cycles.

With pipelining (ideal): Cycles = K + (N - 1) = 5 + 9 = 14 cycles. Speedup = 50 / 14 = 3.57×.

With pipelining (2 stall cycles): Stalls insert idle cycles, called bubbles, into the pipeline. With 2 stall cycles, total = 14 + 2 = 16 cycles. Speedup = 50 / 16 = 3.125×.

As N grows large, the ideal speedup approaches K (here, 5×). Stalls eat into that ceiling. A 20-stage pipeline could theoretically give 20× speedup, but deeper pipelines are more vulnerable to stalls, so the real gain is always less.

Pipeline hazards

The pipeline assumes every stage can proceed every cycle. Three things break that assumption.

Structural hazards

Two instructions need the same hardware resource in the same cycle. Example: a single-ported memory that cannot handle an instruction fetch and a data load simultaneously.

Fix: Duplicate hardware. Harvard architecture separates instruction and data caches. Most modern CPUs have separate I-cache and D-cache, so this hazard is rare in practice.

Data hazards

An instruction depends on the result of a previous instruction that has not finished yet. This is the most common hazard.

Consider:

ADD R1, R2, R3    ; R1 = R2 + R3
SUB R4, R1, R5    ; R4 = R1 - R5  (needs R1 from ADD)

The SUB needs R1, but ADD does not write R1 until its WB stage. SUB reaches ID (where it reads R1) two cycles before ADD writes the result.

Data hazard with stall (NOP insertion)

gantt
  title Data Hazard: Read-After-Write Stall
  dateFormat X
  axisFormat %s
  section ADD R1,R2,R3
  IF  :a_if, 0, 1
  ID  :a_id, 1, 2
  EX  :a_ex, 2, 3
  MEM :a_mem, 3, 4
  WB  :a_wb, 4, 5
  section NOP (bubble)
  stall :nop1, 1, 2
  stall :nop1b, 2, 3
  stall :nop1c, 3, 4
  stall :nop1d, 4, 5
  stall :nop1e, 5, 6
  section NOP (bubble)
  stall :nop2, 2, 3
  stall :nop2b, 3, 4
  stall :nop2c, 4, 5
  stall :nop2d, 5, 6
  stall :nop2e, 6, 7
  section SUB R4,R1,R5
  IF  :s_if, 3, 4
  ID  :s_id, 4, 5
  EX  :s_ex, 5, 6
  MEM :s_mem, 6, 7
  WB  :s_wb, 7, 8

Without forwarding, the pipeline inserts two NOP bubbles so that SUB reads R1 only after ADD has written it back. Two wasted cycles.

Fix: data forwarding (bypassing). The result of ADD is available at the end of EX (cycle 3). Forwarding paths route this value directly to the EX input of SUB, eliminating the stall entirely. Every modern CPU uses forwarding.

Some data hazards cannot be forwarded away. A load followed immediately by a use of the loaded value still causes a one-cycle stall, called a load-use hazard, because the data is not available until after the MEM stage.

Control hazards

Branches change the program counter. The pipeline has already fetched the next sequential instructions, which may be wrong. Those instructions must be flushed, wasting every cycle they occupied.

Hazard summary

Hazard	Cause	Detection	Resolution	Performance cost
Structural	Two instructions need same hardware unit simultaneously	Hardware resource conflict detected during scheduling	Duplicate hardware (separate I-cache/D-cache), stall one instruction	1+ stall cycles per conflict; rare on modern CPUs
Data (RAW)	Instruction reads a register before a prior instruction writes it	Comparator checks source registers of ID-stage instruction against destination registers of EX/MEM/WB-stage instructions	Data forwarding/bypassing; compiler instruction reordering; stall if forwarding insufficient	0 cycles with full forwarding; 1 cycle for load-use; 2 cycles without forwarding
Control	Branch or jump changes PC; pipeline already fetched wrong instructions	Branch instruction detected in ID stage	Branch prediction; delayed branching; speculative execution	Pipeline flush on mispredict: cost = (pipeline depth - 1) wasted cycles

Branch prediction

A branch instruction decides which instruction to fetch next. In a 5-stage pipeline, the branch outcome is known at the EX stage (cycle 3). That means two instructions have already entered the pipeline speculatively. If the branch goes the other way, those two instructions are flushed.

In a 15-stage pipeline, the situation is far worse. The branch might not resolve until stage 10 or later, meaning 9+ instructions could be thrown away.

Branch prediction guesses the outcome before it is known. The CPU fetches instructions along the predicted path. If the guess is correct, no cycles are lost. If wrong, the pipeline flushes and restarts from the correct path.

How branch predictors work

Static prediction: Always predict “not taken” or “backward branches taken” (loops usually jump back). Simple, cheap, roughly 60 to 70% accurate.

Dynamic prediction: Track branch history in hardware tables.

1-bit predictor: Remembers last outcome. Mispredicts twice per loop (first and last iteration).
2-bit saturating counter: Needs two consecutive mispredictions to change its mind. Handles loop exits better. Around 85 to 90% accuracy.
Two-level adaptive predictors: Track patterns of recent branches (global history) and correlate them with per-branch behavior. Modern CPUs use variants of this with thousands of table entries. Accuracy exceeds 95% on typical workloads.
TAGE predictors (TAgged GEometric): Used in recent Intel and AMD cores. Multiple tables indexed by different history lengths. Accuracy above 97% on SPEC benchmarks.

Worked example 2: branch misprediction cost

Setup: 15-stage pipeline running at 3 GHz. A branch misprediction flushes 14 stages.

Time per cycle: 1 / 3 GHz = 0.333 ns per cycle.

Wasted time per misprediction: 14 cycles × 0.333 ns = 4.67 ns per misprediction.

Loop scenario: 1000 iterations, branch at end of loop. The branch predictor mispredicts 5% of the time.

Mispredictions = 1000 × 0.05 = 50 mispredictions
Total wasted cycles = 50 × 14 = 700 cycles
Total wasted time = 700 × 0.333 ns = 233.3 ns

For context, 233 ns is roughly the time for a single main memory access. So 50 mispredictions in a tight loop cost you the equivalent of one DRAM round-trip. That matters when loops run millions of times; the waste scales linearly.

This is why branchless code (using conditional moves, arithmetic tricks) is sometimes faster even though it does more work per iteration. Removing the branch removes the misprediction risk entirely.

Out-of-order execution

In-order CPUs execute instructions in program order. If instruction 3 stalls waiting for data from memory, instructions 4, 5, 6 all wait too, even if they are independent.

Out-of-order (OoO) execution reorders instructions dynamically. The core idea:

Fetch and decode in order. Instructions enter a queue (the reorder buffer, or ROB).
Issue instructions to execution units as soon as their operands are ready, regardless of program order. This is handled by a reservation station or issue queue.
Complete out of order. Results go into the ROB.
Retire/commit in order. The ROB writes results to the architectural register file in program order, preserving the illusion of sequential execution.

Why OoO matters

Consider this sequence:

LOAD  R1, [addr1]    ; cache miss, takes ~100 cycles
ADD   R2, R3, R4     ; independent, uses R3 and R4
MUL   R5, R6, R7     ; independent, uses R6 and R7
SUB   R8, R1, R2     ; depends on R1 (LOAD) and R2 (ADD)

An in-order CPU stalls at the LOAD for ~100 cycles. Everything waits. An OoO CPU sees that ADD and MUL have no dependency on R1, issues them immediately, and only stalls SUB until the LOAD completes. The ADD and MUL execute “for free” during the memory latency.

This is called latency hiding. OoO execution does not make individual instructions faster; it finds useful work to fill the gaps.

The cost of OoO

The reorder buffer, reservation stations, register renaming logic, and dependency tracking hardware are expensive in both transistors and power. A Zen 4 core has a 320-entry ROB. ARM’s Cortex-A510 “little” cores are in-order to save power; the Cortex-X4 “big” cores are OoO. The choice is a power-performance tradeoff.

Superscalar execution

Pipelining lets you start one instruction per cycle. Superscalar designs start multiple instructions per cycle by duplicating execution units.

A 4-wide superscalar CPU can fetch, decode, and issue up to 4 instructions per cycle. In theory, that is 4 IPC (instructions per clock). In practice, dependencies, cache misses, and branch mispredictions keep real throughput closer to 2 to 3 IPC on general workloads.

Superscalar + OoO

Modern high-performance CPUs combine both:

Superscalar provides width: multiple instructions in parallel.
OoO provides flexibility: fill the execution slots even when instructions arrive in an inconvenient order.

Apple’s M-series cores are 8-wide decode, 6-wide superscalar execution. AMD Zen 4 is 6-wide. Intel’s P-cores decode 6 micro-ops per cycle. The wider the machine, the harder it is to keep all slots full, and the more critical branch prediction accuracy becomes.

Why pipeline depth has limits

Deeper pipelines mean higher clock frequencies (each stage does less work, so it finishes faster). Intel pushed this to the extreme with the Pentium 4’s 31-stage pipeline (Prescott, 2004), hitting 3.8 GHz. But three problems emerged:

1. Branch misprediction penalty grows linearly with depth. A 31-stage pipeline flushes up to 30 instructions on a mispredict. At 5% misprediction rate, the waste is enormous.

2. Power consumption scales with frequency. Dynamic power is proportional to frequency. Each pipeline latch also consumes energy. More latches = more power, even before frequency increases.

3. Diminishing returns on IPC. The ideal speedup from pipelining asymptotically approaches the number of stages, but only for infinitely long instruction streams with no hazards. Real code has branches every 5 to 7 instructions. After roughly 15 to 20 stages, the gains from deeper pipelining are consumed by the increased penalty of hazards.

Modern designs settle around 14 to 19 stages. This balances high clock frequency with manageable mispredict penalties. The trend since 2005 has been wider (more superscalar) rather than deeper.

In practice: real bottlenecks

Textbook pipelines are clean. Real workloads hit messier problems.

Branch-heavy code: Interpreters, virtual dispatch, and highly polymorphic code defeat branch predictors. Python’s bytecode interpreter spends significant time on indirect branch mispredictions. This is one reason JIT compilers (V8, HotSpot) exist: they eliminate the interpreter’s branch-heavy dispatch loop.

Memory stalls dominate. Even with OoO execution hiding some latency, an L3 cache miss (40+ ns) stalls the pipeline for 100+ cycles. No amount of instruction reordering helps if every instruction in the ROB depends on the missing data. Cache-friendly data structures matter more than almost any micro-optimization.

Frontend bottlenecks. The decoder has limits. x86 instructions are variable-length and complex to decode. Intel and AMD use micro-op caches to bypass the decoder for hot loops. If your hot loop does not fit in the micro-op cache (~1500 to 4000 micro-ops depending on the design), decode throughput can become the bottleneck.

Compiler cooperation. The hardware does heavy lifting, but compilers help: instruction scheduling to reduce stalls, loop unrolling to reduce branch frequency, and software prefetching hints. Profile-guided optimization (PGO) reorders code so that the hot path has fewer taken branches, improving prediction accuracy and instruction cache locality.

Measurement tools. perf stat on Linux shows you pipeline metrics directly:

instructions / cycles gives IPC
branch-misses shows misprediction count
stalled-cycles-frontend and stalled-cycles-backend show where the pipeline is waiting

If your IPC is below 1.0 on a modern superscalar core, something is badly wrong (usually memory stalls). If branch-misses exceed 2 to 3%, your branch-heavy code is likely the bottleneck.

What comes next

Pipelining, OoO, and superscalar execution get more work done per clock. But they are all constrained by memory bandwidth and latency. The next article covers memory consistency and ordering models: what happens when multiple cores read and write shared memory, and why your lock-free code might be broken on ARM.

Continue reading: CPU memory models and consistency

← Back to all series