Search…
Parallel Thinking · Part 2

Flynn's taxonomy and types of parallel hardware

In this series (5 parts)
  1. What parallelism actually means and why it is hard
  2. Flynn's taxonomy and types of parallel hardware
  3. Amdahl's law and scalability limits
  4. Parallel decomposition: how to split work across processors
  5. CPU vs GPU: why they are built differently and when to use each

Prerequisites

This article builds on two earlier posts. You should be comfortable with the basics of parallel thinking and how SIMD vectorization works on a single CPU core. Both give you the foundation for classifying parallel hardware into formal categories.

Why classify parallel hardware at all

Not all parallelism works the same way. A GPU and a multi-core CPU both run things in parallel, but they do it differently. In 1966 Michael Flynn proposed a taxonomy that classifies computer architectures by two dimensions: the number of instruction streams and the number of data streams. The taxonomy is simple. It has four quadrants. Six decades later it remains the standard vocabulary for discussing parallel hardware.

Understanding the taxonomy matters because the category of hardware you target determines how you write code. SIMD code looks different from MIMD code. GPU code follows yet another model. Knowing the categories helps you pick the right mental model before you write a single line.

The four Flynn categories

Flynn’s taxonomy splits architectures along two axes: instruction streams (one or many) and data streams (one or many). That gives four combinations.

graph TB
  subgraph Flynn["Flynn's Taxonomy"]
      direction TB
      subgraph Row1[" "]
          direction LR
          SISD["<b>SISD</b>
Single Instruction
Single Data
<i>Classic scalar CPU</i>
<i>Intel 8086</i>"]
          SIMD["<b>SIMD</b>
Single Instruction
Multiple Data
<i>AVX-256, SSE</i>
<i>GPU shader cores</i>"]
      end
      subgraph Row2[" "]
          direction LR
          MISD["<b>MISD</b>
Multiple Instruction
Single Data
<i>Mostly theoretical</i>
<i>Fault-tolerant pipelines</i>"]
          MIMD["<b>MIMD</b>
Multiple Instruction
Multiple Data
<i>Multi-core CPUs</i>
<i>Supercomputer clusters</i>"]
      end
  end
  style SISD fill:#1a365d,stroke:#2a4a7f,color:#e2e8f0
  style SIMD fill:#2c5282,stroke:#2a4a7f,color:#e2e8f0
  style MISD fill:#2d3748,stroke:#4a5568,color:#e2e8f0
  style MIMD fill:#2b6cb0,stroke:#2a4a7f,color:#e2e8f0

The four quadrants of Flynn’s taxonomy. Most real hardware falls into SIMD or MIMD. MISD is rare.

SISD: one instruction, one data

SISD is the simplest case. A single processor fetches one instruction, operates on one piece of data, then moves to the next instruction. This is how early computers worked and how a single CPU core behaves when running scalar code.

There is no parallelism in SISD. One thing happens at a time. A for loop that adds numbers one by one on a single core with no vectorization is SISD execution. It is the baseline against which all parallel approaches are measured.

Examples: the original Intel 8086, any modern CPU running scalar code without vector extensions.

SIMD: one instruction, many data

SIMD applies a single instruction to multiple data elements simultaneously. One ADD instruction processes 4, 8, or 16 values packed into wide registers. The hardware has parallel ALUs that execute the same operation in lockstep.

This is the model behind Intel SSE (128-bit), AVX-256, and AVX-512. It is also the conceptual starting point for understanding GPU execution, though GPUs take the idea further.

SIMD works well when every element needs the same operation. Vector addition, pixel color transforms, audio sample processing. It breaks down when elements need different operations, because SIMD cannot branch per element. If some elements need operation A and others need operation B, you have to run both and mask out the irrelevant results.

MIMD: many instructions, many data

MIMD is the most flexible category. Multiple processors execute different instructions on different data simultaneously. Each processor has its own instruction stream and can branch independently.

Multi-core CPUs are MIMD. Core 0 can run a web server thread while core 1 runs a database query. They share memory but execute completely independent instruction streams. Distributed clusters are also MIMD: each node runs its own program on its own data.

MIMD is powerful because it places no constraints on what each processor does. But that flexibility comes with complexity. You need synchronization primitives (locks, barriers, message passing) to coordinate work between processors.

MISD: mostly theoretical

MISD applies multiple different instructions to the same data stream. This is the rarest category and exists mostly in textbooks.

The closest real example is fault-tolerant systems where multiple processors compute the same result independently and vote on the output. The Space Shuttle flight computer used this approach: five identical computers processed the same sensor data through different software to catch bugs. Some researchers also classify systolic arrays and certain pipeline architectures as MISD, but the classification is debated.

For practical purposes, you will rarely encounter MISD hardware. It matters for completeness, not for daily programming.

Flynn categories at a glance

CategoryExample HardwareProgramming ModelTypical Use CaseFlexibility
SISDIntel 8086, single-core scalarSequential codeSimple loops, legacy softwareNone: one thing at a time
SIMDAVX-256, SSE, GPU shader unitsVectorized intrinsics, auto-vectorizationImage processing, matrix math, audio DSPLow: all lanes do the same op
MIMDMulti-core CPUs, supercomputer clustersThreads, MPI, distributed systemsWeb servers, databases, simulationsHigh: each core runs anything
MISDSpace Shuttle computers, fault-tolerant systemsRedundant computation with votingSafety-critical systemsNarrow: same data, different checks

Where GPUs fit: SIMT

GPUs do not fit neatly into Flynn’s original four categories. They look like SIMD from a distance: one instruction applied to many data elements. But the execution model differs in important ways. NVIDIA coined the term SIMT (Single Instruction, Multiple Threads) to describe it.

In classic SIMD, a single instruction operates on a wide register. There is one program counter. All lanes execute the same instruction or get masked off entirely.

In SIMT, each thread has its own program counter and register state. Threads are grouped into warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads). Within a warp, all threads execute the same instruction at the same time. But when threads diverge (take different branches), the hardware serializes the branches rather than simply masking.

graph TD
  subgraph SIMD_Exec["SIMD: Lock-step with masking"]
      direction TB
      I1["Instruction: ADD"] --> L1["Lane 0 ✓"]
      I1 --> L2["Lane 1 ✓"]
      I1 --> L3["Lane 2 ✓"]
      I1 --> L4["Lane 3 ✓"]
      I1 --> L5["Lane 4 ✗ masked"]
      I1 --> L6["Lane 5 ✗ masked"]
      I1 --> L7["Lane 6 ✗ masked"]
      I1 --> L8["Lane 7 ✗ masked"]
  end
  subgraph SIMT_Exec["SIMT: Branch divergence"]
      direction TB
      B1["Branch: threadID % 2 == 0?"] --> P1["Pass 1: Even threads run A"]
      B1 --> P2["Pass 2: Odd threads run B"]
      P1 --> CONV["Reconverge: all threads continue"]
      P2 --> CONV
  end
  style L5 fill:#742a2a,stroke:#9b2c2c,color:#feb2b2
  style L6 fill:#742a2a,stroke:#9b2c2c,color:#feb2b2
  style L7 fill:#742a2a,stroke:#9b2c2c,color:#feb2b2
  style L8 fill:#742a2a,stroke:#9b2c2c,color:#feb2b2

SIMD masks inactive lanes and executes one pass. SIMT serializes divergent branches into separate passes, then reconverges.

The key difference: SIMT threads can diverge. They pay a performance penalty for it (serialization), but the programming model lets you write code with branches, loops, and conditionals per thread. SIMD cannot do this at all without explicit mask manipulation.

This is why GPU programming feels more like writing scalar code that magically runs on thousands of threads. You write if (threadID > 100) and the hardware handles it. In pure SIMD, you would have to manually build a bitmask and apply it.

Worked example 1: SIMD vs SIMT divergence cost

Consider 8 threads executing this pseudocode:

if (threadID % 2 == 0) {
    do A;    // takes 4 cycles
} else {
    do B;    // takes 6 cycles
}

Threads 0, 2, 4, 6 are even. Threads 1, 3, 5, 7 are odd.

SIMD execution (masking approach):

SIMD runs both branches with masks. First it executes A with a mask enabling only even lanes. Then it executes B with a mask enabling only odd lanes. Each pass takes the full cycle count regardless of how many lanes are active.

  • Pass 1 (branch A): 4 cycles, lanes 0, 2, 4, 6 active, lanes 1, 3, 5, 7 masked off
  • Pass 2 (branch B): 6 cycles, lanes 1, 3, 5, 7 active, lanes 0, 2, 4, 6 masked off
  • Total: 4 + 6 = 10 cycles

SIMT execution (serialization approach):

SIMT also serializes the two paths, but through a different mechanism. The warp scheduler detects divergence at the branch point. It first executes all even threads through path A while odd threads are inactive. Then it executes all odd threads through path B while even threads are inactive. After both paths complete, threads reconverge.

  • Pass 1 (branch A): 4 cycles for even threads
  • Pass 2 (branch B): 6 cycles for odd threads
  • Total: 4 + 6 = 10 cycles

The total cost is the same: 10 cycles. Both approaches pay the sum of both branch costs. The difference is not in this simple case but in the programming model. SIMD requires you to write explicit mask logic. SIMT lets you write the if/else naturally and the hardware handles serialization.

Comparison to no divergence: If all 8 threads took path A, both SIMD and SIMT would finish in 4 cycles. Divergence costs 6 extra cycles here. That is a 2.5x slowdown. This is why GPU programmers care deeply about warp divergence.

Worked example 2: classifying real systems

Classify each system into a Flynn category and justify the choice.

System 1: Smartphone GPU (Qualcomm Adreno, Apple GPU)

Classification: SIMT (extends SIMD)

A smartphone GPU groups threads into warps or wave groups. Within each group, threads execute the same instruction simultaneously. They share an instruction stream but each has its own registers and can diverge. This is SIMT. Under Flynn’s original taxonomy, the closest fit is SIMD, but SIMT is the more precise modern label.

System 2: Supercomputer cluster (Frontier, Alps)

Classification: MIMD

Each node in the cluster runs its own program with its own instruction stream on its own data. Nodes communicate via a high-speed interconnect (Slingshot, InfiniBand). Different nodes can execute entirely different code. This is textbook MIMD. Within each node, the GPUs are SIMT and the CPUs are also MIMD at the core level. The system is hierarchically parallel.

System 3: Intel AVX-256 operation

Classification: SIMD

A single AVX-256 instruction (like VADDPS) operates on a 256-bit register holding 8 packed float32 values. One instruction, eight data elements, all processed identically in lockstep. No per-lane branching is possible. Pure SIMD.

System 4: GPU warp of 32 threads

Classification: SIMT

A warp of 32 threads shares a single instruction issue slot but each thread maintains its own program counter and register file. When all threads follow the same path, it looks like SIMD. When threads diverge, the warp serializes branches. This hybrid behavior is why SIMT exists as a distinct category beyond Flynn’s original four.

The SPMD programming model

SIMT describes the hardware. SPMD (Single Program, Multiple Data) describes how you write code for it.

In SPMD, you write one program (a kernel) and launch it across thousands of threads. Each thread runs the same code but operates on different data based on its thread ID. The program is single. The data is multiple. But unlike SIMD, each thread can take different control flow paths through the program.

// SPMD kernel: each thread processes one element
__global__ void add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Every thread runs this same function. Thread 0 computes c[0] = a[0] + b[0]. Thread 1000 computes c[1000] = a[1000] + b[1000]. The if (i < n) guard means some threads do nothing (when the array length is not a multiple of the block size). This is SPMD in action.

SPMD is the dominant programming model for GPUs. CUDA, OpenCL, Metal compute shaders, and Vulkan compute all follow SPMD. You think in terms of one thread, but thousands execute simultaneously.

The relationship between the terms:

  • Flynn’s taxonomy classifies hardware (SISD, SIMD, MIMD, MISD)
  • SIMT describes how GPU hardware executes (single instruction across a warp of threads with independent PCs)
  • SPMD describes how you program it (write one kernel, launch many threads)

These are different layers of the same system. SPMD is the programming model. SIMT is the execution model. SIMD is the closest Flynn category.

In practice

Flynn’s taxonomy gives you a vocabulary, not a recipe. Real systems mix categories at different levels:

  • A modern CPU is MIMD at the core level (each core runs independent instructions) but supports SIMD within each core (AVX vector instructions). When you write multithreaded code with vectorized inner loops, you use both simultaneously.

  • A GPU is SIMT within a warp but MIMD across warps. Different warps on different streaming multiprocessors can execute different instructions at the same time. The hierarchy matters when reasoning about performance.

  • A supercomputer is MIMD at the node level, MIMD at the core level within each node, and SIMT at the warp level within each GPU. Three levels of parallelism, three different models.

When you encounter a new piece of hardware, ask two questions: how many instruction streams does it have, and how many data streams does it process? The answer places it in Flynn’s grid and tells you what programming model to expect.

⚠ Do not confuse the programming model with the execution model. You can write SPMD code (one kernel, many threads) that executes on SIMT hardware (warp-level lockstep) using what looks like SIMD operations (packed math). All three labels are correct simultaneously, describing different layers.

Three practical rules:

  1. Match your algorithm to the hardware category. Data-parallel problems (same operation on every element) suit SIMD and SIMT. Task-parallel problems (different operations on different data) suit MIMD.

  2. Minimize divergence on SIMT hardware. Every branch where threads disagree costs serialization time. Sort work so that adjacent threads take the same path.

  3. Think hierarchically. Your program runs on MIMD cores, each with SIMD lanes, connected to SIMT GPUs. Optimize at every level, not just one.

What comes next

You now have the vocabulary to classify any parallel architecture you encounter. The next article covers a fundamental limit on how much speedup parallelism can deliver: Amdahl’s law and the limits of parallel speedup. It answers the question every engineer asks: if I throw more cores at this problem, how much faster will it go?

Start typing to search across all content
navigate Enter open Esc close