SIMD and vectorization: parallelism on a single CPU core
In this series (6 parts)
- How a CPU actually works: cores, clocks, and execution
- CPU caches and memory hierarchy: why memory access speed matters
- CPU pipelines and instruction-level parallelism
- Memory models and why concurrent CPU code is hard
- SIMD and vectorization: parallelism on a single CPU core
- Processes, threads, and context switching
Prerequisites
This article builds on two earlier posts. You should be comfortable with how a CPU executes instructions and how pipelining overlaps instruction stages. Both are necessary to understand why SIMD exists and where it fits in the execution model.
What SIMD is
Most CPU instructions operate on one piece of data at a time. Load one float, add one float, store one float. This is called scalar execution. It works, but it wastes hardware when you are doing the same operation across thousands of elements.
SIMD stands for Single Instruction, Multiple Data. Instead of adding one pair of floats per instruction, a SIMD instruction adds 4, 8, or 16 pairs in a single cycle. The CPU has wide registers (128, 256, or 512 bits) that hold multiple values packed together. One instruction operates on the entire register at once.
Think of it like a cashier scanning groceries. Scalar execution scans one item at a time. SIMD is a conveyor belt scanner that reads 8 barcodes in one pass. The scanner fires once, but 8 items get processed. After the analogy: SIMD uses wide data paths and parallel ALUs within a single core to apply identical operations across packed data elements simultaneously.
graph LR
subgraph Scalar["Scalar: 8 ADD instructions"]
direction TB
S1["ADD a[0], b[0]"] --> S2["ADD a[1], b[1]"]
S2 --> S3["ADD a[2], b[2]"]
S3 --> S4["ADD a[3], b[3]"]
S4 --> S5["ADD a[4], b[4]"]
S5 --> S6["ADD a[5], b[5]"]
S6 --> S7["ADD a[6], b[6]"]
S7 --> S8["ADD a[7], b[7]"]
end
subgraph SIMD_Block["SIMD: 1 VADDPS instruction"]
direction TB
V1["VADDPS ymm0, ymm1, ymm2"]
V1 --- R1["a[0]+b[0] | a[1]+b[1] | a[2]+b[2] | a[3]+b[3]"]
R1 --- R2["a[4]+b[4] | a[5]+b[5] | a[6]+b[6] | a[7]+b[7]"]
end
Scalar addition requires 8 separate ADD instructions. A single AVX-256 VADDPS instruction performs all 8 additions at once by packing 8 float32 values into 256-bit registers.
This is not multithreading. SIMD parallelism happens within a single core, within a single thread. The core does not spawn extra work. It simply operates on wider data.
SIMD instruction sets: SSE, AVX, AVX-512
Intel and AMD have shipped several generations of SIMD instruction sets, each wider than the last. The key difference between them is register width, which determines how many values you process per instruction.
| Instruction set | Width (bits) | float32 per op | float64 per op | Introduced | CPU family |
|---|---|---|---|---|---|
| SSE | 128 | 4 | 2 | 1999 | Pentium III |
| SSE2 | 128 | 4 | 2 | 2001 | Pentium 4 |
| AVX | 256 | 8 | 4 | 2011 | Sandy Bridge |
| AVX2 | 256 | 8 | 4 | 2013 | Haswell |
| AVX-512 | 512 | 16 | 8 | 2017 | Skylake-X |
| ARM NEON | 128 | 4 | 2 | 2009 | Cortex-A8+ |
| ARM SVE | 128-2048 | Variable | Variable | 2020 | Neoverse V1+ |
SSE gave us 128-bit registers (XMM0 through XMM15). AVX doubled that to 256-bit registers (YMM0 through YMM15). AVX-512 doubled again to 512-bit registers (ZMM0 through ZMM31) and added mask registers for conditional operations.
Each generation is a superset. AVX2 includes all SSE instructions. AVX-512 includes all AVX2 instructions. Code compiled for SSE runs on any x86 chip made after 1999. Code compiled for AVX-512 only runs on high-end Intel processors and recent AMD Zen 4 chips.
Worked example 1: SIMD throughput calculation
AVX-512 operates on 512 bits at once. A float32 value is 4 bytes, which is 32 bits.
How many float32 values fit in one AVX-512 register?
512 bits / 32 bits per float = 16 floats per instruction
Consider adding two arrays of 1024 float32 values element-wise.
Scalar approach: each addition processes 1 float. You need 1024 ADD instructions.
AVX-512 approach: each VADDPS instruction processes 16 floats. You need 1024 / 16 = 64 SIMD instructions.
That is 16x fewer instructions for the same work. The instruction fetch, decode, and retirement overhead drops by the same factor. The data still needs to travel through the memory hierarchy, but the compute side shrinks dramatically.
In practice, the 16x instruction reduction does not always yield 16x wall-clock speedup. Memory bandwidth, cache misses, and pipeline stalls all eat into the theoretical gain. But for data that fits in L1 cache (typically 32-64 KB), you get very close to the theoretical maximum.
Worked example 2: vectorization speedup and efficiency
A scalar loop processes 1024 float32 values, one per iteration. Using AVX-256 (8 floats per instruction), the ideal number of iterations drops to 1024 / 8 = 128.
Ideal speedup: 1024 / 128 = 8x
Suppose you measure the actual speedup at 5x. What is the vectorization efficiency?
Efficiency = actual speedup / ideal speedup = 5 / 8 = 62.5%
What causes the gap between 8x and 5x?
Several factors reduce real-world SIMD performance:
- Memory bandwidth saturation. The ALU can process 8 floats per cycle, but if the data is not already in L1 cache, the core stalls waiting for memory. A cache miss to main memory costs 100+ cycles.
- Tail handling. If the array length is not a multiple of the SIMD width, the last few elements require scalar fallback code. For 1024 elements with width 8, there is no tail. For 1000 elements, the last 8 iterations handle the remaining 1000 mod 8 = 0… but for 1023 elements, 7 elements need scalar processing.
- Alignment overhead. SIMD loads are fastest when data starts at a 32-byte boundary (for AVX-256). Misaligned data requires extra cycles or split loads.
- Register pressure. AVX-256 provides 16 YMM registers. Complex loops that need many intermediate values may spill to the stack, adding load/store overhead.
- Surrounding scalar code. The loop itself might be vectorized, but setup, reduction, and cleanup code still runs scalar.
Auto-vectorization: when the compiler does it
Modern compilers (GCC, Clang, MSVC, ICC) can automatically convert scalar loops into SIMD instructions. This is called auto-vectorization. You write a normal loop, and the compiler emits VADDPS instead of ADDSS if conditions are right.
Consider this C code:
void add_arrays(float* c, const float* a, const float* b, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Compiled with gcc -O2 -mavx2, the compiler recognizes this as a vectorizable loop and generates AVX2 instructions. You can verify this by checking the assembly output with gcc -S or using Compiler Explorer (godbolt.org).
Compile with -fopt-info-vec (GCC) or -Rpass=loop-vectorize (Clang) to see which loops the compiler vectorized and which it skipped.
Loop conditions for vectorization
Auto-vectorization is not magic. The compiler must prove that transforming the loop is safe. It checks several conditions, and if any fail, the loop stays scalar.
No pointer aliasing. If two pointers might reference overlapping memory, the compiler cannot reorder or widen operations safely. In C, use the restrict keyword to promise the compiler that pointers do not alias:
void add_arrays(float* restrict c, const float* restrict a,
const float* restrict b, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
Without restrict, the compiler must assume c and a might overlap. It either inserts runtime checks (slower) or gives up on vectorization entirely.
Predictable memory access. The loop should access memory in a contiguous, stride-1 pattern. Accessing a[i], a[i+1], a[i+2] is ideal. Accessing a[indices[i]] (gather operations) is much harder to vectorize. AVX2 added gather instructions, but they are significantly slower than contiguous loads.
No loop-carried data dependencies. Each iteration must be independent. This vectorizes:
for (int i = 0; i < n; i++) {
c[i] = a[i] * b[i]; // each iteration is independent
}
This does not:
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b[i]; // iteration i depends on iteration i-1
}
The second loop has a dependency chain: you cannot compute a[4] until a[3] is done. No amount of SIMD width helps here.
No complex control flow. Branches inside the loop prevent vectorization unless the compiler can convert them to masked operations. Simple if/else on data values might work with AVX-512 mask registers. Complex branching with early exits will not.
Known trip count or safe over-read. The compiler needs to know the loop count at compile time or prove that reading a few extra elements past the array boundary is safe.
In practice: when auto-vectorization works and when it does not
Auto-vectorization works well for:
- ✓ Array-to-array arithmetic (add, multiply, scale)
- ✓ Dot products and reductions (with compiler support for horizontal adds)
- ✓ Simple filtering with branchless conditions
- ✓ Image processing: per-pixel operations on contiguous buffers
- ✓ Audio processing: applying gain, mixing channels
Auto-vectorization struggles with or fails on:
- ✗ Linked list traversal (pointer chasing, no contiguous access)
- ✗ Tree operations (unpredictable branching and memory patterns)
- ✗ Hash table lookups (random access patterns)
- ✗ Loops with function calls the compiler cannot inline
- ✗ Complex struct-of-arrays vs array-of-structs mismatches
Data layout matters enormously. Consider a particle simulation with x, y, z coordinates. An array-of-structs layout puts {x, y, z} for one particle together in memory. SIMD wants to process all x values, then all y values. A struct-of-arrays layout (separate x[], y[], z[] arrays) vectorizes naturally. The array-of-structs layout forces gather/scatter operations that negate the SIMD advantage.
When auto-vectorization fails, you have three options. First, restructure your data and loops to meet the conditions above. Second, use compiler intrinsics like _mm256_add_ps() to write SIMD explicitly. Third, use libraries like Intel MKL, Eigen, or NumPy that are already hand-vectorized internally.
The intrinsics approach looks like this:
#include <immintrin.h>
void add_arrays_avx(float* c, const float* a, const float* b, int n) {
int i;
for (i = 0; i <= n - 8; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
// scalar tail
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
}
This is more verbose but gives you full control. You choose the register width, the load/store alignment, and the tail handling strategy.
Why SIMD matters before studying GPU parallelism
GPUs are SIMD taken to the extreme. An NVIDIA GPU core (streaming multiprocessor) executes the same instruction across 32 threads simultaneously. This is called SIMT: Single Instruction, Multiple Threads. The mental model is identical to CPU SIMD, just at a different scale.
| Property | CPU SIMD (AVX-512) | GPU SIMT (NVIDIA) |
|---|---|---|
| Width | 16 float32 per instruction | 32 threads per warp |
| Registers per unit | 32 ZMM registers | 65,536 32-bit registers per SM |
| Units per chip | 1-2 per core, 8-64 cores | 80-144 SMs |
| Total parallelism | ~1,024 ops/cycle | ~100,000+ ops/cycle |
| Memory model | Coherent cache hierarchy | Separate device memory |
Understanding SIMD gives you the vocabulary for GPU programming. Concepts that apply directly:
- Data parallelism. Same operation on many elements. SIMD and GPU both require this.
- Memory coalescing. Contiguous access patterns are fast. Scattered access is slow. True on both CPU SIMD and GPU.
- Occupancy and utilization. Not all lanes may be active. Divergent branches waste SIMD lanes on CPU (masked operations) and waste GPU threads (warp divergence).
- Width matching. Your problem size should be a multiple of the SIMD/warp width for full utilization.
If you understand why a loop with random memory access does not vectorize on a CPU, you already understand why random memory access kills GPU kernel performance. The principles are the same. The scale differs.
Limits of SIMD
SIMD is powerful but not universal. Some problems resist vectorization fundamentally:
Irregular data structures. Graphs, trees, and hash maps have unpredictable access patterns. You cannot pack 8 tree nodes into a register and do useful work because the next 8 nodes depend on the current traversal path.
Branch-heavy code. SIMD processes all lanes uniformly. If your algorithm branches differently for each element, you end up computing both paths and masking the results. For a 50/50 branch, you do 2x the work to get SIMD width. That can erase the speedup entirely.
Sequential dependencies. Prefix sums, recurrences, and state machines where each step depends on the previous one cannot be naively parallelized. Specialized parallel algorithms exist (parallel prefix sum, for example) but they require restructuring the computation.
Small data. If your array has 10 elements, AVX-512 processes them in one instruction but the setup cost (loading registers, handling the tail) may exceed the scalar cost. SIMD shines on arrays with thousands to millions of elements.
Frequency throttling. This is a hardware constraint. Many Intel CPUs reduce their clock frequency when executing AVX-512 instructions because the wider execution units draw more power. On Skylake-X, AVX-512 code might run at 3.0 GHz while scalar code runs at 4.0 GHz. If only a small fraction of your program uses SIMD, the frequency penalty can make the overall program slower. Measure before committing to AVX-512.
Key takeaways
SIMD is the first level of data parallelism you encounter in hardware. It processes multiple values per instruction using wide registers within a single core. SSE gives you 4 floats per op. AVX gives you 8. AVX-512 gives you 16. The compiler can auto-vectorize simple loops if the data access is contiguous, the iterations are independent, and pointers do not alias.
The same constraints that make SIMD work (uniform operations, contiguous data, no divergence) are exactly what GPUs demand at a larger scale. Master SIMD thinking and GPU programming becomes a scaling exercise, not a conceptual leap.
What comes next
The next article covers processes and threads at the OS level. Understanding how the operating system schedules work across cores connects SIMD (parallelism within a core) to threading (parallelism across cores) and sets the stage for GPU compute, where thousands of threads run simultaneously.