Mar 5, 2026 · 20 min read · GPU Programming

Memory models and why concurrent CPU code is hard

In this series (6 parts)

How a CPU actually works: cores, clocks, and execution
CPU caches and memory hierarchy: why memory access speed matters
CPU pipelines and instruction-level parallelism
Memory models and why concurrent CPU code is hard
SIMD and vectorization: parallelism on a single CPU core
Processes, threads, and context switching

Prerequisites

This article builds on two earlier posts. You should understand how a CPU pipeline executes instructions (CPU architecture overview) and how caches sit between cores and main memory (CPU caches and the memory hierarchy). The reordering problems below happen precisely because of how store buffers and caches interact with multiple cores.

What a memory model defines

A memory model is a contract. It tells you which orders of reads and writes are legal when multiple threads (or cores) access shared memory. Without this contract, you cannot reason about concurrent code. Period.

Think of it like traffic laws. Two drivers approaching an intersection need agreed-upon rules. Without them, both might proceed and collide. A memory model is the set of rules that prevents cores from “colliding” on shared data, or at least tells you exactly when collisions can happen.

Formally, a memory model answers one question: if thread A writes a value and thread B reads it, under what conditions is thread B guaranteed to see thread A’s write? The answer is more complicated than most engineers expect.

Sequential consistency: the intuitive model

Leslie Lamport defined sequential consistency in 1979. The definition is short: the result of any execution is the same as if all operations from all threads were executed in some sequential order, and each thread’s operations appear in program order.

In plain terms: imagine shuffling each thread’s instructions into one global timeline, but never reordering instructions within a single thread. Every possible shuffle is a valid execution. Nothing else is.

This is what most programmers assume. It is also what almost no modern hardware gives you by default.

Under sequential consistency, if thread A executes:

x = 1
print(y)

and thread B executes:

y = 1
print(x)

Then the possible interleavings are limited. At least one thread must see the other’s write. Both printing 0 is impossible under sequential consistency, because that would require both writes to happen “after” both reads, which contradicts program order in at least one thread.

Why hardware does not give you sequential consistency

Performance. That is the entire reason.

A modern CPU core does not write directly to the cache on every store instruction. Instead, it drops the value into a store buffer: a small queue between the core and L1 cache. The store buffer lets the core continue executing without waiting for the cache coherence protocol to finish. This is critical; a cache miss on a write can take 100+ cycles, and stalling the pipeline for every write would destroy throughput.

The problem: while a value sits in the store buffer, it is invisible to other cores. The local core can see it (store-buffer forwarding), but no other core can. This creates a window where different cores disagree on the state of memory.

x86 processors use a relatively strong model called Total Store Order (TSO). ARM and RISC-V use weaker models that allow even more reordering. On ARM, both loads and stores can be reordered with respect to each other unless you explicitly prevent it.

Store buffers and why they cause visible reordering

Let’s trace through the most famous example in memory model literature: the store buffer reordering problem.

graph LR
  subgraph Core_A["Core A"]
      A_Pipeline["Pipeline"] --> A_SB["Store Buffer
(x=1 pending)"]
  end
  subgraph Core_B["Core B"]
      B_Pipeline["Pipeline"] --> B_SB["Store Buffer
(y=1 pending)"]
  end
  A_SB -->|"drains to"| L1_A["L1 Cache A"]
  B_SB -->|"drains to"| L1_B["L1 Cache B"]
  L1_A --> SharedMem["Shared Memory
x=0, y=0"]
  L1_B --> SharedMem

Example 1: Store buffer reordering

Setup. Shared variables x and y are both initialized to 0.

Thread A:

S1: x = 1       // store
L1: r1 = y      // load into register r1

Thread B:

S2: y = 1       // store
L2: r2 = x      // load into register r2

Question: Can both r1 and r2 be 0 after both threads complete?

Under sequential consistency, no. Under real hardware with store buffers, yes. Here is how.

Step-by-step execution:

Step	Core A	Core B	Store Buffer A	Store Buffer B	Memory (x, y)
1	Executes S1: x=1	Executes S2: y=1	x=1 (pending)	y=1 (pending)	(0, 0)
2	Executes L1: reads y from memory	Executes L2: reads x from memory	x=1 (pending)	y=1 (pending)	(0, 0)
3	r1 = 0 ✗	r2 = 0 ✗	x=1 (pending)	y=1 (pending)	(0, 0)
4	—	—	drains x=1	drains y=1	(1, 1)

Both stores enter the store buffers at step 1 but have not drained to shared memory. At step 2, both loads read from shared memory, which still holds 0 for both variables. The store buffers drain later, but the damage is done. Both threads read stale values.

This is not a bug in the hardware. This is the hardware working as designed. The store buffer exists to make single-threaded code fast, and reordering is the price you pay in multithreaded code.

Result: r1 = 0, r2 = 0. Both threads observed a state of memory that never existed in any single point in time. This violates sequential consistency.

Memory fences: what they do at the hardware level

A memory fence (also called a memory barrier) is an instruction that forces ordering. It tells the CPU: “drain the store buffer before executing any subsequent loads.” The fence does not make the store faster. It makes the core wait.

There are different fence strengths:

Store fence (sfence on x86): all prior stores must be visible before any subsequent store.
Load fence (lfence on x86): all prior loads must complete before any subsequent load.
Full fence (mfence on x86): all prior loads and stores must complete before any subsequent load or store.

sequenceDiagram
  participant SB as Store Buffer
  participant Core as CPU Core
  participant Cache as L1 Cache

  Core->>SB: STORE x = 1
  Note over SB: x=1 queued
  Core->>Core: MFENCE issued
  SB->>Cache: Drain x = 1
  Note over SB: Buffer empty
  Core->>Cache: LOAD y - now safe
  Cache-->>Core: y value - up to date

The fence creates a hard boundary. No load after the fence can execute until all stores before the fence have drained from the store buffer into the cache (and therefore become visible via the coherence protocol).

Example 2: Fixing the reordering with fences

Take the same program from Example 1 and insert a full fence between each store and load:

Thread A:

S1: x = 1         // store
    MFENCE         // full memory fence
L1: r1 = y        // load

Thread B:

S2: y = 1         // store
    MFENCE         // full memory fence
L2: r2 = x        // load

What the fence forces:

Thread A executes S1, placing x=1 in its store buffer.
Thread A hits the MFENCE. The core stalls. It cannot execute L1 until x=1 has drained from the store buffer into the cache, making it globally visible.
Once x=1 is in the cache, the coherence protocol ensures any other core reading x will see 1 (or a newer value).
Thread A executes L1, reading y. If thread B’s store (y=1) has already drained, r1 = 1. If not, r1 = 0, but that means thread B has not yet passed its own fence, so thread B’s load of x will see x=1.

The key insight: at least one fence completes before the other thread’s load executes. This guarantees at least one thread sees the other’s store. The outcome r1 = 0 AND r2 = 0 is now impossible.

Cost: the core stalls while waiting for the store buffer to drain. On x86, an MFENCE can cost 30 to 100+ cycles depending on the microarchitecture and how full the store buffer is. This is why you do not put fences everywhere; you put them exactly where correctness requires them.

Volatile, atomic, and memory order in C++

Writing raw fence instructions is fragile and non-portable. C++11 introduced std::atomic with explicit memory ordering options. The compiler translates these into the correct fence instructions for whatever architecture you target.

Here is what each memory order means:

Memory Order	Guarantee	Performance Cost	When to Use
`memory_order_relaxed`	Atomicity only. No ordering guarantees between threads.	Lowest. No fences emitted on most architectures.	Counters, statistics, anything where you only need the value to not tear.
`memory_order_acquire`	All reads/writes after this load see everything that happened before the corresponding release store.	Low to moderate. Load fence on weak architectures, free on x86.	Reading a flag or pointer that “publishes” data from another thread.
`memory_order_release`	All reads/writes before this store are visible to any thread that does an acquire load of this variable.	Low to moderate. Store fence on weak architectures, free on x86.	Writing a flag or pointer after preparing data for another thread.
`memory_order_acq_rel`	Combines acquire and release on a single read-modify-write operation.	Moderate. Both fences where needed.	Compare-and-swap in lock-free data structures.
`memory_order_seq_cst`	Full sequential consistency. All seq_cst operations appear in a single total order agreed upon by all threads.	Highest. Full fence (MFENCE on x86, DMB on ARM).	Default for `std::atomic`. Use when you need a global ordering guarantee or when reasoning about weaker orders is too error-prone.

A concrete example using acquire/release:

std::atomic<bool> ready{false};
int data = 0;

// Thread A (producer)
data = 42;                                    // non-atomic write
ready.store(true, std::memory_order_release); // release store

// Thread B (consumer)
while (!ready.load(std::memory_order_acquire)) {} // acquire load
assert(data == 42); // guaranteed to pass

The release store on ready guarantees that the write to data (which happened before it) is visible to any thread that performs an acquire load on ready and sees true. The acquire load “synchronizes with” the release store, establishing a happens-before relationship.

Without these annotations (or with memory_order_relaxed), thread B might see ready == true but still read data == 0, because the compiler or hardware could reorder the non-atomic write past the atomic store.

A note on volatile

In C and C++, volatile does not provide atomicity or ordering. It only prevents the compiler from optimizing away reads and writes. It was designed for memory-mapped I/O registers, not for thread synchronization. Using volatile for concurrency is a bug. Use std::atomic.

Java’s volatile is different: it provides acquire/release semantics. Do not confuse the two.

In practice: real concurrency bugs

These are not theoretical problems. Engineers hit them regularly.

Double-checked locking. The classic singleton pattern where you check a flag, acquire a lock, check again, then initialize. Without proper memory ordering, a thread can see the “initialized” flag as true while the object’s fields are still uninitialized. This bug shipped in production Java code for years before the Java Memory Model was fixed in Java 5.

Lock-free queues with torn reads. A producer writes a data payload, then updates a “tail” index. If the compiler or CPU reorders the tail update before the data write, consumers read garbage. Acquire/release semantics on the tail index fix this.

False sharing causing “impossible” performance. Two threads modify independent variables that happen to sit on the same cache line. The coherence protocol bounces the line between cores on every write. The program is correct but runs 10x to 50x slower than expected. The fix: pad your variables to separate cache lines (typically 64 bytes apart).

Relaxed atomics on ARM. Code that works perfectly on x86 (which has a strong memory model) breaks on ARM (which has a weak model). The x86 hardware was inserting fences you did not ask for, masking your missing memory ordering annotations. This is common when porting server code to ARM-based cloud instances.

Compiler reordering without hardware involvement. Even on a single core, the compiler can reorder memory accesses. A compiler barrier (like std::atomic_thread_fence or asm volatile("" ::: "memory")) prevents this. Compilers optimize aggressively; they will reorder, merge, or eliminate memory accesses if the language rules allow it.

The cost spectrum

To put fence costs in perspective:

No synchronization: 1 to 4 cycles per memory operation (L1 hit).
Acquire/release on x86: often free at the hardware level (x86 already orders stores relative to subsequent loads), but prevents compiler reordering.
Acquire/release on ARM: 1 additional barrier instruction, roughly 10 to 40 cycles depending on microarchitecture.
Full fence (seq_cst) on x86: MFENCE or LOCK-prefixed instruction, roughly 30 to 100 cycles.
Full fence on ARM: DMB ISH, roughly 20 to 60 cycles, more if the store buffer is full.

The performance gap between relaxed and sequentially consistent operations is significant in tight loops. A lock-free queue using relaxed atomics where safe and acquire/release where necessary can be 2x to 5x faster than one using seq_cst everywhere.

Why this matters before GPU memory models

GPUs run thousands of threads, not just a handful. The memory model problem does not get easier; it gets dramatically harder.

A GPU has multiple levels of memory (registers, shared memory, L1, L2, global memory) with different visibility rules. Threads within a warp execute in lockstep, but threads across warps and thread blocks can observe writes in different orders. The cost of a full fence on a GPU is far higher relative to throughput, because you are stalling thousands of threads, not just one.

Understanding CPU memory models, store buffers, and fences gives you the vocabulary to reason about GPU memory consistency. When you encounter __threadfence(), __threadfence_block(), or CUDA’s cuda::atomic with memory scopes, you will recognize the same patterns: visibility, ordering, and the tradeoff between performance and correctness.

What comes next

The next article covers vectorization and SIMD: how a single CPU core processes multiple data elements per instruction, and why this is the bridge between scalar CPU thinking and the massively parallel execution model of GPUs.

← Back to all series