Mar 1, 2026 · 18 min read · GPU Programming

How a CPU actually works: cores, clocks, and execution

In this series (6 parts)

How a CPU actually works: cores, clocks, and execution
CPU caches and memory hierarchy: why memory access speed matters
CPU pipelines and instruction-level parallelism
Memory models and why concurrent CPU code is hard
SIMD and vectorization: parallelism on a single CPU core
Processes, threads, and context switching

Prerequisites

This is the first article in the “How Computers Work” series. No prerequisites are needed beyond basic programming experience. If you have written a loop and called a function, you know enough to follow along.

The Von Neumann architecture

Every general-purpose CPU you have used follows the same basic blueprint, published by John von Neumann in 1945. The idea: store both your program and your data in the same memory, and use a single processing unit to operate on them one instruction at a time.

That is the entire concept. A CPU is a machine that reads instructions from memory, figures out what they mean, does the work, and writes results back. Over and over, billions of times per second.

The four stages of every instruction are:

Fetch: read the next instruction from memory, using the program counter (PC) to know where to look.
Decode: figure out what the instruction means. Is it an add? A load from memory? A branch?
Execute: do the actual work in the arithmetic logic unit (ALU) or another functional unit.
Writeback: store the result into a register or back to memory.

Think of it like a factory assembly line with four stations. Each instruction passes through all four stations in order. The CPU cannot skip a station, and it cannot rearrange the order for a single instruction.

graph LR
  subgraph Memory
      IM["Instruction Memory"]
      DM["Data Memory"]
  end
  subgraph CPU
      PC["Program Counter (PC)"]
      F["1. Fetch"]
      D["2. Decode"]
      E["3. Execute (ALU)"]
      W["4. Writeback"]
      RF["Register File"]
  end
  PC -->|address| IM
  IM -->|instruction bits| F
  F --> D
  D --> E
  E --> W
  W -->|result| RF
  RF -->|operands| E
  E -->|store| DM
  DM -->|load| E

The register file is a small, fast set of storage locations inside the CPU, typically 16 to 32 registers on modern architectures. Registers are where the CPU does its actual work. When you write x = a + b in code, the compiler turns that into: load a into a register, load b into a register, add them, store the result.

Memory is large but slow. Registers are tiny but fast. This gap between memory speed and CPU speed is one of the most important facts in all of computing. We will return to it.

What a clock cycle is

A CPU does not run continuously like water through a pipe. It advances in discrete ticks, driven by a crystal oscillator. Each tick is one clock cycle.

A clock cycle is the smallest unit of time a CPU recognizes. Every operation the CPU performs takes some number of clock cycles. A simple integer addition might take 1 cycle. A floating-point division might take 10 to 20 cycles. A load from main memory might take 200 or more cycles.

The clock speed (measured in GHz) tells you how many of these cycles happen per second. A 3.5 GHz processor completes 3.5 billion cycles every second. Each cycle lasts about 0.286 nanoseconds, roughly the time light travels 8.6 centimeters.

Here is what matters for you as a programmer: not all instructions are created equal. The cycle count of an instruction is the real cost, and the variation is enormous.

Operation	Typical cycles	Time at 3.5 GHz
Integer add	1	0.29 ns
Integer multiply	3	0.86 ns
FP multiply	5	1.43 ns
FP divide	10-20	2.9-5.7 ns
L1 cache hit	4	1.1 ns
L2 cache hit	12	3.4 ns
L3 cache hit	40	11.4 ns
Main memory load	200+	57+ ns

That last row is the one to stare at. A memory load costs 200 times more than an add. Your CPU can do 200 additions in the time it takes to fetch one value from RAM. This ratio dominates the performance of nearly every real program.

Worked example 1: Clock cycle math

Let us make this concrete with real numbers.

Setup: A CPU running at 3.5 GHz. We have a loop that runs 1,000 iterations.

Cycles per second:

3.5 \text{ GHz} = 3.5 \times 10^9 \text{ cycles/second} = 3{,}500{,}000{,}000 \text{ cycles/second}

Time per cycle:

\frac{1}{3.5 \times 10^9} \approx 2.857 \times 10^{-10} \text{ seconds} = 0.286 \text{ ns}

Scenario A: 1,000 additions (1 cycle each)

\text{Total cycles} = 1{,}000 \times 1 = 1{,}000 \text{ cycles}

\text{Total time} = 1{,}000 \times 0.286 \text{ ns} = 286 \text{ ns} \approx 0.000286 \text{ ms}

Scenario B: 1,000 memory loads (200 cycles each)

\text{Total cycles} = 1{,}000 \times 200 = 200{,}000 \text{ cycles}

\text{Total time} = 200{,}000 \times 0.286 \text{ ns} = 57{,}200 \text{ ns} \approx 0.057 \text{ ms}

The memory-bound loop takes 200 times longer than the compute-bound loop. Same number of iterations, same clock speed, vastly different performance. This is why experienced engineers obsess over memory access patterns rather than shaving off arithmetic instructions.

Single core execution

A single CPU core executes instructions one after another in program order. Your code runs top to bottom. When you write:

a = load(x)
b = load(y)
c = a + b
store(c, z)

The CPU must do exactly that sequence. It cannot compute a + b before it knows what a and b are. This is a data dependency: the add depends on both loads finishing first.

Modern CPUs use several tricks to squeeze more work out of a single core:

Pipelining: while one instruction is in the Execute stage, the next instruction can be in Decode, and a third can be in Fetch. This overlapping means we can start a new instruction every cycle, even though each instruction takes multiple cycles to finish.
Out-of-order execution: the CPU looks ahead in the instruction stream and executes independent instructions early. If instruction 5 does not depend on instructions 3 or 4, the CPU might run it before them.
Branch prediction: when the CPU hits an if statement, it guesses which path to take and starts executing speculatively. If the guess is right (modern predictors are correct 95%+ of the time), no time is wasted. If wrong, the CPU throws away the speculative work and starts over, costing 10 to 20 cycles.

These optimizations are why modern single-core performance is remarkable. A single core on a recent CPU can retire 4 to 6 instructions per cycle in favorable conditions. But they all have limits, and they all add complexity, power draw, and heat.

Multiple cores: splitting the work

Around 2005, CPU manufacturers hit a wall. They could not keep increasing clock speeds because the power consumption (and heat) grows roughly with the cube of frequency. Instead, they started putting multiple independent cores on the same chip.

A multi-core CPU is essentially multiple CPUs sharing the same memory system. Each core has its own registers, its own pipeline, and its own L1/L2 cache. They share L3 cache and main memory.

graph TB
  subgraph "Single Core Execution"
      direction LR
      SC["Core 0"]
      ST1["Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8"]
      SC --- ST1
      STime["Time: 8 units"]
  end

  subgraph "Multi-Core Execution (4 cores)"
      direction LR
      C0["Core 0"]
      C1["Core 1"]
      C2["Core 2"]
      C3["Core 3"]
      T0["Task 1
Task 2"]
      T1["Task 3
Task 4"]
      T2["Task 5
Task 6"]
      T3["Task 7
Task 8"]
      C0 --- T0
      C1 --- T1
      C2 --- T2
      C3 --- T3
      MTime["Time: 2 units + overhead"]
  end

The diagram shows the fundamental tradeoff. A single core runs all 8 tasks sequentially in 8 time units. Four cores can, in principle, finish in 2 time units. But there is always overhead.

Why parallelism is not free:

Synchronization: cores must coordinate. If core 0 produces a result that core 1 needs, core 1 has to wait. This waiting is called synchronization overhead, and it often dominates the total runtime.
Cache coherence: when core 0 writes to a memory location that core 1 has cached, the hardware must invalidate core 1’s copy. This coherence traffic takes time and bandwidth.
Amdahl’s Law: if 10% of your program is inherently sequential (cannot be parallelized), then even with infinite cores, your maximum speedup is 10x. The sequential portion becomes the bottleneck.
Work distribution: splitting tasks evenly is hard. If one core gets more work than others, it becomes the bottleneck while other cores sit idle.

Worked example 2: Multi-core work division

Setup: 1 million independent additions. 8 cores. 3.5 GHz clock. Each addition takes 1 cycle.

Ideal case (perfect distribution, zero overhead):

\text{Total work} = 1{,}000{,}000 \text{ cycles}

\text{Work per core} = \frac{1{,}000{,}000}{8} = 125{,}000 \text{ cycles}

\text{Time per core} = 125{,}000 \times 0.286 \text{ ns} = 35{,}750 \text{ ns} \approx 35.75 \text{ μs}

\text{Speedup} = \frac{1{,}000{,}000 \times 0.286 \text{ ns}}{35{,}750 \text{ ns}} = 8\times \text{ (perfect)}

Realistic case: now let us add overhead. In practice, you pay for:

Thread creation/scheduling: ~1,000 to 10,000 ns per thread
Work distribution (splitting the array, assigning ranges): ~500 ns
Synchronization at the end (joining results): ~1,000 ns
Cache effects (each core loads its portion into its own cache): variable

A conservative estimate with 8 threads:

\text{Thread overhead} = 8 \times 5{,}000 \text{ ns} = 40{,}000 \text{ ns}

\text{Distribution + sync} \approx 1{,}500 \text{ ns}

\text{Effective time} = 35{,}750 + 40{,}000 + 1{,}500 = 77{,}250 \text{ ns} \approx 77.25 \text{ μs}

\text{Actual speedup} = \frac{286{,}000}{77{,}250} \approx 3.7\times

Instead of the ideal 8x, we get roughly 3.7x. For a task this small, the overhead is significant. This is why you do not parallelize a loop of 1,000 iterations; the overhead eats the benefit. Multi-core parallelism pays off when each core has enough work to amortize the coordination cost, typically millions of operations or more per thread.

What the operating system does

The CPU hardware provides cores, but the operating system (OS) decides what runs on them. This layer matters more than most programmers realize.

Processes and threads: your program runs as a process. Within that process, you can create threads. Each thread is an independent stream of execution that the OS can schedule onto any available core.

The scheduler: the OS scheduler is responsible for mapping threads to cores. It makes decisions every few milliseconds (the “time slice” or “quantum,” typically 1 to 10 ms). When a thread’s time slice expires, or when it blocks waiting for I/O, the scheduler suspends it and puts another thread on that core.

Context switching: when the scheduler swaps one thread for another on a core, it must save all the registers of the old thread and load the registers of the new thread. This costs roughly 1,000 to 10,000 ns, depending on the architecture. Frequent context switches waste time and pollute the CPU caches.

What this means for you:

If you create more threads than you have cores, the OS will time-slice them. You get concurrency (multiple things in progress) but not true parallelism (multiple things running simultaneously).
If your threads block on I/O (disk, network), the OS can schedule other threads in the meantime. This is why async I/O and event loops exist: they let one thread handle many I/O operations without blocking.
Thread priorities, CPU affinity, and NUMA topology all affect performance. On large servers, which core your thread runs on can change memory access latency by 2x or more.

Modern CPU specs: what the numbers mean

Here is a comparison of recent high-end CPUs:

Spec	Intel Core i9-14900K	AMD Ryzen 9 7950X	Apple M3 Max
Cores (P+E / total)	8P + 16E = 24	16	12P + 4E = 16
Base / boost clock (GHz)	3.2 / 6.0	4.5 / 5.7	3.0 / 4.05
L3 cache	36 MB	64 MB	48 MB
Memory bandwidth	89.6 GB/s (DDR5)	83.2 GB/s (DDR5)	400 GB/s (unified)
Year	2023	2022	2023

What these numbers tell a programmer:

Core counts are misleading now. Intel’s 24 cores include 8 “performance” cores and 16 “efficiency” cores. The E-cores are smaller and slower, designed for background tasks. Effective parallelism for compute-heavy work is closer to 8 threads on the P-cores.
Boost clocks are the maximum a single core can reach under ideal thermal conditions. Sustained all-core loads run significantly lower. The i9’s 6.0 GHz boost does not mean all 24 cores run at 6.0 GHz.
L3 cache is shared across all cores. AMD’s 64 MB V-Cache variants (like the 7950X3D) show that cache size directly impacts gaming and database workloads. More cache means fewer expensive trips to main memory.
Memory bandwidth is often the real bottleneck. Apple’s unified memory architecture at 400 GB/s is over 4x faster than Intel or AMD on paper. For memory-bound workloads (large matrix operations, data processing), this matters enormously.

Why CPUs are optimized for latency, not throughput

This is the key insight that connects CPUs to everything else you will learn in this series.

A CPU is designed to make a single thread of execution as fast as possible. Every major feature, out-of-order execution, branch prediction, deep pipelines, large caches, exists to reduce the time (latency) for one stream of instructions.

The alternative approach is throughput: do as many operations as possible per second across many simple threads. GPUs take this path. A modern GPU has thousands of simple cores, each much weaker than a CPU core, but collectively capable of trillions of operations per second.

Why does this distinction matter?

Sequential code (most business logic, web servers, databases) benefits from low latency. You want each request handled as fast as possible. CPUs excel here.
Parallel code (matrix multiplication, image processing, machine learning training) benefits from high throughput. You have millions of independent operations and care about total time, not time per operation. GPUs excel here.

Most real programs have both sequential and parallel sections. The CPU handles the sequential parts; the GPU (or many CPU cores) handles the parallel parts. Understanding which is which, and why, is the foundation of writing fast software.

In practice: real bottlenecks engineers hit

Theory is clean. Real systems are messy. Here are the bottlenecks that actually dominate CPU performance in production code:

Memory access patterns ⚠: the single biggest factor. Code that accesses memory sequentially (arrays, contiguous buffers) runs 10x to 100x faster than code that chases pointers through random locations (linked lists, hash maps with pointer chaining). The CPU prefetcher can predict sequential access; it cannot predict random access.

Branch misprediction ⚠: an if statement in a tight loop that goes both ways unpredictably (like if random() > 0.5) costs 10 to 20 cycles per misprediction. Sorted data often runs faster than unsorted data through the same code because branches become predictable.

False sharing: two threads writing to different variables that happen to live on the same cache line (64 bytes). The hardware coherence protocol bounces the cache line back and forth between cores, turning parallel code into something slower than serial code.

Lock contention: multiple threads competing for the same mutex. Under heavy contention, threads spend most of their time waiting, not computing. Lock-free data structures and partitioning help, but add complexity.

NUMA effects: on multi-socket servers, memory is physically attached to one CPU socket. Accessing memory attached to the other socket takes 1.5x to 2x longer. Software that ignores NUMA topology leaves significant performance on the table.

Thermal throttling: under sustained load, CPUs reduce their clock speed to stay within thermal limits. A benchmark that runs at 5 GHz for 10 seconds might drop to 4 GHz after a minute. Long-running production workloads rarely sustain peak boost clocks.

Key takeaways

A CPU fetches, decodes, executes, and writes back instructions one at a time (logically), with hardware tricks like pipelining and out-of-order execution to overlap work.
Clock speed tells you cycles per second. The cycle count of an operation is what determines its real cost.
Memory loads (200+ cycles) dwarf arithmetic (1 cycle). Memory access patterns dominate real-world performance.
Multi-core parallelism helps, but overhead (synchronization, cache coherence, scheduling) means you rarely get perfect linear speedup.
CPUs are optimized for latency: making one thread fast. GPUs are optimized for throughput: making many threads collectively fast.
The operating system scheduler, cache hierarchy, and memory bandwidth are just as important to performance as raw clock speed.

What comes next

Now that you understand how a CPU executes instructions and why memory access is expensive, the natural question is: what sits between the CPU and main memory? The answer is a hierarchy of caches, and understanding them is essential for writing fast code.

In the next article, CPU caches and memory hierarchy, we will cover L1/L2/L3 caches, cache lines, spatial and temporal locality, and how to write code that works with the cache instead of against it.

← Back to all series