CPU vs GPU: why they are built differently and when to use each
In this series (5 parts)
- What parallelism actually means and why it is hard
- Flynn's taxonomy and types of parallel hardware
- Amdahl's law and scalability limits
- Parallel decomposition: how to split work across processors
- CPU vs GPU: why they are built differently and when to use each
Prerequisites
This article assumes you have read the following:
- Introduction to parallel thinking for the core mental model of parallelism.
- Amdahl’s law for understanding serial bottlenecks and speedup limits.
- How a CPU actually works for the Von Neumann architecture, clock cycles, and instruction execution.
You do not need GPU programming experience. This article explains the hardware differences so you know when to reach for a GPU before you learn how to program one.
Two chips, two design goals
A CPU and a GPU are both silicon chips that execute instructions. They sit on the same motherboard, draw power from the same supply, and talk to the same memory bus. Yet they are built in fundamentally different ways because they solve fundamentally different problems.
A CPU is designed to minimize the latency of a single thread. It dedicates enormous die area to branch predictors, out-of-order execution engines, and large caches. The goal: finish one task as fast as physically possible.
A GPU is designed to maximize the throughput of thousands of threads. It sacrifices single-thread performance to pack thousands of tiny cores onto the die. The goal: finish a massive batch of identical tasks as fast as possible in aggregate.
Neither design is superior. They are specialized for different workloads. Most real programs need both.
Die area: where the transistors go
The most revealing way to understand the difference is to look at how each chip spends its transistor budget.
A modern CPU like the Intel i9-13900K has roughly 24 cores. Each core is a complex machine with a deep pipeline, large L1 and L2 caches, a branch predictor, a reorder buffer, and multiple execution units. The caches alone can consume 40-50% of the die area. The cores themselves, with all their control logic, take most of the rest.
A modern GPU like the NVIDIA A100 has 6912 CUDA cores. Each core is a simple floating-point unit with minimal control logic. The caches are small. Instead, the die area goes toward compute units and a wide memory interface.
graph TB
subgraph CPU_Die["CPU Die (e.g. i9-13900K)"]
direction TB
C1["Core 1
Complex OoO pipeline"]
C2["Core 2"]
C3["..."]
C4["Core 24"]
L3["L3 Cache: 36 MB
approx 40% of die area"]
CU["Control: Branch predictor,
reorder buffer, scheduler"]
end
subgraph GPU_Die["GPU Die (e.g. A100)"]
direction TB
SM1["SM 1: 64 cores"]
SM2["SM 2: 64 cores"]
SM3["SM 3: 64 cores"]
SMN["... 108 SMs total = 6912 cores"]
GM["L2 Cache: 40 MB
approx 10% of die area"]
HBM["Memory controllers:
5120-bit HBM2e interface"]
end
A CPU spends most of its transistor budget on caches and control logic to make a few cores run fast. A GPU spends most of its budget on thousands of simple ALUs to run many threads simultaneously.
The tradeoff is stark. A CPU core can handle complex branching, speculative execution, and irregular memory access patterns. It is a generalist. A GPU core can add two floats. It is a specialist. But there are thousands of them.
Memory bandwidth: the real bottleneck
Raw compute power means nothing if the chip cannot feed data to its cores fast enough. This is where CPUs and GPUs diverge dramatically.
The i9-13900K has dual-channel DDR5 delivering roughly 90 GB/s of memory bandwidth. That is enough to keep 24 cores busy on most workloads.
The A100 has HBM2e (High Bandwidth Memory) delivering 2,039 GB/s. That is over 20x the bandwidth. It achieves this with a 5120-bit memory bus and memory chips stacked directly on the GPU package using through-silicon vias.
Why does the GPU need so much bandwidth? Because it has 6912 cores all requesting data at the same time. Without that bandwidth, most cores would sit idle waiting for memory. The entire GPU design only works if the memory system can keep up.
CPU vs GPU: a side-by-side comparison
| Property | CPU (i9-13900K) | GPU (A100) | What this means |
|---|---|---|---|
| Cores | 24 (8P + 16E) | 6,912 CUDA cores | GPU has ~288x more cores, but each is far simpler |
| Clock speed | Up to 5.8 GHz | 1.41 GHz (boost) | CPU clocks 4x higher per core |
| Memory bandwidth | ~90 GB/s (DDR5) | 2,039 GB/s (HBM2e) | GPU has ~22x more bandwidth |
| Cache | 36 MB L3 + per-core L1/L2 | 40 MB L2, small per-SM L1 | CPU uses cache to hide latency; GPU uses parallelism instead |
| Memory latency | ~50 ns to DRAM | ~300-400 ns to HBM | GPU latency is worse; it hides it by switching threads |
| Threads per core | 2 (hyperthreading) | 32-64 warps (1024-2048 threads per SM) | GPU massively oversubscribes threads to hide latency |
| Programming model | Sequential by default | SIMT (Single Instruction, Multiple Threads) | GPU requires explicit parallel thinking |
The latency row is critical. GPU memory is actually slower per access than CPU memory. But the GPU never waits. When one group of threads stalls on a memory request, the hardware instantly switches to another group that is ready to compute. This latency hiding through massive thread parallelism is the core trick of GPU design.
Relative comparison
To visualize how different these chips are, here is a normalized comparison. Each metric is divided by the CPU value so CPU = 1.0 across the board.
Memory bandwidth: 2039/90 = 22.7x. Peak FP32 FLOPS: ~19.5 TFLOPS / ~0.76 TFLOPS = 25.6x. Cache: 40/36 = 1.1x. Power TDP: 300W/125W = 2.4x. The GPU delivers 20-25x more compute and bandwidth while consuming only about 2x the power. Cache size is roughly similar because they serve different purposes.
The ratios tell the story. If your workload can use all those cores and needs all that bandwidth, the GPU wins by an order of magnitude. If your workload is serial or branch-heavy, those extra cores sit idle and the CPU wins.
When to use a GPU
Not every workload benefits from a GPU. The key criteria are:
Data parallelism. The same operation must apply independently to thousands or millions of data elements. Matrix operations, image convolutions, particle simulations, and hash computations all fit this pattern. If each element can be processed without knowing the result of its neighbors, the work is data-parallel.
High arithmetic intensity. Arithmetic intensity is the ratio of compute operations to bytes of data moved. Workloads with high arithmetic intensity keep the GPU cores busy doing math rather than waiting for memory. Low intensity workloads (like a simple memory copy) are bottlenecked by bandwidth even on a GPU.
Large datasets. The overhead of transferring data to the GPU and launching kernels only pays off when the dataset is large enough to saturate the hardware. Processing 100 elements on a GPU is slower than doing it on a CPU.
When not to use a GPU
Branchy code. GPUs execute threads in groups of 32 (called warps on NVIDIA hardware). If threads in the same warp take different branches, the GPU must execute both paths and mask the results. This is called warp divergence, and it can halve throughput or worse.
Serial bottlenecks. If your algorithm has long sequential dependency chains where step N depends on the result of step N-1, you cannot parallelize it. Amdahl’s law applies here. A serial bottleneck on a 1.4 GHz GPU core is much slower than the same bottleneck on a 5.8 GHz CPU core.
Small data. Kernel launch overhead on a GPU is typically 5-15 microseconds. Data transfer over PCIe adds more. For small problems, the overhead dominates and the CPU finishes first.
Irregular memory access. GPUs achieve peak bandwidth when threads access memory in coalesced patterns (adjacent threads reading adjacent addresses). Random access patterns waste bandwidth and stall the pipeline.
Worked example 1: arithmetic intensity of matrix multiply
Arithmetic intensity determines whether a workload is compute-bound or memory-bound. Let’s compute it for matrix multiplication.
Consider multiplying two 1000x1000 matrices of 32-bit floats:
Total FLOPs:
Each element of the result matrix C[i][j] requires 1000 multiply-add operations (one for each element of the row times the corresponding column element). A multiply-add is 2 FLOPs (one multiply, one add).
- Result elements: 1000 x 1000 = 1,000,000
- FLOPs per element: 1000 x 2 = 2,000
- Total FLOPs: 1,000,000 x 2,000 = 2 x 10⁹ FLOPs (2 GFLOP)
Total bytes moved:
We need to read two input matrices and write one output matrix. Each matrix is 1000 x 1000 x 4 bytes = 4 MB.
- Bytes moved: 3 x 4 MB = 12 MB = 12 x 10⁶ bytes
Arithmetic intensity:
Intensity = FLOPs / Bytes = 2 x 10⁹ / 12 x 10⁶ = ~167 FLOPs/byte
Now compare this to a simple memory copy of the same data. Copying 4 MB means 0 FLOPs and 8 MB moved (read + write). Intensity = 0 FLOPs/byte. It is purely memory-bound.
An arithmetic intensity of 167 is extremely high. The GPU will spend the vast majority of its time doing useful compute, not waiting for data. This is exactly the kind of workload where GPUs excel. The A100 can deliver 19.5 TFLOPS of FP32 compute; at 167 FLOPs/byte, it only needs ~117 GB/s of bandwidth to stay compute-bound, well within its 2,039 GB/s capacity.
This is why matrix multiplication is the canonical GPU workload. Neural network training is dominated by matrix multiplications, which is why GPUs became the default hardware for deep learning.
Worked example 2: PCIe transfer overhead
Even when a workload is perfect for the GPU, data must travel from CPU memory to GPU memory over the PCIe bus. This transfer is not free.
Setup:
- Data size: 1 GB
- PCIe 4.0 x16 bandwidth: ~32 GB/s (theoretical peak; practical is ~25 GB/s)
- GPU kernel execution time: 5 ms
Transfer time:
Using practical bandwidth:
- Transfer to GPU: 1 GB / 25 GB/s = 40 ms
- Transfer back to CPU: 1 GB / 25 GB/s = 40 ms
- Total transfer time: 80 ms
Total execution time:
- Transfer: 80 ms
- Compute: 5 ms
- Total: 85 ms
Transfer fraction:
Transfer time / Total time = 80 / 85 = 94%
The GPU kernel runs in 5 ms, but the program spends 94% of its time just moving data. This is the PCIe bottleneck.
When is offloading not worth it?
If the CPU can process the same 1 GB in 50 ms, then running on the CPU (50 ms total, no transfer cost) beats the GPU path (85 ms). The GPU compute is 10x faster (5 ms vs 50 ms), but the transfer overhead erases the advantage.
Offloading becomes worth it only when:
- The compute time on CPU is much longer than GPU compute + transfer time.
- The data already lives on the GPU from a previous operation.
- You overlap transfer and compute using asynchronous streams.
- You process the data multiple times before copying it back.
Rule of thumb: if you copy data to the GPU, process it once, and copy it back, the arithmetic intensity of your kernel must be high enough that the compute time dominates the transfer time. For PCIe 4.0, this means the kernel needs to do substantially more work than a simple element-wise operation on the data.
The PCIe bottleneck in practice
PCIe bandwidth has improved over generations but has not kept pace with GPU compute growth.
| PCIe Generation | Per-lane bandwidth | x16 bandwidth | Typical GPU FLOPS growth |
|---|---|---|---|
| PCIe 3.0 | 1 GB/s | ~16 GB/s | Baseline |
| PCIe 4.0 | 2 GB/s | ~32 GB/s | 2x bandwidth, ~4x FLOPS |
| PCIe 5.0 | 4 GB/s | ~64 GB/s | 2x bandwidth, ~8x FLOPS |
Each generation doubles PCIe bandwidth, but GPU compute grows faster. This widening gap is why NVIDIA introduced NVLink (600 GB/s on Hopper) for GPU-to-GPU communication and why frameworks like CUDA use pinned memory and async transfers to overlap data movement with computation.
Heterogeneous computing: CPU + GPU together
No real application runs entirely on the GPU. File I/O, network communication, memory allocation, control flow decisions, and orchestration all happen on the CPU. The GPU handles the data-parallel heavy lifting. This division of labor is called heterogeneous computing.
graph TD
subgraph CPU_Work["CPU (Host)"]
A["Read input data from disk"] --> B["Allocate GPU memory"]
B --> C["Copy data: Host → Device"]
C --> D["Launch GPU kernel"]
D --> H["Wait for GPU completion"]
H --> I["Copy results: Device → Host"]
I --> J["Post-process results"]
J --> K["Write output to disk"]
end
subgraph GPU_Work["GPU (Device)"]
D --> E["Kernel: parallel computation"]
E --> F["Thousands of threads execute"]
F --> G["Results written to device memory"]
G --> H
end
style A fill:#4a86c8,color:#fff
style B fill:#4a86c8,color:#fff
style C fill:#e8a838,color:#fff
style D fill:#4a86c8,color:#fff
style E fill:#50b848,color:#fff
style F fill:#50b848,color:#fff
style G fill:#50b848,color:#fff
style H fill:#4a86c8,color:#fff
style I fill:#e8a838,color:#fff
style J fill:#4a86c8,color:#fff
style K fill:#4a86c8,color:#fff
Blue: CPU-only work. Orange: data transfer over PCIe. Green: GPU-parallel computation. The CPU orchestrates the entire pipeline, and the GPU accelerates only the data-parallel portion.
A typical heterogeneous workflow looks like this:
- CPU reads input data from disk or network.
- CPU allocates memory on the GPU (device memory).
- CPU copies input data from host memory to device memory over PCIe.
- CPU launches a GPU kernel (a function that runs on the GPU).
- GPU executes the kernel across thousands of threads.
- CPU copies results back from device memory to host memory.
- CPU handles post-processing, output, and control decisions.
Steps 3 and 6 are the PCIe transfers from the previous example. Steps 1, 2, 4, 7 are serial CPU work. Only step 5 runs on the GPU.
The programmer’s job in heterogeneous computing is to:
- ✓ Identify which portions of the workload are data-parallel.
- ✓ Minimize the number of host-to-device transfers.
- ✓ Keep data on the GPU as long as possible between kernels.
- ✓ Overlap transfer and compute using asynchronous execution.
- ⚠ Avoid launching tiny kernels that do less work than the kernel launch overhead.
- ✗ Do not try to run serial or branchy logic on the GPU.
In practice
Profile before you port. Measure your CPU baseline. Identify the hotspot. If the hotspot is not data-parallel or not computationally intensive, a GPU will not help.
Data transfer is often the bottleneck, not compute. In production pipelines, the first optimization is usually reducing or eliminating PCIe transfers. Techniques include keeping data resident on the GPU across multiple kernel launches, using pinned (page-locked) host memory for faster transfers, and using CUDA streams for async overlap.
Start with libraries, not custom kernels. cuBLAS for linear algebra, cuDNN for neural networks, cuFFT for transforms, and Thrust for parallel algorithms already exist and are heavily optimized. Writing a custom kernel that beats these libraries is hard.
Batch your work. A GPU kernel launch has fixed overhead (5-15 us). Launching 10,000 tiny kernels is slower than launching one kernel that does all the work. Batch small tasks into larger ones.
Memory coalescing matters more than you think. A single uncoalesced memory access pattern can reduce effective bandwidth by 10-30x. Structure your data so adjacent threads access adjacent memory locations.
The CPU is not idle while the GPU computes. Good heterogeneous programs overlap CPU work (preparing the next batch, handling I/O, running control logic) with GPU computation. Treat the CPU and GPU as two workers on an assembly line, not as one worker with a tool.
What comes next
This article covered the hardware-level differences between CPUs and GPUs: why they exist, when each is the right tool, and how they work together. You now understand the memory bandwidth gap, the PCIe transfer cost, and the concept of arithmetic intensity.
The next step is to learn how to actually program a GPU. The next series, GPU Programming with CUDA, starts with CUDA: introduction and history, where you will learn why NVIDIA created CUDA, what problems it solved, and how the programming model maps to the hardware you just studied.