GPUs: from pixels to parallel supercomputers
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read:
- CPU vs GPU: why they are built differently and when to use each for the hardware differences, memory bandwidth gap, PCIe transfer cost, and arithmetic intensity.
You do not need any GPU programming experience yet. This article covers the history and architecture context you will need before writing your first CUDA kernel.
The original job: drawing triangles
GPUs exist because of one specific problem: real-time 3D rendering. In the early 1990s, games like Doom and Quake needed to transform millions of vertices, apply lighting, rasterize triangles into pixels, and do it all 30 or more times per second. CPUs could not keep up.
The solution was a dedicated chip that did nothing but graphics. Early GPUs (the term was coined by NVIDIA in 1999 with the GeForce 256) implemented a fixed-function pipeline: vertices went in one end, colored pixels came out the other, and the programmer had zero control over the intermediate steps.
The classical graphics pipeline has five major stages:
- Vertex processing: transform each vertex from 3D model space into 2D screen space using matrix multiplication.
- Primitive assembly: group vertices into triangles (or lines, or points).
- Rasterization: determine which screen pixels each triangle covers.
- Fragment processing: compute the color of each pixel using textures, lighting, and blending.
- Framebuffer output: write the final pixel color to a memory buffer, which is then sent to the display.
Every one of these stages operates on many independent data elements simultaneously. Thousands of vertices can be transformed in parallel. Millions of pixels can be shaded in parallel. This is why the GPU was designed as a massively parallel processor from day one: the workload demanded it.
graph LR V["Vertices (3D positions)"] --> VP["Vertex Processing"] VP --> PA["Primitive Assembly"] PA --> R["Rasterization"] R --> FP["Fragment Processing"] FP --> FB["Framebuffer Output"] FB --> D["Display"] style VP fill:#4a9eff,color:#fff style FP fill:#4a9eff,color:#fff
The blue stages (vertex processing, fragment processing) are the ones that eventually became programmable. They process millions of independent elements per frame, making them the natural targets for general-purpose computation.
From fixed-function to programmable shaders
The fixed-function pipeline was fast but rigid. Game developers wanted custom lighting models, procedural textures, and effects that the hardware designers had not anticipated. The only option was to wait for the next hardware generation.
In 2001, NVIDIA’s GeForce 3 and ATI’s Radeon 8500 introduced the first programmable shaders: small programs that ran on the GPU for each vertex or each pixel. The programmer could now write custom code instead of choosing from a menu of predefined operations.
Early shader languages were primitive, essentially GPU assembly. Over the next few years, higher-level shading languages appeared: HLSL (Microsoft, 2002), GLSL (OpenGL, 2004), and Cg (NVIDIA, 2003). These looked increasingly like C, with float4 types, math intrinsics, and control flow.
The key hardware change: vertex and pixel shaders initially used separate, specialized hardware units. By 2006, NVIDIA’s GeForce 8800 (G80 architecture) unified them into a single pool of programmable cores. Any core could run any type of shader. This was the unified shader architecture, and it was the direct ancestor of CUDA.
timeline
title GPU Evolution: Fixed Function to General Purpose
1996 : 3dfx Voodoo
: Fixed-function rasterizer
: No programmability
1999 : NVIDIA GeForce 256
: Hardware T&L
: "GPU" term coined
2001 : GeForce 3 / Radeon 8500
: First programmable shaders
: Vertex and pixel programs
2006 : GeForce 8800 (G80)
: Unified shader architecture
: All cores are identical
2007 : CUDA 1.0 released
: General-purpose GPU computing
: Write C code for GPU
2009 : OpenCL 1.0
: Cross-vendor standard
: Apple-initiated specification
2020 : A100 (Ampere)
: Tensor Cores, MIG
: AI/HPC accelerator
2022 : H100 (Hopper)
: Transformer Engine
: 4th-gen Tensor Cores
The GPGPU era: abusing shaders for math
Once shaders became programmable, researchers realized they could trick the GPU into doing non-graphics computation. The technique was called GPGPU (General-Purpose computing on Graphics Processing Units), and it was a hack.
To multiply two matrices, you would encode matrix A as a texture, encode matrix B as another texture, write a fragment shader that computed the dot product for each output element, render a full-screen quad, and read back the “pixel” colors as your result matrix. It worked, but it was painful. You had to think in terms of textures, texture coordinates, and render targets instead of arrays, indices, and memory.
Despite the awkwardness, the results were compelling. Researchers reported 10x to 100x speedups on problems like molecular dynamics, fluid simulation, and signal processing. The raw compute throughput of GPUs was simply too large to ignore.
The problem was clear: GPUs had the hardware for general-purpose parallel computing, but the programming model was designed for drawing pictures. The hardware had outgrown its software interface.
CUDA: computation gets a real programming model
In November 2006, NVIDIA announced CUDA (Compute Unified Device Architecture). The first public release came in June 2007, alongside the GeForce 8800 GTX.
CUDA solved the GPGPU problem by exposing the GPU as what it actually was: a massively parallel processor. Instead of pretending to render pixels, you wrote functions (called kernels) in a C-like language. Each kernel ran on thousands of threads simultaneously. You allocated GPU memory with cudaMalloc, copied data with cudaMemcpy, and launched kernels with a special syntax.
Three decisions made CUDA successful:
- C extension, not a new language. CUDA added a handful of keywords to C/C++ (
__global__,__device__,<<<blocks, threads>>>). Existing C programmers could start writing GPU code in hours, not weeks. - Explicit memory management. The programmer controlled exactly when data moved between CPU and GPU. This was tedious but gave full control over the most critical performance bottleneck.
- Tight compiler and hardware co-design. NVIDIA built the compiler (
nvcc), the driver, the profiler, and the hardware as a single system. When a new architecture shipped, the toolchain was ready on day one.
The tradeoff was vendor lock-in. CUDA runs only on NVIDIA GPUs. NVIDIA decided this was worth it because tight integration produced a better developer experience, and a better developer experience attracted more developers, which attracted more customers, which funded more hardware R&D. That flywheel is still spinning.
The landscape: CUDA, OpenCL, OpenACC, HIP
CUDA is not the only GPU programming framework. Several alternatives exist, each with different tradeoffs.
OpenCL (Open Computing Language) launched in 2009, backed by Apple and later managed by the Khronos Group. It targets any parallel hardware: NVIDIA GPUs, AMD GPUs, Intel GPUs, FPGAs, and even CPUs. The cost of this generality is a more verbose API, weaker tooling, and performance that rarely matches CUDA on NVIDIA hardware.
OpenACC is a directive-based approach. You annotate existing C/C++/Fortran code with #pragma acc directives, and the compiler generates GPU code automatically. It is the easiest path from CPU to GPU, but the compiler cannot always make optimal decisions, so performance ceilings are lower.
HIP (Heterogeneous-computing Interface for Portability) is AMD’s answer to CUDA. The API is nearly identical to CUDA (often a find-and-replace conversion), and it runs on AMD GPUs natively and on NVIDIA GPUs via a translation layer. It is gaining traction in HPC, particularly at DOE labs running AMD Instinct hardware.
| Feature | CUDA | OpenCL | OpenACC | HIP |
|---|---|---|---|---|
| Vendor | NVIDIA only | Cross-vendor (Khronos) | Multi-vendor (compiler-dependent) | AMD (NVIDIA via translation) |
| Language | C/C++ extensions | C-based kernel language | Pragma directives on C/C++/Fortran | C++ extensions (CUDA-like) |
| Abstraction level | Low (explicit memory, threads) | Low (explicit memory, work-items) | High (compiler-managed) | Low (explicit memory, threads) |
| Tooling quality | ✓ Excellent (nvcc, Nsight, nvprof) | ⚠ Varies by vendor | ⚠ Limited profiling | ⚠ Growing (rocprof, ROCm) |
| Performance ceiling | ✓ Highest on NVIDIA | ⚠ Near-CUDA on NVIDIA, native on others | ⚠ Compiler-limited | ✓ Native on AMD, good on NVIDIA |
| Ecosystem/libraries | ✓ cuBLAS, cuDNN, TensorRT, Thrust | ⚠ clBLAS, limited DL libraries | ✗ Minimal | ⚠ rocBLAS, MIOpen, growing |
| Best for | NVIDIA-only production workloads | Cross-platform portability | Quick CPU-to-GPU porting | AMD GPU workloads, portable HPC |
For this series, we use CUDA. The reasons are pragmatic: CUDA has the most mature tooling, the largest community, the best documentation, and the widest library ecosystem. If you learn CUDA well, porting to HIP is straightforward. The concepts (thread hierarchies, memory types, synchronization) transfer to every GPU programming model.
What a GPU looks like from 10,000 feet
Before you write your first kernel, you need a mental model of the hardware. Not at the transistor level, but at the level that affects how you write and optimize code.
A modern NVIDIA GPU is organized into a hierarchy:
Streaming Multiprocessors (SMs) are the fundamental compute building blocks. Each SM contains a set of CUDA cores (simple floating-point and integer units), a warp A warp is the fundamental unit of execution in NVIDIA GPUs, consisting of a group of 32 parallel threads that execute the same instruction simultaneously, known as Single-Instruction, Multiple Thread (SIMT). scheduler, a register file, shared memory, and an L1 cache. When you launch a kernel, the GPU distributes blocks of threads across SMs.
CUDA cores execute one floating-point or integer operation per clock. An SM might have 64 or 128 CUDA cores depending on the architecture. These are not independent processors; they execute in lockstep groups of 32 threads called warps.
L2 cache sits between the SMs and global memory. It is shared across all SMs and caches frequently accessed data to reduce traffic to VRAM.
VRAM (Video RAM) is the GPU’s main memory, typically HBM (High Bandwidth Memory) on data center GPUs or GDDR on consumer cards. VRAM has vastly higher bandwidth than CPU DRAM (2 TB/s on the H100 vs ~90 GB/s for DDR5) but also higher latency.
PCIe / NVLink connects the GPU to the CPU and to other GPUs. PCIe is the standard interface; NVLink is NVIDIA’s proprietary high-speed interconnect for multi-GPU systems.
graph TB
subgraph CPU_Side["Host (CPU)"]
CPU["CPU Cores"]
RAM["System RAM
(DDR5)"]
CPU --- RAM
end
CPU_Side ---|"PCIe / NVLink"| GPU_Side
subgraph GPU_Side["Device (GPU)"]
direction TB
subgraph SMs["Streaming Multiprocessors"]
SM1["SM 0
64-128 CUDA cores
Warp scheduler
Shared memory
L1 cache
Register file"]
SM2["SM 1"]
SM3["SM 2"]
SMn["... SM N"]
end
L2["L2 Cache
(shared across all SMs)"]
VRAM["VRAM (HBM / GDDR)
High bandwidth, high capacity"]
SMs --> L2
L2 --> VRAM
end
style SM1 fill:#4a9eff,color:#fff
style L2 fill:#f5a623,color:#fff
style VRAM fill:#7ed321,color:#fff
The key takeaway: data flows from CPU RAM across PCIe into GPU VRAM, then from VRAM through the L2 cache into the SMs where computation happens. Results flow back the same way. Every level of this hierarchy has different bandwidth and latency characteristics, and your code’s performance depends on how well it respects them.
Three generations of GPU hardware
To make the architecture concrete, here is a comparison of three NVIDIA GPUs spanning six years:
| Specification | GTX 1080 (Pascal, 2016) | A100 (Ampere, 2020) | H100 (Hopper, 2022) |
|---|---|---|---|
| CUDA cores | 2,560 | 6,912 | 16,896 |
| Memory | 8 GB GDDR5X | 80 GB HBM2e | 80 GB HBM3 |
| Memory bandwidth | 320 GB/s | 2,039 GB/s | 3,350 GB/s |
| FP32 TFLOPS | 8.9 | 19.5 | 67.0 |
| Tensor Cores | None | 432 (3rd gen) | 528 (4th gen) |
| TDP | 180 W | 400 W | 700 W |
The generational improvements are dramatic:
- CUDA cores: the H100 has 6.6x the cores of the GTX 1080 and 2.4x the A100.
- Memory bandwidth: the H100 delivers 10.5x the bandwidth of the GTX 1080 and 1.6x the A100.
- FP32 compute: the H100 achieves 7.5x the TFLOPS of the GTX 1080 and 3.4x the A100.
- Memory capacity: the A100 and H100 both offer 80 GB, a 10x jump from the GTX 1080’s 8 GB. This is what makes training large models possible; you cannot train a model that does not fit in memory.
Notice that bandwidth and compute scale faster than clock speed or core count alone. Architectural improvements (wider memory buses, Tensor Cores, improved scheduling) contribute as much as raw transistor counts.
The chart makes one trend clear: memory bandwidth has scaled faster than core count, and compute (TFLOPS) has scaled faster still. Architecture matters at least as much as brute-force transistor increases.
Which workloads belong on a GPU?
Not every problem benefits from GPU acceleration. The rule of thumb: a workload is GPU-appropriate if it has thousands of independent, identical operations on structured data with high arithmetic intensity. Let’s apply that to five concrete examples.
Sorting 1 billion integers: ✓ GPU-appropriate. Parallel sorting algorithms like radix sort decompose beautifully into independent work on sub-arrays. GPU implementations (like CUB’s radix sort) consistently outperform CPU sorts by 5x to 10x on large inputs. The key is that a billion elements provide enough parallelism to saturate the GPU.
Training a neural network: ✓ GPU-appropriate. The core operations are matrix multiplications and element-wise activation functions, both massively data-parallel. A single forward pass through a transformer layer involves multiplying matrices with millions of elements. This is the workload GPUs were practically redesigned for, and Tensor Cores exist specifically to accelerate it.
Parsing JSON: ✗ Not GPU-appropriate. JSON parsing is inherently sequential. Each token depends on what came before it (is this a quote opening or closing? Are we inside a string or a key?). The data is irregular, branching is frequent, and there is almost no arithmetic. This is a CPU workload through and through.
Computing FFT on a large signal: ✓ GPU-appropriate. The Fast Fourier Transform decomposes into butterfly operations that can be executed in parallel across frequency bins. cuFFT achieves 10x or more speedup over CPU implementations for large transforms (millions of points). Smaller transforms may not provide enough parallelism to justify the PCIe transfer cost.
Running a web server: ✗ Not GPU-appropriate. A web server handles many concurrent connections, but each request involves different logic: parsing HTTP headers, querying databases, rendering templates, serializing JSON. The operations are heterogeneous, branch-heavy, and I/O-bound. This is exactly the workload CPUs are designed for: many different tasks, each with complex control flow.
The pattern is consistent: structured data, uniform operations, high parallelism, and compute-bound arithmetic point toward the GPU. Irregular data, complex branching, sequential dependencies, and I/O-bound work point toward the CPU.
In practice
Learn CUDA first, abstract later. Directive-based tools like OpenACC are tempting because they hide complexity. But when performance matters, you need to understand what is happening underneath. Learn CUDA’s explicit model first; you can always move to higher-level abstractions once you know what they are abstracting over.
Vendor lock-in is a real tradeoff, not a fatal flaw. CUDA locks you into NVIDIA hardware. For many organizations, that is acceptable because NVIDIA’s ecosystem (libraries, documentation, tooling, community) is unmatched. If you need AMD or cross-vendor portability, HIP is the cleanest migration path from CUDA.
The GPU is not a drop-in replacement for the CPU. Porting CPU code to a GPU requires rethinking data structures, memory access patterns, and algorithm design. A naive port often runs slower than the original CPU code because it pays the PCIe transfer cost without exploiting enough parallelism to compensate.
Start with the ecosystem, not from scratch. Before writing a custom kernel, check if a library already solves your problem. cuBLAS for linear algebra, cuFFT for transforms, cuDNN for neural network primitives, Thrust for parallel algorithms, and CUB for block-level primitives cover the vast majority of common patterns.
Profile, then optimize. NVIDIA provides Nsight Compute and Nsight Systems for detailed GPU profiling. Never guess where the bottleneck is. Measure occupancy, memory throughput, and compute utilization before changing code.
What comes next
This article covered the journey from fixed-function graphics chips to general-purpose GPU accelerators: why GPUs were built, how programmable shaders enabled GPGPU, why CUDA won the ecosystem war, and what a modern GPU looks like at a high level. You now have the context to understand why CUDA’s programming model is shaped the way it is.
The next article, CUDA Hello World: your first kernel, walks you through installing the CUDA toolkit, writing a kernel that runs on the GPU, compiling it with nvcc, and understanding what happens at each step. You will go from zero to running GPU code in under an hour.