Search…

GPUs: from pixels to parallel supercomputers

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read:

You do not need any GPU programming experience yet. This article covers the history and architecture context you will need before writing your first CUDA kernel.

The original job: drawing triangles

GPUs exist because of one specific problem: real-time 3D rendering. In the early 1990s, games like Doom and Quake needed to transform millions of vertices, apply lighting, rasterize triangles into pixels, and do it all 30 or more times per second. CPUs could not keep up.

The solution was a dedicated chip that did nothing but graphics. Early GPUs (the term was coined by NVIDIA in 1999 with the GeForce 256) implemented a fixed-function pipeline: vertices went in one end, colored pixels came out the other, and the programmer had zero control over the intermediate steps.

The classical graphics pipeline has five major stages:

  1. Vertex processing: transform each vertex from 3D model space into 2D screen space using matrix multiplication.
  2. Primitive assembly: group vertices into triangles (or lines, or points).
  3. Rasterization: determine which screen pixels each triangle covers.
  4. Fragment processing: compute the color of each pixel using textures, lighting, and blending.
  5. Framebuffer output: write the final pixel color to a memory buffer, which is then sent to the display.

Every one of these stages operates on many independent data elements simultaneously. Thousands of vertices can be transformed in parallel. Millions of pixels can be shaded in parallel. This is why the GPU was designed as a massively parallel processor from day one: the workload demanded it.

graph LR
  V["Vertices
(3D positions)"] --> VP["Vertex
Processing"]
  VP --> PA["Primitive
Assembly"]
  PA --> R["Rasterization"]
  R --> FP["Fragment
Processing"]
  FP --> FB["Framebuffer
Output"]
  FB --> D["Display"]
  style VP fill:#4a9eff,color:#fff
  style FP fill:#4a9eff,color:#fff

The blue stages (vertex processing, fragment processing) are the ones that eventually became programmable. They process millions of independent elements per frame, making them the natural targets for general-purpose computation.

From fixed-function to programmable shaders

The fixed-function pipeline was fast but rigid. Game developers wanted custom lighting models, procedural textures, and effects that the hardware designers had not anticipated. The only option was to wait for the next hardware generation.

In 2001, NVIDIA’s GeForce 3 and ATI’s Radeon 8500 introduced the first programmable shaders: small programs that ran on the GPU for each vertex or each pixel. The programmer could now write custom code instead of choosing from a menu of predefined operations.

Early shader languages were primitive, essentially GPU assembly. Over the next few years, higher-level shading languages appeared: HLSL (Microsoft, 2002), GLSL (OpenGL, 2004), and Cg (NVIDIA, 2003). These looked increasingly like C, with float4 types, math intrinsics, and control flow.

The key hardware change: vertex and pixel shaders initially used separate, specialized hardware units. By 2006, NVIDIA’s GeForce 8800 (G80 architecture) unified them into a single pool of programmable cores. Any core could run any type of shader. This was the unified shader architecture, and it was the direct ancestor of CUDA.

timeline
  title GPU Evolution: Fixed Function to General Purpose
  1996 : 3dfx Voodoo
       : Fixed-function rasterizer
       : No programmability
  1999 : NVIDIA GeForce 256
       : Hardware T&L
       : "GPU" term coined
  2001 : GeForce 3 / Radeon 8500
       : First programmable shaders
       : Vertex and pixel programs
  2006 : GeForce 8800 (G80)
       : Unified shader architecture
       : All cores are identical
  2007 : CUDA 1.0 released
       : General-purpose GPU computing
       : Write C code for GPU
  2009 : OpenCL 1.0
       : Cross-vendor standard
       : Apple-initiated specification
  2020 : A100 (Ampere)
       : Tensor Cores, MIG
       : AI/HPC accelerator
  2022 : H100 (Hopper)
       : Transformer Engine
       : 4th-gen Tensor Cores

The GPGPU era: abusing shaders for math

Once shaders became programmable, researchers realized they could trick the GPU into doing non-graphics computation. The technique was called GPGPU (General-Purpose computing on Graphics Processing Units), and it was a hack.

To multiply two matrices, you would encode matrix A as a texture, encode matrix B as another texture, write a fragment shader that computed the dot product for each output element, render a full-screen quad, and read back the “pixel” colors as your result matrix. It worked, but it was painful. You had to think in terms of textures, texture coordinates, and render targets instead of arrays, indices, and memory.

Despite the awkwardness, the results were compelling. Researchers reported 10x to 100x speedups on problems like molecular dynamics, fluid simulation, and signal processing. The raw compute throughput of GPUs was simply too large to ignore.

The problem was clear: GPUs had the hardware for general-purpose parallel computing, but the programming model was designed for drawing pictures. The hardware had outgrown its software interface.

CUDA: computation gets a real programming model

In November 2006, NVIDIA announced CUDA (Compute Unified Device Architecture). The first public release came in June 2007, alongside the GeForce 8800 GTX.

CUDA solved the GPGPU problem by exposing the GPU as what it actually was: a massively parallel processor. Instead of pretending to render pixels, you wrote functions (called kernels) in a C-like language. Each kernel ran on thousands of threads simultaneously. You allocated GPU memory with cudaMalloc, copied data with cudaMemcpy, and launched kernels with a special syntax.

Three decisions made CUDA successful:

  1. C extension, not a new language. CUDA added a handful of keywords to C/C++ (__global__, __device__, <<<blocks, threads>>>). Existing C programmers could start writing GPU code in hours, not weeks.
  2. Explicit memory management. The programmer controlled exactly when data moved between CPU and GPU. This was tedious but gave full control over the most critical performance bottleneck.
  3. Tight compiler and hardware co-design. NVIDIA built the compiler (nvcc), the driver, the profiler, and the hardware as a single system. When a new architecture shipped, the toolchain was ready on day one.

The tradeoff was vendor lock-in. CUDA runs only on NVIDIA GPUs. NVIDIA decided this was worth it because tight integration produced a better developer experience, and a better developer experience attracted more developers, which attracted more customers, which funded more hardware R&D. That flywheel is still spinning.

The landscape: CUDA, OpenCL, OpenACC, HIP

CUDA is not the only GPU programming framework. Several alternatives exist, each with different tradeoffs.

OpenCL (Open Computing Language) launched in 2009, backed by Apple and later managed by the Khronos Group. It targets any parallel hardware: NVIDIA GPUs, AMD GPUs, Intel GPUs, FPGAs, and even CPUs. The cost of this generality is a more verbose API, weaker tooling, and performance that rarely matches CUDA on NVIDIA hardware.

OpenACC is a directive-based approach. You annotate existing C/C++/Fortran code with #pragma acc directives, and the compiler generates GPU code automatically. It is the easiest path from CPU to GPU, but the compiler cannot always make optimal decisions, so performance ceilings are lower.

HIP (Heterogeneous-computing Interface for Portability) is AMD’s answer to CUDA. The API is nearly identical to CUDA (often a find-and-replace conversion), and it runs on AMD GPUs natively and on NVIDIA GPUs via a translation layer. It is gaining traction in HPC, particularly at DOE labs running AMD Instinct hardware.

FeatureCUDAOpenCLOpenACCHIP
VendorNVIDIA onlyCross-vendor (Khronos)Multi-vendor (compiler-dependent)AMD (NVIDIA via translation)
LanguageC/C++ extensionsC-based kernel languagePragma directives on C/C++/FortranC++ extensions (CUDA-like)
Abstraction levelLow (explicit memory, threads)Low (explicit memory, work-items)High (compiler-managed)Low (explicit memory, threads)
Tooling quality✓ Excellent (nvcc, Nsight, nvprof)⚠ Varies by vendor⚠ Limited profiling⚠ Growing (rocprof, ROCm)
Performance ceiling✓ Highest on NVIDIA⚠ Near-CUDA on NVIDIA, native on others⚠ Compiler-limited✓ Native on AMD, good on NVIDIA
Ecosystem/libraries✓ cuBLAS, cuDNN, TensorRT, Thrust⚠ clBLAS, limited DL libraries✗ Minimal⚠ rocBLAS, MIOpen, growing
Best forNVIDIA-only production workloadsCross-platform portabilityQuick CPU-to-GPU portingAMD GPU workloads, portable HPC

For this series, we use CUDA. The reasons are pragmatic: CUDA has the most mature tooling, the largest community, the best documentation, and the widest library ecosystem. If you learn CUDA well, porting to HIP is straightforward. The concepts (thread hierarchies, memory types, synchronization) transfer to every GPU programming model.

What a GPU looks like from 10,000 feet

Before you write your first kernel, you need a mental model of the hardware. Not at the transistor level, but at the level that affects how you write and optimize code.

A modern NVIDIA GPU is organized into a hierarchy:

Streaming Multiprocessors (SMs) are the fundamental compute building blocks. Each SM contains a set of CUDA cores (simple floating-point and integer units), a warp A warp is the fundamental unit of execution in NVIDIA GPUs, consisting of a group of 32 parallel threads that execute the same instruction simultaneously, known as Single-Instruction, Multiple Thread (SIMT). scheduler, a register file, shared memory, and an L1 cache. When you launch a kernel, the GPU distributes blocks of threads across SMs.

CUDA cores execute one floating-point or integer operation per clock. An SM might have 64 or 128 CUDA cores depending on the architecture. These are not independent processors; they execute in lockstep groups of 32 threads called warps.

L2 cache sits between the SMs and global memory. It is shared across all SMs and caches frequently accessed data to reduce traffic to VRAM.

VRAM (Video RAM) is the GPU’s main memory, typically HBM (High Bandwidth Memory) on data center GPUs or GDDR on consumer cards. VRAM has vastly higher bandwidth than CPU DRAM (2 TB/s on the H100 vs ~90 GB/s for DDR5) but also higher latency.

PCIe / NVLink connects the GPU to the CPU and to other GPUs. PCIe is the standard interface; NVLink is NVIDIA’s proprietary high-speed interconnect for multi-GPU systems.

graph TB
  subgraph CPU_Side["Host (CPU)"]
      CPU["CPU Cores"]
      RAM["System RAM
(DDR5)"]
      CPU --- RAM
  end

  CPU_Side ---|"PCIe / NVLink"| GPU_Side

  subgraph GPU_Side["Device (GPU)"]
      direction TB
      subgraph SMs["Streaming Multiprocessors"]
          SM1["SM 0
64-128 CUDA cores
Warp scheduler
Shared memory
L1 cache
Register file"]
          SM2["SM 1"]
          SM3["SM 2"]
          SMn["... SM N"]
      end
      L2["L2 Cache
(shared across all SMs)"]
      VRAM["VRAM (HBM / GDDR)
High bandwidth, high capacity"]
      SMs --> L2
      L2 --> VRAM
  end

  style SM1 fill:#4a9eff,color:#fff
  style L2 fill:#f5a623,color:#fff
  style VRAM fill:#7ed321,color:#fff

The key takeaway: data flows from CPU RAM across PCIe into GPU VRAM, then from VRAM through the L2 cache into the SMs where computation happens. Results flow back the same way. Every level of this hierarchy has different bandwidth and latency characteristics, and your code’s performance depends on how well it respects them.

Three generations of GPU hardware

To make the architecture concrete, here is a comparison of three NVIDIA GPUs spanning six years:

SpecificationGTX 1080 (Pascal, 2016)A100 (Ampere, 2020)H100 (Hopper, 2022)
CUDA cores2,5606,91216,896
Memory8 GB GDDR5X80 GB HBM2e80 GB HBM3
Memory bandwidth320 GB/s2,039 GB/s3,350 GB/s
FP32 TFLOPS8.919.567.0
Tensor CoresNone432 (3rd gen)528 (4th gen)
TDP180 W400 W700 W

The generational improvements are dramatic:

  • CUDA cores: the H100 has 6.6x the cores of the GTX 1080 and 2.4x the A100.
  • Memory bandwidth: the H100 delivers 10.5x the bandwidth of the GTX 1080 and 1.6x the A100.
  • FP32 compute: the H100 achieves 7.5x the TFLOPS of the GTX 1080 and 3.4x the A100.
  • Memory capacity: the A100 and H100 both offer 80 GB, a 10x jump from the GTX 1080’s 8 GB. This is what makes training large models possible; you cannot train a model that does not fit in memory.

Notice that bandwidth and compute scale faster than clock speed or core count alone. Architectural improvements (wider memory buses, Tensor Cores, improved scheduling) contribute as much as raw transistor counts.

The chart makes one trend clear: memory bandwidth has scaled faster than core count, and compute (TFLOPS) has scaled faster still. Architecture matters at least as much as brute-force transistor increases.

Which workloads belong on a GPU?

Not every problem benefits from GPU acceleration. The rule of thumb: a workload is GPU-appropriate if it has thousands of independent, identical operations on structured data with high arithmetic intensity. Let’s apply that to five concrete examples.

Sorting 1 billion integers: ✓ GPU-appropriate. Parallel sorting algorithms like radix sort decompose beautifully into independent work on sub-arrays. GPU implementations (like CUB’s radix sort) consistently outperform CPU sorts by 5x to 10x on large inputs. The key is that a billion elements provide enough parallelism to saturate the GPU.

Training a neural network: ✓ GPU-appropriate. The core operations are matrix multiplications and element-wise activation functions, both massively data-parallel. A single forward pass through a transformer layer involves multiplying matrices with millions of elements. This is the workload GPUs were practically redesigned for, and Tensor Cores exist specifically to accelerate it.

Parsing JSON: ✗ Not GPU-appropriate. JSON parsing is inherently sequential. Each token depends on what came before it (is this a quote opening or closing? Are we inside a string or a key?). The data is irregular, branching is frequent, and there is almost no arithmetic. This is a CPU workload through and through.

Computing FFT on a large signal: ✓ GPU-appropriate. The Fast Fourier Transform decomposes into butterfly operations that can be executed in parallel across frequency bins. cuFFT achieves 10x or more speedup over CPU implementations for large transforms (millions of points). Smaller transforms may not provide enough parallelism to justify the PCIe transfer cost.

Running a web server: ✗ Not GPU-appropriate. A web server handles many concurrent connections, but each request involves different logic: parsing HTTP headers, querying databases, rendering templates, serializing JSON. The operations are heterogeneous, branch-heavy, and I/O-bound. This is exactly the workload CPUs are designed for: many different tasks, each with complex control flow.

The pattern is consistent: structured data, uniform operations, high parallelism, and compute-bound arithmetic point toward the GPU. Irregular data, complex branching, sequential dependencies, and I/O-bound work point toward the CPU.

In practice

Learn CUDA first, abstract later. Directive-based tools like OpenACC are tempting because they hide complexity. But when performance matters, you need to understand what is happening underneath. Learn CUDA’s explicit model first; you can always move to higher-level abstractions once you know what they are abstracting over.

Vendor lock-in is a real tradeoff, not a fatal flaw. CUDA locks you into NVIDIA hardware. For many organizations, that is acceptable because NVIDIA’s ecosystem (libraries, documentation, tooling, community) is unmatched. If you need AMD or cross-vendor portability, HIP is the cleanest migration path from CUDA.

The GPU is not a drop-in replacement for the CPU. Porting CPU code to a GPU requires rethinking data structures, memory access patterns, and algorithm design. A naive port often runs slower than the original CPU code because it pays the PCIe transfer cost without exploiting enough parallelism to compensate.

Start with the ecosystem, not from scratch. Before writing a custom kernel, check if a library already solves your problem. cuBLAS for linear algebra, cuFFT for transforms, cuDNN for neural network primitives, Thrust for parallel algorithms, and CUB for block-level primitives cover the vast majority of common patterns.

Profile, then optimize. NVIDIA provides Nsight Compute and Nsight Systems for detailed GPU profiling. Never guess where the bottleneck is. Measure occupancy, memory throughput, and compute utilization before changing code.

What comes next

This article covered the journey from fixed-function graphics chips to general-purpose GPU accelerators: why GPUs were built, how programmable shaders enabled GPGPU, why CUDA won the ecosystem war, and what a modern GPU looks like at a high level. You now have the context to understand why CUDA’s programming model is shaped the way it is.

The next article, CUDA Hello World: your first kernel, walks you through installing the CUDA toolkit, writing a kernel that runs on the GPU, compiling it with nvcc, and understanding what happens at each step. You will go from zero to running GPU code in under an hour.

Start typing to search across all content
navigate Enter open Esc close