GPU Programming

Posts and notes about gpu programming.

Series & Posts

How a CPU actually works: cores, clocks, and execution Mar 1 18 min

CPU caches and memory hierarchy: why memory access speed matters Mar 2 20 min

CPU pipelines and instruction-level parallelism Mar 4 18 min

Memory models and why concurrent CPU code is hard Mar 5 20 min

SIMD and vectorization: parallelism on a single CPU core Mar 6 16 min

Processes, threads, and context switching Mar 8 18 min

GPUs: from pixels to parallel supercomputers Mar 15 18 min

Your first CUDA program: kernels, threads, and grids Mar 16 22 min

Thread hierarchy in CUDA: threads, blocks, warps, and grids Mar 17 20 min

Warps and warp divergence: the hidden performance trap Mar 18 20 min

CUDA memory hierarchy: where your data lives matters Mar 19 22 min

Memory coalescing: the most important optimization you will learn Mar 21 22 min

Shared memory and tiling: the key to fast matrix operations Mar 22 24 min

Debugging and profiling CUDA programs Mar 23 22 min

Device functions, host functions, and CUDA function qualifiers Mar 24 18 min

Synchronization and atomic operations in CUDA Mar 25 22 min

Parallel prefix sum and reduction: the core parallel primitives Mar 27 24 min

Concurrent data structures on the GPU Mar 28 20 min

CUDA streams and asynchronous execution Mar 29 22 min

CUDA events and fine-grained synchronization Mar 30 16 min

Dynamic parallelism: kernels launching kernels Mar 31 18 min

Unified virtual memory: one pointer for CPU and GPU Apr 2 20 min

Multi-GPU programming and peer access Apr 3 20 min

Memory allocation patterns and multi-dimensional arrays in CUDA Apr 4 18 min

Texture and constant memory: specialized caches Apr 5 18 min

Occupancy, register pressure, and performance tuning Apr 6 22 min

Case study: matrix multiplication from naive to cuBLAS speed Apr 8 26 min

Case study: implementing a convolution layer in CUDA Apr 9 24 min

Case study: reduction and histogram at scale Apr 10 22 min

Heterogeneous computing: CPU and GPU working together Apr 12 20 min

Advanced memory patterns: pinned memory, zero-copy, and more Apr 14 20 min

Advanced stream patterns and concurrent kernel execution Apr 16 20 min

Performance case studies and optimization patterns Apr 18 22 min

Where to go from here: CUDA ecosystem and next steps Apr 20 18 min

What parallelism actually means and why it is hard Mar 9 18 min

Flynn's taxonomy and types of parallel hardware Mar 10 16 min

Amdahl's law and scalability limits Mar 12 18 min

Parallel decomposition: how to split work across processors Mar 13 18 min

CPU vs GPU: why they are built differently and when to use each Mar 14 20 min