GPU Programming
Posts and notes about gpu programming.
Series & Posts
1
How a CPU actually works: cores, clocks, and execution
2 CPU caches and memory hierarchy: why memory access speed matters
3 CPU pipelines and instruction-level parallelism
4 Memory models and why concurrent CPU code is hard
5 SIMD and vectorization: parallelism on a single CPU core
6 Processes, threads, and context switching
1
GPUs: from pixels to parallel supercomputers
2 Your first CUDA program: kernels, threads, and grids
3 Thread hierarchy in CUDA: threads, blocks, warps, and grids
4 Warps and warp divergence: the hidden performance trap
5 CUDA memory hierarchy: where your data lives matters
6 Memory coalescing: the most important optimization you will learn
7 Shared memory and tiling: the key to fast matrix operations
8 Debugging and profiling CUDA programs
9 Device functions, host functions, and CUDA function qualifiers
10 Synchronization and atomic operations in CUDA
11 Parallel prefix sum and reduction: the core parallel primitives
12 Concurrent data structures on the GPU
13 CUDA streams and asynchronous execution
14 CUDA events and fine-grained synchronization
15 Dynamic parallelism: kernels launching kernels
16 Unified virtual memory: one pointer for CPU and GPU
17 Multi-GPU programming and peer access
18 Memory allocation patterns and multi-dimensional arrays in CUDA
19 Texture and constant memory: specialized caches
20 Occupancy, register pressure, and performance tuning
21 Case study: matrix multiplication from naive to cuBLAS speed
22 Case study: implementing a convolution layer in CUDA
23 Case study: reduction and histogram at scale
24 Heterogeneous computing: CPU and GPU working together
25 Advanced memory patterns: pinned memory, zero-copy, and more
26 Advanced stream patterns and concurrent kernel execution
27 Performance case studies and optimization patterns
28 Where to go from here: CUDA ecosystem and next steps