Search…

Where to go from here: CUDA ecosystem and next steps

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable writing CUDA kernels, managing memory hierarchies, using streams, and profiling with Nsight. This article surveys the broader ecosystem you will work with once your CUDA fundamentals are solid.

The ecosystem beyond raw kernels

Writing CUDA C++ kernels from scratch is rarely the right first move. NVIDIA ships a set of libraries that cover the most common computational patterns. These libraries are written by the same team that designs the hardware. They exploit architecture-specific features that are not documented publicly. In almost every case, they outperform hand-written kernels unless you have domain-specific structure to exploit.

The question is not whether to use these libraries. The question is when to drop down to custom kernels and when to stay at a higher level of abstraction.

NVIDIA library landscape

LibraryPurposeWhen to usePython bindingLearning resource
cuBLASDense linear algebra (GEMM, GEMV, triangular solves)Any matrix multiply or BLAS operation; always try first before writing custom GEMMtorch.mm, cupy.cublascuBLAS docs
cuFFTFast Fourier transforms (1D, 2D, 3D, batched)Signal processing, spectral methods, frequency-domain convolutioncupy.fft, torch.fftcuFFT docs
cuSPARSESparse matrix operations (SpMV, SpMM, format conversions)Graphs, recommender systems, scientific computing with >90% zeroscupyx.scipy.sparse, torch.sparsecuSPARSE docs
cuDNNNeural network primitives (convolution, pooling, normalization, attention)Any standard DL layer; Winograd/FFT convolution, fused attentionUsed internally by PyTorch/TensorFlowcuDNN dev guide
cuRANDRandom number generation (Philox, MTGP, Sobol)Monte Carlo simulation, stochastic sampling, dropoutcupy.random, torch.cuda.manual_seedcuRAND docs
ThrustSTL-like parallel algorithms (sort, scan, reduce)When you need parallel primitives without writing kernelscupy.sort, cupy.cumsumThrust docs

The general rule: if your operation maps cleanly to a library call, use the library. Write a custom kernel only when you need to fuse operations, exploit problem-specific structure, or handle a pattern the library does not support.

cuBLAS, cuFFT, cuSPARSE: when to pick which

These three libraries cover the core of numerical computing on GPUs. The decision tree is straightforward:

flowchart TD
  A[Matrix operation needed] --> B{"Dense or sparse?"}
  B -->|Dense| C{"Standard BLAS?"}
  B -->|Sparse, >90% zeros| D[cuSPARSE]
  C -->|Yes: GEMM, GEMV, etc.| E[cuBLAS]
  C -->|"No: custom fused op"| F[Custom kernel]
  A --> G{"Frequency domain?"}
  G -->|Yes: FFT, spectral| H[cuFFT]
  G -->|No| B

cuBLAS handles dense GEMM with automatic selection of the best algorithm for your matrix dimensions and GPU architecture. On Ampere and Hopper GPUs, it uses Tensor Cores transparently when you operate in FP16 or TF32. There is no reason to write your own GEMM unless you are fusing it with a custom activation or need a non-standard layout.

cuFFT supports batched transforms, which are critical for throughput. A single large FFT underutilizes the GPU. Batching hundreds of smaller FFTs saturates the hardware. If your workload involves convolution of small filters, compare cuFFT-based frequency-domain convolution against cuDNN direct convolution. For filter sizes above roughly 11x11, the FFT approach often wins.

cuSPARSE matters when your data is genuinely sparse. If your matrix is 50% zeros, dense operations on cuBLAS are still faster because sparse formats introduce indexing overhead. The crossover point depends on sparsity pattern and matrix size, but as a heuristic: below 90% sparsity, benchmark cuBLAS against cuSPARSE before committing to sparse formats.

cuDNN: the deep learning workhorse

cuDNN is not a single algorithm. It is a heuristic search engine. When you call a convolution forward pass, cuDNN benchmarks multiple algorithms (direct, Winograd, FFT, implicit GEMM) on your specific input dimensions and selects the fastest. PyTorch exposes this via torch.backends.cudnn.benchmark = True.

The algorithms cuDNN selects are not publicly documented at the implementation level. They use Tensor Core instructions, custom memory layouts, and architecture-specific scheduling that you cannot replicate without internal knowledge of the hardware. For standard layers (Conv2D, BatchNorm, LayerNorm, multi-head attention), cuDNN is the ceiling you are benchmarking against.

Where cuDNN falls short: non-standard activation functions, custom normalization schemes, novel attention patterns, or any layer that does not map to cuDNN’s supported operation set. This is where custom kernels or Triton enter the picture.

Triton: GPU kernels in Python

Triton, developed by OpenAI, lets you write GPU kernels in Python with performance that approaches CUDA C++ for many workloads. The key abstraction: instead of thinking about individual threads, you think about blocks of data. Triton’s compiler handles thread mapping, shared memory management, and memory coalescing.

Here is the same vector addition kernel in both CUDA C++ and Triton:

CUDA C++ (15 lines of kernel + launch code):

__global__ void add_kernel(float* out, const float* a,
                           const float* b, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n)
        out[idx] = a[idx] + b[idx];
}

void launch_add(float* out, const float* a,
                const float* b, int n) {
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    add_kernel<<<blocks, threads>>>(out, a, b, n);
}

Triton (12 lines):

import triton
import triton.language as tl

@triton.jit
def add_kernel(out_ptr, a_ptr, b_ptr, n,
               BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < n
    a = tl.load(a_ptr + offsets, mask=mask)
    b = tl.load(b_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, a + b, mask=mask)

For vector addition, both produce identical machine code. The real comparison matters for complex kernels:

DimensionTritonCUDA C++
Lines of code (fused softmax)~30~80
Development timeHoursDays
Performance (GEMM, large matrices)90-95% of cuBLAS85-100% of cuBLAS (depends on skill)
Performance (fused attention)90-98% of handwritten100% (reference)
Shared memory controlAutomaticManual
Warp-level primitivesLimitedFull access
DebuggingPython stack tracescuda-gdb, printf

Triton matches CUDA C++ performance when the bottleneck is memory bandwidth, not compute. Memory-bound kernels like element-wise ops, reductions, and softmax hit the same roofline in Triton because the compiler generates the same load/store patterns. Compute-bound kernels like GEMM close the gap when matrix dimensions are large enough for Triton’s autotuner to find good tile sizes.

Where CUDA C++ still wins: kernels requiring warp-level shuffles, custom shared memory layouts for bank-conflict avoidance, cooperative groups, or dynamic parallelism. If you need to squeeze the last 5% from a Tensor Core GEMM, you still drop to CUDA C++.

In practice: Start with Triton for new kernel development. Profile the result. Drop to CUDA C++ only when profiling shows the Triton kernel leaving measurable performance on the table and you can identify the specific bottleneck the Triton compiler cannot resolve.

PyTorch custom CUDA extensions

When you need a custom operation inside a PyTorch training loop, you write a CUDA extension. The integration pattern has three files:

1. The CUDA kernel (my_op_kernel.cu):

#include <torch/extension.h>
#include <cuda_runtime.h>

__global__ void my_op_forward_kernel(
    const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = /* your computation */ input[idx];
    }
}

torch::Tensor my_op_forward(torch::Tensor input) {
    auto output = torch::empty_like(input);
    int n = input.numel();
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    my_op_forward_kernel<<<blocks, threads>>>(
        input.data_ptr<float>(),
        output.data_ptr<float>(),
        n);
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &my_op_forward, "My op forward");
}

2. Build configuration (setup.py):

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name="my_op",
    ext_modules=[
        CUDAExtension(
            "my_op_cuda",
            ["my_op_kernel.cu"],
        ),
    ],
    cmdclass={"build_ext": BuildExtension},
)

3. Python integration (model.py):

import torch
from torch.autograd import Function
import my_op_cuda

class MyOp(Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return my_op_cuda.forward(input)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        # call backward kernel similarly
        return grad_output  # placeholder

# Usage in a model
x = torch.randn(1024, device="cuda")
y = MyOp.apply(x)

The key points: torch/extension.h provides tensor interop. CUDAExtension in setup.py handles nvcc compilation. torch.autograd.Function connects your kernel to PyTorch’s autograd graph. For quick iteration during development, use torch.utils.cpp_extension.load() instead of setup.py to JIT-compile the extension.

TensorRT: inference optimization

TensorRT converts trained models into optimized inference engines. It performs layer fusion, precision calibration (FP32 to FP16/INT8), kernel auto-tuning, and memory planning. A ResNet-50 model that runs at 200 images/second in PyTorch can reach 5000+ images/second under TensorRT on the same GPU.

flowchart LR
  A[PyTorch model] --> B[ONNX export]
  B --> C[TensorRT builder]
  C --> D{"Optimization steps"}
  D --> E[Layer fusion]
  D --> F[Precision calibration]
  D --> G[Kernel autotuning]
  E --> H[TensorRT engine]
  F --> H
  G --> H
  H --> I[Deployment]

The typical workflow:

  1. Export your PyTorch model to ONNX (torch.onnx.export)
  2. Parse the ONNX model with TensorRT’s builder
  3. Configure precision (FP16 for most cases, INT8 with a calibration dataset for maximum throughput)
  4. Build the engine (this takes minutes as TensorRT profiles every layer)
  5. Serialize the engine to disk for reuse

TensorRT engines are architecture-specific. An engine built for A100 will not run on T4. Rebuild for each target GPU.

⚠ TensorRT does not support all operations. Custom layers require writing TensorRT plugins, which are essentially CUDA kernels wrapped in TensorRT’s plugin interface. If your model uses non-standard ops, verify ONNX export completeness before committing to TensorRT.

NCCL: multi-GPU and multi-node communication

NCCL (NVIDIA Collective Communications Library) handles AllReduce, AllGather, Broadcast, ReduceScatter, and other collective operations across GPUs. It is the communication backbone of distributed training in PyTorch (torch.distributed) and virtually every multi-GPU framework.

NCCL automatically detects the topology (NVLink, PCIe, InfiniBand) and selects optimal communication algorithms. On DGX systems with NVLink, NCCL achieves near-theoretical bandwidth. On PCIe-connected GPUs, it uses tree and ring algorithms to maximize bus utilization.

You rarely call NCCL directly. PyTorch’s DistributedDataParallel uses it internally. But understanding NCCL matters when debugging communication bottlenecks:

  • Use NCCL_DEBUG=INFO to see topology detection and algorithm selection
  • Profile with Nsight Systems to identify communication gaps between compute phases
  • Overlap communication with computation using gradient bucketing (PyTorch does this by default)

ROCm and HIP: AMD portability

AMD’s ROCm platform provides HIP (Heterogeneous-compute Interface for Portability), a C++ API that is source-compatible with CUDA. Most CUDA code can be converted to HIP with hipify-perl or hipify-clang, which performs a mechanical translation of API calls:

CUDAHIP
cudaMallochipMalloc
cudaMemcpyhipMemcpy
__global____global__
blockIdx.xblockIdx.x
<<<blocks, threads>>><<<blocks, threads>>>

The kernel language is nearly identical. What changes: library calls (cuBLAS becomes rocBLAS, cuDNN becomes MIOpen), profiling tools (Nsight becomes rocprof), and performance characteristics. An optimized CUDA kernel may not be optimal on AMD hardware because the warp size differs (64 on AMD vs 32 on NVIDIA) and the memory hierarchy has different latencies.

In practice: Write your kernels targeting CUDA. Use hipify to generate the AMD version. Profile separately on each platform and tune warp-level code for each architecture. The kernel logic stays the same; the tuning parameters change.

For projects that must run on both vendors from day one, consider Triton. Triton’s compiler targets both NVIDIA and AMD GPUs, and the block-level abstraction hides warp size differences.

Research directions

Three areas are actively reshaping GPU computing:

Sparse computation. Ampere introduced structured sparsity (2:4 pattern) with hardware support in Tensor Cores. Hopper extended this. The promise: 2x throughput on models pruned to 50% sparsity with minimal accuracy loss. The reality: tooling is immature and not all models tolerate structured pruning. Libraries like torch.sparse and NVIDIA’s ASP (Automatic SParsity) are evolving rapidly.

Mixed precision and narrow formats. FP8 (E4M3 and E5M2) on Hopper, FP4 on Blackwell. Each generation adds narrower number formats that double throughput at the cost of dynamic range. The programming challenge: managing scaling factors, loss scaling, and per-tensor vs per-channel quantization. The transformer-engine library from NVIDIA automates FP8 training for transformer architectures.

In-memory and near-memory processing. HBM bandwidth is growing slower than compute throughput. The gap creates a memory wall that kernel optimization alone cannot solve. Research into processing-in-memory (PIM) and near-memory compute aims to bring computation to the data. Samsung’s HBM-PIM and UPMEM’s PIM-DIMM are early commercial efforts. This is a 5-10 year horizon, but it will fundamentally change how we think about GPU kernel design.

Learning path recommendations

This series has taken you from GPU architecture through kernel development, memory optimization, synchronization, multi-GPU programming, and real-world case studies. Where you go next depends on your role.

flowchart TD
  A["Series complete"] --> B{"Your role?"}
  B -->|ML Engineer| C["Triton kernels
PyTorch extensions
TensorRT deployment"]
  B -->|Systems Programmer| D["Multi-GPU with NCCL
Custom memory allocators
Driver API"]
  B -->|Researcher| E["Novel architectures
Sparse computation
Mixed precision training"]
  C --> F["Build: custom fused attention kernel"]
  D --> G["Build: distributed training framework"]
  E --> H["Build: efficient sparse transformer"]

If you are an ML engineer:

Your highest-leverage skill is writing fused kernels for operations your framework does not optimize well. Learn Triton first. Build a fused attention kernel. Then learn TensorRT for inference deployment. Read the FlashAttention paper and reimplement it. Understand torch.compile and how it generates Triton kernels. Your goal: eliminate memory round-trips between operations.

Recommended reading: FlashAttention (Dao et al.), Megatron-LM (Shoeybi et al.), the Triton tutorial series on openai.com.

If you are a systems programmer:

Your path goes deeper into the hardware. Study the CUDA Driver API (not just the Runtime API). Learn how context creation, module loading, and kernel launching work at the driver level. Build a custom GPU memory allocator. Explore NCCL internals and write a simple AllReduce over PCIe. Profile real workloads with Nsight Compute and learn to read every metric in the roofline chart.

Recommended reading: CUDA C++ Programming Guide (NVIDIA), “Programming Massively Parallel Processors” (Hwu, Kirk, El Hajj), the GTC conference talks on Nsight Compute.

If you are a researcher:

Focus on the boundary between algorithms and hardware. Understand how Tensor Core MMA instructions map to matrix tiles. Study structured sparsity and how 2:4 patterns interact with pruning algorithms. Explore FP8 training and the numerical analysis behind loss scaling. Read about hardware-aware neural architecture search (NAS) and how kernel performance models guide architecture decisions.

Recommended reading: “Efficient Processing of Deep Neural Networks” (Sze et al.), “A Survey of Quantization Methods” (Gholami et al.), NVIDIA’s Hopper and Blackwell architecture whitepapers.

Series summary

This series covered 28 articles spanning GPU architecture, CUDA programming fundamentals, memory hierarchy optimization, synchronization primitives, multi-GPU programming, and performance engineering through real-world case studies. Here is the progression:

  1. Foundations (articles 1-5): GPU architecture, SIMT execution, thread hierarchy, memory allocation, and your first kernel.
  2. Memory (articles 6-12): Global memory coalescing, shared memory tiling, constant and texture memory, unified memory, and the memory hierarchy as a whole.
  3. Execution (articles 13-17): Warps, divergence, synchronization, atomics, streams, events, and concurrent data structures.
  4. Performance (articles 18-22): Occupancy, profiling, dynamic parallelism, and multi-GPU programming.
  5. Case studies (articles 23-27): Matrix multiplication, CNN layers, prefix sums, and end-to-end performance optimization.
  6. Ecosystem (article 28): Libraries, tools, portability, and where to go next.

The gap between knowing CUDA syntax and writing production-quality GPU code is large. It is closed by profiling, by reading other people’s kernels, and by repeatedly measuring the distance between your kernel’s throughput and the hardware’s theoretical peak. Every article in this series has pointed you toward that measurement discipline.

The GPU ecosystem moves fast. New architectures ship every two years. New libraries appear every few months. The fundamentals, memory coalescing, occupancy, warp-level thinking, roofline analysis, do not change. Master those, and every new GPU generation becomes an incremental update rather than a fresh learning curve.

Build something. Profile it. Make it faster. That is the practice.

Start typing to search across all content
navigate Enter open Esc close