Multi-GPU programming and peer access
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- Unified virtual memory for
cudaMallocManaged, page migration, prefetching, and memory hints across devices. - CUDA streams and asynchronous execution for overlapping transfers with compute and managing concurrent work.
You should be comfortable allocating device memory, launching kernels, and using streams. Everything here builds on those primitives but targets systems with two or more GPUs.
Why multiple GPUs
A single GPU has a ceiling. An A100 peaks at 312 TFLOPS (FP16 Tensor) and 80 GB of HBM2e. Training a 70B parameter model requires over 140 GB just for weights in FP16. That does not fit on one device. Even when the model fits, a single GPU processes one batch at a time. Four GPUs processing four batches deliver roughly four times the throughput, minus synchronization overhead.
Multi-GPU programming is not optional for large-scale workloads. It is the standard deployment model for training, inference serving, and any compute job that exceeds single-device memory or throughput limits.
Device enumeration
Every multi-GPU program starts with discovering what hardware is available. CUDA provides cudaGetDeviceCount and cudaGetDeviceProperties for this:
#include <cstdio>
#include <cuda_runtime.h>
int main() {
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
printf("Found %d CUDA device(s)\n", deviceCount);
for (int i = 0; i < deviceCount; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf(" Device %d: %s\n", i, prop.name);
printf(" Compute capability: %d.%d\n", prop.major, prop.minor);
printf(" Global memory: %.2f GB\n",
prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
printf(" SM count: %d\n", prop.multiProcessorCount);
}
return 0;
}
On a DGX A100 this prints eight devices. On a consumer workstation with two RTX 4090s, you get two. The device index (0 through N-1) is how CUDA identifies each GPU for every subsequent API call.
cudaSetDevice: directing traffic
By default, all CUDA calls target device 0. To work with a specific GPU, call cudaSetDevice before any allocation or kernel launch:
cudaSetDevice(0);
float *d_a0;
cudaMalloc(&d_a0, N * sizeof(float));
kernel<<<grid, block>>>(d_a0, N); // runs on GPU 0
cudaSetDevice(1);
float *d_a1;
cudaMalloc(&d_a1, N * sizeof(float));
kernel<<<grid, block>>>(d_a1, N); // runs on GPU 1
Every cudaMalloc, kernel launch, stream creation, and event creation is bound to the current device. Mixing up the active device is the most common multi-GPU bug. A kernel launched on device 0 cannot access memory allocated on device 1 unless peer access is enabled.
⚠ Always call cudaSetDevice before every group of operations targeting a specific GPU. Do not assume the current device persists across function boundaries.
Multi-GPU topology
GPUs connect to the system through different interconnects. The topology determines how fast data moves between devices:
graph TD CPU["CPU (Host Memory)"] PS["PCIe Switch"] NVB["NVLink Bridge"] CPU -->|"PCIe 4.0 32 GB/s"| PS PS -->|"PCIe 4.0 32 GB/s"| GPU0["GPU 0"] PS -->|"PCIe 4.0 32 GB/s"| GPU1["GPU 1"] GPU0 <-->|"NVLink 3.0 600 GB/s"| NVB GPU1 <-->|"NVLink 3.0 600 GB/s"| NVB NVB <-->|"Direct P2P"| GPU2["GPU 2"] NVB <-->|"Direct P2P"| GPU3["GPU 3"] style CPU fill:#264653,stroke:#e0e0e0,color:#e0e0e0 style PS fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0 style NVB fill:#e76f51,stroke:#e0e0e0,color:#e0e0e0 style GPU0 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style GPU1 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style GPU2 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style GPU3 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
The difference between NVLink and PCIe is not incremental. It is nearly 20x in bandwidth. This gap determines whether multi-GPU scaling is limited by compute or by communication.
GPU interconnect comparison
| Technology | Bandwidth (GB/s) | Latency | Typical Use | Hardware Requirement |
|---|---|---|---|---|
| PCIe 3.0 x16 | 16 | ~1 us | Consumer GPUs, older servers | Standard motherboard slot |
| PCIe 4.0 x16 | 32 | ~1 us | Workstations, mid-tier servers | PCIe 4.0 CPU and motherboard |
| PCIe 5.0 x16 | 64 | ~0.5 us | Next-gen servers | PCIe 5.0 CPU and motherboard |
| NVLink 2.0 | 300 | ~0.3 us | V100 GPU pairs | NVLink bridge connector |
| NVLink 3.0 | 600 | ~0.2 us | A100 DGX systems | NVSwitch or NVLink bridge |
| NVLink 4.0 | 900 | ~0.2 us | H100 DGX systems | NVSwitch 3.0 |
Peer access: direct GPU-to-GPU transfers
Without peer access, copying data from GPU 0 to GPU 1 goes through host memory: GPU 0 to host, then host to GPU 1. This doubles the transfer time and saturates PCIe in both directions.
Peer access allows one GPU to read from or write to another GPU’s memory directly, bypassing the host entirely. On NVLink-connected devices, this uses the full NVLink bandwidth. On PCIe, it uses the direct PCIe path between devices (if the topology supports it).
Checking and enabling peer access
#include <cstdio>
#include <cuda_runtime.h>
int main() {
int deviceCount;
cudaGetDeviceCount(&deviceCount);
printf("Peer access matrix:\n ");
for (int j = 0; j < deviceCount; j++) printf("GPU%d ", j);
printf("\n");
for (int i = 0; i < deviceCount; i++) {
printf("GPU%d: ", i);
for (int j = 0; j < deviceCount; j++) {
if (i == j) {
printf(" - ");
continue;
}
int canAccess;
cudaDeviceCanAccessPeer(&canAccess, i, j);
printf(" %s ", canAccess ? "✓" : "✗");
}
printf("\n");
}
// Enable peer access from device 0 to device 1
int canAccess;
cudaDeviceCanAccessPeer(&canAccess, 0, 1);
if (canAccess) {
cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1, 0);
printf("Peer access enabled: GPU 0 -> GPU 1\n");
} else {
printf("Peer access not supported. Using host-mediated transfer.\n");
}
return 0;
}
cudaDeviceCanAccessPeer returns 1 if direct access is possible. cudaDeviceEnablePeerAccess must be called from the source device. The second argument is reserved and must be 0. Peer access is unidirectional: enabling GPU 0 to access GPU 1 does not automatically enable the reverse. Enable both directions if you need bidirectional transfers.
Peer-to-peer memory copy
With peer access enabled, cudaMemcpyPeer transfers data directly between devices:
// Allocate on each device
cudaSetDevice(0);
float *d_a0;
cudaMalloc(&d_a0, N * sizeof(float));
cudaSetDevice(1);
float *d_a1;
cudaMalloc(&d_a1, N * sizeof(float));
// Direct GPU 0 -> GPU 1 copy (no host staging)
cudaMemcpyPeer(d_a1, 1, d_a0, 0, N * sizeof(float));
You can also use cudaMemcpyPeerAsync with a stream for non-blocking transfers. This is essential for overlapping communication with compute in multi-GPU pipelines.
If peer access is not available, CUDA silently falls back to host-mediated transfer. The copy still works, but at PCIe bandwidth instead of NVLink bandwidth. Always check cudaDeviceCanAccessPeer and log which path you are taking.
Example A: peer access check with bandwidth comparison
Consider device 0 trying to access device 1’s memory. The logic:
- Call
cudaDeviceCanAccessPeer(&canAccess, 0, 1). - If
canAccessis true, enable peer access. Transfers use NVLink at 600 GB/s (on A100). - If
canAccessis false, fall back to host-mediated copy. Transfers use PCIe at 32 GB/s (PCIe 4.0).
For a 1 GB buffer:
- NVLink path: 1 GB / 600 GB/s = 1.67 ms
- PCIe host-mediated path: 1 GB / 32 GB/s (device to host) + 1 GB / 32 GB/s (host to device) = 62.5 ms
That is a 37x difference. Peer access is not an optimization. On multi-GPU systems with NVLink, it is a requirement for acceptable performance.
Data distribution strategies
Two fundamental strategies exist for distributing work across GPUs:
Data parallel
Every GPU holds a complete copy of the model (or computation). The input data is split across GPUs. Each GPU processes its partition independently, then the results are aggregated.
flowchart TD D["Input Batch (4N samples)"] D -->|"Partition"| S0["GPU 0: N samples"] D -->|"Partition"| S1["GPU 1: N samples"] D -->|"Partition"| S2["GPU 2: N samples"] D -->|"Partition"| S3["GPU 3: N samples"] S0 -->|"Forward + Backward"| G0["Gradients 0"] S1 -->|"Forward + Backward"| G1["Gradients 1"] S2 -->|"Forward + Backward"| G2["Gradients 2"] S3 -->|"Forward + Backward"| G3["Gradients 3"] G0 --> AR["AllReduce (Gradient Aggregation)"] G1 --> AR G2 --> AR G3 --> AR AR --> U0["GPU 0: Update Weights"] AR --> U1["GPU 1: Update Weights"] AR --> U2["GPU 2: Update Weights"] AR --> U3["GPU 3: Update Weights"] style D fill:#264653,stroke:#e0e0e0,color:#e0e0e0 style S0 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0 style S1 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0 style S2 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0 style S3 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0 style G0 fill:#e9c46a,stroke:#264653,color:#264653 style G1 fill:#e9c46a,stroke:#264653,color:#264653 style G2 fill:#e9c46a,stroke:#264653,color:#264653 style G3 fill:#e9c46a,stroke:#264653,color:#264653 style AR fill:#e76f51,stroke:#e0e0e0,color:#e0e0e0 style U0 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style U1 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style U2 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0 style U3 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
Data parallelism is the simpler strategy. It scales well when the model fits on a single GPU and the communication cost of aggregating results is small relative to compute time.
Model parallel
The model itself is split across GPUs. Each GPU holds a subset of layers (pipeline parallelism) or a subset of each layer’s parameters (tensor parallelism). The data flows through GPUs sequentially or is distributed across partial computations.
Model parallelism is necessary when the model does not fit in a single GPU’s memory. It introduces more complex communication patterns: activations must be sent between pipeline stages, and partial matrix products must be reduced across tensor-parallel ranks.
When to use which:
- Model fits on one GPU, want more throughput: data parallel
- Model does not fit on one GPU: model parallel (or a hybrid)
- Very large models (100B+ parameters): hybrid of data, tensor, and pipeline parallelism
Example B: data-parallel partition with timing
Scenario: 1 billion floats (4 GB total) across 4 GPUs.
- Each GPU gets 250 million floats (1 GB per GPU).
- Compute time per GPU: 5 ms (each GPU processes its partition independently).
- Aggregation (AllReduce) time: 1 ms.
- Total multi-GPU time: 5 ms (compute) + 1 ms (aggregation) = 6 ms.
Single GPU baseline: 20 ms (all 1 billion floats on one device).
Speedup: 20 ms / 6 ms = 3.33x on 4 GPUs.
Why not 4x? The 1 ms aggregation overhead. With larger compute-to-communication ratios, scaling approaches linear. With smaller ratios, communication dominates and adding GPUs provides diminishing returns.
Simple data-parallel reduction across 2 GPUs
This example partitions an array across two GPUs, computes a partial sum on each, then combines the results on the host:
#include <cstdio>
#include <cuda_runtime.h>
__global__ void partialSum(const float *input, float *output, int N) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (idx < N) ? input[idx] : 0.0f;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}
int main() {
const int N = 1 << 24; // 16M elements total
const int half = N / 2;
const int blockSize = 256;
const int gridSize = (half + blockSize - 1) / blockSize;
// Allocate and initialize host data
float *h_data = new float[N];
for (int i = 0; i < N; i++) h_data[i] = 1.0f;
float *d_input[2], *d_output[2];
float *h_partial[2];
for (int dev = 0; dev < 2; dev++) {
cudaSetDevice(dev);
cudaMalloc(&d_input[dev], half * sizeof(float));
cudaMalloc(&d_output[dev], gridSize * sizeof(float));
h_partial[dev] = new float[gridSize];
// Copy this GPU's half of the data
cudaMemcpy(d_input[dev], h_data + dev * half,
half * sizeof(float), cudaMemcpyHostToDevice);
}
// Launch reduction on both GPUs
for (int dev = 0; dev < 2; dev++) {
cudaSetDevice(dev);
partialSum<<<gridSize, blockSize, blockSize * sizeof(float)>>>(
d_input[dev], d_output[dev], half);
}
// Collect partial results
float totalSum = 0.0f;
for (int dev = 0; dev < 2; dev++) {
cudaSetDevice(dev);
cudaDeviceSynchronize();
cudaMemcpy(h_partial[dev], d_output[dev],
gridSize * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < gridSize; i++) totalSum += h_partial[dev][i];
}
printf("Total sum: %.0f (expected: %d)\n", totalSum, N);
// Cleanup
for (int dev = 0; dev < 2; dev++) {
cudaSetDevice(dev);
cudaFree(d_input[dev]);
cudaFree(d_output[dev]);
delete[] h_partial[dev];
}
delete[] h_data;
return 0;
}
Key points: each cudaSetDevice call switches the active device before allocation, launch, and synchronization. The kernels run concurrently on separate GPUs since they are on different devices. The host collects and combines partial results after both GPUs finish.
Multi-GPU with CuPy (Python)
CuPy provides a clean Python API for multi-GPU programming using device contexts:
import cupy as cp
import numpy as np
# Discover GPUs
n_devices = cp.cuda.runtime.getDeviceCount()
print(f"Found \{n_devices\} GPU(s)")
N = 1 << 24 # 16M elements
h_data = np.ones(N, dtype=np.float32)
half = N // 2
partial_sums = []
for dev_id in range(min(n_devices, 2)):
with cp.cuda.Device(dev_id):
# Data is allocated on the active device
d_slice = cp.asarray(h_data[dev_id * half : (dev_id + 1) * half])
partial = float(cp.sum(d_slice))
partial_sums.append(partial)
print(f" GPU \{dev_id\}: partial sum = \{partial\}")
total = sum(partial_sums)
print(f"Total sum: \{total\} (expected: \{N\})")
The cp.cuda.Device(dev_id) context manager sets the active device for all CuPy operations inside the block. Memory allocations, kernel launches, and synchronization all target that device. When the block exits, the previous device is restored.
For peer-to-peer transfers in CuPy:
with cp.cuda.Device(0):
src = cp.arange(1000, dtype=cp.float32)
with cp.cuda.Device(1):
dst = cp.empty(1000, dtype=cp.float32)
# Copy from device 0 to device 1
dst.data.copy_from_device(src.data, src.nbytes)
print(f"First element on GPU 1: \{dst[0]\}")
Synchronization across GPUs: NCCL
The NVIDIA Collective Communications Library (NCCL, pronounced “nickel”) provides optimized multi-GPU and multi-node collective operations: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.
NCCL automatically detects the topology (NVLink, PCIe, InfiniBand) and selects the optimal communication algorithm. It uses ring or tree algorithms depending on the number of GPUs and the interconnect structure.
#include <nccl.h>
#include <cuda_runtime.h>
// Initialize NCCL for 4 GPUs
int nDevices = 4;
ncclComm_t comms[4];
int devs[4] = {0, 1, 2, 3};
cudaStream_t streams[4];
ncclCommInitAll(comms, nDevices, devs);
for (int i = 0; i < nDevices; i++) {
cudaSetDevice(i);
cudaStreamCreate(&streams[i]);
}
// AllReduce: sum gradients across all GPUs
// Each GPU has sendbuff with its local gradients
// After AllReduce, each recvbuff contains the sum of all GPUs' gradients
ncclGroupStart();
for (int i = 0; i < nDevices; i++) {
ncclAllReduce(sendbuff[i], recvbuff[i], count,
ncclFloat, ncclSum, comms[i], streams[i]);
}
ncclGroupEnd();
// Synchronize all streams
for (int i = 0; i < nDevices; i++) {
cudaSetDevice(i);
cudaStreamSynchronize(streams[i]);
}
// Cleanup
for (int i = 0; i < nDevices; i++) {
ncclCommDestroy(comms[i]);
}
ncclGroupStart and ncclGroupEnd batch multiple collective calls into a single operation, allowing NCCL to optimize the communication pattern across all devices simultaneously.
Why NCCL instead of manual peer copies:
- NCCL uses ring and tree algorithms that minimize the number of transfer steps from O(N) to O(log N) for N GPUs.
- It automatically uses NVLink when available and falls back to PCIe or InfiniBand.
- It handles multi-node communication over RDMA networks.
- Writing a correct, performant AllReduce by hand across 8 GPUs with mixed NVLink/PCIe topology is error-prone. NCCL solves this once.
When peer access helps vs. hurts
Peer access is not always beneficial:
Peer access helps when:
- ✓ GPUs are connected via NVLink (high bandwidth, low latency).
- ✓ The workload requires frequent, large transfers between specific GPU pairs.
- ✓ You can overlap peer transfers with compute using streams.
Peer access does not help (or hurts) when:
- ✗ GPUs are on separate PCIe root complexes with no direct path. The “peer” transfer routes through the CPU and is no faster than explicit host-mediated copy.
- ✗ The transfer volume is small. Latency dominates, and the bandwidth advantage of NVLink is irrelevant for sub-kilobyte messages.
- ✗ Enabling peer access on many GPU pairs consumes address space. On systems with 8+ GPUs, enabling all 56 peer pairs can cause address space pressure on 32-bit applications.
⚠ Always benchmark your specific topology. Use nvidia-smi topo -m to print the interconnect matrix and identify which GPU pairs have NVLink vs PCIe connections.
In practice
Start with data parallelism. It is simpler to implement, debug, and scale. Use torch.nn.DataParallel or torch.distributed for PyTorch, or manual partitioning with NCCL for CUDA C++. Only move to model parallelism when the model does not fit on a single GPU.
Use NCCL for all collective operations. Do not write your own AllReduce. NCCL handles topology detection, algorithm selection, and multi-node scaling. Link with -lnccl and let it manage communication.
Profile interconnect utilization. Nsight Systems shows GPU-to-GPU transfer events on the timeline. If communication dominates, increase the batch size (more compute per synchronization step), overlap communication with compute using streams, or switch to a hybrid parallelism strategy.
Check peer access at startup. Log the peer access matrix and the interconnect topology. Silent fallback to host-mediated transfers is a common source of unexplained slowdowns. Make it visible.
Pin host memory for multi-GPU transfers. When host-mediated copies are unavoidable, use cudaMallocHost or cudaHostAlloc for pinned memory. Pinned memory achieves full PCIe bandwidth. Pageable memory does not.
What comes next
This article covered multi-GPU programming: device enumeration, peer access, NVLink vs PCIe trade-offs, data and model parallelism, and NCCL for collective operations. You now know how to partition work across GPUs and how interconnect topology determines scaling efficiency.
The next step is understanding memory allocation strategies in depth. CUDA memory allocation covers cudaMalloc vs cudaMallocManaged vs memory pools, allocation overhead, fragmentation, and strategies for managing memory across multi-GPU systems. The allocation patterns you choose directly affect both single-GPU and multi-GPU performance.