Apr 3, 2026 · 20 min read · GPU Programming

Multi-GPU programming and peer access

In this series (28 parts)

Prerequisites

This article assumes you have read the following:

Unified virtual memory for cudaMallocManaged, page migration, prefetching, and memory hints across devices.
CUDA streams and asynchronous execution for overlapping transfers with compute and managing concurrent work.

You should be comfortable allocating device memory, launching kernels, and using streams. Everything here builds on those primitives but targets systems with two or more GPUs.

Why multiple GPUs

A single GPU has a ceiling. An A100 peaks at 312 TFLOPS (FP16 Tensor) and 80 GB of HBM2e. Training a 70B parameter model requires over 140 GB just for weights in FP16. That does not fit on one device. Even when the model fits, a single GPU processes one batch at a time. Four GPUs processing four batches deliver roughly four times the throughput, minus synchronization overhead.

Multi-GPU programming is not optional for large-scale workloads. It is the standard deployment model for training, inference serving, and any compute job that exceeds single-device memory or throughput limits.

Device enumeration

Every multi-GPU program starts with discovering what hardware is available. CUDA provides cudaGetDeviceCount and cudaGetDeviceProperties for this:

#include <cstdio>
#include <cuda_runtime.h>

int main() {
    int deviceCount = 0;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA device(s)\n", deviceCount);

    for (int i = 0; i < deviceCount; i++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printf("  Device %d: %s\n", i, prop.name);
        printf("    Compute capability: %d.%d\n", prop.major, prop.minor);
        printf("    Global memory: %.2f GB\n",
               prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
        printf("    SM count: %d\n", prop.multiProcessorCount);
    }
    return 0;
}

On a DGX A100 this prints eight devices. On a consumer workstation with two RTX 4090s, you get two. The device index (0 through N-1) is how CUDA identifies each GPU for every subsequent API call.

cudaSetDevice: directing traffic

By default, all CUDA calls target device 0. To work with a specific GPU, call cudaSetDevice before any allocation or kernel launch:

cudaSetDevice(0);
float *d_a0;
cudaMalloc(&d_a0, N * sizeof(float));
kernel<<<grid, block>>>(d_a0, N);  // runs on GPU 0

cudaSetDevice(1);
float *d_a1;
cudaMalloc(&d_a1, N * sizeof(float));
kernel<<<grid, block>>>(d_a1, N);  // runs on GPU 1

Every cudaMalloc, kernel launch, stream creation, and event creation is bound to the current device. Mixing up the active device is the most common multi-GPU bug. A kernel launched on device 0 cannot access memory allocated on device 1 unless peer access is enabled.

⚠ Always call cudaSetDevice before every group of operations targeting a specific GPU. Do not assume the current device persists across function boundaries.

Multi-GPU topology

GPUs connect to the system through different interconnects. The topology determines how fast data moves between devices:

graph TD
  CPU["CPU (Host Memory)"]
  PS["PCIe Switch"]
  NVB["NVLink Bridge"]

  CPU -->|"PCIe 4.0
32 GB/s"| PS
  PS -->|"PCIe 4.0
32 GB/s"| GPU0["GPU 0"]
  PS -->|"PCIe 4.0
32 GB/s"| GPU1["GPU 1"]
  GPU0 <-->|"NVLink 3.0
600 GB/s"| NVB
  GPU1 <-->|"NVLink 3.0
600 GB/s"| NVB
  NVB <-->|"Direct P2P"| GPU2["GPU 2"]
  NVB <-->|"Direct P2P"| GPU3["GPU 3"]

  style CPU fill:#264653,stroke:#e0e0e0,color:#e0e0e0
  style PS fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0
  style NVB fill:#e76f51,stroke:#e0e0e0,color:#e0e0e0
  style GPU0 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style GPU1 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style GPU2 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style GPU3 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0

The difference between NVLink and PCIe is not incremental. It is nearly 20x in bandwidth. This gap determines whether multi-GPU scaling is limited by compute or by communication.

GPU interconnect comparison

Technology	Bandwidth (GB/s)	Latency	Typical Use	Hardware Requirement
PCIe 3.0 x16	16	~1 us	Consumer GPUs, older servers	Standard motherboard slot
PCIe 4.0 x16	32	~1 us	Workstations, mid-tier servers	PCIe 4.0 CPU and motherboard
PCIe 5.0 x16	64	~0.5 us	Next-gen servers	PCIe 5.0 CPU and motherboard
NVLink 2.0	300	~0.3 us	V100 GPU pairs	NVLink bridge connector
NVLink 3.0	600	~0.2 us	A100 DGX systems	NVSwitch or NVLink bridge
NVLink 4.0	900	~0.2 us	H100 DGX systems	NVSwitch 3.0

Peer access: direct GPU-to-GPU transfers

Without peer access, copying data from GPU 0 to GPU 1 goes through host memory: GPU 0 to host, then host to GPU 1. This doubles the transfer time and saturates PCIe in both directions.

Peer access allows one GPU to read from or write to another GPU’s memory directly, bypassing the host entirely. On NVLink-connected devices, this uses the full NVLink bandwidth. On PCIe, it uses the direct PCIe path between devices (if the topology supports it).

Checking and enabling peer access

#include <cstdio>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);

    printf("Peer access matrix:\n    ");
    for (int j = 0; j < deviceCount; j++) printf("GPU%d  ", j);
    printf("\n");

    for (int i = 0; i < deviceCount; i++) {
        printf("GPU%d: ", i);
        for (int j = 0; j < deviceCount; j++) {
            if (i == j) {
                printf(" -    ");
                continue;
            }
            int canAccess;
            cudaDeviceCanAccessPeer(&canAccess, i, j);
            printf(" %s   ", canAccess ? "✓" : "✗");
        }
        printf("\n");
    }

    // Enable peer access from device 0 to device 1
    int canAccess;
    cudaDeviceCanAccessPeer(&canAccess, 0, 1);
    if (canAccess) {
        cudaSetDevice(0);
        cudaDeviceEnablePeerAccess(1, 0);
        printf("Peer access enabled: GPU 0 -> GPU 1\n");
    } else {
        printf("Peer access not supported. Using host-mediated transfer.\n");
    }

    return 0;
}

cudaDeviceCanAccessPeer returns 1 if direct access is possible. cudaDeviceEnablePeerAccess must be called from the source device. The second argument is reserved and must be 0. Peer access is unidirectional: enabling GPU 0 to access GPU 1 does not automatically enable the reverse. Enable both directions if you need bidirectional transfers.

Peer-to-peer memory copy

With peer access enabled, cudaMemcpyPeer transfers data directly between devices:

// Allocate on each device
cudaSetDevice(0);
float *d_a0;
cudaMalloc(&d_a0, N * sizeof(float));

cudaSetDevice(1);
float *d_a1;
cudaMalloc(&d_a1, N * sizeof(float));

// Direct GPU 0 -> GPU 1 copy (no host staging)
cudaMemcpyPeer(d_a1, 1, d_a0, 0, N * sizeof(float));

You can also use cudaMemcpyPeerAsync with a stream for non-blocking transfers. This is essential for overlapping communication with compute in multi-GPU pipelines.

If peer access is not available, CUDA silently falls back to host-mediated transfer. The copy still works, but at PCIe bandwidth instead of NVLink bandwidth. Always check cudaDeviceCanAccessPeer and log which path you are taking.

Example A: peer access check with bandwidth comparison

Consider device 0 trying to access device 1’s memory. The logic:

Call cudaDeviceCanAccessPeer(&canAccess, 0, 1).
If canAccess is true, enable peer access. Transfers use NVLink at 600 GB/s (on A100).
If canAccess is false, fall back to host-mediated copy. Transfers use PCIe at 32 GB/s (PCIe 4.0).

For a 1 GB buffer:

NVLink path: 1 GB / 600 GB/s = 1.67 ms
PCIe host-mediated path: 1 GB / 32 GB/s (device to host) + 1 GB / 32 GB/s (host to device) = 62.5 ms

That is a 37x difference. Peer access is not an optimization. On multi-GPU systems with NVLink, it is a requirement for acceptable performance.

Data distribution strategies

Two fundamental strategies exist for distributing work across GPUs:

Data parallel

Every GPU holds a complete copy of the model (or computation). The input data is split across GPUs. Each GPU processes its partition independently, then the results are aggregated.

flowchart TD
  D["Input Batch (4N samples)"]
  D -->|"Partition"| S0["GPU 0: N samples"]
  D -->|"Partition"| S1["GPU 1: N samples"]
  D -->|"Partition"| S2["GPU 2: N samples"]
  D -->|"Partition"| S3["GPU 3: N samples"]

  S0 -->|"Forward + Backward"| G0["Gradients 0"]
  S1 -->|"Forward + Backward"| G1["Gradients 1"]
  S2 -->|"Forward + Backward"| G2["Gradients 2"]
  S3 -->|"Forward + Backward"| G3["Gradients 3"]

  G0 --> AR["AllReduce
(Gradient Aggregation)"]
  G1 --> AR
  G2 --> AR
  G3 --> AR

  AR --> U0["GPU 0: Update Weights"]
  AR --> U1["GPU 1: Update Weights"]
  AR --> U2["GPU 2: Update Weights"]
  AR --> U3["GPU 3: Update Weights"]

  style D fill:#264653,stroke:#e0e0e0,color:#e0e0e0
  style S0 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0
  style S1 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0
  style S2 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0
  style S3 fill:#2a9d8f,stroke:#e0e0e0,color:#e0e0e0
  style G0 fill:#e9c46a,stroke:#264653,color:#264653
  style G1 fill:#e9c46a,stroke:#264653,color:#264653
  style G2 fill:#e9c46a,stroke:#264653,color:#264653
  style G3 fill:#e9c46a,stroke:#264653,color:#264653
  style AR fill:#e76f51,stroke:#e0e0e0,color:#e0e0e0
  style U0 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style U1 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style U2 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0
  style U3 fill:#6a2d4f,stroke:#e0e0e0,color:#e0e0e0

Data parallelism is the simpler strategy. It scales well when the model fits on a single GPU and the communication cost of aggregating results is small relative to compute time.

Model parallel

The model itself is split across GPUs. Each GPU holds a subset of layers (pipeline parallelism) or a subset of each layer’s parameters (tensor parallelism). The data flows through GPUs sequentially or is distributed across partial computations.

Model parallelism is necessary when the model does not fit in a single GPU’s memory. It introduces more complex communication patterns: activations must be sent between pipeline stages, and partial matrix products must be reduced across tensor-parallel ranks.

When to use which:

Model fits on one GPU, want more throughput: data parallel
Model does not fit on one GPU: model parallel (or a hybrid)
Very large models (100B+ parameters): hybrid of data, tensor, and pipeline parallelism

Example B: data-parallel partition with timing

Scenario: 1 billion floats (4 GB total) across 4 GPUs.

Each GPU gets 250 million floats (1 GB per GPU).
Compute time per GPU: 5 ms (each GPU processes its partition independently).
Aggregation (AllReduce) time: 1 ms.
Total multi-GPU time: 5 ms (compute) + 1 ms (aggregation) = 6 ms.

Single GPU baseline: 20 ms (all 1 billion floats on one device).

Speedup: 20 ms / 6 ms = 3.33x on 4 GPUs.

Why not 4x? The 1 ms aggregation overhead. With larger compute-to-communication ratios, scaling approaches linear. With smaller ratios, communication dominates and adding GPUs provides diminishing returns.

Simple data-parallel reduction across 2 GPUs

This example partitions an array across two GPUs, computes a partial sum on each, then combines the results on the host:

#include <cstdio>
#include <cuda_runtime.h>

__global__ void partialSum(const float *input, float *output, int N) {
    extern __shared__ float sdata[];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = (idx < N) ? input[idx] : 0.0f;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

int main() {
    const int N = 1 << 24;  // 16M elements total
    const int half = N / 2;
    const int blockSize = 256;
    const int gridSize = (half + blockSize - 1) / blockSize;

    // Allocate and initialize host data
    float *h_data = new float[N];
    for (int i = 0; i < N; i++) h_data[i] = 1.0f;

    float *d_input[2], *d_output[2];
    float *h_partial[2];

    for (int dev = 0; dev < 2; dev++) {
        cudaSetDevice(dev);
        cudaMalloc(&d_input[dev], half * sizeof(float));
        cudaMalloc(&d_output[dev], gridSize * sizeof(float));
        h_partial[dev] = new float[gridSize];

        // Copy this GPU's half of the data
        cudaMemcpy(d_input[dev], h_data + dev * half,
                   half * sizeof(float), cudaMemcpyHostToDevice);
    }

    // Launch reduction on both GPUs
    for (int dev = 0; dev < 2; dev++) {
        cudaSetDevice(dev);
        partialSum<<<gridSize, blockSize, blockSize * sizeof(float)>>>(
            d_input[dev], d_output[dev], half);
    }

    // Collect partial results
    float totalSum = 0.0f;
    for (int dev = 0; dev < 2; dev++) {
        cudaSetDevice(dev);
        cudaDeviceSynchronize();
        cudaMemcpy(h_partial[dev], d_output[dev],
                   gridSize * sizeof(float), cudaMemcpyDeviceToHost);
        for (int i = 0; i < gridSize; i++) totalSum += h_partial[dev][i];
    }

    printf("Total sum: %.0f (expected: %d)\n", totalSum, N);

    // Cleanup
    for (int dev = 0; dev < 2; dev++) {
        cudaSetDevice(dev);
        cudaFree(d_input[dev]);
        cudaFree(d_output[dev]);
        delete[] h_partial[dev];
    }
    delete[] h_data;
    return 0;
}

Key points: each cudaSetDevice call switches the active device before allocation, launch, and synchronization. The kernels run concurrently on separate GPUs since they are on different devices. The host collects and combines partial results after both GPUs finish.

Multi-GPU with CuPy (Python)

CuPy provides a clean Python API for multi-GPU programming using device contexts:

import cupy as cp
import numpy as np

# Discover GPUs
n_devices = cp.cuda.runtime.getDeviceCount()
print(f"Found \{n_devices\} GPU(s)")

N = 1 << 24  # 16M elements
h_data = np.ones(N, dtype=np.float32)
half = N // 2

partial_sums = []

for dev_id in range(min(n_devices, 2)):
    with cp.cuda.Device(dev_id):
        # Data is allocated on the active device
        d_slice = cp.asarray(h_data[dev_id * half : (dev_id + 1) * half])
        partial = float(cp.sum(d_slice))
        partial_sums.append(partial)
        print(f"  GPU \{dev_id\}: partial sum = \{partial\}")

total = sum(partial_sums)
print(f"Total sum: \{total\} (expected: \{N\})")

The cp.cuda.Device(dev_id) context manager sets the active device for all CuPy operations inside the block. Memory allocations, kernel launches, and synchronization all target that device. When the block exits, the previous device is restored.

For peer-to-peer transfers in CuPy:

with cp.cuda.Device(0):
    src = cp.arange(1000, dtype=cp.float32)

with cp.cuda.Device(1):
    dst = cp.empty(1000, dtype=cp.float32)
    # Copy from device 0 to device 1
    dst.data.copy_from_device(src.data, src.nbytes)
    print(f"First element on GPU 1: \{dst[0]\}")

Synchronization across GPUs: NCCL

The NVIDIA Collective Communications Library (NCCL, pronounced “nickel”) provides optimized multi-GPU and multi-node collective operations: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

NCCL automatically detects the topology (NVLink, PCIe, InfiniBand) and selects the optimal communication algorithm. It uses ring or tree algorithms depending on the number of GPUs and the interconnect structure.

#include <nccl.h>
#include <cuda_runtime.h>

// Initialize NCCL for 4 GPUs
int nDevices = 4;
ncclComm_t comms[4];
int devs[4] = {0, 1, 2, 3};
cudaStream_t streams[4];

ncclCommInitAll(comms, nDevices, devs);

for (int i = 0; i < nDevices; i++) {
    cudaSetDevice(i);
    cudaStreamCreate(&streams[i]);
}

// AllReduce: sum gradients across all GPUs
// Each GPU has sendbuff with its local gradients
// After AllReduce, each recvbuff contains the sum of all GPUs' gradients
ncclGroupStart();
for (int i = 0; i < nDevices; i++) {
    ncclAllReduce(sendbuff[i], recvbuff[i], count,
                  ncclFloat, ncclSum, comms[i], streams[i]);
}
ncclGroupEnd();

// Synchronize all streams
for (int i = 0; i < nDevices; i++) {
    cudaSetDevice(i);
    cudaStreamSynchronize(streams[i]);
}

// Cleanup
for (int i = 0; i < nDevices; i++) {
    ncclCommDestroy(comms[i]);
}

ncclGroupStart and ncclGroupEnd batch multiple collective calls into a single operation, allowing NCCL to optimize the communication pattern across all devices simultaneously.

Why NCCL instead of manual peer copies:

NCCL uses ring and tree algorithms that minimize the number of transfer steps from O(N) to O(log N) for N GPUs.
It automatically uses NVLink when available and falls back to PCIe or InfiniBand.
It handles multi-node communication over RDMA networks.
Writing a correct, performant AllReduce by hand across 8 GPUs with mixed NVLink/PCIe topology is error-prone. NCCL solves this once.

When peer access helps vs. hurts

Peer access is not always beneficial:

Peer access helps when:

✓ GPUs are connected via NVLink (high bandwidth, low latency).
✓ The workload requires frequent, large transfers between specific GPU pairs.
✓ You can overlap peer transfers with compute using streams.

Peer access does not help (or hurts) when:

✗ GPUs are on separate PCIe root complexes with no direct path. The “peer” transfer routes through the CPU and is no faster than explicit host-mediated copy.
✗ The transfer volume is small. Latency dominates, and the bandwidth advantage of NVLink is irrelevant for sub-kilobyte messages.
✗ Enabling peer access on many GPU pairs consumes address space. On systems with 8+ GPUs, enabling all 56 peer pairs can cause address space pressure on 32-bit applications.

⚠ Always benchmark your specific topology. Use nvidia-smi topo -m to print the interconnect matrix and identify which GPU pairs have NVLink vs PCIe connections.

In practice

Start with data parallelism. It is simpler to implement, debug, and scale. Use torch.nn.DataParallel or torch.distributed for PyTorch, or manual partitioning with NCCL for CUDA C++. Only move to model parallelism when the model does not fit on a single GPU.

Use NCCL for all collective operations. Do not write your own AllReduce. NCCL handles topology detection, algorithm selection, and multi-node scaling. Link with -lnccl and let it manage communication.

Profile interconnect utilization. Nsight Systems shows GPU-to-GPU transfer events on the timeline. If communication dominates, increase the batch size (more compute per synchronization step), overlap communication with compute using streams, or switch to a hybrid parallelism strategy.

Check peer access at startup. Log the peer access matrix and the interconnect topology. Silent fallback to host-mediated transfers is a common source of unexplained slowdowns. Make it visible.

Pin host memory for multi-GPU transfers. When host-mediated copies are unavoidable, use cudaMallocHost or cudaHostAlloc for pinned memory. Pinned memory achieves full PCIe bandwidth. Pageable memory does not.

What comes next

This article covered multi-GPU programming: device enumeration, peer access, NVLink vs PCIe trade-offs, data and model parallelism, and NCCL for collective operations. You now know how to partition work across GPUs and how interconnect topology determines scaling efficiency.

The next step is understanding memory allocation strategies in depth. CUDA memory allocation covers cudaMalloc vs cudaMallocManaged vs memory pools, allocation overhead, fragmentation, and strategies for managing memory across multi-GPU systems. The allocation patterns you choose directly affect both single-GPU and multi-GPU performance.

← Back to all series