Search…

Unified virtual memory: one pointer for CPU and GPU

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable writing kernels that operate on device memory allocated with cudaMalloc and transferred with cudaMemcpy. The entire article is about eliminating that explicit transfer step and understanding what happens under the hood when you do.

The problem: two address spaces, two copies of everything

Traditional CUDA programming forces you to maintain parallel worlds. The CPU has its memory (allocated with malloc or new), the GPU has its memory (allocated with cudaMalloc), and you are responsible for shuttling data between them with cudaMemcpy.

// The explicit-copy workflow everyone learns first
float *h_a, *d_a;
h_a = (float*)malloc(N * sizeof(float));
cudaMalloc(&d_a, N * sizeof(float));

// Initialize on host
for (int i = 0; i < N; i++) h_a[i] = i;

// Copy host -> device
cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice);

// Launch kernel
add_kernel<<<blocks, threads>>>(d_a, N);

// Copy device -> host
cudaMemcpy(h_a, d_a, N * sizeof(float), cudaMemcpyDeviceToHost);

// Clean up both
cudaFree(d_a);
free(h_a);

This works, but it has real costs. You maintain two pointers for every buffer. You choose the direction of every transfer. You synchronize to make sure the copy finished before reading the result. For a 10-line kernel, you write 20 lines of memory management. Bugs hide in the copy direction enum, in forgotten transfers, and in stale pointers.

What unified memory actually is

Unified memory creates a single virtual address space that both the CPU and GPU can dereference. You call cudaMallocManaged instead of cudaMalloc, and the returned pointer works on both processors.

float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Write on CPU - same pointer
for (int i = 0; i < N; i++) data[i] = i;

// Read on GPU - same pointer
add_kernel<<<blocks, threads>>>(data, N);
cudaDeviceSynchronize();

// Read result on CPU - same pointer
printf("data[0] = %f\n", data[0]);

cudaFree(data);

No cudaMemcpy. No host/device pointer pairs. The CUDA runtime manages the physical location of data transparently.

graph TB
  subgraph UVA["Unified Virtual Address Space"]
      direction TB
      PAGE1["Page 0x7f00...000
4 KB"]
      PAGE2["Page 0x7f00...001
4 KB"]
      PAGE3["Page 0x7f00...002
4 KB"]
      PAGE4["Page 0x7f00...003
4 KB"]
  end

  subgraph CPU_MEM["CPU Physical Memory"]
      CPAG1["Page copy"]
      CPAG3["Page copy"]
  end

  subgraph GPU_MEM["GPU Physical Memory (HBM)"]
      GPAG2["Page copy"]
      GPAG4["Page copy"]
  end

  PAGE1 -. "currently resides" .-> CPAG1
  PAGE2 -. "migrated on access" .-> GPAG2
  PAGE3 -. "currently resides" .-> CPAG3
  PAGE4 -. "migrated on access" .-> GPAG4

  RUNTIME["CUDA Runtime
Handles page faults
Migrates pages on demand"]

  RUNTIME --> CPU_MEM
  RUNTIME --> GPU_MEM

  style UVA fill:#1a1a2e,stroke:#16213e,color:#e0e0e0
  style CPU_MEM fill:#2d6a4f,stroke:#1b4332,color:#fff
  style GPU_MEM fill:#6a2d4f,stroke:#4a1d35,color:#fff
  style RUNTIME fill:#e76f51,stroke:#c45e43,color:#fff

The key insight: a page of managed memory exists in exactly one physical location at any moment. When the other processor touches it, the runtime migrates the page over PCIe or NVLink. This migration is what makes unified memory both powerful and potentially expensive.

How it works: demand paging and page faults

On Pascal (sm_60) and later architectures, unified memory uses true demand paging with hardware page fault support. Here is the sequence when a GPU kernel touches a page that currently lives in CPU memory:

  1. The GPU’s MMU detects the page is not mapped in GPU page tables.
  2. A page fault is raised and sent to the CUDA driver.
  3. The driver migrates the 4 KB (or 64 KB on some configurations) page from CPU memory to GPU memory over PCIe/NVLink.
  4. The GPU page table is updated.
  5. The faulting warp resumes execution.

Each fault costs roughly 10 to 20 microseconds. That sounds small until you consider scale. One gigabyte of data at 4 KB pages is 262,144 pages. If every page faults individually, the overhead is 2.6 to 5.2 seconds just for faults, not counting the actual data transfer. Compare that to a bulk cudaMemcpy of 1 GB over PCIe Gen4 x16 at ~25 GB/s: about 40 milliseconds.

This is the central tradeoff: unified memory trades convenience for potentially higher transfer latency when pages fault on demand instead of being moved in bulk.

Explicit copy vs. unified memory: the API comparison

AspectcudaMalloc + cudaMemcpycudaMallocManagedcudaMallocManaged + prefetch
Allocation APIcudaMalloccudaMallocManagedcudaMallocManaged
Requires explicit copy✓ Yes✗ No✗ No
Accessible from CPU✗ Not device ptr✓ Yes✓ Yes
Accessible from GPU✓ Yes✓ Yes✓ Yes
Peak transfer perf✓ Best (bulk DMA)⚠ Worst (page faults)✓ Near-best (bulk async)
Ease of use⚠ Manual✓ Automatic✓ Automatic + hint
Oversubscription✗ No✓ Yes (Pascal+)✓ Yes (Pascal+)

Full example: vector addition three ways

Way 1: explicit cudaMemcpy

#include <cstdio>
#include <cuda_runtime.h>

__global__ void vec_add(const float *a, const float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

void explicit_copy(int N) {
    size_t bytes = N * sizeof(float);
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);
    float *d_a, *d_b, *d_c;

    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    for (int i = 0; i < N; i++) { h_a[i] = 1.0f; h_b[i] = 2.0f; }

    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    vec_add<<<(N + 255) / 256, 256>>>(d_a, d_b, d_c, N);

    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    printf("explicit: c[0] = %f\n", h_c[0]);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
}

Way 2: unified memory (no prefetch)

void unified_no_prefetch(int N) {
    size_t bytes = N * sizeof(float);
    float *a, *b, *c;

    cudaMallocManaged(&a, bytes);
    cudaMallocManaged(&b, bytes);
    cudaMallocManaged(&c, bytes);

    for (int i = 0; i < N; i++) { a[i] = 1.0f; b[i] = 2.0f; }

    vec_add<<<(N + 255) / 256, 256>>>(a, b, c, N);
    cudaDeviceSynchronize();

    printf("unified (no prefetch): c[0] = %f\n", c[0]);
    cudaFree(a); cudaFree(b); cudaFree(c);
}

Simpler code, but every page of a and b triggers a page fault when the GPU first reads it.

Way 3: unified memory with prefetch

void unified_with_prefetch(int N) {
    size_t bytes = N * sizeof(float);
    float *a, *b, *c;
    int device = 0;

    cudaMallocManaged(&a, bytes);
    cudaMallocManaged(&b, bytes);
    cudaMallocManaged(&c, bytes);

    for (int i = 0; i < N; i++) { a[i] = 1.0f; b[i] = 2.0f; }

    // Prefetch all three buffers to GPU before kernel launch
    cudaMemPrefetchAsync(a, bytes, device);
    cudaMemPrefetchAsync(b, bytes, device);
    cudaMemPrefetchAsync(c, bytes, device);

    vec_add<<<(N + 255) / 256, 256>>>(a, b, c, N);
    cudaDeviceSynchronize();

    // Prefetch result back to CPU before host reads it
    cudaMemPrefetchAsync(c, bytes, cudaCpuDeviceId);
    cudaDeviceSynchronize();

    printf("unified (prefetch): c[0] = %f\n", c[0]);
    cudaFree(a); cudaFree(b); cudaFree(c);
}

cudaMemPrefetchAsync triggers a bulk migration, similar to cudaMemcpy but without needing separate host and device pointers. The runtime moves pages in large batches, avoiding per-page fault overhead.

Prefetching: getting bulk transfer performance back

cudaMemPrefetchAsync is the tool that closes the performance gap. It takes a managed pointer, a byte count, and a destination device ID. Passing cudaCpuDeviceId moves pages to host memory.

// Prefetch to GPU 0
cudaMemPrefetchAsync(ptr, size, 0, stream);

// Prefetch back to CPU
cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId, stream);

The function is asynchronous with respect to the host. It can be issued into a stream, so prefetches and kernel launches overlap naturally. This is the recommended pattern for production code that uses managed memory: allocate with cudaMallocManaged, prefetch before compute, and let the runtime handle the mechanics.

Memory hints with cudaMemAdvise

For workloads where data is read by one processor and written by another, or read by multiple GPUs, cudaMemAdvise provides finer control.

// Hint: GPU 0 will mostly read this data
cudaMemAdvise(ptr, size, cudaMemAdviseSetReadMostly, 0);

// Hint: CPU is the preferred location for this data
cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId);

// Hint: GPU 0 will access this data (create mapping, reduce faults)
cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, 0);
Advice flagEffect
cudaMemAdviseSetReadMostlyRuntime may create read-only copies on multiple processors. Writes invalidate all copies and trigger migration.
cudaMemAdviseSetPreferredLocationPages migrate to the preferred location when evicted. Does not prevent access from other processors.
cudaMemAdviseSetAccessedByCreates a direct mapping so the specified processor can access pages without faulting, even if pages live elsewhere. Requires hardware support (NVLink or PCIe ATS).

The SetReadMostly hint is particularly useful for lookup tables, model weights during inference, and any buffer that the GPU reads but never writes. The runtime duplicates the pages instead of migrating them, so both CPU and GPU can read without faults.

Python with CuPy: transparent managed memory

CuPy abstracts GPU memory management entirely. Transfers happen automatically when you move data between NumPy and CuPy arrays.

import cupy as cp
import numpy as np
import time

N = 1 << 24  # ~16M elements

# NumPy array on CPU
a_cpu = np.ones(N, dtype=np.float32)
b_cpu = np.full(N, 2.0, dtype=np.float32)

# Transfer to GPU happens implicitly
a_gpu = cp.asarray(a_cpu)
b_gpu = cp.asarray(b_cpu)

# Compute on GPU
start = time.perf_counter()
c_gpu = a_gpu + b_gpu
cp.cuda.Stream.null.synchronize()
gpu_time = time.perf_counter() - start

# Transfer back happens implicitly
c_cpu = cp.asnumpy(c_gpu)

print(f"CuPy result: c[0] = {c_cpu[0]}")
print(f"GPU compute time: {gpu_time * 1000:.2f} ms")

Under the hood, CuPy uses cudaMalloc (not managed memory) and explicit copies, so it gets bulk transfer performance. The API hides the complexity while keeping the transfers efficient.

Performance comparison

The following chart shows typical wall-clock time for a 256 MB vector addition on a PCIe Gen4 GPU. Your numbers will vary with hardware, driver version, and data size.

The unified-without-prefetch case is roughly 7x slower than explicit copies. Prefetching brings it within 15% of the explicit approach. CuPy lands close to explicit because it uses bulk cudaMemcpy internally.

Worked examples

Example A: page fault overhead calculation

Problem. You allocate 1 GB of managed memory and access it for the first time from a GPU kernel. Each page is 4 KB. Each page fault takes 10 microseconds to service. How many faults occur and what is the total fault overhead?

Solution.

Number of pages:

1 GB4 KB=1,073,741,8244,096=262,144 pages\frac{1 \text{ GB}}{4 \text{ KB}} = \frac{1{,}073{,}741{,}824}{4{,}096} = 262{,}144 \text{ pages}

Total fault overhead:

262,144×10  μs=2,621,440  μs=2.62 seconds262{,}144 \times 10\;\mu s = 2{,}621{,}440\;\mu s = 2.62 \text{ seconds}

Compare this to a bulk cudaMemcpy of 1 GB over PCIe Gen4 x16 (~25 GB/s):

1 GB25 GB/s=0.04 seconds=40 ms\frac{1 \text{ GB}}{25 \text{ GB/s}} = 0.04 \text{ seconds} = 40 \text{ ms}

The fault-based migration is 65x slower than a bulk transfer. This is why prefetching matters. In practice, the GPU runtime batches nearby faults, so the real overhead is lower than 2.62 seconds, but it is still substantially worse than a bulk copy.

Example B: prefetch timeline comparison

Without prefetch:

Time -->
CPU: [init data]---idle--------------------------[read result]
GPU:               [fault][migrate][compute][fault][migrate]...
                   |<---- stalls dominate -------->|

The kernel launches, immediately faults on the first page, stalls while the page migrates, computes a few elements, faults on the next page, and so on. Warps spend most of their time stalled on page faults rather than doing arithmetic.

With prefetch:

Time -->
CPU: [init data][prefetch issued]---[read result]
GPU:            [bulk migration]====[compute]====
                |<-- overlap -->|   |<- no faults ->|

The prefetch triggers a bulk DMA transfer that moves all pages to GPU memory before the kernel starts. The kernel runs at full speed with zero page faults. The prefetch can even overlap with other work if you issue it into a non-default stream.

The performance difference is not subtle. For data-intensive kernels, prefetching can improve end-to-end time by 5x to 10x.

When unified memory costs you

Unified memory is not free. Here are the cases where it hurts:

Frequent CPU/GPU ping-pong. If you write data on the CPU, compute on the GPU, read back on the CPU, then compute again on the GPU, every transition triggers page migrations. Each round-trip moves pages back and forth over PCIe. Prefetching helps, but the fundamental bottleneck is the PCIe bandwidth.

Fine-grained sharing. If a CPU thread and a GPU kernel both need the same page at the same time, the page thrashes between processors. This is worse than explicit copies because at least with explicit copies you control exactly when the transfer happens.

Pre-Pascal hardware. GPUs before Pascal (sm_60) do not support page faults on the GPU. On Kepler and Maxwell, cudaMallocManaged works but with severe restrictions: the entire managed allocation is migrated to the GPU before any kernel launch and migrated back after. This is often slower than explicit copies because the runtime cannot be selective about which pages to move.

Oversubscription overhead. Unified memory lets you allocate more managed memory than the GPU has physical memory. Pages are evicted and re-fetched on demand. This works but can cause severe performance degradation if the working set significantly exceeds GPU memory.

Profiling complexity. Page faults are non-deterministic. The first run of a kernel is slow (cold pages), subsequent runs may be fast (pages already resident). This makes benchmarking tricky. Always use cudaMemPrefetchAsync in benchmarks to get reproducible numbers.

Pascal+ requirement

Hardware page faulting is the mechanism that makes unified memory practical. It requires:

  • GPU architecture: Pascal (sm_60) or later. This means GTX 1000 series, Tesla P100, and all subsequent generations.
  • Driver: CUDA 8.0+ for basic support. CUDA 10.0+ for cudaMemAdvise hints. CUDA 11.0+ for full oversubscription and concurrent access.
  • OS: 64-bit Linux or Windows. Linux gets the most complete support, including oversubscription and AccessedBy hints over NVLink.

On pre-Pascal GPUs, cudaMallocManaged still compiles and runs, but the runtime falls back to a “pre-fault” model that migrates all managed data before every kernel launch. This negates most of the performance benefits.

In practice

Default to explicit copies for performance-critical paths. Unified memory is excellent for prototyping, debugging, and code that runs infrequently. For hot paths, explicit cudaMemcpy (or cudaMemcpyAsync in streams) gives you deterministic transfer timing and avoids fault overhead entirely.

Use prefetching whenever you use managed memory in production. Treat cudaMemPrefetchAsync as mandatory, not optional. Without it, you are relying on demand paging, and demand paging is slow for bulk data.

Combine with cudaMemAdvise for multi-GPU workloads. On systems with NVLink, SetAccessedBy lets a GPU access pages on another GPU without migrating them. This avoids the migration cost entirely for read-heavy access patterns.

Profile page faults with Nsight Systems. The nsys timeline shows page fault events and migration traffic. If you see a wall of fault events at kernel launch time, add prefetch calls. If faults are scattered throughout execution, check for CPU/GPU ping-pong access patterns.

Unified memory enables oversubscription. If your dataset is larger than GPU memory, unified memory lets the runtime page data in and out automatically. This is not fast, but it lets you run workloads that would otherwise require manual tiling. For research and experimentation, this convenience is significant.

What comes next

This article covered unified virtual memory: how cudaMallocManaged creates a single address space, how demand paging and page faults work under the hood, and how prefetching recovers bulk transfer performance. You now know when unified memory simplifies your code and when it costs you.

The next step is scaling beyond a single GPU. CUDA multi-GPU programming covers peer-to-peer access, multi-device memory management, kernel launches across GPUs, and how unified memory interacts with multi-GPU topologies. The concepts from this article (page migration, prefetching, memory hints) become even more important when data must move between multiple devices.

Start typing to search across all content
navigate Enter open Esc close