Search…

Advanced memory patterns: pinned memory, zero-copy, and more

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable with cudaMalloc, cudaMemcpy, cudaMemcpyAsync, and non-default streams. This article digs deeper into the host-side memory story: how allocation strategy on the CPU side determines what the GPU can do with that memory, and at what speed.

Why host memory allocation matters

Every cudaMemcpy between host and device crosses the PCIe bus (or NVLink, on supported systems). The CUDA runtime uses DMA (Direct Memory Access) engines to perform these transfers. DMA engines operate on physical addresses. The CPU, meanwhile, gives user programs virtual addresses. Before a DMA engine can touch a buffer, the runtime must ensure every virtual page maps to a pinned physical page that will not be swapped out by the OS during the transfer.

For ordinary malloc’d memory (pageable memory), the CUDA runtime cannot hand the buffer directly to the DMA engine. Instead, it copies data into an internal pinned staging buffer, then DMA transfers from that staging buffer to the GPU. This extra copy through the staging buffer halves the effective bandwidth.

sequenceDiagram
  participant App as Application
  participant RT as CUDA Runtime
  participant SB as Staging Buffer (pinned)
  participant DMA as DMA Engine
  participant GPU as GPU Memory

  Note over App,GPU: Pageable Transfer Path
  App->>RT: cudaMemcpy - d_ptr, h_pageable, size, H2D
  RT->>SB: memcpy from pageable to staging
  SB->>DMA: DMA transfer initiated
  DMA->>GPU: Write to device memory
  DMA-->>RT: Transfer complete
  RT-->>App: Return

  Note over App,GPU: Pinned Transfer Path
  App->>RT: cudaMemcpy - d_ptr, h_pinned, size, H2D
  RT->>DMA: DMA transfer directly from pinned
  DMA->>GPU: Write to device memory
  DMA-->>RT: Transfer complete
  RT-->>App: Return

The pageable path pays twice: once for the CPU-side memcpy into the staging buffer, and once for the DMA transfer itself. The pinned path pays only once.

Pinned (page-locked) memory

Pinned memory is host memory that the OS guarantees will never be paged out to disk. You allocate it with cudaMallocHost or cudaHostAlloc:

float *h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));

// Use h_pinned like any host pointer
for (int i = 0; i < N; i++) h_pinned[i] = (float)i;

// Transfer at full PCIe bandwidth
cudaMemcpy(d_data, h_pinned, N * sizeof(float), cudaMemcpyHostToDevice);

// Free when done
cudaFreeHost(h_pinned);

cudaMallocHost returns a pointer that is page-locked in physical memory. The DMA engine can read from it directly, avoiding the staging buffer copy. On PCIe Gen3 x16, pageable transfers typically achieve 6 to 8 GB/s. Pinned transfers reach 12 to 13 GB/s, close to the theoretical 15.75 GB/s peak.

There is a cost. Every byte of pinned memory is a byte the OS cannot swap. If you pin 32 GB on a 64 GB machine, you have permanently consumed half the physical memory for the lifetime of that allocation. Other processes, including the OS page cache, lose access to those pages. Over-pinning leads to system-wide memory pressure and can cause OOM kills of unrelated processes.

Rule of thumb: pin only the buffers that are actively participating in GPU transfers. Free pinned allocations as soon as the transfer is complete if the data is not reused.

Benchmarking pageable vs pinned transfers

The following program measures transfer bandwidth for both allocation types across a range of buffer sizes:

#include <cstdio>
#include <cuda_runtime.h>

void benchmark_transfer(size_t bytes, bool pinned) {
    float *h_data, *d_data;

    if (pinned) {
        cudaMallocHost(&h_data, bytes);
    } else {
        h_data = (float*)malloc(bytes);
    }
    cudaMalloc(&d_data, bytes);

    // Warm up
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);

    // Timed transfer
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    for (int i = 0; i < 20; i++) {
        cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    }
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);

    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start, stop);
    double seconds = ms / 1000.0;
    double gb = (double)bytes * 20.0 / (1024.0 * 1024.0 * 1024.0);
    double bandwidth = gb / seconds;

    printf("%s %8zu MB: %.2f GB/s (%.2f ms per transfer)\n",
           pinned ? "Pinned  " : "Pageable",
           bytes / (1024 * 1024), bandwidth, ms / 20.0);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_data);
    if (pinned) cudaFreeHost(h_data); else free(h_data);
}

int main() {
    size_t sizes[] = {1<<20, 4<<20, 16<<20, 64<<20, 256<<20, 512<<20};
    int n = sizeof(sizes) / sizeof(sizes[0]);

    printf("--- Pageable ---\n");
    for (int i = 0; i < n; i++) benchmark_transfer(sizes[i], false);

    printf("\n--- Pinned ---\n");
    for (int i = 0; i < n; i++) benchmark_transfer(sizes[i], true);

    return 0;
}

Typical results on an A100 with PCIe Gen4 x16:

Key observations:

  • Pinned memory consistently delivers roughly 2x the bandwidth of pageable memory at large sizes.
  • Both curves flatten above ~64 MB as the PCIe bus saturates.
  • Zero-copy bandwidth (GPU reading host memory over PCIe) is lower because each kernel memory access traverses the bus individually. Zero-copy makes sense only when data is accessed once or the working set is small.

Zero-copy memory

Zero-copy memory is pinned host memory that is mapped into the GPU’s address space. The GPU can read and write it directly without an explicit cudaMemcpy. You allocate it with cudaHostAlloc using the cudaHostAllocMapped flag:

float *h_mapped, *d_mapped;

// Allocate pinned + mapped memory
cudaHostAlloc(&h_mapped, N * sizeof(float),
              cudaHostAllocMapped);

// Get the device-visible pointer
cudaHostGetDevicePointer(&d_mapped, h_mapped, 0);

// CPU initializes data
for (int i = 0; i < N; i++) h_mapped[i] = (float)i;

// GPU kernel reads directly from host memory over PCIe
scale_kernel<<<blocks, threads>>>(d_mapped, 2.0f, N);
cudaDeviceSynchronize();

// CPU reads the result - no copy back needed
printf("Result[0] = %f\n", h_mapped[0]);

cudaFreeHost(h_mapped);

The kernel accesses d_mapped as if it were device memory, but every load and store crosses the PCIe bus. This means each access has PCIe latency (microseconds) instead of device memory latency (hundreds of nanoseconds).

Here is a kernel that demonstrates this:

__global__ void scale_kernel(float *data, float factor, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        // Each read and write here crosses PCIe
        data[idx] = data[idx] * factor;
    }
}

When zero-copy beats explicit copy

Zero-copy is not a general-purpose replacement for cudaMemcpy. It wins in specific scenarios:

Data accessed exactly once. If a kernel reads each element once and never again, zero-copy avoids the overhead of copying the entire buffer to device memory. The GPU reads data on demand over PCIe, processes it, and discards it.

Small data structures. Configuration parameters, lookup tables under a few KB, or scalar outputs are cheaper to access via zero-copy than to transfer explicitly. The overhead of a cudaMemcpy call (kernel launch latency, synchronization) can exceed the cost of a few PCIe reads.

Output-only buffers. If the GPU writes results that the CPU will read immediately, zero-copy avoids the device-to-host copy. The GPU writes directly to host memory. This is particularly effective for streaming pipelines where the CPU consumes results as fast as the GPU produces them.

CPU and GPU concurrent access. Zero-copy is the only mechanism (before unified memory) that lets both processors access the same physical buffer without explicit synchronization of copies. The programmer is still responsible for avoiding data races, but the memory management is simpler.

⚠ Zero-copy performs poorly when the GPU accesses the same data multiple times, when access patterns are random (causing many small PCIe transactions), or when the data set is large enough that bulk transfer amortizes the cudaMemcpy overhead.

Write-combining memory

Write-combining (WC) is a CPU memory type optimized for sequential writes and DMA reads. You allocate it with:

float *h_wc;
cudaHostAlloc(&h_wc, N * sizeof(float),
              cudaHostAllocWriteCombined | cudaHostAllocMapped);

Write-combining memory has three properties:

  1. CPU writes are fast. The CPU buffers writes internally and flushes them in large bursts, bypassing the cache hierarchy. Sequential write throughput can be 2 to 3x higher than normal cached memory.
  2. CPU reads are slow. Because WC memory is not cached, every CPU read goes directly to DRAM. Read latency is 10x or more compared to cached memory.
  3. GPU DMA reads are faster. Because WC memory does not pollute the CPU cache, the DMA engine does not need to snoop the cache hierarchy during transfers. This can add 10 to 15% bandwidth improvement on some systems.

The ideal use case is a producer-consumer pattern: the CPU writes data sequentially into a WC buffer, and the GPU reads it via DMA or zero-copy. Never read WC memory from the CPU. If you need to read back, use a separate standard pinned buffer for the output.

Bandwidth calculation and measurement

Effective bandwidth is the fundamental metric for memory-bound kernels:

Effective Bandwidth (GB/s)=Bytes Read+Bytes WrittenTime (seconds)×109\text{Effective Bandwidth (GB/s)} = \frac{\text{Bytes Read} + \text{Bytes Written}}{\text{Time (seconds)} \times 10^9}

For a vector addition kernel processing N floats:

Bytes=N×4×3=12N bytes (2 reads + 1 write)\text{Bytes} = N \times 4 \times 3 = 12N \text{ bytes (2 reads + 1 write)}

If N = 256M elements and the kernel takes 2.4 ms:

Bandwidth=256×106×122.4×103×109=3.072×1092.4×103×109=1,280 GB/s\text{Bandwidth} = \frac{256 \times 10^6 \times 12}{2.4 \times 10^{-3} \times 10^9} = \frac{3.072 \times 10^9}{2.4 \times 10^{-3} \times 10^9} = 1{,}280 \text{ GB/s}

On an A100 with 2,039 GB/s peak HBM bandwidth, that is 62.8% of peak. The gap comes from imperfect coalescing, L2 cache misses, and instruction overhead.

For PCIe transfers, the theoretical peak depends on the generation:

GenerationLane Ratex16 Peak (each direction)
PCIe Gen31 GB/s per lane~15.75 GB/s
PCIe Gen42 GB/s per lane~31.5 GB/s
PCIe Gen54 GB/s per lane~63 GB/s

Measured bandwidth is always lower due to protocol overhead, TLP headers, and flow control credits. Expect 80 to 90% of theoretical in practice.

cudaMemcpy2D and cudaMemcpy3D

Real data is not always contiguous. Images are stored with row padding (pitch). 3D volumes have slice padding. Copying a rectangular sub-region of a padded 2D array with cudaMemcpy requires a loop over rows, which means N separate DMA transfers with N kernel launch overheads.

cudaMemcpy2D handles this in a single call:

// Copy a width x height sub-region from a pitched host buffer
// to a pitched device buffer
cudaMemcpy2D(
    d_ptr,              // destination pointer
    d_pitch,            // destination pitch (bytes per row including padding)
    h_ptr,              // source pointer
    h_pitch,            // source pitch
    width * sizeof(float),  // width of the region to copy (bytes)
    height,             // number of rows
    cudaMemcpyHostToDevice
);

The runtime packs the transfer efficiently, handling the pitch mismatch between host and device allocations. Device memory is typically allocated with cudaMallocPitch, which chooses a pitch that satisfies coalescing requirements:

float *d_data;
size_t d_pitch;
cudaMallocPitch(&d_data, &d_pitch, width * sizeof(float), height);

cudaMemcpy3D extends this to three dimensions using a cudaMemcpy3DParms structure:

cudaMemcpy3DParms params = {0};
params.srcPtr = make_cudaPitchedPtr(h_volume, h_pitch,
                                     width, height);
params.dstPtr = make_cudaPitchedPtr(d_volume, d_pitch,
                                     width, height);
params.extent = make_cudaExtent(width * sizeof(float),
                                 height, depth);
params.kind = cudaMemcpyHostToDevice;

cudaMemcpy3D(&params);

Both 2D and 3D variants have async versions (cudaMemcpy2DAsync, cudaMemcpy3DAsync) that take a stream parameter and require pinned host memory. Use them in multi-stream pipelines to overlap pitched transfers with compute.

Choosing the right allocation strategy

flowchart TD
  START["Need host memory for GPU transfers?"]
  Q1{"Transfer frequency?"}
  Q2{"Data access pattern on GPU?"}
  Q3{"CPU reads the buffer?"}
  Q4{"Is data 2D/3D with pitch?"}

  PAGEABLE["Use malloc / pageable
✓ Simple
✗ Half bandwidth"]
  PINNED["Use cudaMallocHost
✓ Full bandwidth
✓ Async transfer capable
⚠ Cannot be swapped"]
  ZEROCOPY["Use cudaHostAlloc (Mapped)
✓ No explicit copy
✓ Single-access data
⚠ PCIe latency per access"]
  WC["Use cudaHostAlloc (WC + Mapped)
✓ Fast CPU writes
✓ Better DMA throughput
✗ CPU reads are very slow"]
  PITCHED["Use cudaMallocPitch + cudaMemcpy2D
✓ Coalesced device access
✓ Single DMA call"]

  START --> Q1
  Q1 -->|Rare / prototyping| PAGEABLE
  Q1 -->|Frequent / performance critical| Q2
  Q2 -->|Bulk transfer then compute| PINNED
  Q2 -->|GPU reads each element once| Q3
  Q3 -->|Yes| ZEROCOPY
  Q3 -->|No, CPU only writes| WC
  Q1 -->|Pitched / padded data| Q4
  Q4 -->|Yes| PITCHED

  style PAGEABLE fill:#6a2d4f,stroke:#4a1d3f,color:#fff
  style PINNED fill:#2d6a4f,stroke:#1b4332,color:#fff
  style ZEROCOPY fill:#264653,stroke:#1a3340,color:#fff
  style WC fill:#e76f51,stroke:#c45a3f,color:#fff
  style PITCHED fill:#40916c,stroke:#2d6a4f,color:#fff

Worked examples

Example A: pinned bandwidth and annual data volume

Problem. A data pipeline transfers 512 MB buffers from host to GPU. With pageable memory, measured bandwidth is 6 GB/s. With pinned memory, measured bandwidth is 12 GB/s. Calculate (a) the transfer time for a single 512 MB buffer using each strategy, (b) the speedup, and (c) if the pipeline runs 100 transfers per second, how much total data moves per year?

Solution.

Transfer time for pageable:

tpageable=0.512 GB6 GB/s=85.3 mst_{\text{pageable}} = \frac{0.512 \text{ GB}}{6 \text{ GB/s}} = 85.3 \text{ ms}

Transfer time for pinned:

tpinned=0.512 GB12 GB/s=42.7 mst_{\text{pinned}} = \frac{0.512 \text{ GB}}{12 \text{ GB/s}} = 42.7 \text{ ms}

Speedup:

tpageabletpinned=85.342.7=2.0×\frac{t_{\text{pageable}}}{t_{\text{pinned}}} = \frac{85.3}{42.7} = 2.0\times

Annual data volume at 100 transfers per second:

100  transferss×0.512  GBtransfer×86,400  sday×365  days=1,614,816,000 GB1.61 exabytes/year100 \;\frac{\text{transfers}}{\text{s}} \times 0.512 \;\frac{\text{GB}}{\text{transfer}} \times 86{,}400 \;\frac{\text{s}}{\text{day}} \times 365 \;\text{days} = 1{,}614{,}816{,}000 \text{ GB} \approx 1.61 \text{ exabytes/year}

The 2x bandwidth improvement from pinned memory saves 42.6 ms per transfer. Over a year at 100 Hz, that is:

100×0.0426×86,400×365=134,330,160 seconds saved4.26 years of CPU-thread time100 \times 0.0426 \times 86{,}400 \times 365 = 134{,}330{,}160 \text{ seconds saved} \approx 4.26 \text{ years of CPU-thread time}

Pinned memory is not optional for high-throughput pipelines.

Example B: zero-copy break-even analysis

Problem. A GPU can access CPU memory via zero-copy at PCIe bandwidth of 12 GB/s. A kernel has arithmetic intensity of 2 FLOPs per byte. The GPU has a peak compute throughput of 10 TFLOPS. PCIe bandwidth is 12 GB/s. (a) Is this kernel compute-bound or bandwidth-bound under zero-copy? (b) What arithmetic intensity would make zero-copy efficient (i.e., compute-bound)?

Solution.

(a) The roofline model gives the crossover point. Under zero-copy, the memory system is PCIe, not HBM. The attainable compute at 2 FLOPs/byte with PCIe bandwidth:

Attainable FLOPS=2  FLOPsbyte×12  GBs=24 GFLOPS\text{Attainable FLOPS} = 2 \;\frac{\text{FLOPs}}{\text{byte}} \times 12 \;\frac{\text{GB}}{\text{s}} = 24 \text{ GFLOPS}

The GPU peak is 10,000 GFLOPS. At 24 GFLOPS, the kernel reaches 0.24% of peak compute. This kernel is severely bandwidth-bound under zero-copy. The PCIe bus is the bottleneck, not the SMs.

(b) For the kernel to become compute-bound, the attainable FLOPS must equal or exceed the GPU peak:

Ibreak-even=Peak FLOPSPCIe BW=10,000 GFLOPS12 GB/s=833.3  FLOPsbyteI_{\text{break-even}} = \frac{\text{Peak FLOPS}}{\text{PCIe BW}} = \frac{10{,}000 \text{ GFLOPS}}{12 \text{ GB/s}} = 833.3 \;\frac{\text{FLOPs}}{\text{byte}}

An arithmetic intensity of approximately 833 FLOPs/byte is needed before zero-copy stops being the bottleneck. Very few kernels achieve this. In practice, zero-copy is best for kernels that touch very little data (configuration reads, small lookup tables) rather than for bulk computation.

For comparison, a kernel reading from HBM at 2 TB/s with the same 2 FLOPs/byte intensity:

Attainable=2×2,000=4,000 GFLOPS=40% of peak\text{Attainable} = 2 \times 2{,}000 = 4{,}000 \text{ GFLOPS} = 40\% \text{ of peak}

This is why explicit copy to device memory is almost always preferred for compute kernels.

Example C: pitched memory overhead

Problem. You have a 1024x768 float image. cudaMallocPitch returns a pitch of 4,096 bytes (4 KB alignment). The actual row width is 1024 x 4 = 4,096 bytes. Now consider a 1000x768 float image. The row width is 1000 x 4 = 4,000 bytes. With 512-byte alignment, the pitch becomes 4,096 bytes. How much memory is wasted due to pitch padding, and what is the wasted fraction?

Solution.

For the 1024x768 image, row width equals pitch. No padding waste:

Padding per row=4,0964,096=0 bytes\text{Padding per row} = 4{,}096 - 4{,}096 = 0 \text{ bytes}

For the 1000x768 image:

Padding per row=4,0964,000=96 bytes\text{Padding per row} = 4{,}096 - 4{,}000 = 96 \text{ bytes} Total padding=96×768=73,728 bytes=72 KB\text{Total padding} = 96 \times 768 = 73{,}728 \text{ bytes} = 72 \text{ KB} Useful data=4,000×768=3,072,000 bytes2.93 MB\text{Useful data} = 4{,}000 \times 768 = 3{,}072{,}000 \text{ bytes} \approx 2.93 \text{ MB} Total allocated=4,096×768=3,145,728 bytes3.0 MB\text{Total allocated} = 4{,}096 \times 768 = 3{,}145{,}728 \text{ bytes} \approx 3.0 \text{ MB} Wasted fraction=73,7283,145,728=2.34%\text{Wasted fraction} = \frac{73{,}728}{3{,}145{,}728} = 2.34\%

The 2.34% overhead is a small price for guaranteed coalesced access. Without pitch alignment, threads in a warp accessing consecutive elements of a row might span two 128-byte cache lines, doubling the number of memory transactions. The padding ensures every row starts at an aligned address, and cudaMemcpy2D handles the pitch mismatch between host (4,000 byte rows) and device (4,096 byte rows) transparently.

In practice

Pin only what you transfer. Over-pinning starves the OS of physical pages. Allocate pinned buffers for your transfer pipeline and use regular malloc for everything else. Monitor system memory pressure with nvidia-smi and OS tools.

Use cudaMemcpyAsync with pinned memory for overlap. Pinned memory is a prerequisite for asynchronous transfers. If you are using streams for compute/transfer overlap (and you should be), every host buffer involved in an async transfer must be pinned. Pageable buffers silently fall back to synchronous behavior.

Benchmark your actual PCIe bandwidth. Do not trust theoretical numbers. PCIe topology (which slot, through which switch), NUMA node placement, and IOMMU settings all affect measured bandwidth. Run the CUDA bandwidthTest sample on your specific machine.

Reserve zero-copy for low-traffic host data. Configuration structs, small lookup tables, and scalar outputs are good candidates. Anything that resembles a bulk data transfer should go through explicit copy to device memory.

Use write-combining for CPU-write, GPU-read pipelines. Video capture, sensor data, and network packet buffers all fit this pattern. The CPU writes sequentially, and the GPU reads via DMA. Never read WC memory from the CPU side.

Prefer cudaMemcpy2D over manual row-by-row loops. A single 2D copy call is faster, easier to read, and works correctly with cudaMemcpy2DAsync for overlap. The same applies to cudaMemcpy3D for volumetric data.

What comes next

This article covered advanced host memory patterns: pinned memory for full PCIe bandwidth, zero-copy for direct GPU access to host memory, write-combining for CPU-write workloads, bandwidth measurement, and pitched memory transfers with cudaMemcpy2D and cudaMemcpy3D. You now know how to choose the right allocation strategy for each stage of a GPU data pipeline.

The next article moves into advanced stream patterns. CUDA advanced streams covers multi-stream pipelines with fine-grained dependencies, stream priorities, stream callbacks, and CUDA graphs for recording and replaying entire workflows. The memory patterns from this article combine with stream techniques to build production-grade transfer pipelines that saturate both the PCIe bus and the GPU simultaneously.

Start typing to search across all content
navigate Enter open Esc close