Advanced memory patterns: pinned memory, zero-copy, and more
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- CUDA memory hierarchy for global memory, shared memory, registers, and the GPU memory system.
- CUDA streams and asynchronous execution for async transfers, pinned memory basics, and overlapping compute with data movement.
You should be comfortable with cudaMalloc, cudaMemcpy, cudaMemcpyAsync, and non-default streams. This article digs deeper into the host-side memory story: how allocation strategy on the CPU side determines what the GPU can do with that memory, and at what speed.
Why host memory allocation matters
Every cudaMemcpy between host and device crosses the PCIe bus (or NVLink, on supported systems). The CUDA runtime uses DMA (Direct Memory Access) engines to perform these transfers. DMA engines operate on physical addresses. The CPU, meanwhile, gives user programs virtual addresses. Before a DMA engine can touch a buffer, the runtime must ensure every virtual page maps to a pinned physical page that will not be swapped out by the OS during the transfer.
For ordinary malloc’d memory (pageable memory), the CUDA runtime cannot hand the buffer directly to the DMA engine. Instead, it copies data into an internal pinned staging buffer, then DMA transfers from that staging buffer to the GPU. This extra copy through the staging buffer halves the effective bandwidth.
sequenceDiagram participant App as Application participant RT as CUDA Runtime participant SB as Staging Buffer (pinned) participant DMA as DMA Engine participant GPU as GPU Memory Note over App,GPU: Pageable Transfer Path App->>RT: cudaMemcpy - d_ptr, h_pageable, size, H2D RT->>SB: memcpy from pageable to staging SB->>DMA: DMA transfer initiated DMA->>GPU: Write to device memory DMA-->>RT: Transfer complete RT-->>App: Return Note over App,GPU: Pinned Transfer Path App->>RT: cudaMemcpy - d_ptr, h_pinned, size, H2D RT->>DMA: DMA transfer directly from pinned DMA->>GPU: Write to device memory DMA-->>RT: Transfer complete RT-->>App: Return
The pageable path pays twice: once for the CPU-side memcpy into the staging buffer, and once for the DMA transfer itself. The pinned path pays only once.
Pinned (page-locked) memory
Pinned memory is host memory that the OS guarantees will never be paged out to disk. You allocate it with cudaMallocHost or cudaHostAlloc:
float *h_pinned;
cudaMallocHost(&h_pinned, N * sizeof(float));
// Use h_pinned like any host pointer
for (int i = 0; i < N; i++) h_pinned[i] = (float)i;
// Transfer at full PCIe bandwidth
cudaMemcpy(d_data, h_pinned, N * sizeof(float), cudaMemcpyHostToDevice);
// Free when done
cudaFreeHost(h_pinned);
cudaMallocHost returns a pointer that is page-locked in physical memory. The DMA engine can read from it directly, avoiding the staging buffer copy. On PCIe Gen3 x16, pageable transfers typically achieve 6 to 8 GB/s. Pinned transfers reach 12 to 13 GB/s, close to the theoretical 15.75 GB/s peak.
There is a cost. Every byte of pinned memory is a byte the OS cannot swap. If you pin 32 GB on a 64 GB machine, you have permanently consumed half the physical memory for the lifetime of that allocation. Other processes, including the OS page cache, lose access to those pages. Over-pinning leads to system-wide memory pressure and can cause OOM kills of unrelated processes.
Rule of thumb: pin only the buffers that are actively participating in GPU transfers. Free pinned allocations as soon as the transfer is complete if the data is not reused.
Benchmarking pageable vs pinned transfers
The following program measures transfer bandwidth for both allocation types across a range of buffer sizes:
#include <cstdio>
#include <cuda_runtime.h>
void benchmark_transfer(size_t bytes, bool pinned) {
float *h_data, *d_data;
if (pinned) {
cudaMallocHost(&h_data, bytes);
} else {
h_data = (float*)malloc(bytes);
}
cudaMalloc(&d_data, bytes);
// Warm up
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
// Timed transfer
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (int i = 0; i < 20; i++) {
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms = 0.0f;
cudaEventElapsedTime(&ms, start, stop);
double seconds = ms / 1000.0;
double gb = (double)bytes * 20.0 / (1024.0 * 1024.0 * 1024.0);
double bandwidth = gb / seconds;
printf("%s %8zu MB: %.2f GB/s (%.2f ms per transfer)\n",
pinned ? "Pinned " : "Pageable",
bytes / (1024 * 1024), bandwidth, ms / 20.0);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(d_data);
if (pinned) cudaFreeHost(h_data); else free(h_data);
}
int main() {
size_t sizes[] = {1<<20, 4<<20, 16<<20, 64<<20, 256<<20, 512<<20};
int n = sizeof(sizes) / sizeof(sizes[0]);
printf("--- Pageable ---\n");
for (int i = 0; i < n; i++) benchmark_transfer(sizes[i], false);
printf("\n--- Pinned ---\n");
for (int i = 0; i < n; i++) benchmark_transfer(sizes[i], true);
return 0;
}
Typical results on an A100 with PCIe Gen4 x16:
Key observations:
- Pinned memory consistently delivers roughly 2x the bandwidth of pageable memory at large sizes.
- Both curves flatten above ~64 MB as the PCIe bus saturates.
- Zero-copy bandwidth (GPU reading host memory over PCIe) is lower because each kernel memory access traverses the bus individually. Zero-copy makes sense only when data is accessed once or the working set is small.
Zero-copy memory
Zero-copy memory is pinned host memory that is mapped into the GPU’s address space. The GPU can read and write it directly without an explicit cudaMemcpy. You allocate it with cudaHostAlloc using the cudaHostAllocMapped flag:
float *h_mapped, *d_mapped;
// Allocate pinned + mapped memory
cudaHostAlloc(&h_mapped, N * sizeof(float),
cudaHostAllocMapped);
// Get the device-visible pointer
cudaHostGetDevicePointer(&d_mapped, h_mapped, 0);
// CPU initializes data
for (int i = 0; i < N; i++) h_mapped[i] = (float)i;
// GPU kernel reads directly from host memory over PCIe
scale_kernel<<<blocks, threads>>>(d_mapped, 2.0f, N);
cudaDeviceSynchronize();
// CPU reads the result - no copy back needed
printf("Result[0] = %f\n", h_mapped[0]);
cudaFreeHost(h_mapped);
The kernel accesses d_mapped as if it were device memory, but every load and store crosses the PCIe bus. This means each access has PCIe latency (microseconds) instead of device memory latency (hundreds of nanoseconds).
Here is a kernel that demonstrates this:
__global__ void scale_kernel(float *data, float factor, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
// Each read and write here crosses PCIe
data[idx] = data[idx] * factor;
}
}
When zero-copy beats explicit copy
Zero-copy is not a general-purpose replacement for cudaMemcpy. It wins in specific scenarios:
Data accessed exactly once. If a kernel reads each element once and never again, zero-copy avoids the overhead of copying the entire buffer to device memory. The GPU reads data on demand over PCIe, processes it, and discards it.
Small data structures. Configuration parameters, lookup tables under a few KB, or scalar outputs are cheaper to access via zero-copy than to transfer explicitly. The overhead of a cudaMemcpy call (kernel launch latency, synchronization) can exceed the cost of a few PCIe reads.
Output-only buffers. If the GPU writes results that the CPU will read immediately, zero-copy avoids the device-to-host copy. The GPU writes directly to host memory. This is particularly effective for streaming pipelines where the CPU consumes results as fast as the GPU produces them.
CPU and GPU concurrent access. Zero-copy is the only mechanism (before unified memory) that lets both processors access the same physical buffer without explicit synchronization of copies. The programmer is still responsible for avoiding data races, but the memory management is simpler.
⚠ Zero-copy performs poorly when the GPU accesses the same data multiple times, when access patterns are random (causing many small PCIe transactions), or when the data set is large enough that bulk transfer amortizes the cudaMemcpy overhead.
Write-combining memory
Write-combining (WC) is a CPU memory type optimized for sequential writes and DMA reads. You allocate it with:
float *h_wc;
cudaHostAlloc(&h_wc, N * sizeof(float),
cudaHostAllocWriteCombined | cudaHostAllocMapped);
Write-combining memory has three properties:
- CPU writes are fast. The CPU buffers writes internally and flushes them in large bursts, bypassing the cache hierarchy. Sequential write throughput can be 2 to 3x higher than normal cached memory.
- CPU reads are slow. Because WC memory is not cached, every CPU read goes directly to DRAM. Read latency is 10x or more compared to cached memory.
- GPU DMA reads are faster. Because WC memory does not pollute the CPU cache, the DMA engine does not need to snoop the cache hierarchy during transfers. This can add 10 to 15% bandwidth improvement on some systems.
The ideal use case is a producer-consumer pattern: the CPU writes data sequentially into a WC buffer, and the GPU reads it via DMA or zero-copy. Never read WC memory from the CPU. If you need to read back, use a separate standard pinned buffer for the output.
Bandwidth calculation and measurement
Effective bandwidth is the fundamental metric for memory-bound kernels:
For a vector addition kernel processing N floats:
If N = 256M elements and the kernel takes 2.4 ms:
On an A100 with 2,039 GB/s peak HBM bandwidth, that is 62.8% of peak. The gap comes from imperfect coalescing, L2 cache misses, and instruction overhead.
For PCIe transfers, the theoretical peak depends on the generation:
| Generation | Lane Rate | x16 Peak (each direction) |
|---|---|---|
| PCIe Gen3 | 1 GB/s per lane | ~15.75 GB/s |
| PCIe Gen4 | 2 GB/s per lane | ~31.5 GB/s |
| PCIe Gen5 | 4 GB/s per lane | ~63 GB/s |
Measured bandwidth is always lower due to protocol overhead, TLP headers, and flow control credits. Expect 80 to 90% of theoretical in practice.
cudaMemcpy2D and cudaMemcpy3D
Real data is not always contiguous. Images are stored with row padding (pitch). 3D volumes have slice padding. Copying a rectangular sub-region of a padded 2D array with cudaMemcpy requires a loop over rows, which means N separate DMA transfers with N kernel launch overheads.
cudaMemcpy2D handles this in a single call:
// Copy a width x height sub-region from a pitched host buffer
// to a pitched device buffer
cudaMemcpy2D(
d_ptr, // destination pointer
d_pitch, // destination pitch (bytes per row including padding)
h_ptr, // source pointer
h_pitch, // source pitch
width * sizeof(float), // width of the region to copy (bytes)
height, // number of rows
cudaMemcpyHostToDevice
);
The runtime packs the transfer efficiently, handling the pitch mismatch between host and device allocations. Device memory is typically allocated with cudaMallocPitch, which chooses a pitch that satisfies coalescing requirements:
float *d_data;
size_t d_pitch;
cudaMallocPitch(&d_data, &d_pitch, width * sizeof(float), height);
cudaMemcpy3D extends this to three dimensions using a cudaMemcpy3DParms structure:
cudaMemcpy3DParms params = {0};
params.srcPtr = make_cudaPitchedPtr(h_volume, h_pitch,
width, height);
params.dstPtr = make_cudaPitchedPtr(d_volume, d_pitch,
width, height);
params.extent = make_cudaExtent(width * sizeof(float),
height, depth);
params.kind = cudaMemcpyHostToDevice;
cudaMemcpy3D(¶ms);
Both 2D and 3D variants have async versions (cudaMemcpy2DAsync, cudaMemcpy3DAsync) that take a stream parameter and require pinned host memory. Use them in multi-stream pipelines to overlap pitched transfers with compute.
Choosing the right allocation strategy
flowchart TD
START["Need host memory for GPU transfers?"]
Q1{"Transfer frequency?"}
Q2{"Data access pattern on GPU?"}
Q3{"CPU reads the buffer?"}
Q4{"Is data 2D/3D with pitch?"}
PAGEABLE["Use malloc / pageable
✓ Simple
✗ Half bandwidth"]
PINNED["Use cudaMallocHost
✓ Full bandwidth
✓ Async transfer capable
⚠ Cannot be swapped"]
ZEROCOPY["Use cudaHostAlloc (Mapped)
✓ No explicit copy
✓ Single-access data
⚠ PCIe latency per access"]
WC["Use cudaHostAlloc (WC + Mapped)
✓ Fast CPU writes
✓ Better DMA throughput
✗ CPU reads are very slow"]
PITCHED["Use cudaMallocPitch + cudaMemcpy2D
✓ Coalesced device access
✓ Single DMA call"]
START --> Q1
Q1 -->|Rare / prototyping| PAGEABLE
Q1 -->|Frequent / performance critical| Q2
Q2 -->|Bulk transfer then compute| PINNED
Q2 -->|GPU reads each element once| Q3
Q3 -->|Yes| ZEROCOPY
Q3 -->|No, CPU only writes| WC
Q1 -->|Pitched / padded data| Q4
Q4 -->|Yes| PITCHED
style PAGEABLE fill:#6a2d4f,stroke:#4a1d3f,color:#fff
style PINNED fill:#2d6a4f,stroke:#1b4332,color:#fff
style ZEROCOPY fill:#264653,stroke:#1a3340,color:#fff
style WC fill:#e76f51,stroke:#c45a3f,color:#fff
style PITCHED fill:#40916c,stroke:#2d6a4f,color:#fff
Worked examples
Example A: pinned bandwidth and annual data volume
Problem. A data pipeline transfers 512 MB buffers from host to GPU. With pageable memory, measured bandwidth is 6 GB/s. With pinned memory, measured bandwidth is 12 GB/s. Calculate (a) the transfer time for a single 512 MB buffer using each strategy, (b) the speedup, and (c) if the pipeline runs 100 transfers per second, how much total data moves per year?
Solution.
Transfer time for pageable:
Transfer time for pinned:
Speedup:
Annual data volume at 100 transfers per second:
The 2x bandwidth improvement from pinned memory saves 42.6 ms per transfer. Over a year at 100 Hz, that is:
Pinned memory is not optional for high-throughput pipelines.
Example B: zero-copy break-even analysis
Problem. A GPU can access CPU memory via zero-copy at PCIe bandwidth of 12 GB/s. A kernel has arithmetic intensity of 2 FLOPs per byte. The GPU has a peak compute throughput of 10 TFLOPS. PCIe bandwidth is 12 GB/s. (a) Is this kernel compute-bound or bandwidth-bound under zero-copy? (b) What arithmetic intensity would make zero-copy efficient (i.e., compute-bound)?
Solution.
(a) The roofline model gives the crossover point. Under zero-copy, the memory system is PCIe, not HBM. The attainable compute at 2 FLOPs/byte with PCIe bandwidth:
The GPU peak is 10,000 GFLOPS. At 24 GFLOPS, the kernel reaches 0.24% of peak compute. This kernel is severely bandwidth-bound under zero-copy. The PCIe bus is the bottleneck, not the SMs.
(b) For the kernel to become compute-bound, the attainable FLOPS must equal or exceed the GPU peak:
An arithmetic intensity of approximately 833 FLOPs/byte is needed before zero-copy stops being the bottleneck. Very few kernels achieve this. In practice, zero-copy is best for kernels that touch very little data (configuration reads, small lookup tables) rather than for bulk computation.
For comparison, a kernel reading from HBM at 2 TB/s with the same 2 FLOPs/byte intensity:
This is why explicit copy to device memory is almost always preferred for compute kernels.
Example C: pitched memory overhead
Problem. You have a 1024x768 float image. cudaMallocPitch returns a pitch of 4,096 bytes (4 KB alignment). The actual row width is 1024 x 4 = 4,096 bytes. Now consider a 1000x768 float image. The row width is 1000 x 4 = 4,000 bytes. With 512-byte alignment, the pitch becomes 4,096 bytes. How much memory is wasted due to pitch padding, and what is the wasted fraction?
Solution.
For the 1024x768 image, row width equals pitch. No padding waste:
For the 1000x768 image:
The 2.34% overhead is a small price for guaranteed coalesced access. Without pitch alignment, threads in a warp accessing consecutive elements of a row might span two 128-byte cache lines, doubling the number of memory transactions. The padding ensures every row starts at an aligned address, and cudaMemcpy2D handles the pitch mismatch between host (4,000 byte rows) and device (4,096 byte rows) transparently.
In practice
Pin only what you transfer. Over-pinning starves the OS of physical pages. Allocate pinned buffers for your transfer pipeline and use regular malloc for everything else. Monitor system memory pressure with nvidia-smi and OS tools.
Use cudaMemcpyAsync with pinned memory for overlap. Pinned memory is a prerequisite for asynchronous transfers. If you are using streams for compute/transfer overlap (and you should be), every host buffer involved in an async transfer must be pinned. Pageable buffers silently fall back to synchronous behavior.
Benchmark your actual PCIe bandwidth. Do not trust theoretical numbers. PCIe topology (which slot, through which switch), NUMA node placement, and IOMMU settings all affect measured bandwidth. Run the CUDA bandwidthTest sample on your specific machine.
Reserve zero-copy for low-traffic host data. Configuration structs, small lookup tables, and scalar outputs are good candidates. Anything that resembles a bulk data transfer should go through explicit copy to device memory.
Use write-combining for CPU-write, GPU-read pipelines. Video capture, sensor data, and network packet buffers all fit this pattern. The CPU writes sequentially, and the GPU reads via DMA. Never read WC memory from the CPU side.
Prefer cudaMemcpy2D over manual row-by-row loops. A single 2D copy call is faster, easier to read, and works correctly with cudaMemcpy2DAsync for overlap. The same applies to cudaMemcpy3D for volumetric data.
What comes next
This article covered advanced host memory patterns: pinned memory for full PCIe bandwidth, zero-copy for direct GPU access to host memory, write-combining for CPU-write workloads, bandwidth measurement, and pitched memory transfers with cudaMemcpy2D and cudaMemcpy3D. You now know how to choose the right allocation strategy for each stage of a GPU data pipeline.
The next article moves into advanced stream patterns. CUDA advanced streams covers multi-stream pipelines with fine-grained dependencies, stream priorities, stream callbacks, and CUDA graphs for recording and replaying entire workflows. The memory patterns from this article combine with stream techniques to build production-grade transfer pipelines that saturate both the PCIe bus and the GPU simultaneously.