Memory allocation patterns and multi-dimensional arrays in CUDA
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- CUDA memory hierarchy for global memory, shared memory, registers, and the full picture of how GPU memory spaces work.
You should be comfortable calling cudaMalloc, cudaMemcpy, and launching kernels with <<<blocks, threads>>>. Everything here builds on that foundation by covering the allocation patterns you need when your data is not a flat 1D array.
cudaMalloc and cudaFree: the basics
cudaMalloc allocates a contiguous block of bytes in device global memory. cudaFree releases it. The API mirrors malloc/free on the host, but the returned pointer lives in a separate address space that the CPU cannot dereference.
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
// ... launch kernels that use d_data ...
cudaFree(d_data);
Three things to remember:
cudaMallocreturnscudaError_t. Always check it, or at minimum wrap it in a macro that does.- The pointer is valid only on the device. Passing it to a host function that dereferences it crashes your program.
cudaFreeis synchronous. It blocks until all preceding GPU work on the default stream completes.
For 1D arrays this is all you need. The complexity starts when you work with 2D and 3D data.
The alignment problem with 2D arrays
GPUs load data from global memory in 128-byte transactions aligned to 128-byte boundaries. When you store a 2D matrix as a flat row-major allocation (cudaMalloc(&d_mat, rows * cols * sizeof(float))), the start of each row may not be aligned to a transaction boundary. Misaligned rows force the hardware to issue extra transactions for every row access.
The solution is pitched allocation: pad each row so that every row starts on an aligned boundary.
graph LR
subgraph "Flat allocation (no pitch)"
R0F["Row 0: 400 bytes"]
R1F["Row 1: 400 bytes (misaligned start)"]
R2F["Row 2: 400 bytes (misaligned start)"]
end
subgraph "Pitched allocation (pitch = 512 bytes)"
R0P["Row 0: 400 bytes data + 112 bytes padding"]
R1P["Row 1: 400 bytes data + 112 bytes padding"]
R2P["Row 2: 400 bytes data + 112 bytes padding"]
end
style R0F fill:#e63946,stroke:#333,color:#fff
style R1F fill:#e63946,stroke:#333,color:#fff
style R2F fill:#e63946,stroke:#333,color:#fff
style R0P fill:#2d6a4f,stroke:#333,color:#fff
style R1P fill:#2d6a4f,stroke:#333,color:#fff
style R2P fill:#2d6a4f,stroke:#333,color:#fff
The padding bytes waste memory but guarantee that every row starts at an address divisible by the pitch. This makes coalesced access possible regardless of row width.
cudaMallocPitch: 2D allocation with automatic alignment
cudaMallocPitch allocates a 2D region and returns both the pointer and the pitch (the row width in bytes, including padding).
float *d_matrix;
size_t pitch;
int rows = 100;
int cols = 100;
cudaMallocPitch(&d_matrix, &pitch,
cols * sizeof(float), // requested row width in bytes
rows); // number of rows
The runtime chooses a pitch that satisfies alignment requirements for the current device. You do not control the pitch value directly.
Pitch calculation example
For a 100x100 float matrix:
- Row size = 100 * 4 = 400 bytes
- The runtime rounds up to the next aligned boundary. On most devices this is 512 bytes.
- Pitch = 512 bytes
- Padding per row = 512 - 400 = 112 bytes
- Total allocation = 512 * 100 = 51,200 bytes (versus 40,000 for a flat allocation)
The overhead is 28%, but the alignment benefit far outweighs this cost for kernels that access rows from many threads.
Indexing with pitch
You cannot use d_matrix[row * cols + col] because the pitch is larger than cols * sizeof(float). Instead, use byte-level pointer arithmetic:
__global__ void processMatrix(float *d_matrix, size_t pitch,
int rows, int cols) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < rows && col < cols) {
// Cast to char* for byte arithmetic, then back to float*
float *rowPtr = (float *)((char *)d_matrix + row * pitch);
float value = rowPtr[col];
// Write back using the same pointer arithmetic
rowPtr[col] = value * 2.0f;
}
}
The pattern is always the same: cast to char*, offset by row * pitch bytes, cast back to your element type, then index by column. This works because pitch is in bytes, not elements.
Copying data to/from pitched memory
Use cudaMemcpy2D to transfer between host (unpitched) and device (pitched) memory:
float h_matrix[100][100];
// ... fill h_matrix ...
// Host -> Device (host pitch = cols * sizeof(float), no padding)
cudaMemcpy2D(d_matrix, pitch, // dst, dst pitch
h_matrix, cols * sizeof(float), // src, src pitch
cols * sizeof(float), // width of data to copy (bytes)
rows, // height (rows)
cudaMemcpyHostToDevice);
// Device -> Host
cudaMemcpy2D(h_matrix, cols * sizeof(float),
d_matrix, pitch,
cols * sizeof(float), rows,
cudaMemcpyDeviceToHost);
The key detail: the source and destination pitches can differ. The host side has no padding, so its pitch equals the actual row width.
cudaMalloc3D: extending pitched allocation to three dimensions
For 3D data (volumes, time series of 2D frames), cudaMalloc3D allocates a pitched 3D extent and returns a cudaPitchedPtr:
cudaExtent extent = make_cudaExtent(
cols * sizeof(float), // width in bytes
rows, // height
depth // depth (number of 2D slices)
);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
// Access fields:
// devPitchedPtr.ptr - base pointer
// devPitchedPtr.pitch - pitch in bytes (same alignment as 2D)
// devPitchedPtr.xsize - requested width in bytes
// devPitchedPtr.ysize - requested height
Indexing in a kernel follows the same byte-arithmetic pattern, extended to three dimensions:
__global__ void process3D(cudaPitchedPtr devPtr, int cols, int rows, int depth) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
if (x < cols && y < rows && z < depth) {
char *base = (char *)devPtr.ptr;
// Slice offset: z * pitch * rows
// Row offset within slice: y * pitch
char *slice = base + z * devPtr.pitch * rows;
float *row = (float *)(slice + y * devPtr.pitch);
float value = row[x];
row[x] = value + 1.0f;
}
}
Use cudaMemcpy3D with a cudaMemcpy3DParms struct for host-device transfers. The parameter struct is verbose but follows the same src/dst pitch logic as cudaMemcpy2D.
Dynamic shared memory
Static shared memory is declared with a compile-time size:
__shared__ float tile[32][32];
This works when the size is known at compile time. When the size depends on a runtime parameter (block size, tile size passed as an argument), you need dynamic shared memory.
Declaration and launch syntax
Declare a dynamically-sized shared array with extern __shared__:
__global__ void reductionKernel(float *input, float *output, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global to shared
sdata[tid] = (i < n) ? input[i] : 0.0f;
__syncthreads();
// Parallel reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
__syncthreads();
}
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}
The shared memory size is specified at kernel launch as the third argument to the execution configuration:
int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
size_t sharedBytes = blockSize * sizeof(float);
reductionKernel<<<gridSize, blockSize, sharedBytes>>>(d_input, d_output, n);
The sharedBytes argument tells the runtime how many bytes to allocate for the extern __shared__ array. The array has no compile-time size; it gets its size entirely from the launch configuration.
Multiple dynamic shared arrays
Only one extern __shared__ array can exist per kernel (they all alias to the same base address). To use multiple arrays, allocate a single block and partition it manually:
__global__ void multiArrayKernel(float *data, int *indices, int n) {
extern __shared__ char sharedMem[];
float *sharedFloats = (float *)sharedMem;
int *sharedInts = (int *)(sharedMem + blockDim.x * sizeof(float));
int tid = threadIdx.x;
sharedFloats[tid] = data[blockIdx.x * blockDim.x + tid];
sharedInts[tid] = indices[blockIdx.x * blockDim.x + tid];
__syncthreads();
// ... use both arrays ...
}
// Launch with combined size
size_t sharedSize = blockDim * sizeof(float) + blockDim * sizeof(int);
multiArrayKernel<<<grid, block, sharedSize>>>(d_data, d_indices, n);
⚠ Watch alignment. If sharedFloats has a count that is not a multiple of sizeof(int) alignment requirements, you need manual padding between the sub-arrays.
Deep copy of structures with device pointers
Consider a structure on the host that contains a pointer to dynamically allocated data:
struct Particles {
float *x;
float *y;
float *vx;
float *vy;
int count;
};
You cannot simply cudaMemcpy this structure to the device. The pointers inside it point to host memory. After the copy, the device has a struct whose pointers are invalid device-side addresses.
graph TD
subgraph "Host Memory"
HS["Particles struct
x: 0x7fff1000
y: 0x7fff2000
count: 1000"]
HX["float x[1000]"]
HY["float y[1000]"]
HS -->|"x ptr"| HX
HS -->|"y ptr"| HY
end
subgraph "Device Memory (after shallow copy)"
DS["Particles struct
x: 0x7fff1000 (INVALID)
y: 0x7fff2000 (INVALID)
count: 1000"]
end
subgraph "Device Memory (after deep copy)"
DD["Particles struct
x: 0xB000A000 (valid)
y: 0xB000B000 (valid)
count: 1000"]
DX["float x[1000]"]
DY["float y[1000]"]
DD -->|"x ptr"| DX
DD -->|"y ptr"| DY
end
style HS fill:#457b9d,stroke:#333,color:#fff
style DS fill:#e63946,stroke:#333,color:#fff
style DD fill:#2d6a4f,stroke:#333,color:#fff
style HX fill:#457b9d,stroke:#333,color:#fff
style HY fill:#457b9d,stroke:#333,color:#fff
style DX fill:#2d6a4f,stroke:#333,color:#fff
style DY fill:#2d6a4f,stroke:#333,color:#fff
The deep copy procedure requires multiple steps:
void copyParticlesToDevice(const Particles &h_p, Particles **d_pp) {
// Step 1: Allocate device memory for each array member
float *d_x, *d_y, *d_vx, *d_vy;
size_t arrayBytes = h_p.count * sizeof(float);
cudaMalloc(&d_x, arrayBytes);
cudaMalloc(&d_y, arrayBytes);
cudaMalloc(&d_vx, arrayBytes);
cudaMalloc(&d_vy, arrayBytes);
// Step 2: Copy array data from host to device
cudaMemcpy(d_x, h_p.x, arrayBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h_p.y, arrayBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_vx, h_p.vx, arrayBytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, h_p.vy, arrayBytes, cudaMemcpyHostToDevice);
// Step 3: Build a temporary host struct with device pointers
Particles h_temp;
h_temp.x = d_x;
h_temp.y = d_y;
h_temp.vx = d_vx;
h_temp.vy = d_vy;
h_temp.count = h_p.count;
// Step 4: Allocate device memory for the struct and copy it
cudaMalloc(d_pp, sizeof(Particles));
cudaMemcpy(*d_pp, &h_temp, sizeof(Particles), cudaMemcpyHostToDevice);
}
The kernel receives a Particles* and dereferences it naturally:
__global__ void updateParticles(Particles *p, float dt) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < p->count) {
p->x[i] += p->vx[i] * dt;
p->y[i] += p->vy[i] * dt;
}
}
Freeing requires the reverse: copy the struct back to read the device pointers, free each array, then free the struct.
This pattern is tedious but unavoidable when your data layout uses struct-of-arrays with pointer members. An alternative is to flatten the struct into separate kernel arguments, which avoids the deep copy entirely at the cost of longer argument lists.
Passing multi-dimensional arrays to kernels
CUDA kernels cannot accept C-style 2D arrays (float matrix[M][N]) as parameters because the compiler needs to know N at compile time to compute row offsets. Three practical approaches exist:
Flat pointer with manual indexing. Pass a float* and the dimensions. Index as matrix[row * cols + col]. Simple, no alignment guarantees.
Pitched pointer from cudaMallocPitch. Pass the float*, the pitch, and the dimensions. Index with byte arithmetic as shown earlier. Best memory access performance.
Array of row pointers. Allocate an array of float* on the device where each element points to the start of a row. This lets you use matrix[row][col] syntax inside the kernel but requires an extra indirection per access and a separate allocation for the pointer array. Rarely worth the convenience.
For most production code, the pitched pointer approach wins because it combines alignment with a single contiguous allocation.
Memory pools and stream-ordered allocation
Starting with CUDA 11.2, memory pools (cudaMallocAsync / cudaFreeAsync) eliminate the overhead of repeated cudaMalloc/cudaFree calls within a kernel dispatch loop.
Traditional allocation is synchronous. Every cudaMalloc call synchronizes with the device and goes through the OS memory allocator. In a loop that launches thousands of small kernels, allocation overhead can dominate wall-clock time.
// Traditional: synchronous, high overhead per iteration
for (int i = 0; i < iterations; i++) {
cudaMalloc(&d_temp, tempBytes);
kernel<<<grid, block>>>(d_input, d_temp, d_output);
cudaFree(d_temp);
}
// Pool-based: asynchronous, near-zero overhead per iteration
cudaMemPool_t pool;
cudaDeviceGetDefaultMemPool(&pool, device);
for (int i = 0; i < iterations; i++) {
cudaMallocAsync(&d_temp, tempBytes, stream);
kernel<<<grid, block, 0, stream>>>(d_input, d_temp, d_output);
cudaFreeAsync(d_temp, stream);
}
cudaStreamSynchronize(stream);
cudaMallocAsync returns immediately and records the allocation as a stream-ordered operation. The runtime reuses freed blocks from the pool without returning memory to the OS. The first call may allocate from the OS, but subsequent calls of the same size hit the pool cache.
Key properties of memory pools:
- Stream ordering. The allocation is visible only after preceding operations in the stream complete. Frees are deferred until preceding operations finish.
- Reuse. Freed blocks stay in the pool. A subsequent
cudaMallocAsyncof the same or smaller size reuses an existing block instead of calling into the driver. - Cross-stream visibility. By default, memory from one stream’s pool is not visible to another stream. Use
cudaMemPoolSetAccessto share. - Trim control. Call
cudaMemPoolTrimToto release unused pool memory back to the OS when memory pressure is high.
Fragmentation
GPU memory fragmentation follows the same principles as CPU heap fragmentation, but the consequences are sharper because GPU memory is typically smaller and has no swap space.
External fragmentation occurs when free memory exists but is scattered in non-contiguous blocks. A 2 GB allocation fails even though 3 GB is free across dozens of small holes. This happens when you repeatedly allocate and free buffers of different sizes.
Internal fragmentation is the waste from alignment padding (like the 112 bytes per row in pitched allocations) or from pool allocators rounding up to a size class.
Strategies to reduce fragmentation:
| Strategy | Effect |
|---|---|
| Allocate all large buffers at startup | Prevents interleaving of different-lifetime allocations |
Use memory pools (cudaMallocAsync) | Pool reuse avoids returning memory to the OS and re-fragmenting |
| Reuse scratch buffers across kernel launches | One allocation serves many launches |
| Prefer power-of-two sizes | Reduces the number of distinct size classes the allocator must track |
Monitor with cudaMemGetInfo | Check free/total memory before large allocations to fail gracefully |
⚠ Unlike CPU programs, GPU programs cannot defragment by moving allocations. There is no garbage collector and no compaction. Once memory is fragmented, the only fix is to free everything and reallocate.
In practice
Use cudaMallocPitch for any 2D data accessed by kernels. The alignment overhead is small compared to the coalescing benefit. Flat allocations for 2D data are a common source of subtle performance loss that does not show up as an error, only as unexpectedly low throughput.
Prefer dynamic shared memory when the tile size varies. Hardcoding __shared__ float tile[32][32] ties your kernel to one block configuration. Dynamic shared memory lets you tune block size without recompiling.
Avoid deep copies when possible. Restructure your data to pass flat arrays as separate kernel arguments. If deep copy is unavoidable, encapsulate the allocation/copy/free sequence in a helper function so every call site does not repeat the multi-step dance.
Switch to cudaMallocAsync for allocation-heavy workloads. If you profile and find that cudaMalloc/cudaFree account for measurable wall-clock time, memory pools are the fix. The API change is minimal: add Async and pass a stream.
Allocate early, free late. The simplest fragmentation prevention strategy is to allocate your working set at program startup and reuse it for the lifetime of the application. This is not always possible, but when it is, it eliminates an entire class of problems.
What comes next
This article covered the allocation patterns that go beyond cudaMalloc: pitched 2D and 3D memory for aligned access, dynamic shared memory for runtime-sized scratch space, deep copies for structures with pointer members, and memory pools for high-frequency allocation workloads.
The next article moves to specialized read-only memory spaces. Texture and constant memory in CUDA covers __constant__ memory for broadcast-style reads, texture objects for spatially coherent access with hardware interpolation, and when these specialized paths outperform plain global memory loads.