Apr 4, 2026 · 18 min read · GPU Programming

Memory allocation patterns and multi-dimensional arrays in CUDA

In this series (28 parts)

Prerequisites

This article assumes you have read the following:

CUDA memory hierarchy for global memory, shared memory, registers, and the full picture of how GPU memory spaces work.

You should be comfortable calling cudaMalloc, cudaMemcpy, and launching kernels with <<<blocks, threads>>>. Everything here builds on that foundation by covering the allocation patterns you need when your data is not a flat 1D array.

cudaMalloc and cudaFree: the basics

cudaMalloc allocates a contiguous block of bytes in device global memory. cudaFree releases it. The API mirrors malloc/free on the host, but the returned pointer lives in a separate address space that the CPU cannot dereference.

float *d_data;
cudaMalloc(&d_data, N * sizeof(float));

// ... launch kernels that use d_data ...

cudaFree(d_data);

Three things to remember:

cudaMalloc returns cudaError_t. Always check it, or at minimum wrap it in a macro that does.
The pointer is valid only on the device. Passing it to a host function that dereferences it crashes your program.
cudaFree is synchronous. It blocks until all preceding GPU work on the default stream completes.

For 1D arrays this is all you need. The complexity starts when you work with 2D and 3D data.

The alignment problem with 2D arrays

GPUs load data from global memory in 128-byte transactions aligned to 128-byte boundaries. When you store a 2D matrix as a flat row-major allocation (cudaMalloc(&d_mat, rows * cols * sizeof(float))), the start of each row may not be aligned to a transaction boundary. Misaligned rows force the hardware to issue extra transactions for every row access.

The solution is pitched allocation: pad each row so that every row starts on an aligned boundary.

graph LR
  subgraph "Flat allocation (no pitch)"
      R0F["Row 0: 400 bytes"]
      R1F["Row 1: 400 bytes (misaligned start)"]
      R2F["Row 2: 400 bytes (misaligned start)"]
  end

  subgraph "Pitched allocation (pitch = 512 bytes)"
      R0P["Row 0: 400 bytes data + 112 bytes padding"]
      R1P["Row 1: 400 bytes data + 112 bytes padding"]
      R2P["Row 2: 400 bytes data + 112 bytes padding"]
  end

  style R0F fill:#e63946,stroke:#333,color:#fff
  style R1F fill:#e63946,stroke:#333,color:#fff
  style R2F fill:#e63946,stroke:#333,color:#fff
  style R0P fill:#2d6a4f,stroke:#333,color:#fff
  style R1P fill:#2d6a4f,stroke:#333,color:#fff
  style R2P fill:#2d6a4f,stroke:#333,color:#fff

The padding bytes waste memory but guarantee that every row starts at an address divisible by the pitch. This makes coalesced access possible regardless of row width.

cudaMallocPitch: 2D allocation with automatic alignment

cudaMallocPitch allocates a 2D region and returns both the pointer and the pitch (the row width in bytes, including padding).

float *d_matrix;
size_t pitch;
int rows = 100;
int cols = 100;

cudaMallocPitch(&d_matrix, &pitch,
                cols * sizeof(float),  // requested row width in bytes
                rows);                 // number of rows

The runtime chooses a pitch that satisfies alignment requirements for the current device. You do not control the pitch value directly.

Pitch calculation example

For a 100x100 float matrix:

Row size = 100 * 4 = 400 bytes
The runtime rounds up to the next aligned boundary. On most devices this is 512 bytes.
Pitch = 512 bytes
Padding per row = 512 - 400 = 112 bytes
Total allocation = 512 * 100 = 51,200 bytes (versus 40,000 for a flat allocation)

The overhead is 28%, but the alignment benefit far outweighs this cost for kernels that access rows from many threads.

Indexing with pitch

You cannot use d_matrix[row * cols + col] because the pitch is larger than cols * sizeof(float). Instead, use byte-level pointer arithmetic:

__global__ void processMatrix(float *d_matrix, size_t pitch,
                              int rows, int cols) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    if (row < rows && col < cols) {
        // Cast to char* for byte arithmetic, then back to float*
        float *rowPtr = (float *)((char *)d_matrix + row * pitch);
        float value = rowPtr[col];

        // Write back using the same pointer arithmetic
        rowPtr[col] = value * 2.0f;
    }
}

The pattern is always the same: cast to char*, offset by row * pitch bytes, cast back to your element type, then index by column. This works because pitch is in bytes, not elements.

Copying data to/from pitched memory

Use cudaMemcpy2D to transfer between host (unpitched) and device (pitched) memory:

float h_matrix[100][100];
// ... fill h_matrix ...

// Host -> Device (host pitch = cols * sizeof(float), no padding)
cudaMemcpy2D(d_matrix, pitch,            // dst, dst pitch
             h_matrix, cols * sizeof(float),  // src, src pitch
             cols * sizeof(float),        // width of data to copy (bytes)
             rows,                        // height (rows)
             cudaMemcpyHostToDevice);

// Device -> Host
cudaMemcpy2D(h_matrix, cols * sizeof(float),
             d_matrix, pitch,
             cols * sizeof(float), rows,
             cudaMemcpyDeviceToHost);

The key detail: the source and destination pitches can differ. The host side has no padding, so its pitch equals the actual row width.

cudaMalloc3D: extending pitched allocation to three dimensions

For 3D data (volumes, time series of 2D frames), cudaMalloc3D allocates a pitched 3D extent and returns a cudaPitchedPtr:

cudaExtent extent = make_cudaExtent(
    cols * sizeof(float),  // width in bytes
    rows,                  // height
    depth                  // depth (number of 2D slices)
);

cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);

// Access fields:
// devPitchedPtr.ptr    - base pointer
// devPitchedPtr.pitch  - pitch in bytes (same alignment as 2D)
// devPitchedPtr.xsize  - requested width in bytes
// devPitchedPtr.ysize  - requested height

Indexing in a kernel follows the same byte-arithmetic pattern, extended to three dimensions:

__global__ void process3D(cudaPitchedPtr devPtr, int cols, int rows, int depth) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;

    if (x < cols && y < rows && z < depth) {
        char *base = (char *)devPtr.ptr;
        // Slice offset: z * pitch * rows
        // Row offset within slice: y * pitch
        char *slice = base + z * devPtr.pitch * rows;
        float *row = (float *)(slice + y * devPtr.pitch);
        float value = row[x];
        row[x] = value + 1.0f;
    }
}

Use cudaMemcpy3D with a cudaMemcpy3DParms struct for host-device transfers. The parameter struct is verbose but follows the same src/dst pitch logic as cudaMemcpy2D.

Dynamic shared memory

Static shared memory is declared with a compile-time size:

__shared__ float tile[32][32];

This works when the size is known at compile time. When the size depends on a runtime parameter (block size, tile size passed as an argument), you need dynamic shared memory.

Declaration and launch syntax

Declare a dynamically-sized shared array with extern __shared__:

__global__ void reductionKernel(float *input, float *output, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Load from global to shared
    sdata[tid] = (i < n) ? input[i] : 0.0f;
    __syncthreads();

    // Parallel reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        __syncthreads();
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

The shared memory size is specified at kernel launch as the third argument to the execution configuration:

int blockSize = 256;
int gridSize = (n + blockSize - 1) / blockSize;
size_t sharedBytes = blockSize * sizeof(float);

reductionKernel<<<gridSize, blockSize, sharedBytes>>>(d_input, d_output, n);

The sharedBytes argument tells the runtime how many bytes to allocate for the extern __shared__ array. The array has no compile-time size; it gets its size entirely from the launch configuration.

Multiple dynamic shared arrays

Only one extern __shared__ array can exist per kernel (they all alias to the same base address). To use multiple arrays, allocate a single block and partition it manually:

__global__ void multiArrayKernel(float *data, int *indices, int n) {
    extern __shared__ char sharedMem[];

    float *sharedFloats = (float *)sharedMem;
    int *sharedInts = (int *)(sharedMem + blockDim.x * sizeof(float));

    int tid = threadIdx.x;
    sharedFloats[tid] = data[blockIdx.x * blockDim.x + tid];
    sharedInts[tid] = indices[blockIdx.x * blockDim.x + tid];
    __syncthreads();

    // ... use both arrays ...
}

// Launch with combined size
size_t sharedSize = blockDim * sizeof(float) + blockDim * sizeof(int);
multiArrayKernel<<<grid, block, sharedSize>>>(d_data, d_indices, n);

⚠ Watch alignment. If sharedFloats has a count that is not a multiple of sizeof(int) alignment requirements, you need manual padding between the sub-arrays.

Deep copy of structures with device pointers

Consider a structure on the host that contains a pointer to dynamically allocated data:

struct Particles {
    float *x;
    float *y;
    float *vx;
    float *vy;
    int count;
};

You cannot simply cudaMemcpy this structure to the device. The pointers inside it point to host memory. After the copy, the device has a struct whose pointers are invalid device-side addresses.

graph TD
  subgraph "Host Memory"
      HS["Particles struct
x: 0x7fff1000
y: 0x7fff2000
count: 1000"]
      HX["float x[1000]"]
      HY["float y[1000]"]
      HS -->|"x ptr"| HX
      HS -->|"y ptr"| HY
  end

  subgraph "Device Memory (after shallow copy)"
      DS["Particles struct
x: 0x7fff1000 (INVALID)
y: 0x7fff2000 (INVALID)
count: 1000"]
  end

  subgraph "Device Memory (after deep copy)"
      DD["Particles struct
x: 0xB000A000 (valid)
y: 0xB000B000 (valid)
count: 1000"]
      DX["float x[1000]"]
      DY["float y[1000]"]
      DD -->|"x ptr"| DX
      DD -->|"y ptr"| DY
  end

  style HS fill:#457b9d,stroke:#333,color:#fff
  style DS fill:#e63946,stroke:#333,color:#fff
  style DD fill:#2d6a4f,stroke:#333,color:#fff
  style HX fill:#457b9d,stroke:#333,color:#fff
  style HY fill:#457b9d,stroke:#333,color:#fff
  style DX fill:#2d6a4f,stroke:#333,color:#fff
  style DY fill:#2d6a4f,stroke:#333,color:#fff

The deep copy procedure requires multiple steps:

void copyParticlesToDevice(const Particles &h_p, Particles **d_pp) {
    // Step 1: Allocate device memory for each array member
    float *d_x, *d_y, *d_vx, *d_vy;
    size_t arrayBytes = h_p.count * sizeof(float);

    cudaMalloc(&d_x, arrayBytes);
    cudaMalloc(&d_y, arrayBytes);
    cudaMalloc(&d_vx, arrayBytes);
    cudaMalloc(&d_vy, arrayBytes);

    // Step 2: Copy array data from host to device
    cudaMemcpy(d_x, h_p.x, arrayBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, h_p.y, arrayBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_vx, h_p.vx, arrayBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_vy, h_p.vy, arrayBytes, cudaMemcpyHostToDevice);

    // Step 3: Build a temporary host struct with device pointers
    Particles h_temp;
    h_temp.x = d_x;
    h_temp.y = d_y;
    h_temp.vx = d_vx;
    h_temp.vy = d_vy;
    h_temp.count = h_p.count;

    // Step 4: Allocate device memory for the struct and copy it
    cudaMalloc(d_pp, sizeof(Particles));
    cudaMemcpy(*d_pp, &h_temp, sizeof(Particles), cudaMemcpyHostToDevice);
}

The kernel receives a Particles* and dereferences it naturally:

__global__ void updateParticles(Particles *p, float dt) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < p->count) {
        p->x[i] += p->vx[i] * dt;
        p->y[i] += p->vy[i] * dt;
    }
}

Freeing requires the reverse: copy the struct back to read the device pointers, free each array, then free the struct.

This pattern is tedious but unavoidable when your data layout uses struct-of-arrays with pointer members. An alternative is to flatten the struct into separate kernel arguments, which avoids the deep copy entirely at the cost of longer argument lists.

Passing multi-dimensional arrays to kernels

CUDA kernels cannot accept C-style 2D arrays (float matrix[M][N]) as parameters because the compiler needs to know N at compile time to compute row offsets. Three practical approaches exist:

Flat pointer with manual indexing. Pass a float* and the dimensions. Index as matrix[row * cols + col]. Simple, no alignment guarantees.

Pitched pointer from cudaMallocPitch. Pass the float*, the pitch, and the dimensions. Index with byte arithmetic as shown earlier. Best memory access performance.

Array of row pointers. Allocate an array of float* on the device where each element points to the start of a row. This lets you use matrix[row][col] syntax inside the kernel but requires an extra indirection per access and a separate allocation for the pointer array. Rarely worth the convenience.

For most production code, the pitched pointer approach wins because it combines alignment with a single contiguous allocation.

Memory pools and stream-ordered allocation

Starting with CUDA 11.2, memory pools (cudaMallocAsync / cudaFreeAsync) eliminate the overhead of repeated cudaMalloc/cudaFree calls within a kernel dispatch loop.

Traditional allocation is synchronous. Every cudaMalloc call synchronizes with the device and goes through the OS memory allocator. In a loop that launches thousands of small kernels, allocation overhead can dominate wall-clock time.

// Traditional: synchronous, high overhead per iteration
for (int i = 0; i < iterations; i++) {
    cudaMalloc(&d_temp, tempBytes);
    kernel<<<grid, block>>>(d_input, d_temp, d_output);
    cudaFree(d_temp);
}

// Pool-based: asynchronous, near-zero overhead per iteration
cudaMemPool_t pool;
cudaDeviceGetDefaultMemPool(&pool, device);

for (int i = 0; i < iterations; i++) {
    cudaMallocAsync(&d_temp, tempBytes, stream);
    kernel<<<grid, block, 0, stream>>>(d_input, d_temp, d_output);
    cudaFreeAsync(d_temp, stream);
}
cudaStreamSynchronize(stream);

cudaMallocAsync returns immediately and records the allocation as a stream-ordered operation. The runtime reuses freed blocks from the pool without returning memory to the OS. The first call may allocate from the OS, but subsequent calls of the same size hit the pool cache.

Key properties of memory pools:

Stream ordering. The allocation is visible only after preceding operations in the stream complete. Frees are deferred until preceding operations finish.
Reuse. Freed blocks stay in the pool. A subsequent cudaMallocAsync of the same or smaller size reuses an existing block instead of calling into the driver.
Cross-stream visibility. By default, memory from one stream’s pool is not visible to another stream. Use cudaMemPoolSetAccess to share.
Trim control. Call cudaMemPoolTrimTo to release unused pool memory back to the OS when memory pressure is high.

Fragmentation

GPU memory fragmentation follows the same principles as CPU heap fragmentation, but the consequences are sharper because GPU memory is typically smaller and has no swap space.

External fragmentation occurs when free memory exists but is scattered in non-contiguous blocks. A 2 GB allocation fails even though 3 GB is free across dozens of small holes. This happens when you repeatedly allocate and free buffers of different sizes.

Internal fragmentation is the waste from alignment padding (like the 112 bytes per row in pitched allocations) or from pool allocators rounding up to a size class.

Strategies to reduce fragmentation:

Strategy	Effect
Allocate all large buffers at startup	Prevents interleaving of different-lifetime allocations
Use memory pools (`cudaMallocAsync`)	Pool reuse avoids returning memory to the OS and re-fragmenting
Reuse scratch buffers across kernel launches	One allocation serves many launches
Prefer power-of-two sizes	Reduces the number of distinct size classes the allocator must track
Monitor with `cudaMemGetInfo`	Check free/total memory before large allocations to fail gracefully

⚠ Unlike CPU programs, GPU programs cannot defragment by moving allocations. There is no garbage collector and no compaction. Once memory is fragmented, the only fix is to free everything and reallocate.

In practice

Use cudaMallocPitch for any 2D data accessed by kernels. The alignment overhead is small compared to the coalescing benefit. Flat allocations for 2D data are a common source of subtle performance loss that does not show up as an error, only as unexpectedly low throughput.

Prefer dynamic shared memory when the tile size varies. Hardcoding __shared__ float tile[32][32] ties your kernel to one block configuration. Dynamic shared memory lets you tune block size without recompiling.

Avoid deep copies when possible. Restructure your data to pass flat arrays as separate kernel arguments. If deep copy is unavoidable, encapsulate the allocation/copy/free sequence in a helper function so every call site does not repeat the multi-step dance.

Switch to cudaMallocAsync for allocation-heavy workloads. If you profile and find that cudaMalloc/cudaFree account for measurable wall-clock time, memory pools are the fix. The API change is minimal: add Async and pass a stream.

Allocate early, free late. The simplest fragmentation prevention strategy is to allocate your working set at program startup and reuse it for the lifetime of the application. This is not always possible, but when it is, it eliminates an entire class of problems.

What comes next

This article covered the allocation patterns that go beyond cudaMalloc: pitched 2D and 3D memory for aligned access, dynamic shared memory for runtime-sized scratch space, deep copies for structures with pointer members, and memory pools for high-frequency allocation workloads.

The next article moves to specialized read-only memory spaces. Texture and constant memory in CUDA covers __constant__ memory for broadcast-style reads, texture objects for spatially coherent access with hardware interpolation, and when these specialized paths outperform plain global memory loads.

← Back to all series