Search…

Texture and constant memory: specialized caches

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

  • CUDA memory hierarchy for the full picture of registers, shared memory, L1/L2 caches, constant memory, texture memory, and global memory.

You should understand that global memory latency runs 200 to 800 cycles, that shared memory provides fast block-scoped storage, and that the L1/L2 caches are hardware-managed. This article focuses on two specialized memory spaces that sit outside the standard global/shared/register model: constant memory and texture memory. Both are read-only on the device side, both have dedicated caches, and both are worth understanding even though modern GPUs have reduced the cases where they outperform plain global reads.

Constant memory: 64 KB of broadcast power

Constant memory is a 64 KB region of device memory backed by a dedicated per-SM cache. The hardware optimizes for one specific access pattern: every thread in a warp reads the same address in the same cycle.

When that pattern holds, a single cache read is broadcast to all 32 threads simultaneously. One memory transaction serves the entire warp. When the pattern does not hold (threads read different addresses), the accesses serialize. A warp reading 32 distinct constant memory addresses issues 32 sequential reads, which is dramatically slower than a single coalesced global memory load.

sequenceDiagram
  participant W as Warp (32 threads)
  participant CC as Constant Cache
  participant CM as Constant Memory (DRAM)

  Note over W: All threads request address 0x100
  W->>CC: Read 0x100 - single request
  CC-->>W: Broadcast to all 32 threads - 1 cycle
  Note over W: Result: 1 transaction, 1 cycle

  Note over W: Threads request 32 different addresses
  W->>CC: Read 0x100 - thread 0
  CC-->>W: Serve thread 0
  W->>CC: Read 0x104 - thread 1
  CC-->>W: Serve thread 1
  W->>CC: Read 0x108 - thread 2
  CC-->>W: Serve thread 2
  Note over W: ... repeats 32 times - serialized
  Note over W: Result: 32 transactions, 32 cycles

Declaring and using constant memory

Constant memory variables are declared at file scope with __constant__ and populated from the host with cudaMemcpyToSymbol:

#define KERNEL_RADIUS 2
#define KERNEL_SIZE (2 * KERNEL_RADIUS + 1)

// 5x5 convolution filter weights in constant memory
__constant__ float cFilter[KERNEL_SIZE * KERNEL_SIZE];

__global__ void convolve2D(const float* input, float* output,
                           int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    float sum = 0.0f;
    for (int ky = -KERNEL_RADIUS; ky <= KERNEL_RADIUS; ky++) {
        for (int kx = -KERNEL_RADIUS; kx <= KERNEL_RADIUS; kx++) {
            int sx = min(max(x + kx, 0), width - 1);
            int sy = min(max(y + ky, 0), height - 1);
            int filterIdx = (ky + KERNEL_RADIUS) * KERNEL_SIZE
                          + (kx + KERNEL_RADIUS);
            sum += input[sy * width + sx] * cFilter[filterIdx];
        }
    }
    output[y * width + x] = sum;
}

// Host side
void launchConvolution(const float* d_input, float* d_output,
                       const float* h_filter, int w, int h) {
    cudaMemcpyToSymbol(cFilter, h_filter,
                       KERNEL_SIZE * KERNEL_SIZE * sizeof(float));

    dim3 block(16, 16);
    dim3 grid((w + 15) / 16, (h + 15) / 16);
    convolve2D<<<grid, block>>>(d_input, d_output, w, h);
}

Every thread in the warp executing the inner loop reads the same cFilter[filterIdx] value at the same time. All 32 threads in the warp step through the filter weights in lockstep. This is the ideal constant memory pattern: one broadcast per weight, zero serialization.

When constant memory wins and when it loses

The rule is simple:

  • All threads read the same address: 1 transaction, broadcast to 32 threads. Faster than global memory.
  • Threads read different addresses: up to 32 serialized transactions. Slower than a single coalesced global memory load.

Lookup tables indexed by threadIdx.x are the classic antipattern. If thread 0 reads table[0] while thread 31 reads table[31], constant memory serializes all 32 reads. Global memory (or shared memory) handles this in a single coalesced transaction.

Texture memory: read-only cache with 2D spatial locality

Texture memory uses a dedicated read-only cache that is separate from the L1 cache. Its design targets spatial locality in two dimensions. When a thread reads pixel (x, y), the texture cache prefetches a 2D tile of nearby texels, anticipating that neighboring threads will read (x+1, y), (x, y+1), and other spatially close addresses.

This is fundamentally different from the L1 cache, which is optimized for 1D linear access. An image kernel that reads a 3x3 neighborhood around each pixel generates a stride-N access pattern in linear memory (where N is the image width). The L1 cache line covers consecutive bytes, so the rows above and below the current pixel likely miss. The texture cache, organized in 2D tiles, keeps all three rows hot.

graph TD
  subgraph TextureCache["Texture Cache (2D Tiled)"]
      T1["Tile (0,0)
32x32 texels"]
      T2["Tile (1,0)
32x32 texels"]
      T3["Tile (0,1)
32x32 texels"]
      T4["Tile (1,1)
32x32 texels"]
  end

  subgraph L1Cache["L1 Cache (1D Linear)"]
      L1a["Cache line 0
128 bytes contiguous"]
      L1b["Cache line 1
128 bytes contiguous"]
      L1c["Cache line 2
128 bytes contiguous"]
      L1d["Cache line 3
128 bytes contiguous"]
  end

  subgraph Access["Stencil Read Pattern: (x,y), (x+1,y), (x,y+1)"]
      P1["(x, y)"]
      P2["(x+1, y)"]
      P3["(x, y+1)"]
  end

  P1 -->|"Hit: same tile"| T1
  P2 -->|"Hit: same tile"| T1
  P3 -->|"Hit: same tile"| T1

  P1 -->|"Hit: same line"| L1a
  P2 -->|"Hit: same line"| L1a
  P3 -->|"Miss: different line"| L1c

  style T1 fill:#2d6a4f,stroke:#1b4332,color:#fff
  style T2 fill:#40916c,stroke:#2d6a4f,color:#fff
  style T3 fill:#40916c,stroke:#2d6a4f,color:#fff
  style T4 fill:#52b788,stroke:#40916c,color:#fff
  style L1a fill:#1d3557,stroke:#0d1b2a,color:#fff
  style L1b fill:#457b9d,stroke:#1d3557,color:#fff
  style L1c fill:#457b9d,stroke:#1d3557,color:#fff
  style L1d fill:#a8dadc,stroke:#457b9d,color:#000

Texture cache vs L1 cache

PropertyTexture CacheL1 Cache
Spatial layout2D tiled (typically 32x32 texels)1D linear (128-byte cache lines)
Best access pattern2D neighborhood readsLinear/coalesced reads
Free hardware featuresInterpolation, clamping, wrappingNone
Read-only enforcementYes (via texture objects)No (read/write)
Separate from L1Yes, dedicated hardwareShares SM resources
Stencil hit rateHigh (neighbors in same tile)Low (rows miss cache lines)

Creating and using texture objects

Modern CUDA (compute capability 3.0+) uses texture objects rather than the deprecated texture references. Here is a complete example that samples a 2D image through a texture object:

#include <cuda_runtime.h>

__global__ void textureSample(cudaTextureObject_t tex, float* output,
                              int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    // tex2D performs hardware interpolation if configured
    // +0.5f centers the sample on the texel
    float val = tex2D<float>(tex, x + 0.5f, y + 0.5f);

    // Read neighbors for a simple edge detection filter
    float left  = tex2D<float>(tex, x - 0.5f, y + 0.5f);
    float right = tex2D<float>(tex, x + 1.5f, y + 0.5f);
    float up    = tex2D<float>(tex, x + 0.5f, y - 0.5f);
    float down  = tex2D<float>(tex, x + 0.5f, y + 1.5f);

    // Laplacian edge detection
    output[y * width + x] = 4.0f * val - left - right - up - down;
}

cudaTextureObject_t createTextureObject(const float* h_image,
                                        int width, int height) {
    // Allocate CUDA array (hardware-optimized 2D layout)
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray_t cuArray;
    cudaMallocArray(&cuArray, &channelDesc, width, height);

    // Copy host image to CUDA array
    cudaMemcpy2DToArray(cuArray, 0, 0, h_image,
                        width * sizeof(float),
                        width * sizeof(float), height,
                        cudaMemcpyHostToDevice);

    // Configure texture descriptor
    cudaTextureDesc texDesc = \{\};
    texDesc.addressMode[0] = cudaAddressModeClamp;
    texDesc.addressMode[1] = cudaAddressModeClamp;
    texDesc.filterMode     = cudaFilterModeLinear;  // HW interpolation
    texDesc.readMode       = cudaReadModeElementType;
    texDesc.normalizedCoords = false;

    // Configure resource descriptor
    cudaResourceDesc resDesc = \{\};
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    // Create texture object
    cudaTextureObject_t texObj;
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, nullptr);
    return texObj;
}

Key points about this code:

  • CUDA arrays (cudaMallocArray) store data in a hardware-optimized 2D tiled layout. You cannot access CUDA arrays with regular pointer arithmetic. They exist specifically for the texture unit.
  • tex2D<float>() reads through the texture cache and optionally performs bilinear interpolation in hardware (when filterMode is cudaFilterModeLinear). This interpolation is free: it does not consume CUDA cores.
  • Address modes (Clamp, Wrap, Mirror, Border) handle out-of-bounds coordinates in hardware. No branch, no bounds check in your kernel.

CUDA arrays and surface objects

CUDA arrays are opaque memory allocations with a tiled layout optimized for 2D/3D spatial locality. They are read-only when accessed through texture objects. If you need read-write access to a CUDA array, use surface objects:

// Write to a CUDA array through a surface object
__global__ void surfaceWrite(cudaSurfaceObject_t surf,
                             int width, int height) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;

    float value = computeSomething(x, y);
    surf2Dwrite(value, surf, x * sizeof(float), y);
}

Surface objects provide the same 2D locality benefits as texture objects but allow writes. They are useful for image processing pipelines where the output is also a 2D array that will be read as a texture in a subsequent pass.

Specialized memory comparison

Memory TypeSizeRead/WriteBroadcast2D Spatial LocalityBest Use Case
Constant64 KBRead-only (device)✓ (warp-wide)Filter weights, physics constants, config params
TextureDevice DRAMRead-only (device)✓ (2D tiled cache)Image processing, interpolation, stencil ops
SurfaceDevice DRAMRead-write✓ (2D tiled cache)In-place image transforms, multi-pass pipelines
Global8-80 GBRead-write✗ (1D cache lines)General-purpose data storage
SharedUp to 228 KB/SMRead-write✗ (programmer-managed)Tiled algorithms, inter-thread communication
__ldg()Device DRAMRead-only (device)✗ (uses texture cache)Read-only global data, simple replacement for textures

Worked example: constant memory for convolution weights

Consider a 5x5 convolution filter applied to a 1024x1024 image with a block size of 32x32 (1024 threads per block).

Setup: 25 filter weights. Each thread computes one output pixel and reads all 25 weights. A warp of 32 threads processes 32 adjacent pixels. In the inner loop, all 32 threads read the same filter weight cFilter[i] at the same iteration.

Constant memory path: All 32 threads request cFilter[0] at the same cycle. The constant cache serves one read and broadcasts to all 32 threads. Total transactions for all 25 weights across the warp: 25 transactions (one per unique weight).

Global memory path: If the weights lived in global memory as a plain float*, the hardware could still coalesce the reads since all 32 threads read the same address. On modern GPUs (Volta+), the L1 cache often handles this well. But the constant cache guarantees single-cycle broadcast with no cache line waste. Global memory loads a full 128-byte cache line (32 floats) to serve one 4-byte read, wasting 112 bytes of cache capacity. Constant memory loads exactly what is needed.

Transaction count comparison:

ApproachTransactions per weight per warpCache waste per loadTotal for 25 weights
Constant memory1 (broadcast)0 bytes25
Global memory (L1 hit)1 (coalesced same-address)124 bytes25
Global memory (L1 miss)1 (DRAM sector)28 bytes (32-byte sector)25

The transaction counts are similar, but constant memory wins on cache efficiency. The constant cache is dedicated and small, so filter weights do not compete with your image data for L1 space. In a real kernel, L1 pressure from image reads can evict the filter weights, causing repeated DRAM fetches. The constant cache prevents this entirely.

Worked example: texture cache hit rates for stencil access

Consider a kernel that reads three pixels per thread: (x, y), (x+1, y), and (x, y+1). The image is 2048x2048, float32.

Texture cache model: The texture cache stores data in 2D tiles, typically around 32x32 texels. A single tile covers a 32x32 pixel region.

Sequential thread mapping (thread i handles pixel (i % width, i / width) within a block): Threads within a warp process 32 horizontally adjacent pixels. For thread at (x, y):

  • (x, y): loads the tile containing this pixel.
  • (x+1, y): same tile (horizontal neighbor). ✓ Cache hit.
  • (x, y+1): same tile as long as y % 32 != 31. ✓ Cache hit for 31/32 rows.

For a warp reading 32 adjacent pixels in row y, all three reads per thread hit the same tile except at tile boundaries. Expected hit rate: the first access per tile is a miss (cold start), and subsequent accesses within the tile hit. For a 32x32 tile covering 1024 texels, one miss loads the tile and serves roughly 1024 subsequent reads. With 3 reads per thread and 32 threads per warp:

  • 96 total reads per warp.
  • Approximately 1 to 2 cache misses per warp (tile loads).
  • Hit rate: ~97-99%.

Checkerboard thread mapping (thread i handles pixel (2*(i%16), 2*(i/16)), skipping every other pixel): Now threads are spread across a wider area. Each thread still reads its three neighbors, but neighboring threads are 2 pixels apart. Within a 32x32 tile, the warp still stays inside the tile most of the time because the spread is only 2x.

  • Reads span a 32x2 pixel region horizontally (still within one tile width).
  • (x+1, y) and (x, y+1) neighbors remain within the tile.
  • Hit rate: ~95-97%, slightly lower due to reduced spatial density but still high.

The texture cache provides excellent hit rates for both patterns because its 2D tiling matches the 2D access locality. The L1 cache, organized in 128-byte linear cache lines, would suffer significantly worse hit rates for the (x, y+1) reads, which are 2048 floats (8192 bytes) away in linear memory, well beyond any cache line.

The __ldg() alternative and deprecation note

Starting with Kepler (compute capability 3.5), CUDA provides __ldg(), a built-in function that reads global memory through the read-only data cache (which is physically the same as the texture cache on most architectures):

float val = __ldg(&input[idx]);

This gives you the benefit of the texture cache path (non-polluting reads, potentially better cache behavior for read-only data) without the overhead of creating texture objects, allocating CUDA arrays, or managing texture descriptors.

On Volta and later architectures, the compiler automatically routes const __restrict__ pointer reads through the same read-only cache path. This means:

__global__ void kernel(const float* __restrict__ input, float* output) {
    // The compiler may use ldg automatically for reads from input
    output[threadIdx.x] = input[threadIdx.x] * 2.0f;
}

When __ldg() suffices vs when you still need texture objects:

Feature__ldg() / const __restrict__Texture Objects
Read-only cache path
Hardware interpolation✓ (free bilinear/trilinear)
Hardware address clamping/wrapping
2D/3D spatial cache tiling✗ (still 1D cache line)✓ (CUDA array layout)
Setup complexityNoneModerate (array + descriptors)
Works with regular pointers✗ (needs CUDA array or pitched ptr)

Practical guidance: For most modern CUDA code, __ldg() or const __restrict__ is enough. Use texture objects only when you need hardware interpolation, non-trivial address modes, or you have measured that 2D spatial cache tiling gives a significant hit rate improvement over the L1 cache for your specific access pattern.

The old texture<> reference API (bind/unbind texture references) is deprecated since CUDA 12.0 and removed in later versions. Always use texture objects (cudaTextureObject_t).

In practice

Constant and texture memory are specialized tools for specific access patterns. They are not general-purpose replacements for shared or global memory.

Production considerations:

  • Constant memory is for small, uniform data. Filter kernels, lookup tables read uniformly by all threads, physical constants, and configuration parameters. Never exceed 64 KB. If you need more, use global memory with __ldg().
  • Profile before switching to texture objects. On Volta+ GPUs, the L1 cache is large (up to 256 KB per SM configurable) and handles many 2D access patterns well enough. Measure with ncu to compare texture path vs global path hit rates before committing to the extra complexity.
  • Constant memory kills performance when divergent. If profiling shows high constant cache serialization (l1tex__t_sectors_pipe_lsu_mem_constant_op_ld.sum much higher than expected), switch to shared memory or a global memory lookup.
  • Texture interpolation is free compute. If your algorithm needs bilinear interpolation (image resizing, mesh sampling, volume rendering), the texture unit does it in hardware with zero core cycles. This alone justifies texture objects for graphics and image processing workloads.
  • Surface objects for ping-pong buffers. Multi-pass image filters that read the previous pass and write the current pass benefit from surface objects. Read through texture, write through surface, swap per pass.
  • CUDA arrays cannot be accessed from the host. You must copy data to/from them with cudaMemcpy2DToArray and cudaMemcpy2DFromArray. Factor the copy cost into your performance model.

Common mistakes

MistakeSymptomFix
Indexing constant memory with threadIdx.xSevere serialization, kernel 10-30x slower than expectedUse shared memory or global memory for per-thread lookups
Exceeding 64 KB constant memoryCompilation error or silent truncationSplit data or move excess to global memory
Using deprecated texture referencesCompilation warnings, removed in CUDA 12+Migrate to cudaTextureObject_t
Forgetting cudaDestroyTextureObjectResource leak across kernel launchesDestroy texture objects when no longer needed
Assuming texture is always fasterNo speedup or regression vs global readsProfile with ncu; texture wins only for 2D spatial patterns

What comes next

Constant and texture memory round out the GPU memory space toolkit. Each memory type has a specific access pattern where it excels and patterns where it actively hurts performance. The art of CUDA optimization is matching your data access patterns to the right memory space.

With the memory hierarchy fully covered, the next question is how to keep the GPU’s compute units busy. The next article, CUDA occupancy and performance tuning, covers occupancy calculations, register pressure, shared memory limits, and the systematic process of tuning kernel launch configurations for maximum throughput.

Start typing to search across all content
navigate Enter open Esc close