Texture and constant memory: specialized caches
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- CUDA memory hierarchy for the full picture of registers, shared memory, L1/L2 caches, constant memory, texture memory, and global memory.
You should understand that global memory latency runs 200 to 800 cycles, that shared memory provides fast block-scoped storage, and that the L1/L2 caches are hardware-managed. This article focuses on two specialized memory spaces that sit outside the standard global/shared/register model: constant memory and texture memory. Both are read-only on the device side, both have dedicated caches, and both are worth understanding even though modern GPUs have reduced the cases where they outperform plain global reads.
Constant memory: 64 KB of broadcast power
Constant memory is a 64 KB region of device memory backed by a dedicated per-SM cache. The hardware optimizes for one specific access pattern: every thread in a warp reads the same address in the same cycle.
When that pattern holds, a single cache read is broadcast to all 32 threads simultaneously. One memory transaction serves the entire warp. When the pattern does not hold (threads read different addresses), the accesses serialize. A warp reading 32 distinct constant memory addresses issues 32 sequential reads, which is dramatically slower than a single coalesced global memory load.
sequenceDiagram participant W as Warp (32 threads) participant CC as Constant Cache participant CM as Constant Memory (DRAM) Note over W: All threads request address 0x100 W->>CC: Read 0x100 - single request CC-->>W: Broadcast to all 32 threads - 1 cycle Note over W: Result: 1 transaction, 1 cycle Note over W: Threads request 32 different addresses W->>CC: Read 0x100 - thread 0 CC-->>W: Serve thread 0 W->>CC: Read 0x104 - thread 1 CC-->>W: Serve thread 1 W->>CC: Read 0x108 - thread 2 CC-->>W: Serve thread 2 Note over W: ... repeats 32 times - serialized Note over W: Result: 32 transactions, 32 cycles
Declaring and using constant memory
Constant memory variables are declared at file scope with __constant__ and populated from the host with cudaMemcpyToSymbol:
#define KERNEL_RADIUS 2
#define KERNEL_SIZE (2 * KERNEL_RADIUS + 1)
// 5x5 convolution filter weights in constant memory
__constant__ float cFilter[KERNEL_SIZE * KERNEL_SIZE];
__global__ void convolve2D(const float* input, float* output,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
float sum = 0.0f;
for (int ky = -KERNEL_RADIUS; ky <= KERNEL_RADIUS; ky++) {
for (int kx = -KERNEL_RADIUS; kx <= KERNEL_RADIUS; kx++) {
int sx = min(max(x + kx, 0), width - 1);
int sy = min(max(y + ky, 0), height - 1);
int filterIdx = (ky + KERNEL_RADIUS) * KERNEL_SIZE
+ (kx + KERNEL_RADIUS);
sum += input[sy * width + sx] * cFilter[filterIdx];
}
}
output[y * width + x] = sum;
}
// Host side
void launchConvolution(const float* d_input, float* d_output,
const float* h_filter, int w, int h) {
cudaMemcpyToSymbol(cFilter, h_filter,
KERNEL_SIZE * KERNEL_SIZE * sizeof(float));
dim3 block(16, 16);
dim3 grid((w + 15) / 16, (h + 15) / 16);
convolve2D<<<grid, block>>>(d_input, d_output, w, h);
}
Every thread in the warp executing the inner loop reads the same cFilter[filterIdx] value at the same time. All 32 threads in the warp step through the filter weights in lockstep. This is the ideal constant memory pattern: one broadcast per weight, zero serialization.
When constant memory wins and when it loses
The rule is simple:
- ✓ All threads read the same address: 1 transaction, broadcast to 32 threads. Faster than global memory.
- ✗ Threads read different addresses: up to 32 serialized transactions. Slower than a single coalesced global memory load.
Lookup tables indexed by threadIdx.x are the classic antipattern. If thread 0 reads table[0] while thread 31 reads table[31], constant memory serializes all 32 reads. Global memory (or shared memory) handles this in a single coalesced transaction.
Texture memory: read-only cache with 2D spatial locality
Texture memory uses a dedicated read-only cache that is separate from the L1 cache. Its design targets spatial locality in two dimensions. When a thread reads pixel (x, y), the texture cache prefetches a 2D tile of nearby texels, anticipating that neighboring threads will read (x+1, y), (x, y+1), and other spatially close addresses.
This is fundamentally different from the L1 cache, which is optimized for 1D linear access. An image kernel that reads a 3x3 neighborhood around each pixel generates a stride-N access pattern in linear memory (where N is the image width). The L1 cache line covers consecutive bytes, so the rows above and below the current pixel likely miss. The texture cache, organized in 2D tiles, keeps all three rows hot.
graph TD
subgraph TextureCache["Texture Cache (2D Tiled)"]
T1["Tile (0,0)
32x32 texels"]
T2["Tile (1,0)
32x32 texels"]
T3["Tile (0,1)
32x32 texels"]
T4["Tile (1,1)
32x32 texels"]
end
subgraph L1Cache["L1 Cache (1D Linear)"]
L1a["Cache line 0
128 bytes contiguous"]
L1b["Cache line 1
128 bytes contiguous"]
L1c["Cache line 2
128 bytes contiguous"]
L1d["Cache line 3
128 bytes contiguous"]
end
subgraph Access["Stencil Read Pattern: (x,y), (x+1,y), (x,y+1)"]
P1["(x, y)"]
P2["(x+1, y)"]
P3["(x, y+1)"]
end
P1 -->|"Hit: same tile"| T1
P2 -->|"Hit: same tile"| T1
P3 -->|"Hit: same tile"| T1
P1 -->|"Hit: same line"| L1a
P2 -->|"Hit: same line"| L1a
P3 -->|"Miss: different line"| L1c
style T1 fill:#2d6a4f,stroke:#1b4332,color:#fff
style T2 fill:#40916c,stroke:#2d6a4f,color:#fff
style T3 fill:#40916c,stroke:#2d6a4f,color:#fff
style T4 fill:#52b788,stroke:#40916c,color:#fff
style L1a fill:#1d3557,stroke:#0d1b2a,color:#fff
style L1b fill:#457b9d,stroke:#1d3557,color:#fff
style L1c fill:#457b9d,stroke:#1d3557,color:#fff
style L1d fill:#a8dadc,stroke:#457b9d,color:#000
Texture cache vs L1 cache
| Property | Texture Cache | L1 Cache |
|---|---|---|
| Spatial layout | 2D tiled (typically 32x32 texels) | 1D linear (128-byte cache lines) |
| Best access pattern | 2D neighborhood reads | Linear/coalesced reads |
| Free hardware features | Interpolation, clamping, wrapping | None |
| Read-only enforcement | Yes (via texture objects) | No (read/write) |
| Separate from L1 | Yes, dedicated hardware | Shares SM resources |
| Stencil hit rate | High (neighbors in same tile) | Low (rows miss cache lines) |
Creating and using texture objects
Modern CUDA (compute capability 3.0+) uses texture objects rather than the deprecated texture references. Here is a complete example that samples a 2D image through a texture object:
#include <cuda_runtime.h>
__global__ void textureSample(cudaTextureObject_t tex, float* output,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
// tex2D performs hardware interpolation if configured
// +0.5f centers the sample on the texel
float val = tex2D<float>(tex, x + 0.5f, y + 0.5f);
// Read neighbors for a simple edge detection filter
float left = tex2D<float>(tex, x - 0.5f, y + 0.5f);
float right = tex2D<float>(tex, x + 1.5f, y + 0.5f);
float up = tex2D<float>(tex, x + 0.5f, y - 0.5f);
float down = tex2D<float>(tex, x + 0.5f, y + 1.5f);
// Laplacian edge detection
output[y * width + x] = 4.0f * val - left - right - up - down;
}
cudaTextureObject_t createTextureObject(const float* h_image,
int width, int height) {
// Allocate CUDA array (hardware-optimized 2D layout)
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray_t cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
// Copy host image to CUDA array
cudaMemcpy2DToArray(cuArray, 0, 0, h_image,
width * sizeof(float),
width * sizeof(float), height,
cudaMemcpyHostToDevice);
// Configure texture descriptor
cudaTextureDesc texDesc = \{\};
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
texDesc.filterMode = cudaFilterModeLinear; // HW interpolation
texDesc.readMode = cudaReadModeElementType;
texDesc.normalizedCoords = false;
// Configure resource descriptor
cudaResourceDesc resDesc = \{\};
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = cuArray;
// Create texture object
cudaTextureObject_t texObj;
cudaCreateTextureObject(&texObj, &resDesc, &texDesc, nullptr);
return texObj;
}
Key points about this code:
- CUDA arrays (
cudaMallocArray) store data in a hardware-optimized 2D tiled layout. You cannot access CUDA arrays with regular pointer arithmetic. They exist specifically for the texture unit. tex2D<float>()reads through the texture cache and optionally performs bilinear interpolation in hardware (whenfilterModeiscudaFilterModeLinear). This interpolation is free: it does not consume CUDA cores.- Address modes (
Clamp,Wrap,Mirror,Border) handle out-of-bounds coordinates in hardware. No branch, no bounds check in your kernel.
CUDA arrays and surface objects
CUDA arrays are opaque memory allocations with a tiled layout optimized for 2D/3D spatial locality. They are read-only when accessed through texture objects. If you need read-write access to a CUDA array, use surface objects:
// Write to a CUDA array through a surface object
__global__ void surfaceWrite(cudaSurfaceObject_t surf,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
float value = computeSomething(x, y);
surf2Dwrite(value, surf, x * sizeof(float), y);
}
Surface objects provide the same 2D locality benefits as texture objects but allow writes. They are useful for image processing pipelines where the output is also a 2D array that will be read as a texture in a subsequent pass.
Specialized memory comparison
| Memory Type | Size | Read/Write | Broadcast | 2D Spatial Locality | Best Use Case |
|---|---|---|---|---|---|
| Constant | 64 KB | Read-only (device) | ✓ (warp-wide) | ✗ | Filter weights, physics constants, config params |
| Texture | Device DRAM | Read-only (device) | ✗ | ✓ (2D tiled cache) | Image processing, interpolation, stencil ops |
| Surface | Device DRAM | Read-write | ✗ | ✓ (2D tiled cache) | In-place image transforms, multi-pass pipelines |
| Global | 8-80 GB | Read-write | ✗ | ✗ (1D cache lines) | General-purpose data storage |
| Shared | Up to 228 KB/SM | Read-write | ✗ | ✗ (programmer-managed) | Tiled algorithms, inter-thread communication |
__ldg() | Device DRAM | Read-only (device) | ✗ | ✗ (uses texture cache) | Read-only global data, simple replacement for textures |
Worked example: constant memory for convolution weights
Consider a 5x5 convolution filter applied to a 1024x1024 image with a block size of 32x32 (1024 threads per block).
Setup: 25 filter weights. Each thread computes one output pixel and reads all 25 weights. A warp of 32 threads processes 32 adjacent pixels. In the inner loop, all 32 threads read the same filter weight cFilter[i] at the same iteration.
Constant memory path: All 32 threads request cFilter[0] at the same cycle. The constant cache serves one read and broadcasts to all 32 threads. Total transactions for all 25 weights across the warp: 25 transactions (one per unique weight).
Global memory path: If the weights lived in global memory as a plain float*, the hardware could still coalesce the reads since all 32 threads read the same address. On modern GPUs (Volta+), the L1 cache often handles this well. But the constant cache guarantees single-cycle broadcast with no cache line waste. Global memory loads a full 128-byte cache line (32 floats) to serve one 4-byte read, wasting 112 bytes of cache capacity. Constant memory loads exactly what is needed.
Transaction count comparison:
| Approach | Transactions per weight per warp | Cache waste per load | Total for 25 weights |
|---|---|---|---|
| Constant memory | 1 (broadcast) | 0 bytes | 25 |
| Global memory (L1 hit) | 1 (coalesced same-address) | 124 bytes | 25 |
| Global memory (L1 miss) | 1 (DRAM sector) | 28 bytes (32-byte sector) | 25 |
The transaction counts are similar, but constant memory wins on cache efficiency. The constant cache is dedicated and small, so filter weights do not compete with your image data for L1 space. In a real kernel, L1 pressure from image reads can evict the filter weights, causing repeated DRAM fetches. The constant cache prevents this entirely.
Worked example: texture cache hit rates for stencil access
Consider a kernel that reads three pixels per thread: (x, y), (x+1, y), and (x, y+1). The image is 2048x2048, float32.
Texture cache model: The texture cache stores data in 2D tiles, typically around 32x32 texels. A single tile covers a 32x32 pixel region.
Sequential thread mapping (thread i handles pixel (i % width, i / width) within a block): Threads within a warp process 32 horizontally adjacent pixels. For thread at (x, y):
(x, y): loads the tile containing this pixel.(x+1, y): same tile (horizontal neighbor). ✓ Cache hit.(x, y+1): same tile as long asy % 32 != 31. ✓ Cache hit for 31/32 rows.
For a warp reading 32 adjacent pixels in row y, all three reads per thread hit the same tile except at tile boundaries. Expected hit rate: the first access per tile is a miss (cold start), and subsequent accesses within the tile hit. For a 32x32 tile covering 1024 texels, one miss loads the tile and serves roughly 1024 subsequent reads. With 3 reads per thread and 32 threads per warp:
- 96 total reads per warp.
- Approximately 1 to 2 cache misses per warp (tile loads).
- Hit rate: ~97-99%.
Checkerboard thread mapping (thread i handles pixel (2*(i%16), 2*(i/16)), skipping every other pixel): Now threads are spread across a wider area. Each thread still reads its three neighbors, but neighboring threads are 2 pixels apart. Within a 32x32 tile, the warp still stays inside the tile most of the time because the spread is only 2x.
- Reads span a 32x2 pixel region horizontally (still within one tile width).
(x+1, y)and(x, y+1)neighbors remain within the tile.- Hit rate: ~95-97%, slightly lower due to reduced spatial density but still high.
The texture cache provides excellent hit rates for both patterns because its 2D tiling matches the 2D access locality. The L1 cache, organized in 128-byte linear cache lines, would suffer significantly worse hit rates for the (x, y+1) reads, which are 2048 floats (8192 bytes) away in linear memory, well beyond any cache line.
The __ldg() alternative and deprecation note
Starting with Kepler (compute capability 3.5), CUDA provides __ldg(), a built-in function that reads global memory through the read-only data cache (which is physically the same as the texture cache on most architectures):
float val = __ldg(&input[idx]);
This gives you the benefit of the texture cache path (non-polluting reads, potentially better cache behavior for read-only data) without the overhead of creating texture objects, allocating CUDA arrays, or managing texture descriptors.
On Volta and later architectures, the compiler automatically routes const __restrict__ pointer reads through the same read-only cache path. This means:
__global__ void kernel(const float* __restrict__ input, float* output) {
// The compiler may use ldg automatically for reads from input
output[threadIdx.x] = input[threadIdx.x] * 2.0f;
}
When __ldg() suffices vs when you still need texture objects:
| Feature | __ldg() / const __restrict__ | Texture Objects |
|---|---|---|
| Read-only cache path | ✓ | ✓ |
| Hardware interpolation | ✗ | ✓ (free bilinear/trilinear) |
| Hardware address clamping/wrapping | ✗ | ✓ |
| 2D/3D spatial cache tiling | ✗ (still 1D cache line) | ✓ (CUDA array layout) |
| Setup complexity | None | Moderate (array + descriptors) |
| Works with regular pointers | ✓ | ✗ (needs CUDA array or pitched ptr) |
⚠ Practical guidance: For most modern CUDA code, __ldg() or const __restrict__ is enough. Use texture objects only when you need hardware interpolation, non-trivial address modes, or you have measured that 2D spatial cache tiling gives a significant hit rate improvement over the L1 cache for your specific access pattern.
The old texture<> reference API (bind/unbind texture references) is deprecated since CUDA 12.0 and removed in later versions. Always use texture objects (cudaTextureObject_t).
In practice
Constant and texture memory are specialized tools for specific access patterns. They are not general-purpose replacements for shared or global memory.
Production considerations:
- Constant memory is for small, uniform data. Filter kernels, lookup tables read uniformly by all threads, physical constants, and configuration parameters. Never exceed 64 KB. If you need more, use global memory with
__ldg(). - Profile before switching to texture objects. On Volta+ GPUs, the L1 cache is large (up to 256 KB per SM configurable) and handles many 2D access patterns well enough. Measure with
ncuto compare texture path vs global path hit rates before committing to the extra complexity. - Constant memory kills performance when divergent. If profiling shows high constant cache serialization (
l1tex__t_sectors_pipe_lsu_mem_constant_op_ld.summuch higher than expected), switch to shared memory or a global memory lookup. - Texture interpolation is free compute. If your algorithm needs bilinear interpolation (image resizing, mesh sampling, volume rendering), the texture unit does it in hardware with zero core cycles. This alone justifies texture objects for graphics and image processing workloads.
- Surface objects for ping-pong buffers. Multi-pass image filters that read the previous pass and write the current pass benefit from surface objects. Read through texture, write through surface, swap per pass.
- CUDA arrays cannot be accessed from the host. You must copy data to/from them with
cudaMemcpy2DToArrayandcudaMemcpy2DFromArray. Factor the copy cost into your performance model.
Common mistakes
| Mistake | Symptom | Fix |
|---|---|---|
Indexing constant memory with threadIdx.x | Severe serialization, kernel 10-30x slower than expected | Use shared memory or global memory for per-thread lookups |
| Exceeding 64 KB constant memory | Compilation error or silent truncation | Split data or move excess to global memory |
| Using deprecated texture references | Compilation warnings, removed in CUDA 12+ | Migrate to cudaTextureObject_t |
Forgetting cudaDestroyTextureObject | Resource leak across kernel launches | Destroy texture objects when no longer needed |
| Assuming texture is always faster | No speedup or regression vs global reads | Profile with ncu; texture wins only for 2D spatial patterns |
What comes next
Constant and texture memory round out the GPU memory space toolkit. Each memory type has a specific access pattern where it excels and patterns where it actively hurts performance. The art of CUDA optimization is matching your data access patterns to the right memory space.
With the memory hierarchy fully covered, the next question is how to keep the GPU’s compute units busy. The next article, CUDA occupancy and performance tuning, covers occupancy calculations, register pressure, shared memory limits, and the systematic process of tuning kernel launch configurations for maximum throughput.