Search…

Advanced stream patterns and concurrent kernel execution

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article builds on CUDA streams and asynchronous execution and CUDA events and fine-grained synchronization. You need to understand how streams serialize work, how events create cross-stream dependencies, and how cudaMemcpyAsync enables overlap between transfers and compute. Everything here extends those primitives into patterns used in production inference servers, simulation pipelines, and multi-tenant GPU clusters.

When concurrent kernels actually overlap

Launching two kernels into separate streams does not guarantee they run simultaneously. The GPU scheduler assigns thread blocks to SMs. If kernel A consumes all available SMs, kernel B waits in the hardware queue until blocks from A retire and free resources. Concurrency requires that both kernels leave room for each other.

Three conditions must hold for two kernels to overlap on the same GPU:

  1. They are in different streams (or use the per-thread default stream with --default-stream per-thread).
  2. There is no implicit serialization through the legacy default stream.
  3. The combined resource demand (registers, shared memory, thread blocks) fits within the SM budget.

The third point is where most developers get surprised. A kernel that requests 48 KB of shared memory per block on an SM with 48 KB available will claim the entire SM. No second kernel can share that SM, regardless of how many streams you create.

SM resource sharing between concurrent kernels

Each SM has a fixed pool of resources: registers, shared memory, warp slots, and block slots. When the scheduler assigns a block from kernel A to an SM, it deducts that block’s resource usage. If enough resources remain, the scheduler can place a block from kernel B on the same SM.

Consider an A100 GPU with 108 SMs and a maximum of 32 blocks per SM. Two scenarios:

Scenario 1: Both kernels use 50% of SM resources. Kernel A launches 108 blocks, each using 16 registers per thread, 256 threads, and 8 KB shared memory. That consumes roughly half the register file and half the shared memory on each SM. Kernel B has identical resource usage. The scheduler can co-locate one block from A and one block from B on each SM. Both kernels overlap. ✓

Scenario 2: One kernel uses 60% of SM resources. Kernel A launches 108 blocks, each using 24 KB shared memory (50% of the 48 KB limit on many SMs). Kernel B also uses 24 KB. Together they need 48 KB, which is exactly the limit. This might work if no other resources are constrained. But if kernel A uses 40 registers per thread with 512 threads per block, that is 20,480 registers per block. An SM with 65,536 registers can hold at most 3 such blocks. If kernel B also needs 20,480 registers per block, the SM can hold 3 total blocks. If A already occupies 2 block slots, B gets at most 1. Overlap is partial. ✓ (reduced)

Now raise kernel A’s shared memory to 32 KB per block. Kernel B also needs 32 KB. Together they need 64 KB. That exceeds the SM’s shared memory. No co-location. Kernel B’s blocks wait until A’s blocks complete. ✗

The occupancy calculator (cudaOccupancyMaxActiveBlocksPerMultiprocessor) tells you how many blocks of a single kernel fit on one SM. For concurrent kernel planning, you need to check whether the combined resource footprint of both kernels fits.

// Query resource usage for concurrent kernel planning
cudaFuncAttributes attrA, attrB;
cudaFuncGetAttributes(&attrA, kernelA);
cudaFuncGetAttributes(&attrB, kernelB);

int regsA = attrA.numRegs;        // registers per thread
int smemA = attrA.sharedSizeBytes; // static shared memory
int regsB = attrB.numRegs;
int smemB = attrB.sharedSizeBytes;

printf("Kernel A: %d regs/thread, %zu bytes smem\n", regsA, smemA);
printf("Kernel B: %d regs/thread, %zu bytes smem\n", regsB, smemB);

// Check if combined static shared memory fits on one SM
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
size_t smemPerSM = prop.sharedMemPerMultiprocessor;
printf("Shared memory per SM: %zu bytes\n", smemPerSM);
printf("Combined: %zu bytes -> %s\n",
       smemA + smemB,
       (smemA + smemB <= smemPerSM) ? "can co-locate" : "cannot co-locate");

Multi-stream concurrent kernel benchmark

The following benchmark launches two kernels with tunable SM usage into separate streams and measures whether they overlap:

__global__ void busyKernel(float* out, int N, int iterations) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;
    float val = 1.0f;
    for (int i = 0; i < iterations; i++) {
        val = val * 1.001f + 0.001f;
    }
    out[idx] = val;
}

void benchmarkConcurrency(int blocksA, int blocksB, int threadsPerBlock) {
    int N = blocksA * threadsPerBlock;
    float *dA, *dB;
    cudaMalloc(&dA, N * sizeof(float));
    cudaMalloc(&dB, N * sizeof(float));

    cudaStream_t streamA, streamB;
    cudaStreamCreate(&streamA);
    cudaStreamCreate(&streamB);

    cudaEvent_t start, endA, endB, endBoth;
    cudaEventCreate(&start);
    cudaEventCreate(&endA);
    cudaEventCreate(&endB);
    cudaEventCreate(&endBoth);

    int iters = 100000;

    // Measure kernel A alone
    cudaEventRecord(start, streamA);
    busyKernel<<<blocksA, threadsPerBlock, 0, streamA>>>(dA, N, iters);
    cudaEventRecord(endA, streamA);
    cudaStreamSynchronize(streamA);
    float timeA;
    cudaEventElapsedTime(&timeA, start, endA);

    // Measure kernel B alone
    cudaEventRecord(start, streamB);
    busyKernel<<<blocksB, threadsPerBlock, 0, streamB>>>(dB, N, iters);
    cudaEventRecord(endB, streamB);
    cudaStreamSynchronize(streamB);
    float timeB;
    cudaEventElapsedTime(&timeB, start, endB);

    // Measure both concurrently
    cudaEventRecord(start, streamA);
    busyKernel<<<blocksA, threadsPerBlock, 0, streamA>>>(dA, N, iters);
    busyKernel<<<blocksB, threadsPerBlock, 0, streamB>>>(dB, N, iters);
    cudaEventRecord(endBoth, streamA);
    cudaStreamSynchronize(streamA);
    cudaStreamSynchronize(streamB);
    float timeBoth;
    cudaEventElapsedTime(&timeBoth, start, endBoth);

    printf("Kernel A alone: %.2f ms\n", timeA);
    printf("Kernel B alone: %.2f ms\n", timeB);
    printf("Both concurrent: %.2f ms\n", timeBoth);
    printf("Speedup: %.2fx (ideal 2.0x)\n", (timeA + timeB) / timeBoth);

    cudaStreamDestroy(streamA);
    cudaStreamDestroy(streamB);
    cudaEventDestroy(start);
    cudaEventDestroy(endA);
    cudaEventDestroy(endB);
    cudaEventDestroy(endBoth);
    cudaFree(dA);
    cudaFree(dB);
}

When blocksA + blocksB is less than or equal to the SM count, both kernels can run entirely in parallel and you see close to 2x speedup. When the total exceeds the SM count, the speedup drops because some blocks must wait for others to finish.

CUDA graphs: record once, replay many

Every kernel launch goes through the CUDA runtime: parameter packing, driver calls, hardware queue insertion. On a modern GPU this takes 5 to 20 microseconds per launch from the CPU side. For a pipeline of 10 small kernels, that is 50 to 200 microseconds of pure overhead per iteration. If each kernel runs for only 10 microseconds, launch overhead dominates actual compute.

CUDA graphs solve this by capturing a sequence of operations (kernel launches, memory copies, event records) into a dependency graph, then replaying the entire graph with a single API call. The driver pre-compiles the launch sequence into a hardware-level command buffer. Replay skips all the per-launch overhead.

The workflow has three phases:

  1. Capture: execute operations inside a stream capture region. The runtime records them as graph nodes instead of executing them.
  2. Instantiate: compile the graph into an executable form.
  3. Launch: submit the entire executable graph with one call.
graph TD
  A["cudaMemcpyAsync H2D"] --> B["kernelA preprocess"]
  B --> C["kernelB compute"]
  B --> D["kernelC auxiliary"]
  C --> E["kernelD reduce"]
  D --> E
  E --> F["cudaMemcpyAsync D2H"]

  style A fill:#636EFA,color:#fff
  style B fill:#EF553B,color:#fff
  style C fill:#EF553B,color:#fff
  style D fill:#AB63FA,color:#fff
  style E fill:#EF553B,color:#fff
  style F fill:#00CC96,color:#fff

This graph has six nodes. Nodes C and D are independent of each other (both depend only on B), so the runtime can schedule them concurrently. Node E depends on both C and D completing. When you call cudaGraphLaunch, the runtime submits all six operations in dependency order with a single driver call. For subsequent iterations, the same executable graph replays without re-recording.

Graph construction and launch

There are two ways to build a graph: stream capture (implicit) and explicit API construction. Stream capture is simpler and works with existing code:

cudaGraph_t graph;
cudaGraphExec_t graphExec;
cudaStream_t stream;
cudaStreamCreate(&stream);

// Phase 1: Capture
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream);
preprocessKernel<<<grid, block, 0, stream>>>(d_input, d_temp, N);

// Fork into two concurrent paths using events
cudaEvent_t fork;
cudaEventCreate(&fork);
cudaEventRecord(fork, stream);

cudaStream_t auxStream;
cudaStreamCreate(&auxStream);
cudaStreamWaitEvent(auxStream, fork, 0);

computeKernel<<<grid, block, 0, stream>>>(d_temp, d_output, N);
auxiliaryKernel<<<grid, block, 0, auxStream>>>(d_temp, d_aux, N);

// Join: main stream waits for auxiliary
cudaEvent_t join;
cudaEventCreate(&join);
cudaEventRecord(join, auxStream);
cudaStreamWaitEvent(stream, join, 0);

reduceKernel<<<grid, block, 0, stream>>>(d_output, d_aux, d_result, N);
cudaMemcpyAsync(h_result, d_result, resultSize, cudaMemcpyDeviceToHost, stream);

cudaStreamEndCapture(stream, &graph);

// Phase 2: Instantiate
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);

// Phase 3: Launch (repeat as many times as needed)
for (int iter = 0; iter < 1000; iter++) {
    cudaGraphLaunch(graphExec, stream);
}
cudaStreamSynchronize(stream);

// Cleanup
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);

The explicit API (cudaGraphCreate, cudaGraphAddKernelNode, cudaGraphAddDependencies) gives finer control but requires manually specifying every node and edge. Use stream capture unless you need to build graphs dynamically without executing the captured operations.

Why graphs reduce launch overhead

Without graphs, each kernel launch follows this path:

  1. CPU packs kernel parameters into a launch configuration struct.
  2. CPU calls into the CUDA driver.
  3. Driver validates parameters, allocates from the command buffer pool.
  4. Driver writes the command into the GPU’s hardware queue.
  5. GPU fetches and decodes the command.

Steps 1 through 4 take 5 to 20 microseconds per launch. For a 10-kernel pipeline, that is 50 to 200 microseconds of CPU-side overhead per iteration.

With a graph, steps 1 through 4 happen once during instantiation. Each cudaGraphLaunch call writes a single “replay” command to the hardware queue. The GPU already has the pre-compiled command sequence in device memory. Replay overhead is roughly 1 microsecond regardless of how many nodes the graph contains.

Example: graph launch overhead calculation

Problem. You have a pipeline of 10 kernels. Each kernel executes in 15 microseconds on the GPU. You run 1000 iterations. Standard per-launch overhead is 10 microseconds. Graph replay overhead is 1 microsecond.

Standard launches:

  • Per iteration: 10 kernels * (10 us launch + 15 us execute) = 250 us
  • Launch overhead per iteration: 10 * 10 = 100 us
  • Total for 1000 iterations: 1000 * 250 = 250,000 us = 250 ms
  • Total launch overhead: 1000 * 100 = 100,000 us = 100 ms (40% of wall time)

Graph approach:

  • Capture and instantiation (one-time): 10 * 10 = 100 us (plus instantiation, roughly 50 us) = 150 us
  • Per iteration: 1 us replay + 10 * 15 us execute = 151 us
  • Total for 1000 iterations: 150 + 1000 * 151 = 151,150 us = 151.15 ms
  • Total launch overhead: 150 + 1000 * 1 = 1,150 us = 1.15 ms (0.76% of wall time)

Savings: 98.85 ms of launch overhead eliminated. That is a 39.5% reduction in total wall-clock time.

The savings grow with the number of kernels per graph and the number of iterations. For inference servers running the same model thousands of times per second, graphs are not optional.

The solid lines show total wall-clock time (launch overhead plus kernel execution). The dotted lines isolate the launch overhead. By iteration 100, standard launching has accumulated 10,000 microseconds of overhead while graph replay has accumulated roughly 250 microseconds.

Graph limitations and when not to use them

Graphs are not free. Key constraints:

  • Static topology. The captured graph is fixed. If your pipeline changes shape between iterations (different number of kernels, different grid dimensions), you must re-capture and re-instantiate. This is cheaper than building from scratch but still costs more than a simple replay.
  • No host-side logic. You cannot put if statements or CPU-side decisions inside a graph. Host nodes exist (cudaGraphAddHostNode) but they execute on the CPU and add synchronization points.
  • Memory allocation. cudaMalloc inside a capture region is not supported in older CUDA versions. CUDA 11.4+ added cudaGraphAddMemAllocNode for in-graph allocation, but the memory layout is fixed at instantiation time.
  • Parameter updates. You can update kernel parameters in an existing executable graph with cudaGraphExecKernelNodeSetParams without re-instantiating. Use this when input pointers change between iterations but the graph structure stays the same.

MPS: Multi-Process Service

On a shared GPU cluster, multiple processes often submit work to the same GPU. Without MPS, each process gets exclusive access through context switching. The GPU time-slices between processes: run process A’s kernels, flush caches, switch to process B’s context, run B’s kernels, switch back. Context switches cost 25 to 50 microseconds and flush the L1 cache.

MPS (Multi-Process Service) merges multiple CUDA contexts into a single context managed by a daemon process. All client processes submit work through the MPS server, which feeds kernels into the GPU’s hardware queues directly. This enables true concurrent kernel execution across processes, just as if the kernels were launched from different streams within a single process.

Starting MPS:

# On the host machine
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d   # start MPS daemon in background

# Verify it is running
echo get_server_list | nvidia-cuda-mps-control

# Stop MPS when done
echo quit | nvidia-cuda-mps-control

With MPS enabled, two separate inference processes can share the GPU’s SMs concurrently. Process A’s model might use 40% of SMs while process B’s model uses 40%. Both execute simultaneously on different SMs. Without MPS, the GPU would alternate between them with context-switch overhead.

MPS constraints:

  • All clients must use the same GPU and the same CUDA version.
  • If any client triggers a fatal GPU error, all clients are affected.
  • MPS does not provide memory isolation. One process can corrupt another’s GPU memory (though not intentionally, since they use separate address spaces at the API level).
  • Volta and newer architectures support limited memory protection through address space isolation within MPS.

Compute preemption

Before Pascal (compute capability 6.0), a running kernel could not be interrupted. If kernel A was running on an SM, no other work could start on that SM until A’s thread blocks completed. This meant a long-running kernel could starve other streams, other processes, and even the display.

Pascal introduced compute preemption at the instruction level. The GPU can pause a running kernel mid-instruction, save its register state and shared memory, and schedule another kernel on those SMs. This enables:

  • Fair scheduling across multiple processes on shared GPUs.
  • Interactive responsiveness: a long compute kernel does not freeze the display because the scheduler can preempt it to run display-related work.
  • Debugging: breakpoints in cuda-gdb rely on preemption to pause a kernel and inspect state.

Preemption is not free. Saving and restoring register state takes 5 to 30 microseconds depending on the kernel’s register usage. The scheduler avoids preemption when possible and uses it only under resource pressure or explicit priority requests.

Priority streams

CUDA allows you to assign priorities to streams. Higher-priority streams preempt lower-priority ones at thread block boundaries (or at instruction boundaries on Pascal+ with compute preemption).

int leastPriority, greatestPriority;
cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
// leastPriority is a larger number (lower priority)
// greatestPriority is a smaller number (higher priority)
// Typically: leastPriority = 0, greatestPriority = -1

cudaStream_t highPriority, lowPriority;
cudaStreamCreateWithPriority(&highPriority, cudaStreamNonBlocking, greatestPriority);
cudaStreamCreateWithPriority(&lowPriority, cudaStreamNonBlocking, leastPriority);

// Low-priority background work
backgroundKernel<<<grid, block, 0, lowPriority>>>(d_data, N);

// High-priority latency-sensitive work
inferenceKernel<<<grid, block, 0, highPriority>>>(d_input, d_output, N);

When both streams have work queued, the scheduler gives SM resources to the high-priority stream first. As the high-priority kernel’s blocks complete and free SMs, the low-priority kernel’s blocks start filling in. If the high-priority stream gets new work while the low-priority kernel is running, the scheduler preempts low-priority blocks at the next preemption point.

This is critical for inference servers that handle both real-time requests (high priority) and batch processing (low priority) on the same GPU. The real-time path gets consistent low latency while batch work fills in gaps.

In practice

Start with profiling, not assumptions. Use Nsight Systems to visualize whether your kernels actually overlap. The timeline view shows concurrent kernel execution with color-coded bars per stream. If you see serialized execution despite separate streams, check resource usage with Nsight Compute.

Use graphs for repeated pipelines. Any pipeline that runs the same sequence of operations more than a few hundred times benefits from graph capture. Inference, iterative solvers, and simulation time-stepping are ideal candidates. Measure the improvement: if your kernels are already large (milliseconds each), launch overhead is negligible and graphs add complexity without benefit.

MPS for multi-tenant GPUs. If your cluster runs multiple inference models on the same GPU, MPS eliminates context-switch overhead. But test failure isolation: a memory error in one process kills all MPS clients. In high-reliability settings, you may prefer time-slicing with its overhead.

Priority streams for mixed workloads. Pair priority streams with MPS for the best multi-tenant latency. The real-time inference path gets high priority; background retraining gets low priority. Monitor tail latency, not just average, to ensure preemption is working.

Do not over-complicate concurrency. Two concurrent kernels that each use 50% of SMs give the same throughput as running them sequentially. Concurrency helps latency (both finish sooner) but not throughput. It helps throughput only when the concurrent kernels together utilize resources that would otherwise be idle.

What comes next

The next article covers performance case studies: real workloads analyzed end-to-end with profiling data, optimization decisions, and measured results across memory-bound, compute-bound, and latency-bound scenarios.

Start typing to search across all content
navigate Enter open Esc close