Search…

CUDA streams and asynchronous execution

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article builds on debugging and profiling from the previous post. You should be comfortable launching kernels, checking errors with cudaGetLastError(), and reading Nsight Systems timelines. We will use those timelines extensively to visualize overlap.

The problem: serial execution wastes hardware

A typical CUDA workflow copies data to the GPU, runs a kernel, and copies results back. With the default API calls, each step waits for the previous one to finish. The PCIe bus sits idle during compute. The SMs sit idle during transfers. On an A100 with 2 TB/s memory bandwidth and 2 GB/s PCIe throughput, this means the expensive compute hardware does nothing for most of the wall clock time on transfer-heavy workloads.

Streams fix this. A CUDA stream is a sequence of operations that execute in order with respect to each other but can overlap with operations in other streams. Two kernels in two different streams can run concurrently. A memory transfer in one stream can overlap with a kernel in another. This is how production CUDA code achieves near-100% utilization of both the PCIe bus and the SMs simultaneously.

The default stream

Every CUDA API call that does not take a stream parameter uses the default stream (also called stream 0 or the NULL stream). Operations on the default stream are serialized with respect to all other streams on the device.

// All three operations run sequentially on the default stream
cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);
myKernel<<<grid, block>>>(d_input, d_output, N);
cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);

The cudaMemcpy call is synchronous: it blocks the CPU thread until the transfer completes. The kernel launch is asynchronous (returns immediately to the CPU) but the second cudaMemcpy cannot begin until the kernel finishes because both are on the default stream.

This gives you a timeline like:

gantt
  title Sequential Execution (Default Stream)
  dateFormat X
  axisFormat %s

  section Stream 0
  H2D Transfer     :a1, 0, 30
  Kernel Compute   :a2, after a1, 20
  D2H Transfer     :a3, after a2, 30

Total wall time: 80 ms. The SMs are idle for 60 ms (75% of the time).

Non-default streams

You create a stream with cudaStreamCreate and pass it to async API calls and kernel launches:

cudaStream_t stream1;
cudaStreamCreate(&stream1);

// Async copy: does NOT block the CPU
cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream1);

// Kernel launch on stream1
myKernel<<<grid, block, 0, stream1>>>(d_input, d_output, N);

// Async copy back
cudaMemcpyAsync(h_output, d_output, size, cudaMemcpyDeviceToHost, stream1);

// CPU continues here immediately; use synchronization when you need results
cudaStreamSynchronize(stream1);

cudaStreamDestroy(stream1);

Within a single stream, operations still execute in order. The benefit comes when you use multiple streams and split your data into chunks.

Pinned (page-locked) memory

There is a critical requirement for async transfers: the host memory must be pinned. Pageable memory (allocated with malloc or new) cannot be used with cudaMemcpyAsync for true asynchronous behavior. The driver must first copy pageable data to an internal pinned staging buffer, which serializes the transfer.

float* h_data;

// Pageable: cudaMemcpyAsync will silently fall back to synchronous
h_data = (float*)malloc(size);

// Pinned: enables true async transfer via DMA
cudaMallocHost(&h_data, size);

// Alternative with more control
cudaHostAlloc(&h_data, size, cudaHostAllocDefault);

Pinned memory is physically locked in RAM and cannot be swapped to disk. The GPU DMA engine can transfer data directly without CPU involvement. The tradeoff: pinned memory reduces the amount of pageable memory available to the OS. Allocating gigabytes of pinned memory on a system with limited RAM can cause swapping of other processes. A reasonable guideline is to keep pinned allocations under 50% of system RAM.

Always pair cudaMallocHost with cudaFreeHost, not regular free. Mixing them is undefined behavior.

Synchronization functions

You need explicit synchronization to know when results are ready. CUDA provides several levels:

FunctionScopeBlocks CPUBlocks DeviceWhen to Use
cudaStreamSynchronize(stream)Single streamWait for one stream to finish
cudaDeviceSynchronize()All streamsWait for everything on the device
cudaStreamWaitEvent(stream, event)Cross-stream✓ (that stream)Make one stream wait for another
cudaStreamQuery(stream)Single streamNon-blocking check if stream is done
cudaEventSynchronize(event)Single eventWait for a specific recorded event

cudaDeviceSynchronize() is the sledgehammer. It waits for every stream on the device to complete. Use it sparingly because it destroys all concurrency. In production code, prefer cudaStreamSynchronize or event-based synchronization for fine-grained control.

Overlapping transfers and compute

The real payoff of streams is running a kernel on chunk N while simultaneously transferring chunk N+1. This requires:

  1. Two or more streams
  2. Pinned host memory
  3. Data split into chunks
  4. A GPU with a copy engine separate from the compute engine (every GPU since Fermi)

Here is the pattern with two streams:

gantt
  title Overlapped Execution (2 Streams, 4 Chunks)
  dateFormat X
  axisFormat %s

  section Stream 1
  H2D chunk 0    :a1, 0, 8
  Kernel chunk 0 :a2, after a1, 5
  D2H chunk 0    :a3, after a2, 8
  H2D chunk 2    :a5, after a3, 8
  Kernel chunk 2 :a6, after a5, 5
  D2H chunk 2    :a7, after a6, 8

  section Stream 2
  H2D chunk 1    :b1, after a1, 8
  Kernel chunk 1 :b2, after b1, 5
  D2H chunk 1    :b3, after b2, 8
  H2D chunk 3    :b5, after b3, 8
  Kernel chunk 3 :b6, after b5, 5
  D2H chunk 3    :b7, after b6, 8

In the overlapped timeline, the H2D transfer for chunk 1 overlaps with the kernel execution for chunk 0. The D2H transfer for chunk 0 overlaps with the kernel on chunk 1. The total wall time drops significantly compared to the sequential case.

Sequential vs overlapped: measured comparison

The following code demonstrates both approaches and uses CUDA events to measure elapsed time:

#include <cstdio>
#include <cuda_runtime.h>

#define CHECK(call) \{ \
    cudaError_t err = call; \
    if (err != cudaSuccess) \{ \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    \} \
\}

__global__ void processKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float val = input[idx];
        // Simulate some compute work
        for (int i = 0; i < 100; i++) {
            val = sinf(val) * cosf(val) + val;
        }
        output[idx] = val;
    }
}

int main() {
    const int N = 1 << 22;        // ~4M elements
    const int SIZE = N * sizeof(float);
    const int NUM_STREAMS = 2;
    const int CHUNK = N / NUM_STREAMS;
    const int CHUNK_SIZE = CHUNK * sizeof(float);

    // Pinned host memory
    float *h_input, *h_output;
    CHECK(cudaMallocHost(&h_input, SIZE));
    CHECK(cudaMallocHost(&h_output, SIZE));

    // Initialize
    for (int i = 0; i < N; i++) h_input[i] = (float)i;

    // Device memory
    float *d_input, *d_output;
    CHECK(cudaMalloc(&d_input, SIZE));
    CHECK(cudaMalloc(&d_output, SIZE));

    // --- Sequential version ---
    cudaEvent_t start, stop;
    CHECK(cudaEventCreate(&start));
    CHECK(cudaEventCreate(&stop));

    CHECK(cudaEventRecord(start));
    CHECK(cudaMemcpy(d_input, h_input, SIZE, cudaMemcpyHostToDevice));
    processKernel<<<(N + 255) / 256, 256>>>(d_input, d_output, N);
    CHECK(cudaMemcpy(h_output, d_output, SIZE, cudaMemcpyDeviceToHost));
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop));

    float seqMs = 0;
    CHECK(cudaEventElapsedTime(&seqMs, start, stop));
    printf("Sequential: %.2f ms\n", seqMs);

    // --- Overlapped version with streams ---
    cudaStream_t streams[NUM_STREAMS];
    for (int i = 0; i < NUM_STREAMS; i++) {
        CHECK(cudaStreamCreate(&streams[i]));
    }

    CHECK(cudaEventRecord(start));
    for (int i = 0; i < NUM_STREAMS; i++) {
        int offset = i * CHUNK;
        CHECK(cudaMemcpyAsync(d_input + offset, h_input + offset,
                              CHUNK_SIZE, cudaMemcpyHostToDevice, streams[i]));
        processKernel<<<(CHUNK + 255) / 256, 256, 0, streams[i]>>>(
            d_input + offset, d_output + offset, CHUNK);
        CHECK(cudaMemcpyAsync(h_output + offset, d_output + offset,
                              CHUNK_SIZE, cudaMemcpyDeviceToHost, streams[i]));
    }
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop));

    float overlapMs = 0;
    CHECK(cudaEventElapsedTime(&overlapMs, start, stop));
    printf("Overlapped (%d streams): %.2f ms\n", NUM_STREAMS, overlapMs);
    printf("Speedup: %.2fx\n", seqMs / overlapMs);

    // Cleanup
    for (int i = 0; i < NUM_STREAMS; i++) {
        CHECK(cudaStreamDestroy(streams[i]));
    }
    CHECK(cudaEventDestroy(start));
    CHECK(cudaEventDestroy(stop));
    CHECK(cudaFreeHost(h_input));
    CHECK(cudaFreeHost(h_output));
    CHECK(cudaMalloc(&d_input, SIZE));
    CHECK(cudaFree(d_input));
    CHECK(cudaFree(d_output));

    return 0;
}

Compile and run:

nvcc -O2 -o stream_overlap stream_overlap.cu
./stream_overlap

On an A100, typical output for 4M elements:

Sequential: 12.45 ms
Overlapped (2 streams): 7.82 ms
Speedup: 1.59x

The speedup depends on the ratio of transfer time to compute time. When they are roughly equal, two streams approach 2x. When one dominates, the benefit shrinks.

Ping-pong buffer pattern

For continuous streaming workloads (real-time video, sensor data, inference servers), the ping-pong pattern uses two buffers and two streams in alternation. While one buffer is being processed on the GPU, the other is receiving new data from the host.

graph LR
  subgraph Frame N
  A["Buffer A
H2D Transfer"] --> B["Buffer A
Kernel Exec"] --> C["Buffer A
D2H Transfer"]
  end
  subgraph Frame N+1
  D["Buffer B
H2D Transfer"] --> E["Buffer B
Kernel Exec"] --> F["Buffer B
D2H Transfer"]
  end

  B -.->|overlaps| D
  C -.->|overlaps| E

  style A fill:#636EFA,stroke:#4850B5,color:#fff
  style B fill:#636EFA,stroke:#4850B5,color:#fff
  style C fill:#636EFA,stroke:#4850B5,color:#fff
  style D fill:#EF553B,stroke:#C44230,color:#fff
  style E fill:#EF553B,stroke:#C44230,color:#fff
  style F fill:#EF553B,stroke:#C44230,color:#fff
#include <cstdio>
#include <cuda_runtime.h>

#define CHECK(call) \{ \
    cudaError_t err = call; \
    if (err != cudaSuccess) \{ \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    \} \
\}

__global__ void inferenceKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] * 2.0f + 1.0f;
    }
}

int main() {
    const int N = 1 << 20;
    const int SIZE = N * sizeof(float);
    const int NUM_FRAMES = 100;

    // Two pinned host buffers (ping and pong)
    float *h_in[2], *h_out[2];
    float *d_in[2], *d_out[2];
    cudaStream_t streams[2];

    for (int i = 0; i < 2; i++) {
        CHECK(cudaMallocHost(&h_in[i], SIZE));
        CHECK(cudaMallocHost(&h_out[i], SIZE));
        CHECK(cudaMalloc(&d_in[i], SIZE));
        CHECK(cudaMalloc(&d_out[i], SIZE));
        CHECK(cudaStreamCreate(&streams[i]));
    }

    // Initialize both input buffers
    for (int i = 0; i < 2; i++)
        for (int j = 0; j < N; j++)
            h_in[i][j] = (float)(i * N + j);

    cudaEvent_t start, stop;
    CHECK(cudaEventCreate(&start));
    CHECK(cudaEventCreate(&stop));
    CHECK(cudaEventRecord(start));

    for (int frame = 0; frame < NUM_FRAMES; frame++) {
        int buf = frame % 2;       // current buffer
        cudaStream_t s = streams[buf];

        // Wait for this stream's previous work to complete
        // before overwriting its host buffer
        CHECK(cudaStreamSynchronize(s));

        // Simulate: fill h_in[buf] with new data for this frame
        // In real code, this would be a sensor read or network recv

        CHECK(cudaMemcpyAsync(d_in[buf], h_in[buf], SIZE,
                              cudaMemcpyHostToDevice, s));
        inferenceKernel<<<(N + 255) / 256, 256, 0, s>>>(
            d_in[buf], d_out[buf], N);
        CHECK(cudaMemcpyAsync(h_out[buf], d_out[buf], SIZE,
                              cudaMemcpyDeviceToHost, s));
    }

    CHECK(cudaDeviceSynchronize());
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop));

    float ms = 0;
    CHECK(cudaEventElapsedTime(&ms, start, stop));
    printf("Ping-pong: %d frames in %.2f ms (%.2f ms/frame)\n",
           NUM_FRAMES, ms, ms / NUM_FRAMES);

    for (int i = 0; i < 2; i++) {
        CHECK(cudaStreamDestroy(streams[i]));
        CHECK(cudaFreeHost(h_in[i]));
        CHECK(cudaFreeHost(h_out[i]));
        CHECK(cudaFree(d_in[i]));
        CHECK(cudaFree(d_out[i]));
    }
    CHECK(cudaEventDestroy(start));
    CHECK(cudaEventDestroy(stop));

    return 0;
}

The key insight: cudaStreamSynchronize(s) at the top of each iteration ensures the previous frame on that buffer is done before the CPU overwrites h_in[buf]. But the other stream is still running in parallel. This gives you continuous overlap with only two buffers.

The default stream trap

A common and frustrating pitfall: mixing the default stream with non-default streams kills concurrency.

cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);

// Stream 1: async copy + kernel
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, s1);
kernelA<<<grid, block, 0, s1>>>(d_a, N);

// BUG: this copy uses the default stream (no stream argument)
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

// Stream 2: kernel that we hoped would overlap with stream 1
kernelB<<<grid, block, 0, s2>>>(d_b, N);

The cudaMemcpy on the default stream acts as a barrier. It waits for all prior work across all streams to finish, then blocks all subsequent work until it completes. The Nsight Systems timeline will show a flat sequential execution despite using two streams.

Fix: always use cudaMemcpyAsync with an explicit stream when you need concurrency.

// Fixed: every operation has an explicit stream
cudaMemcpyAsync(d_a, h_a, size, cudaMemcpyHostToDevice, s1);
kernelA<<<grid, block, 0, s1>>>(d_a, N);

cudaMemcpyAsync(d_b, h_b, size, cudaMemcpyHostToDevice, s2);
kernelB<<<grid, block, 0, s2>>>(d_b, N);

You can also compile with --default-stream per-thread to give each CPU thread its own default stream, avoiding the implicit synchronization. But explicit stream management is clearer and more portable.

CuPy streams in Python

CuPy provides stream support that mirrors the CUDA C++ API:

import cupy as cp
import time

n = 10_000_000

# Pinned host memory via CuPy
h_input = cp.cuda.alloc_pinned_memory(n * 4)
h_input_array = np.frombuffer(h_input, dtype=np.float32, count=n)
h_input_array[:] = np.random.randn(n).astype(np.float32)

# --- Sequential ---
start = time.perf_counter()
d_arr = cp.asarray(h_input_array)          # H2D
d_result = cp.sin(d_arr) * cp.cos(d_arr)   # Compute
h_result = cp.asnumpy(d_result)             # D2H
cp.cuda.Device().synchronize()
seq_time = time.perf_counter() - start

# --- With streams ---
stream1 = cp.cuda.Stream(non_blocking=True)
stream2 = cp.cuda.Stream(non_blocking=True)

half = n // 2
start = time.perf_counter()

with stream1:
    d_a = cp.asarray(h_input_array[:half])
    r_a = cp.sin(d_a) * cp.cos(d_a)
    h_a = cp.asnumpy(r_a)

with stream2:
    d_b = cp.asarray(h_input_array[half:])
    r_b = cp.sin(d_b) * cp.cos(d_b)
    h_b = cp.asnumpy(r_b)

stream1.synchronize()
stream2.synchronize()
overlap_time = time.perf_counter() - start

print(f"Sequential: {seq_time*1000:.2f} ms")
print(f"2 streams:  {overlap_time*1000:.2f} ms")
print(f"Speedup:    {seq_time/overlap_time:.2f}x")

CuPy’s Stream context manager routes all operations within the with block to that stream. The non_blocking=True flag creates a stream that does not synchronize with the default stream, equivalent to cudaStreamNonBlocking in C++.

Overlap visualization

The following chart shows a Gantt-style view of how operations overlap as you increase the number of streams for a workload with equal transfer and compute times:

With 1 stream, total time is 80 ms. With 2 streams, the overlapped region reduces total time to roughly 55 ms. With 4 streams, we approach roughly 42 ms. The improvement follows diminishing returns because there are only so many hardware copy engines (typically 1 H2D + 1 D2H on consumer GPUs, 2+1 on data center GPUs).

Example: overlap calculation

Problem. You have 1 GB of data to process. Transfer rates: H2D = 30 ms, kernel = 20 ms, D2H = 30 ms. Sequential total = 80 ms. You split the data into 4 equal chunks and use 2 streams. What is the ideal overlapped time?

Analysis. Each chunk: H2D = 7.5 ms, kernel = 5 ms, D2H = 7.5 ms (= 20 ms per chunk). With 2 streams, one stream processes chunk 0 while the other starts chunk 1. The overlap depends on the hardware’s ability to run a copy engine and the compute engine simultaneously.

The ideal pipeline fills all engines. Stream 1 starts H2D for chunk 0 at t=0. Stream 2 starts H2D for chunk 1 at t=7.5 (after stream 1’s H2D finishes, since both share the H2D engine). Stream 1’s kernel starts at t=7.5 and overlaps with stream 2’s H2D.

The pipeline depth is: first chunk runs fully (20 ms), then each subsequent chunk adds only the longest stage (7.5 ms for H2D or D2H). For 4 chunks with 2 streams:

  • Startup: first chunk completes all 3 stages = 7.5 + 5 + 7.5 = 20 ms
  • Remaining 3 chunks add the bottleneck stage each = 3 * 7.5 = 22.5 ms
  • Total = 20 + 22.5 = 42.5 ms
  • Speedup = 80 / 42.5 = 1.88x

In practice, you will see slightly less than this due to stream scheduling overhead and contention on the copy engine. The rule of thumb: 2 streams captures most of the benefit, 4 streams captures nearly all of it, and beyond 4 you rarely see meaningful improvement.

Example: default stream trap in practice

Problem. A developer has two independent kernels and wants them to overlap:

cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);

// Prepare data for kernel B using a utility function
// that internally calls cudaMemcpy (default stream!)
prepareData(d_b, h_b, size);  // uses cudaMemcpy inside

kernelA<<<grid, block, 0, s1>>>(d_a, N);
kernelB<<<grid, block, 0, s2>>>(d_b, N);

Why no overlap? The prepareData function uses cudaMemcpy (synchronous, default stream). The default stream waits for all prior work to finish and blocks all subsequent work. Even though kernelA and kernelB are in different streams, the hidden cudaMemcpy in prepareData serializes everything.

Fix:

void prepareDataAsync(float* d, float* h, size_t size, cudaStream_t stream) {
    cudaMemcpyAsync(d, h, size, cudaMemcpyHostToDevice, stream);
}

prepareDataAsync(d_b, h_b, size, s2);  // explicit stream, no barrier
kernelA<<<grid, block, 0, s1>>>(d_a, N);
kernelB<<<grid, block, 0, s2>>>(d_b, N);

The lesson: audit every function in your pipeline for hidden default-stream calls. A single cudaMemcpy, cudaMalloc, or cudaMemset on the default stream will serialize your entire concurrent pipeline. Nsight Systems makes these barriers visible as gaps in the timeline.

In practice

Start with 2 streams. Two streams are enough to saturate one H2D copy engine and one compute engine simultaneously. Adding more helps only when you have multiple copy engines or your kernel is so short that scheduling overhead matters.

Always use pinned memory with async transfers. Forgetting this is the number one reason developers do not see any speedup from streams. The transfer falls back to synchronous and the carefully constructed pipeline collapses.

Profile with Nsight Systems, not wall clock. The timeline view shows exactly where overlap happens and where gaps exist. Wall clock timing can be misleading because it includes CPU overhead, driver latency, and JIT compilation on first launch.

Watch for false dependencies. Any operation on the default stream serializes everything. Library calls (cuBLAS, cuDNN, cuFFT) accept stream parameters. Pass your stream to every library call. Check wrapper functions for hidden synchronous calls.

Ping-pong for latency-sensitive workloads. If you are processing a continuous data feed (camera frames, network packets, audio buffers), the ping-pong pattern gives you a steady-state pipeline with minimal latency. Two buffers, two streams, alternating every frame.

What comes next

The next article covers CUDA events and synchronization: recording events, measuring kernel timing with sub-millisecond precision, inter-stream dependencies with cudaStreamWaitEvent, and building dependency graphs for complex multi-kernel pipelines.

Start typing to search across all content
navigate Enter open Esc close