Search…

CUDA events and fine-grained synchronization

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read CUDA streams and asynchronous execution, where you learned how streams allow overlapping kernel execution with memory transfers. You need that foundation because events only make sense in the context of streams. You should also be comfortable with basic kernel launches and the cudaMemcpyAsync API.

Events as timestamps on the GPU timeline

A CUDA event is a marker that you insert into a stream. When the GPU reaches that marker during execution, it records a timestamp from a high-resolution clock on the device. That is all an event is: a point in a stream’s execution timeline with a recorded time.

The CPU never sees GPU clock cycles directly. The GPU runs asynchronously, and its internal clock is separate from the host system clock. Events bridge this gap. You record an event before a kernel, record another after, then ask the runtime for the elapsed time between them. The runtime reads both timestamps from the GPU clock and returns the difference in milliseconds.

This is fundamentally different from wrapping a kernel launch with clock() or std::chrono on the CPU. A kernel launch returns to the CPU almost immediately (it is asynchronous). Measuring wall-clock time around a launch measures only the launch overhead, not the actual kernel execution time. To measure real GPU execution time, you need GPU-side timestamps. That means events.

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

An event starts as an unrecorded marker. You place it into a stream with cudaEventRecord. When the GPU processes all prior work in that stream and reaches the event, it writes a timestamp.

The event API

Four functions cover the entire event lifecycle:

FunctionBlocks CPU?Blocks GPU stream?Measures time?Use case
cudaEventCreateNoNoNoAllocate event object
cudaEventRecordNoNoNoInsert timestamp marker into stream
cudaEventSynchronize✓ YesNoNoCPU waits until event is recorded
cudaEventElapsedTimeNoNo✓ YesCompute ms between two recorded events
cudaStreamWaitEventNo✓ YesNoMake stream B wait for event from stream A
cudaEventDestroyNoNoNoFree event resources

cudaEventRecord(event, stream) inserts the event into the specified stream. If you pass 0 (the default stream), the event goes into the default stream. The GPU will record the timestamp when it reaches that point in the stream’s queue.

cudaEventSynchronize(event) blocks the calling CPU thread until the GPU has actually recorded the event. This is the CPU-side wait. The GPU keeps running other streams in the meantime.

cudaEventElapsedTime(&ms, start, stop) computes the elapsed time in milliseconds between two events. Both events must have been recorded and completed. The resolution is approximately 0.5 microseconds on modern GPUs.

cudaStreamWaitEvent(stream, event, 0) is the cross-stream synchronization primitive. It tells a stream to pause and wait until a specific event from any stream has been recorded. The third argument is flags (must be 0 today). This call does not block the CPU. It inserts a dependency into the GPU stream.

Accurate kernel timing with events

Here is the canonical pattern for timing a kernel:

#include <cstdio>

__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 100'000'000;  // 100M elements
    const size_t bytes = N * sizeof(float);

    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Initialize with ones (simplified)
    cudaMemset(d_a, 0x3f800000, bytes);  // Not proper init, just for timing demo
    cudaMemset(d_b, 0x3f800000, bytes);

    // Create events
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int threads = 256;
    int blocks = (N + threads - 1) / threads;

    // Record start, launch kernel, record stop
    cudaEventRecord(start);
    vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
    cudaEventRecord(stop);

    // Wait for stop event to be recorded
    cudaEventSynchronize(stop);

    // Compute elapsed time
    float ms = 0.0f;
    cudaEventElapsedTime(&ms, start, stop);
    printf("Kernel time: %.2f ms\n", ms);

    // Cleanup
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

The sequence matters. cudaEventRecord(start) goes before the kernel launch. cudaEventRecord(stop) goes after. Both are non-blocking on the CPU. The kernel launch is also non-blocking. The only blocking call is cudaEventSynchronize(stop), which waits for the GPU to finish the kernel and record the stop event.

The GPU timeline looks like this:

gantt
  title GPU Timeline: Event Timing
  dateFormat X
  axisFormat %s

  section Stream 0
  Record start event       :done, s1, 0, 1
  vectorAdd kernel         :active, s2, 1, 8
  Record stop event        :done, s3, 8, 9

The CPU returns from all three calls (record, launch, record) in microseconds. The actual GPU work takes milliseconds. cudaEventSynchronize is where the CPU blocks until the stop timestamp is written.

Why GPU timers beat CPU timers

Consider this incorrect approach:

auto cpu_start = std::chrono::high_resolution_clock::now();
vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
auto cpu_stop = std::chrono::high_resolution_clock::now();
float cpu_ms = std::chrono::duration<float, std::milli>(cpu_stop - cpu_start).count();

This measures the time to push a kernel launch command into the driver queue. On a warm GPU, that is 5 to 15 microseconds regardless of whether the kernel runs for 1 ms or 100 ms. The CPU does not wait for the kernel to finish unless you add cudaDeviceSynchronize() before cpu_stop.

Even with cudaDeviceSynchronize, the CPU timer measures launch overhead plus kernel execution plus synchronization overhead. The event-based timer isolates only the GPU execution, with 0.5 microsecond resolution, directly from the device clock.

Three reasons to prefer events:

  1. Accuracy. Events measure GPU clock cycles. CPU timers include driver overhead, OS scheduling jitter, and synchronization latency.
  2. Non-intrusiveness. Recording an event does not force synchronization. The GPU keeps running. CPU timers require cudaDeviceSynchronize, which drains the entire GPU pipeline and destroys concurrency.
  3. Stream-specific. Events can time individual streams. CPU synchronization waits for all streams, making it impossible to isolate one stream’s performance.

Example: computing achieved throughput

Suppose the vectorAdd kernel on 100M FP32 elements reports 5.23 ms via event timing. Each element performs one addition (one FLOP). The total work is 100,000,000 FLOPs.

Achieved GFLOPs = FLOPs / (time_in_seconds)
                = 100,000,000 / 0.00523
                = 19.12 GFLOPs

For a memory-bound kernel like vector addition, the more relevant metric is effective bandwidth:

Bytes moved = 3 * 100,000,000 * 4 bytes = 1.2 GB   (read A, read B, write C)
Bandwidth   = 1.2 GB / 0.00523 s = 229.4 GB/s

On an A100 with 2039 GB/s peak bandwidth, that is 11.3% utilization. This tells you the kernel is not saturating memory bandwidth and there may be launch overhead or other inefficiencies dominating at this problem size.

Event-based stream synchronization

Events become powerful when you need one stream to wait for another without blocking the CPU. The function cudaStreamWaitEvent creates a GPU-side dependency: stream B will not execute any commands enqueued after the wait call until the specified event in stream A has been recorded.

The problem: cross-stream dependencies

Consider a pipeline where stream A computes a result that stream B needs as input. Without synchronization:

cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);

// Stream A: compute intermediate result
computePartA<<<grid, block, 0, streamA>>>(d_input, d_intermediate);

// Stream B: needs d_intermediate from stream A
// BUG: stream B may start before stream A finishes!
computePartB<<<grid, block, 0, streamB>>>(d_intermediate, d_output);

Streams execute independently by default. Stream B’s kernel might read d_intermediate before stream A has finished writing it. The result is a data race: sometimes correct, sometimes garbage, depending on GPU scheduling.

The fix: cudaStreamWaitEvent

cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);

cudaEvent_t partA_done;
cudaEventCreate(&partA_done);

// Stream A: compute intermediate result
computePartA<<<grid, block, 0, streamA>>>(d_input, d_intermediate);
cudaEventRecord(partA_done, streamA);

// Stream B: wait for stream A's event, then proceed
cudaStreamWaitEvent(streamB, partA_done, 0);
computePartB<<<grid, block, 0, streamB>>>(d_intermediate, d_output);

// Cleanup
cudaEventDestroy(partA_done);
cudaStreamDestroy(streamA);
cudaStreamDestroy(streamB);

The event partA_done is recorded in stream A after computePartA. When stream B encounters the cudaStreamWaitEvent, it stalls until the GPU has recorded partA_done. Only then does stream B proceed with computePartB.

The CPU is never blocked. Both kernel launches return immediately. The dependency is entirely GPU-side.

Visualizing the dependency

Without the event dependency, both kernels can overlap (and race):

gantt
  title Without cudaStreamWaitEvent (data race!)
  dateFormat X
  axisFormat %s

  section Stream A
  computePartA         :active, a1, 0, 6

  section Stream B
  computePartB (RACE)  :crit, b1, 1, 5

With cudaStreamWaitEvent, stream B correctly waits:

gantt
  title With cudaStreamWaitEvent (correct)
  dateFormat X
  axisFormat %s

  section Stream A
  computePartA            :active, a1, 0, 6
  Record partA_done       :done, a2, 6, 7

  section Stream B
  Wait for partA_done     :done, b0, 0, 7
  computePartB            :active, b1, 7, 13

Stream B idles until stream A’s event fires. After that, execution proceeds with the guarantee that d_intermediate is fully written.

Building dependency chains across streams

Real pipelines often have multiple stages across multiple streams. Events let you build arbitrary DAGs (directed acyclic graphs) of dependencies.

Consider a three-stage pipeline: stream A produces data, stream B transforms it, and stream C consumes it. Each stage depends on the previous:

cudaStream_t sA, sB, sC;
cudaStreamCreate(&sA);
cudaStreamCreate(&sB);
cudaStreamCreate(&sC);

cudaEvent_t evA, evB;
cudaEventCreate(&evA);
cudaEventCreate(&evB);

// Stage 1: produce data in stream A
produce<<<grid, block, 0, sA>>>(d_raw, d_data);
cudaEventRecord(evA, sA);

// Stage 2: transform in stream B (waits for A)
cudaStreamWaitEvent(sB, evA, 0);
transform<<<grid, block, 0, sB>>>(d_data, d_transformed);
cudaEventRecord(evB, sB);

// Stage 3: consume in stream C (waits for B)
cudaStreamWaitEvent(sC, evB, 0);
consume<<<grid, block, 0, sC>>>(d_transformed, d_result);

// Cleanup
cudaEventDestroy(evA);
cudaEventDestroy(evB);
cudaStreamDestroy(sA);
cudaStreamDestroy(sB);
cudaStreamDestroy(sC);

The dependency chain is: produce (A) -> transform (B) -> consume (C). Each stream can also run independent work before or after the dependency point. For example, stream A could launch another kernel after recording evA, and that kernel would run concurrently with stream B’s transform.

You can also create fan-out dependencies. Record one event and have multiple streams wait on it:

cudaEventRecord(evA, sA);
cudaStreamWaitEvent(sB, evA, 0);  // B waits for A
cudaStreamWaitEvent(sC, evA, 0);  // C also waits for A

And fan-in: one stream waits for events from multiple streams:

cudaEventRecord(evA, sA);
cudaEventRecord(evB, sB);
cudaStreamWaitEvent(sC, evA, 0);  // C waits for A
cudaStreamWaitEvent(sC, evB, 0);  // C also waits for B

Stream C will not proceed until both A and B have reached their respective events. This pattern is common in parallel reduction pipelines where multiple partial results must be merged.

Python timing with CuPy events

CuPy provides an event API that mirrors the CUDA C++ interface:

import cupy as cp

n = 100_000_000
a = cp.ones(n, dtype=cp.float32)
b = cp.ones(n, dtype=cp.float32)

start = cp.cuda.Event()
stop = cp.cuda.Event()

start.record()
c = a + b  # Element-wise addition on GPU
stop.record()
stop.synchronize()

elapsed_ms = cp.cuda.get_elapsed_time(start, stop)
print(f"Kernel time: {elapsed_ms:.2f} ms")

# Compute achieved GFLOPs
flops = n  # one add per element
gflops = flops / (elapsed_ms / 1000) / 1e9
print(f"Achieved: {gflops:.2f} GFLOPs")

cp.cuda.Event() creates an event. record() places it in the current stream. stop.synchronize() blocks the Python thread until the event is recorded. cp.cuda.get_elapsed_time(start, stop) returns milliseconds, exactly like cudaEventElapsedTime.

For stream-level timing in CuPy:

stream = cp.cuda.Stream()

start = cp.cuda.Event()
stop = cp.cuda.Event()

with stream:
    start.record(stream)
    c = a + b
    stop.record(stream)

stop.synchronize()
print(f"Stream time: {cp.cuda.get_elapsed_time(start, stop):.2f} ms")

Event creation flags

cudaEventCreate accepts optional flags through cudaEventCreateWithFlags:

cudaEvent_t event;
cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
FlagEffect
cudaEventDefaultFull timing support. Default behavior.
cudaEventDisableTimingCannot be used with cudaEventElapsedTime. Faster to record.
cudaEventBlockingSynccudaEventSynchronize yields the CPU thread instead of spinning.
cudaEventInterprocessEvent can be shared across processes via IPC.

⚠ If you only need events for synchronization (not timing), always use cudaEventDisableTiming. Recording a timing-enabled event requires the GPU to flush its instruction pipeline to get an accurate timestamp. A non-timing event is a lightweight fence with no pipeline flush. In tight loops with many event records, the difference is measurable.

The cudaEventBlockingSync flag changes how cudaEventSynchronize waits. By default, it spin-waits (burns CPU cycles checking a flag). With cudaEventBlockingSync, the thread yields to the OS scheduler. Use this when the CPU has other useful work to do, or in power-sensitive environments.

Common mistakes

Recording an event without synchronizing before reading elapsed time. If you call cudaEventElapsedTime before the stop event has been recorded, you get cudaErrorNotReady. Always call cudaEventSynchronize(stop) first.

Measuring launch overhead instead of kernel time. Wrapping a kernel launch with CPU timers and no cudaDeviceSynchronize gives you 5 to 15 microseconds regardless of the kernel’s actual execution time.

Forgetting the third argument to cudaStreamWaitEvent. The flags parameter must be 0. Passing anything else is undefined behavior in current CUDA versions.

Using timing-enabled events for pure synchronization. Each timing-enabled event record forces a pipeline flush. If you are recording thousands of events per frame for dependency management, switch to cudaEventDisableTiming.

Assuming events work across devices. A standard event is device-local. Recording an event on GPU 0 and waiting on it from a stream on GPU 1 requires cudaEventInterprocess or cudaEventDisableTiming with peer access enabled.

In practice

Always use events for kernel timing. CPU timers are wrong by default because kernel launches are asynchronous. Events are the only reliable way to measure GPU execution time without introducing synchronization artifacts.

Use cudaEventDisableTiming for synchronization-only events. The performance difference matters when you have many events per frame. Profile it: on Ampere GPUs, a non-timing event record is roughly 2x faster than a timing-enabled one.

Prefer cudaStreamWaitEvent over cudaDeviceSynchronize. Device synchronization drains every stream on the GPU, destroying all concurrency. cudaStreamWaitEvent is surgical: only the target stream stalls, and only until the specific event fires. Other streams keep running.

Record events at natural boundaries. Place events between computational stages, not inside tight loops. An event per kernel launch in a 1000-kernel pipeline is fine. An event per thread block is not (you would need host-side logic that defeats the purpose).

Warm up before timing. The first kernel launch in a CUDA program incurs JIT compilation and context initialization overhead. Launch the kernel once, synchronize, then time the second launch. This gives you a representative measurement.

What comes next

The next article covers dynamic parallelism: launching kernels from within kernels on the GPU, recursive decomposition of work, and when nested launches help versus when they add overhead.

Start typing to search across all content
navigate Enter open Esc close