CUDA events and fine-grained synchronization
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read CUDA streams and asynchronous execution, where you learned how streams allow overlapping kernel execution with memory transfers. You need that foundation because events only make sense in the context of streams. You should also be comfortable with basic kernel launches and the cudaMemcpyAsync API.
Events as timestamps on the GPU timeline
A CUDA event is a marker that you insert into a stream. When the GPU reaches that marker during execution, it records a timestamp from a high-resolution clock on the device. That is all an event is: a point in a stream’s execution timeline with a recorded time.
The CPU never sees GPU clock cycles directly. The GPU runs asynchronously, and its internal clock is separate from the host system clock. Events bridge this gap. You record an event before a kernel, record another after, then ask the runtime for the elapsed time between them. The runtime reads both timestamps from the GPU clock and returns the difference in milliseconds.
This is fundamentally different from wrapping a kernel launch with clock() or std::chrono on the CPU. A kernel launch returns to the CPU almost immediately (it is asynchronous). Measuring wall-clock time around a launch measures only the launch overhead, not the actual kernel execution time. To measure real GPU execution time, you need GPU-side timestamps. That means events.
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
An event starts as an unrecorded marker. You place it into a stream with cudaEventRecord. When the GPU processes all prior work in that stream and reaches the event, it writes a timestamp.
The event API
Four functions cover the entire event lifecycle:
| Function | Blocks CPU? | Blocks GPU stream? | Measures time? | Use case |
|---|---|---|---|---|
cudaEventCreate | No | No | No | Allocate event object |
cudaEventRecord | No | No | No | Insert timestamp marker into stream |
cudaEventSynchronize | ✓ Yes | No | No | CPU waits until event is recorded |
cudaEventElapsedTime | No | No | ✓ Yes | Compute ms between two recorded events |
cudaStreamWaitEvent | No | ✓ Yes | No | Make stream B wait for event from stream A |
cudaEventDestroy | No | No | No | Free event resources |
cudaEventRecord(event, stream) inserts the event into the specified stream. If you pass 0 (the default stream), the event goes into the default stream. The GPU will record the timestamp when it reaches that point in the stream’s queue.
cudaEventSynchronize(event) blocks the calling CPU thread until the GPU has actually recorded the event. This is the CPU-side wait. The GPU keeps running other streams in the meantime.
cudaEventElapsedTime(&ms, start, stop) computes the elapsed time in milliseconds between two events. Both events must have been recorded and completed. The resolution is approximately 0.5 microseconds on modern GPUs.
cudaStreamWaitEvent(stream, event, 0) is the cross-stream synchronization primitive. It tells a stream to pause and wait until a specific event from any stream has been recorded. The third argument is flags (must be 0 today). This call does not block the CPU. It inserts a dependency into the GPU stream.
Accurate kernel timing with events
Here is the canonical pattern for timing a kernel:
#include <cstdio>
__global__ void vectorAdd(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i];
}
}
int main() {
const int N = 100'000'000; // 100M elements
const size_t bytes = N * sizeof(float);
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Initialize with ones (simplified)
cudaMemset(d_a, 0x3f800000, bytes); // Not proper init, just for timing demo
cudaMemset(d_b, 0x3f800000, bytes);
// Create events
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int threads = 256;
int blocks = (N + threads - 1) / threads;
// Record start, launch kernel, record stop
cudaEventRecord(start);
vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
cudaEventRecord(stop);
// Wait for stop event to be recorded
cudaEventSynchronize(stop);
// Compute elapsed time
float ms = 0.0f;
cudaEventElapsedTime(&ms, start, stop);
printf("Kernel time: %.2f ms\n", ms);
// Cleanup
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
The sequence matters. cudaEventRecord(start) goes before the kernel launch. cudaEventRecord(stop) goes after. Both are non-blocking on the CPU. The kernel launch is also non-blocking. The only blocking call is cudaEventSynchronize(stop), which waits for the GPU to finish the kernel and record the stop event.
The GPU timeline looks like this:
gantt title GPU Timeline: Event Timing dateFormat X axisFormat %s section Stream 0 Record start event :done, s1, 0, 1 vectorAdd kernel :active, s2, 1, 8 Record stop event :done, s3, 8, 9
The CPU returns from all three calls (record, launch, record) in microseconds. The actual GPU work takes milliseconds. cudaEventSynchronize is where the CPU blocks until the stop timestamp is written.
Why GPU timers beat CPU timers
Consider this incorrect approach:
auto cpu_start = std::chrono::high_resolution_clock::now();
vectorAdd<<<blocks, threads>>>(d_a, d_b, d_c, N);
auto cpu_stop = std::chrono::high_resolution_clock::now();
float cpu_ms = std::chrono::duration<float, std::milli>(cpu_stop - cpu_start).count();
This measures the time to push a kernel launch command into the driver queue. On a warm GPU, that is 5 to 15 microseconds regardless of whether the kernel runs for 1 ms or 100 ms. The CPU does not wait for the kernel to finish unless you add cudaDeviceSynchronize() before cpu_stop.
Even with cudaDeviceSynchronize, the CPU timer measures launch overhead plus kernel execution plus synchronization overhead. The event-based timer isolates only the GPU execution, with 0.5 microsecond resolution, directly from the device clock.
Three reasons to prefer events:
- Accuracy. Events measure GPU clock cycles. CPU timers include driver overhead, OS scheduling jitter, and synchronization latency.
- Non-intrusiveness. Recording an event does not force synchronization. The GPU keeps running. CPU timers require
cudaDeviceSynchronize, which drains the entire GPU pipeline and destroys concurrency. - Stream-specific. Events can time individual streams. CPU synchronization waits for all streams, making it impossible to isolate one stream’s performance.
Example: computing achieved throughput
Suppose the vectorAdd kernel on 100M FP32 elements reports 5.23 ms via event timing. Each element performs one addition (one FLOP). The total work is 100,000,000 FLOPs.
Achieved GFLOPs = FLOPs / (time_in_seconds)
= 100,000,000 / 0.00523
= 19.12 GFLOPs
For a memory-bound kernel like vector addition, the more relevant metric is effective bandwidth:
Bytes moved = 3 * 100,000,000 * 4 bytes = 1.2 GB (read A, read B, write C)
Bandwidth = 1.2 GB / 0.00523 s = 229.4 GB/s
On an A100 with 2039 GB/s peak bandwidth, that is 11.3% utilization. This tells you the kernel is not saturating memory bandwidth and there may be launch overhead or other inefficiencies dominating at this problem size.
Event-based stream synchronization
Events become powerful when you need one stream to wait for another without blocking the CPU. The function cudaStreamWaitEvent creates a GPU-side dependency: stream B will not execute any commands enqueued after the wait call until the specified event in stream A has been recorded.
The problem: cross-stream dependencies
Consider a pipeline where stream A computes a result that stream B needs as input. Without synchronization:
cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);
// Stream A: compute intermediate result
computePartA<<<grid, block, 0, streamA>>>(d_input, d_intermediate);
// Stream B: needs d_intermediate from stream A
// BUG: stream B may start before stream A finishes!
computePartB<<<grid, block, 0, streamB>>>(d_intermediate, d_output);
Streams execute independently by default. Stream B’s kernel might read d_intermediate before stream A has finished writing it. The result is a data race: sometimes correct, sometimes garbage, depending on GPU scheduling.
The fix: cudaStreamWaitEvent
cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);
cudaEvent_t partA_done;
cudaEventCreate(&partA_done);
// Stream A: compute intermediate result
computePartA<<<grid, block, 0, streamA>>>(d_input, d_intermediate);
cudaEventRecord(partA_done, streamA);
// Stream B: wait for stream A's event, then proceed
cudaStreamWaitEvent(streamB, partA_done, 0);
computePartB<<<grid, block, 0, streamB>>>(d_intermediate, d_output);
// Cleanup
cudaEventDestroy(partA_done);
cudaStreamDestroy(streamA);
cudaStreamDestroy(streamB);
The event partA_done is recorded in stream A after computePartA. When stream B encounters the cudaStreamWaitEvent, it stalls until the GPU has recorded partA_done. Only then does stream B proceed with computePartB.
The CPU is never blocked. Both kernel launches return immediately. The dependency is entirely GPU-side.
Visualizing the dependency
Without the event dependency, both kernels can overlap (and race):
gantt title Without cudaStreamWaitEvent (data race!) dateFormat X axisFormat %s section Stream A computePartA :active, a1, 0, 6 section Stream B computePartB (RACE) :crit, b1, 1, 5
With cudaStreamWaitEvent, stream B correctly waits:
gantt title With cudaStreamWaitEvent (correct) dateFormat X axisFormat %s section Stream A computePartA :active, a1, 0, 6 Record partA_done :done, a2, 6, 7 section Stream B Wait for partA_done :done, b0, 0, 7 computePartB :active, b1, 7, 13
Stream B idles until stream A’s event fires. After that, execution proceeds with the guarantee that d_intermediate is fully written.
Building dependency chains across streams
Real pipelines often have multiple stages across multiple streams. Events let you build arbitrary DAGs (directed acyclic graphs) of dependencies.
Consider a three-stage pipeline: stream A produces data, stream B transforms it, and stream C consumes it. Each stage depends on the previous:
cudaStream_t sA, sB, sC;
cudaStreamCreate(&sA);
cudaStreamCreate(&sB);
cudaStreamCreate(&sC);
cudaEvent_t evA, evB;
cudaEventCreate(&evA);
cudaEventCreate(&evB);
// Stage 1: produce data in stream A
produce<<<grid, block, 0, sA>>>(d_raw, d_data);
cudaEventRecord(evA, sA);
// Stage 2: transform in stream B (waits for A)
cudaStreamWaitEvent(sB, evA, 0);
transform<<<grid, block, 0, sB>>>(d_data, d_transformed);
cudaEventRecord(evB, sB);
// Stage 3: consume in stream C (waits for B)
cudaStreamWaitEvent(sC, evB, 0);
consume<<<grid, block, 0, sC>>>(d_transformed, d_result);
// Cleanup
cudaEventDestroy(evA);
cudaEventDestroy(evB);
cudaStreamDestroy(sA);
cudaStreamDestroy(sB);
cudaStreamDestroy(sC);
The dependency chain is: produce (A) -> transform (B) -> consume (C). Each stream can also run independent work before or after the dependency point. For example, stream A could launch another kernel after recording evA, and that kernel would run concurrently with stream B’s transform.
You can also create fan-out dependencies. Record one event and have multiple streams wait on it:
cudaEventRecord(evA, sA);
cudaStreamWaitEvent(sB, evA, 0); // B waits for A
cudaStreamWaitEvent(sC, evA, 0); // C also waits for A
And fan-in: one stream waits for events from multiple streams:
cudaEventRecord(evA, sA);
cudaEventRecord(evB, sB);
cudaStreamWaitEvent(sC, evA, 0); // C waits for A
cudaStreamWaitEvent(sC, evB, 0); // C also waits for B
Stream C will not proceed until both A and B have reached their respective events. This pattern is common in parallel reduction pipelines where multiple partial results must be merged.
Python timing with CuPy events
CuPy provides an event API that mirrors the CUDA C++ interface:
import cupy as cp
n = 100_000_000
a = cp.ones(n, dtype=cp.float32)
b = cp.ones(n, dtype=cp.float32)
start = cp.cuda.Event()
stop = cp.cuda.Event()
start.record()
c = a + b # Element-wise addition on GPU
stop.record()
stop.synchronize()
elapsed_ms = cp.cuda.get_elapsed_time(start, stop)
print(f"Kernel time: {elapsed_ms:.2f} ms")
# Compute achieved GFLOPs
flops = n # one add per element
gflops = flops / (elapsed_ms / 1000) / 1e9
print(f"Achieved: {gflops:.2f} GFLOPs")
cp.cuda.Event() creates an event. record() places it in the current stream. stop.synchronize() blocks the Python thread until the event is recorded. cp.cuda.get_elapsed_time(start, stop) returns milliseconds, exactly like cudaEventElapsedTime.
For stream-level timing in CuPy:
stream = cp.cuda.Stream()
start = cp.cuda.Event()
stop = cp.cuda.Event()
with stream:
start.record(stream)
c = a + b
stop.record(stream)
stop.synchronize()
print(f"Stream time: {cp.cuda.get_elapsed_time(start, stop):.2f} ms")
Event creation flags
cudaEventCreate accepts optional flags through cudaEventCreateWithFlags:
cudaEvent_t event;
cudaEventCreateWithFlags(&event, cudaEventDisableTiming);
| Flag | Effect |
|---|---|
cudaEventDefault | Full timing support. Default behavior. |
cudaEventDisableTiming | Cannot be used with cudaEventElapsedTime. Faster to record. |
cudaEventBlockingSync | cudaEventSynchronize yields the CPU thread instead of spinning. |
cudaEventInterprocess | Event can be shared across processes via IPC. |
⚠ If you only need events for synchronization (not timing), always use cudaEventDisableTiming. Recording a timing-enabled event requires the GPU to flush its instruction pipeline to get an accurate timestamp. A non-timing event is a lightweight fence with no pipeline flush. In tight loops with many event records, the difference is measurable.
The cudaEventBlockingSync flag changes how cudaEventSynchronize waits. By default, it spin-waits (burns CPU cycles checking a flag). With cudaEventBlockingSync, the thread yields to the OS scheduler. Use this when the CPU has other useful work to do, or in power-sensitive environments.
Common mistakes
Recording an event without synchronizing before reading elapsed time. If you call cudaEventElapsedTime before the stop event has been recorded, you get cudaErrorNotReady. Always call cudaEventSynchronize(stop) first.
Measuring launch overhead instead of kernel time. Wrapping a kernel launch with CPU timers and no cudaDeviceSynchronize gives you 5 to 15 microseconds regardless of the kernel’s actual execution time.
Forgetting the third argument to cudaStreamWaitEvent. The flags parameter must be 0. Passing anything else is undefined behavior in current CUDA versions.
Using timing-enabled events for pure synchronization. Each timing-enabled event record forces a pipeline flush. If you are recording thousands of events per frame for dependency management, switch to cudaEventDisableTiming.
Assuming events work across devices. A standard event is device-local. Recording an event on GPU 0 and waiting on it from a stream on GPU 1 requires cudaEventInterprocess or cudaEventDisableTiming with peer access enabled.
In practice
Always use events for kernel timing. CPU timers are wrong by default because kernel launches are asynchronous. Events are the only reliable way to measure GPU execution time without introducing synchronization artifacts.
Use cudaEventDisableTiming for synchronization-only events. The performance difference matters when you have many events per frame. Profile it: on Ampere GPUs, a non-timing event record is roughly 2x faster than a timing-enabled one.
Prefer cudaStreamWaitEvent over cudaDeviceSynchronize. Device synchronization drains every stream on the GPU, destroying all concurrency. cudaStreamWaitEvent is surgical: only the target stream stalls, and only until the specific event fires. Other streams keep running.
Record events at natural boundaries. Place events between computational stages, not inside tight loops. An event per kernel launch in a 1000-kernel pipeline is fine. An event per thread block is not (you would need host-side logic that defeats the purpose).
Warm up before timing. The first kernel launch in a CUDA program incurs JIT compilation and context initialization overhead. Launch the kernel once, synchronize, then time the second launch. This gives you a representative measurement.
What comes next
The next article covers dynamic parallelism: launching kernels from within kernels on the GPU, recursive decomposition of work, and when nested launches help versus when they add overhead.