Synchronization and atomic operations in CUDA
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article assumes you have read the following:
- Shared memory and tiling for understanding
__shared__memory, cooperative loads, and the synchronization points that tiling requires. - Warps and warp divergence for the SIMT execution model and how threads within a warp execute in lockstep.
You should be comfortable writing kernels that use shared memory and understand why barriers exist between load and compute phases.
The synchronization problem on a GPU
A CPU program with 8 threads can synchronize with a single mutex or barrier. The cost is manageable because there are only 8 threads. A GPU kernel launches tens of thousands of threads. Mutexes do not scale. Barriers across all threads are not even possible within a single kernel launch.
CUDA provides two categories of coordination primitives:
- Barriers that make threads wait for each other (
__syncthreads()). - Atomic operations that let threads safely update shared data without explicit locks.
Both have costs. Understanding when to use each, and when to restructure your algorithm to avoid both, separates fast GPU code from code that technically works but runs at a fraction of peak throughput.
__syncthreads(): the block-level barrier
__syncthreads() is a barrier that applies to all threads in a single block. When a thread reaches this call, it stalls until every other thread in the same block also reaches it. After the barrier, all writes to shared memory (and global memory, with caveats) made before the barrier are visible to all threads in the block.
This is the primitive that makes tiling work. Without it, thread 0 might read from shared memory before thread 31 has finished writing its element.
graph TD
subgraph "Thread Block (8 threads shown)"
direction TB
T0["Thread 0: write s[0]"]
T1["Thread 1: write s[1]"]
T2["Thread 2: write s[2]"]
T3["Thread 3: ..."]
T4["Thread 4: ..."]
T5["Thread 5: ..."]
T6["Thread 6: ..."]
T7["Thread 7: write s[7]"]
end
T0 --> B["syncthreads barrier
All threads must arrive"]
T1 --> B
T2 --> B
T3 --> B
T4 --> B
T5 --> B
T6 --> B
T7 --> B
B --> R0["Thread 0: read s[0..7] ✓"]
B --> R1["Thread 1: read s[0..7] ✓"]
B --> R7["Thread 7: read s[0..7] ✓"]
The cost of __syncthreads() itself is low on modern hardware: roughly 4-8 cycles when threads are already converged. The real cost is indirect. Threads that arrive early must stall, wasting execution slots the scheduler could have used. In practice, this matters only when the work before the barrier is highly uneven across threads.
Divergent __syncthreads(): undefined behavior
Every thread in the block must reach the same __syncthreads() call. If some threads take a branch that skips the barrier while others do not, the result is undefined behavior. The kernel may deadlock, produce wrong results, or appear to work on one GPU and fail on another.
// ⚠ UNDEFINED BEHAVIOR: divergent __syncthreads()
__global__ void bad_kernel(float* data, int n) {
int tid = threadIdx.x;
__shared__ float s[256];
s[tid] = data[tid];
if (tid < 128) {
__syncthreads(); // ✗ Only half the block calls this
s[tid] += s[tid + 128];
}
}
The fix is to move the barrier outside the conditional:
__global__ void fixed_kernel(float* data, int n) {
int tid = threadIdx.x;
__shared__ float s[256];
s[tid] = data[tid];
__syncthreads(); // ✓ All threads hit this
if (tid < 128) {
s[tid] += s[tid + 128];
}
}
The CUDA compiler does not always catch divergent barriers. compute-sanitizer --tool synccheck will detect them at runtime. Run it on every new kernel.
Global synchronization: you cannot do it in one kernel
__syncthreads() synchronizes within a block. There is no built-in primitive to synchronize across all blocks within a single kernel launch. This is a deliberate design choice: blocks execute in arbitrary order on available SMs, and there may be more blocks than SMs. If block 0 waits for block 99 and block 99 has not been scheduled yet, you have a deadlock.
If your algorithm requires a global synchronization point (e.g., iterative solvers, multi-pass algorithms), you have two options:
- Split into multiple kernel launches. A kernel launch is an implicit global barrier. All blocks from kernel A finish before any block from kernel B begins. This is the standard approach. The overhead of a kernel launch is roughly 5-10 microseconds, which is negligible for kernels that run for hundreds of microseconds or more.
- Cooperative groups (CUDA 9+).
grid.sync()from the cooperative groups API provides a true grid-wide barrier, but only if you launch exactly as many blocks as the GPU can run concurrently. This limits parallelism and is rarely used outside specialized algorithms.
For the vast majority of cases, the multi-kernel approach is the right one.
Atomic operations: safe concurrent updates
Atomic operations let a thread read, modify, and write a memory location as a single indivisible operation. No other thread can see the intermediate state. The hardware guarantees that concurrent atomics to the same address are serialized.
The race condition without atomics
Consider building a histogram: each thread reads a data element and increments the corresponding bin.
// ✗ RACE CONDITION: multiple threads read-modify-write the same bin
__global__ void histogram_broken(const int* data, int* bins, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
int bin = data[idx];
bins[bin] += 1; // ✗ Not atomic: read, add, write can interleave
}
}
If threads 5 and 12 both increment bins[3], the sequence might be: thread 5 reads bins[3] as 7, thread 12 reads bins[3] as 7, thread 5 writes 8, thread 12 writes 8. One increment is lost.
The fix: atomicAdd
// ✓ CORRECT: atomicAdd serializes updates to each bin
__global__ void histogram_correct(const int* data, int* bins, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
int bin = data[idx];
atomicAdd(&bins[bin], 1); // ✓ Hardware-guaranteed atomic
}
}
atomicAdd reads the value at &bins[bin], adds 1, and writes the result back as a single atomic operation. The hardware guarantees no interleaving.
Atomic operations reference
| Operation | Function | Supported types | Scope | Performance note |
|---|---|---|---|---|
| Add | atomicAdd | int, unsigned, float, double, __half2 | Global/Shared | Native FP32 since Kepler; FP64 native since Pascal |
| Subtract | atomicSub | int, unsigned | Global/Shared | Implemented as atomicAdd with negated value |
| Min | atomicMin | int, unsigned | Global/Shared | Integer only; use CAS loop for float |
| Max | atomicMax | int, unsigned | Global/Shared | Integer only; use CAS loop for float |
| Increment | atomicInc | unsigned | Global/Shared | Wraps at specified limit |
| Decrement | atomicDec | unsigned | Global/Shared | Wraps at specified limit |
| Exchange | atomicExch | int, unsigned, float | Global/Shared | Unconditional swap; fast |
| Compare-and-swap | atomicCAS | int, unsigned, unsigned long long | Global/Shared | Foundation for custom atomics |
| Bitwise AND | atomicAnd | int, unsigned | Global/Shared | Useful for bitmask operations |
| Bitwise OR | atomicOr | int, unsigned | Global/Shared | Useful for flag arrays |
| Bitwise XOR | atomicXor | int, unsigned | Global/Shared | Rarely used |
atomicCAS: building custom atomics
atomicCAS(address, expected, desired) atomically compares *address with expected. If they match, it writes desired and returns the old value. If they do not match, it leaves the value unchanged and returns the current value. This is the building block for any atomic operation the hardware does not natively support.
Here is a spinlock built with atomicCAS:
__device__ int lock = 0;
__device__ void acquire_lock() {
while (atomicCAS(&lock, 0, 1) != 0) {
// spin until we successfully swap 0 -> 1
}
}
__device__ void release_lock() {
atomicExch(&lock, 0);
}
⚠ Spinlocks on a GPU are almost always the wrong answer. Thousands of threads spinning on the same lock will destroy throughput. They appear here only to illustrate CAS mechanics. If you think you need a GPU spinlock, restructure your algorithm instead.
A more practical CAS use is atomic floating-point max:
__device__ float atomicMaxFloat(float* addr, float val) {
int* addr_as_int = (int*)addr;
int old = *addr_as_int;
int assumed;
do {
assumed = old;
old = atomicCAS(addr_as_int, assumed,
__float_as_int(fmaxf(__int_as_float(assumed), val)));
} while (assumed != old);
return __int_as_float(old);
}
Atomic contention: the performance cost
Atomics are correct but not free. When multiple threads atomically update the same address, the hardware serializes those updates. Each atomic operation to global memory costs roughly 200-600 cycles depending on contention and GPU architecture.
Example: contention math
Consider 1024 threads all calling atomicAdd on the same global counter.
- Each atomic operation takes roughly 600 cycles under full contention.
- The operations are serialized: only one completes at a time for a given address.
- Total serialized cycles: 1024 * 600 = 614,400 cycles.
- At a 1.5 GHz clock, that is roughly 410 microseconds spent just on atomic serialization.
Now compare to a two-level reduction:
- Level 1 (block-local): Each block of 256 threads does a shared memory reduction using
__syncthreads(). Cost: ~10 cycles per step * log2(256) = 80 cycles. All 4 blocks run in parallel. - Level 2 (cross-block): 4 partial sums are written to global memory. A second kernel reduces those 4 values.
Total cost for level 1: ~80 cycles (parallel across blocks). Level 2: negligible (4 elements). The reduction is roughly 5000x faster than the naive atomic approach.
graph TD
subgraph "Naive: 1024 atomicAdd to one counter"
direction TB
AT1["Thread 0: atomicAdd → 600 cycles"]
AT2["Thread 1: atomicAdd → 600 cycles"]
AT3["Thread 2: atomicAdd → 600 cycles"]
ATN["...Thread 1023: atomicAdd → 600 cycles"]
AT1 --> S["Serialized: 1024 * 600 = 614,400 cycles"]
AT2 --> S
AT3 --> S
ATN --> S
end
subgraph "Two-level reduction"
direction TB
B0["Block 0: shared mem reduce
256 threads → 1 sum
approx 80 cycles"]
B1["Block 1: same"]
B2["Block 2: same"]
B3["Block 3: same"]
B0 --> F["4 partial sums → final reduce
approx 10 cycles"]
B1 --> F
B2 --> F
B3 --> F
end
The rule: atomics are appropriate when contention is low (many distinct addresses, like histogram bins spread across 256 bins) or when the number of updates is small. Atomics to a single address from thousands of threads are a performance antipattern.
Worked example: parallel reduction with __syncthreads()
Reduction (summing an array) is the canonical example of block-level synchronization. Here is a step-by-step trace with 8 threads and input values [3, 1, 4, 1, 5, 9, 2, 6].
Step-by-step trace
Initial shared memory state:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Value | 3 | 1 | 4 | 1 | 5 | 9 | 2 | 6 |
Step 1 (stride = 4, active threads: 0, 1, 2, 3):
- Thread 0: s[0] += s[4] = 3 + 5 = 8
- Thread 1: s[1] += s[5] = 1 + 9 = 10
- Thread 2: s[2] += s[6] = 4 + 2 = 6
- Thread 3: s[3] += s[7] = 1 + 6 = 7
__syncthreads() — all threads wait.
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Value | 8 | 10 | 6 | 7 | 5 | 9 | 2 | 6 |
Step 2 (stride = 2, active threads: 0, 1):
- Thread 0: s[0] += s[2] = 8 + 6 = 14
- Thread 1: s[1] += s[3] = 10 + 7 = 17
__syncthreads() — all threads wait.
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Value | 14 | 17 | 6 | 7 | 5 | 9 | 2 | 6 |
Step 3 (stride = 1, active thread: 0):
- Thread 0: s[0] += s[1] = 14 + 17 = 31
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Value | 31 | 17 | 6 | 7 | 5 | 9 | 2 | 6 |
Final sum: s[0] = 31. Verify: 3 + 1 + 4 + 1 + 5 + 9 + 2 + 6 = 31. ✓
CUDA implementation
__global__ void block_reduce_sum(const float* input, float* output, int n) {
__shared__ float s[256];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + tid;
// Load from global to shared (with bounds check)
s[tid] = (gid < n) ? input[gid] : 0.0f;
__syncthreads();
// Tree reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
s[tid] += s[tid + stride];
}
__syncthreads(); // ✓ All threads hit this regardless of branch
}
// Thread 0 writes the block's partial sum
if (tid == 0) {
output[blockIdx.x] = s[0];
}
}
Key observations:
__syncthreads()is inside the loop body, outside theif. Every thread executes the barrier on every iteration.- Active threads halve each step: 128, 64, 32, 16, 8, 4, 2, 1. Inactive threads still participate in the barrier but skip the addition.
- For the final warp (stride < 32), you can replace
__syncthreads()with__syncwarp()on Volta+ for slightly better performance.
Python (Numba): atomic histogram
Numba’s CUDA support provides cuda.atomic.add as the equivalent of atomicAdd. Here is a histogram kernel:
from numba import cuda
import numpy as np
@cuda.jit
def histogram_kernel(data, bins):
idx = cuda.grid(1)
if idx < data.shape[0]:
bin_idx = data[idx]
cuda.atomic.add(bins, bin_idx, 1)
# Usage
data = np.random.randint(0, 256, size=1_000_000).astype(np.int32)
bins = np.zeros(256, dtype=np.int32)
d_data = cuda.to_device(data)
d_bins = cuda.to_device(bins)
threads_per_block = 256
blocks = (data.shape[0] + threads_per_block - 1) // threads_per_block
histogram_kernel[blocks, threads_per_block](d_data, d_bins)
result = d_bins.copy_to_host()
expected = np.bincount(data, minlength=256).astype(np.int32)
assert np.array_equal(result, expected)
print("Histogram correct ✓")
For higher throughput, use privatized histograms: each block accumulates into shared memory first, then atomically merges into the global histogram. This reduces contention on global memory atomics by a factor of (number of blocks).
Memory fences: __threadfence() and __threadfence_block()
Atomic operations guarantee that a single read-modify-write is indivisible. They do not guarantee that other (non-atomic) writes made by the same thread are visible to other threads at any particular time. That is the job of memory fences.
CUDA provides three fence levels:
__threadfence_block(): Ensures all writes by the calling thread are visible to all threads in the same block. Lightweight. Use when coordinating through shared memory.__threadfence(): Ensures all writes by the calling thread are visible to all threads on the device (all blocks). More expensive. Use when coordinating through global memory across blocks.__threadfence_system(): Ensures visibility to all threads on the device and to the host. Rarely needed.
A fence does not stall other threads. It only constrains the ordering of the calling thread’s own writes as perceived by other threads.
When fences matter
A common pattern is producer-consumer signaling through global memory. Block 0 writes a result to global memory and sets a flag. Block 1 polls the flag and reads the result.
__device__ int result;
__device__ int flag = 0;
// Block 0: producer
__global__ void producer() {
result = 42;
__threadfence(); // Ensure 'result' is visible before 'flag'
flag = 1;
}
// Block 1: consumer
__global__ void consumer(int* output) {
while (flag == 0) {} // Spin until flag is set
__threadfence(); // Ensure we see the 'result' write
*output = result; // Reads 42 ✓
}
Without __threadfence(), the GPU memory system may reorder the writes. Block 1 could see flag == 1 but read a stale value of result. The fence prevents this by enforcing write ordering.
⚠ This pattern (spinning on a global flag across blocks) is fragile and generally not recommended. Prefer separate kernel launches for producer-consumer relationships. Fences are shown here so you understand the semantics when you encounter them in lock-free data structures or advanced cooperative algorithms.
CUDA memory consistency model
CUDA follows a relaxed consistency model. Threads within a warp see each other’s writes immediately (they share an instruction stream). Threads in the same block see each other’s writes after __syncthreads(). Threads in different blocks have no ordering guarantees unless you use:
- Atomic operations (for the specific address being atomically accessed).
- Memory fences (for ordering non-atomic writes relative to other writes).
- Kernel launch boundaries (implicit global fence).
The hierarchy:
| Scope | Visibility guarantee | Mechanism |
|---|---|---|
| Same warp | Immediate (lockstep execution) | SIMT model |
| Same block | After __syncthreads() | Barrier |
| Cross-block | After __threadfence() + atomic signal | Fence + atomic |
| Cross-kernel | After kernel completion | Implicit barrier |
| CPU-GPU | After cudaDeviceSynchronize() or stream sync | Host API |
Understanding this hierarchy prevents a class of bugs that are nearly impossible to diagnose: kernels that produce correct results 99% of the time and silently corrupt data the other 1%.
When atomics are the right tool
Atomics are correct and performant when:
- Contention is spread across many addresses. A histogram with 256 bins and 1M data points averages ~4000 updates per bin. With 256 distinct addresses, hardware can pipeline atomics efficiently.
- The number of atomic updates is small. Writing one partial sum per block (e.g., 4096 blocks writing to 4096 slots) has no contention at all.
- You need a simple reduction across blocks.
atomicAdd(&global_sum, block_partial)at the end of a block reduction is clean and fast. One atomic per block, not per thread.
Atomics are the wrong tool when:
- All threads target the same address. Restructure as a reduction.
- You need complex multi-word updates. CAS loops with retries under high contention can be slower than a redesigned algorithm.
- A reduction tree would work. Always prefer
__syncthreads()with shared memory over atomics when the data fits in a block.
In practice
Synchronization bugs on GPUs are among the hardest to diagnose because they often manifest as intermittent wrong results rather than crashes. Production considerations:
- Always run
compute-sanitizer --tool synccheckon new kernels. It detects divergent barriers, missing barriers, and barrier misuse. The overhead is 10-50x, so use it during development, not in production. - Prefer two-level reductions over single-level atomics. Reduce within a block using shared memory and
__syncthreads(), then merge block results with a small number of atomics or a second kernel. This pattern works for sum, min, max, and count. - Avoid global memory fences when possible. If your algorithm requires
__threadfence()to communicate between blocks, consider whether splitting into two kernels would be simpler and equally fast. The 5-10 microsecond kernel launch overhead is usually cheaper than the complexity and fragility of fence-based coordination. - Use shared memory atomics for block-local histograms. Shared memory atomics are roughly 10x faster than global memory atomics because the latency is 20-30 cycles instead of 200-600. Build a private histogram in shared memory, then merge with global atomics at the end.
- Profile atomic throughput with Nsight Compute. Look at the
l1tex__t_sectors_pipe_lsu_mem_global_op_atommetric to see how many sectors are consumed by atomic operations. If atomics dominate your kernel time, restructure.
What comes next
Reduction is a building block. The next algorithm that every GPU programmer needs is the prefix sum (scan), which computes running totals across an array. Prefix sums use the same __syncthreads() and shared memory patterns covered here, combined with a clever up-sweep / down-sweep structure.
The next article, CUDA prefix sum and reduction patterns, covers inclusive and exclusive scans, work-efficient algorithms, and how these primitives compose into higher-level operations like stream compaction and radix sort.