Search…

Synchronization and atomic operations in CUDA

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable writing kernels that use shared memory and understand why barriers exist between load and compute phases.

The synchronization problem on a GPU

A CPU program with 8 threads can synchronize with a single mutex or barrier. The cost is manageable because there are only 8 threads. A GPU kernel launches tens of thousands of threads. Mutexes do not scale. Barriers across all threads are not even possible within a single kernel launch.

CUDA provides two categories of coordination primitives:

  1. Barriers that make threads wait for each other (__syncthreads()).
  2. Atomic operations that let threads safely update shared data without explicit locks.

Both have costs. Understanding when to use each, and when to restructure your algorithm to avoid both, separates fast GPU code from code that technically works but runs at a fraction of peak throughput.

__syncthreads(): the block-level barrier

__syncthreads() is a barrier that applies to all threads in a single block. When a thread reaches this call, it stalls until every other thread in the same block also reaches it. After the barrier, all writes to shared memory (and global memory, with caveats) made before the barrier are visible to all threads in the block.

This is the primitive that makes tiling work. Without it, thread 0 might read from shared memory before thread 31 has finished writing its element.

graph TD
  subgraph "Thread Block (8 threads shown)"
      direction TB
      T0["Thread 0: write s[0]"]
      T1["Thread 1: write s[1]"]
      T2["Thread 2: write s[2]"]
      T3["Thread 3: ..."]
      T4["Thread 4: ..."]
      T5["Thread 5: ..."]
      T6["Thread 6: ..."]
      T7["Thread 7: write s[7]"]
  end

  T0 --> B["syncthreads barrier
All threads must arrive"]
  T1 --> B
  T2 --> B
  T3 --> B
  T4 --> B
  T5 --> B
  T6 --> B
  T7 --> B

  B --> R0["Thread 0: read s[0..7] ✓"]
  B --> R1["Thread 1: read s[0..7] ✓"]
  B --> R7["Thread 7: read s[0..7] ✓"]

The cost of __syncthreads() itself is low on modern hardware: roughly 4-8 cycles when threads are already converged. The real cost is indirect. Threads that arrive early must stall, wasting execution slots the scheduler could have used. In practice, this matters only when the work before the barrier is highly uneven across threads.

Divergent __syncthreads(): undefined behavior

Every thread in the block must reach the same __syncthreads() call. If some threads take a branch that skips the barrier while others do not, the result is undefined behavior. The kernel may deadlock, produce wrong results, or appear to work on one GPU and fail on another.

// ⚠ UNDEFINED BEHAVIOR: divergent __syncthreads()
__global__ void bad_kernel(float* data, int n) {
    int tid = threadIdx.x;
    __shared__ float s[256];
    s[tid] = data[tid];

    if (tid < 128) {
        __syncthreads();  // ✗ Only half the block calls this
        s[tid] += s[tid + 128];
    }
}

The fix is to move the barrier outside the conditional:

__global__ void fixed_kernel(float* data, int n) {
    int tid = threadIdx.x;
    __shared__ float s[256];
    s[tid] = data[tid];

    __syncthreads();  // ✓ All threads hit this

    if (tid < 128) {
        s[tid] += s[tid + 128];
    }
}

The CUDA compiler does not always catch divergent barriers. compute-sanitizer --tool synccheck will detect them at runtime. Run it on every new kernel.

Global synchronization: you cannot do it in one kernel

__syncthreads() synchronizes within a block. There is no built-in primitive to synchronize across all blocks within a single kernel launch. This is a deliberate design choice: blocks execute in arbitrary order on available SMs, and there may be more blocks than SMs. If block 0 waits for block 99 and block 99 has not been scheduled yet, you have a deadlock.

If your algorithm requires a global synchronization point (e.g., iterative solvers, multi-pass algorithms), you have two options:

  1. Split into multiple kernel launches. A kernel launch is an implicit global barrier. All blocks from kernel A finish before any block from kernel B begins. This is the standard approach. The overhead of a kernel launch is roughly 5-10 microseconds, which is negligible for kernels that run for hundreds of microseconds or more.
  2. Cooperative groups (CUDA 9+). grid.sync() from the cooperative groups API provides a true grid-wide barrier, but only if you launch exactly as many blocks as the GPU can run concurrently. This limits parallelism and is rarely used outside specialized algorithms.

For the vast majority of cases, the multi-kernel approach is the right one.

Atomic operations: safe concurrent updates

Atomic operations let a thread read, modify, and write a memory location as a single indivisible operation. No other thread can see the intermediate state. The hardware guarantees that concurrent atomics to the same address are serialized.

The race condition without atomics

Consider building a histogram: each thread reads a data element and increments the corresponding bin.

// ✗ RACE CONDITION: multiple threads read-modify-write the same bin
__global__ void histogram_broken(const int* data, int* bins, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx];
        bins[bin] += 1;  // ✗ Not atomic: read, add, write can interleave
    }
}

If threads 5 and 12 both increment bins[3], the sequence might be: thread 5 reads bins[3] as 7, thread 12 reads bins[3] as 7, thread 5 writes 8, thread 12 writes 8. One increment is lost.

The fix: atomicAdd

// ✓ CORRECT: atomicAdd serializes updates to each bin
__global__ void histogram_correct(const int* data, int* bins, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int bin = data[idx];
        atomicAdd(&bins[bin], 1);  // ✓ Hardware-guaranteed atomic
    }
}

atomicAdd reads the value at &bins[bin], adds 1, and writes the result back as a single atomic operation. The hardware guarantees no interleaving.

Atomic operations reference

OperationFunctionSupported typesScopePerformance note
AddatomicAddint, unsigned, float, double, __half2Global/SharedNative FP32 since Kepler; FP64 native since Pascal
SubtractatomicSubint, unsignedGlobal/SharedImplemented as atomicAdd with negated value
MinatomicMinint, unsignedGlobal/SharedInteger only; use CAS loop for float
MaxatomicMaxint, unsignedGlobal/SharedInteger only; use CAS loop for float
IncrementatomicIncunsignedGlobal/SharedWraps at specified limit
DecrementatomicDecunsignedGlobal/SharedWraps at specified limit
ExchangeatomicExchint, unsigned, floatGlobal/SharedUnconditional swap; fast
Compare-and-swapatomicCASint, unsigned, unsigned long longGlobal/SharedFoundation for custom atomics
Bitwise ANDatomicAndint, unsignedGlobal/SharedUseful for bitmask operations
Bitwise ORatomicOrint, unsignedGlobal/SharedUseful for flag arrays
Bitwise XORatomicXorint, unsignedGlobal/SharedRarely used

atomicCAS: building custom atomics

atomicCAS(address, expected, desired) atomically compares *address with expected. If they match, it writes desired and returns the old value. If they do not match, it leaves the value unchanged and returns the current value. This is the building block for any atomic operation the hardware does not natively support.

Here is a spinlock built with atomicCAS:

__device__ int lock = 0;

__device__ void acquire_lock() {
    while (atomicCAS(&lock, 0, 1) != 0) {
        // spin until we successfully swap 0 -> 1
    }
}

__device__ void release_lock() {
    atomicExch(&lock, 0);
}

⚠ Spinlocks on a GPU are almost always the wrong answer. Thousands of threads spinning on the same lock will destroy throughput. They appear here only to illustrate CAS mechanics. If you think you need a GPU spinlock, restructure your algorithm instead.

A more practical CAS use is atomic floating-point max:

__device__ float atomicMaxFloat(float* addr, float val) {
    int* addr_as_int = (int*)addr;
    int old = *addr_as_int;
    int assumed;
    do {
        assumed = old;
        old = atomicCAS(addr_as_int, assumed,
                        __float_as_int(fmaxf(__int_as_float(assumed), val)));
    } while (assumed != old);
    return __int_as_float(old);
}

Atomic contention: the performance cost

Atomics are correct but not free. When multiple threads atomically update the same address, the hardware serializes those updates. Each atomic operation to global memory costs roughly 200-600 cycles depending on contention and GPU architecture.

Example: contention math

Consider 1024 threads all calling atomicAdd on the same global counter.

  • Each atomic operation takes roughly 600 cycles under full contention.
  • The operations are serialized: only one completes at a time for a given address.
  • Total serialized cycles: 1024 * 600 = 614,400 cycles.
  • At a 1.5 GHz clock, that is roughly 410 microseconds spent just on atomic serialization.

Now compare to a two-level reduction:

  1. Level 1 (block-local): Each block of 256 threads does a shared memory reduction using __syncthreads(). Cost: ~10 cycles per step * log2(256) = 80 cycles. All 4 blocks run in parallel.
  2. Level 2 (cross-block): 4 partial sums are written to global memory. A second kernel reduces those 4 values.

Total cost for level 1: ~80 cycles (parallel across blocks). Level 2: negligible (4 elements). The reduction is roughly 5000x faster than the naive atomic approach.

graph TD
  subgraph "Naive: 1024 atomicAdd to one counter"
      direction TB
      AT1["Thread 0: atomicAdd → 600 cycles"]
      AT2["Thread 1: atomicAdd → 600 cycles"]
      AT3["Thread 2: atomicAdd → 600 cycles"]
      ATN["...Thread 1023: atomicAdd → 600 cycles"]
      AT1 --> S["Serialized: 1024 * 600 = 614,400 cycles"]
      AT2 --> S
      AT3 --> S
      ATN --> S
  end

  subgraph "Two-level reduction"
      direction TB
      B0["Block 0: shared mem reduce
256 threads → 1 sum
approx 80 cycles"]
      B1["Block 1: same"]
      B2["Block 2: same"]
      B3["Block 3: same"]
      B0 --> F["4 partial sums → final reduce
approx 10 cycles"]
      B1 --> F
      B2 --> F
      B3 --> F
  end

The rule: atomics are appropriate when contention is low (many distinct addresses, like histogram bins spread across 256 bins) or when the number of updates is small. Atomics to a single address from thousands of threads are a performance antipattern.

Worked example: parallel reduction with __syncthreads()

Reduction (summing an array) is the canonical example of block-level synchronization. Here is a step-by-step trace with 8 threads and input values [3, 1, 4, 1, 5, 9, 2, 6].

Step-by-step trace

Initial shared memory state:

Index01234567
Value31415926

Step 1 (stride = 4, active threads: 0, 1, 2, 3):

  • Thread 0: s[0] += s[4] = 3 + 5 = 8
  • Thread 1: s[1] += s[5] = 1 + 9 = 10
  • Thread 2: s[2] += s[6] = 4 + 2 = 6
  • Thread 3: s[3] += s[7] = 1 + 6 = 7

__syncthreads() — all threads wait.

Index01234567
Value810675926

Step 2 (stride = 2, active threads: 0, 1):

  • Thread 0: s[0] += s[2] = 8 + 6 = 14
  • Thread 1: s[1] += s[3] = 10 + 7 = 17

__syncthreads() — all threads wait.

Index01234567
Value1417675926

Step 3 (stride = 1, active thread: 0):

  • Thread 0: s[0] += s[1] = 14 + 17 = 31
Index01234567
Value3117675926

Final sum: s[0] = 31. Verify: 3 + 1 + 4 + 1 + 5 + 9 + 2 + 6 = 31. ✓

CUDA implementation

__global__ void block_reduce_sum(const float* input, float* output, int n) {
    __shared__ float s[256];

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    // Load from global to shared (with bounds check)
    s[tid] = (gid < n) ? input[gid] : 0.0f;
    __syncthreads();

    // Tree reduction in shared memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            s[tid] += s[tid + stride];
        }
        __syncthreads();  // ✓ All threads hit this regardless of branch
    }

    // Thread 0 writes the block's partial sum
    if (tid == 0) {
        output[blockIdx.x] = s[0];
    }
}

Key observations:

  • __syncthreads() is inside the loop body, outside the if. Every thread executes the barrier on every iteration.
  • Active threads halve each step: 128, 64, 32, 16, 8, 4, 2, 1. Inactive threads still participate in the barrier but skip the addition.
  • For the final warp (stride < 32), you can replace __syncthreads() with __syncwarp() on Volta+ for slightly better performance.

Python (Numba): atomic histogram

Numba’s CUDA support provides cuda.atomic.add as the equivalent of atomicAdd. Here is a histogram kernel:

from numba import cuda
import numpy as np

@cuda.jit
def histogram_kernel(data, bins):
    idx = cuda.grid(1)
    if idx < data.shape[0]:
        bin_idx = data[idx]
        cuda.atomic.add(bins, bin_idx, 1)

# Usage
data = np.random.randint(0, 256, size=1_000_000).astype(np.int32)
bins = np.zeros(256, dtype=np.int32)

d_data = cuda.to_device(data)
d_bins = cuda.to_device(bins)

threads_per_block = 256
blocks = (data.shape[0] + threads_per_block - 1) // threads_per_block
histogram_kernel[blocks, threads_per_block](d_data, d_bins)

result = d_bins.copy_to_host()
expected = np.bincount(data, minlength=256).astype(np.int32)
assert np.array_equal(result, expected)
print("Histogram correct ✓")

For higher throughput, use privatized histograms: each block accumulates into shared memory first, then atomically merges into the global histogram. This reduces contention on global memory atomics by a factor of (number of blocks).

Memory fences: __threadfence() and __threadfence_block()

Atomic operations guarantee that a single read-modify-write is indivisible. They do not guarantee that other (non-atomic) writes made by the same thread are visible to other threads at any particular time. That is the job of memory fences.

CUDA provides three fence levels:

  • __threadfence_block(): Ensures all writes by the calling thread are visible to all threads in the same block. Lightweight. Use when coordinating through shared memory.
  • __threadfence(): Ensures all writes by the calling thread are visible to all threads on the device (all blocks). More expensive. Use when coordinating through global memory across blocks.
  • __threadfence_system(): Ensures visibility to all threads on the device and to the host. Rarely needed.

A fence does not stall other threads. It only constrains the ordering of the calling thread’s own writes as perceived by other threads.

When fences matter

A common pattern is producer-consumer signaling through global memory. Block 0 writes a result to global memory and sets a flag. Block 1 polls the flag and reads the result.

__device__ int result;
__device__ int flag = 0;

// Block 0: producer
__global__ void producer() {
    result = 42;
    __threadfence();   // Ensure 'result' is visible before 'flag'
    flag = 1;
}

// Block 1: consumer
__global__ void consumer(int* output) {
    while (flag == 0) {}  // Spin until flag is set
    __threadfence();       // Ensure we see the 'result' write
    *output = result;      // Reads 42 ✓
}

Without __threadfence(), the GPU memory system may reorder the writes. Block 1 could see flag == 1 but read a stale value of result. The fence prevents this by enforcing write ordering.

⚠ This pattern (spinning on a global flag across blocks) is fragile and generally not recommended. Prefer separate kernel launches for producer-consumer relationships. Fences are shown here so you understand the semantics when you encounter them in lock-free data structures or advanced cooperative algorithms.

CUDA memory consistency model

CUDA follows a relaxed consistency model. Threads within a warp see each other’s writes immediately (they share an instruction stream). Threads in the same block see each other’s writes after __syncthreads(). Threads in different blocks have no ordering guarantees unless you use:

  1. Atomic operations (for the specific address being atomically accessed).
  2. Memory fences (for ordering non-atomic writes relative to other writes).
  3. Kernel launch boundaries (implicit global fence).

The hierarchy:

ScopeVisibility guaranteeMechanism
Same warpImmediate (lockstep execution)SIMT model
Same blockAfter __syncthreads()Barrier
Cross-blockAfter __threadfence() + atomic signalFence + atomic
Cross-kernelAfter kernel completionImplicit barrier
CPU-GPUAfter cudaDeviceSynchronize() or stream syncHost API

Understanding this hierarchy prevents a class of bugs that are nearly impossible to diagnose: kernels that produce correct results 99% of the time and silently corrupt data the other 1%.

When atomics are the right tool

Atomics are correct and performant when:

  • Contention is spread across many addresses. A histogram with 256 bins and 1M data points averages ~4000 updates per bin. With 256 distinct addresses, hardware can pipeline atomics efficiently.
  • The number of atomic updates is small. Writing one partial sum per block (e.g., 4096 blocks writing to 4096 slots) has no contention at all.
  • You need a simple reduction across blocks. atomicAdd(&global_sum, block_partial) at the end of a block reduction is clean and fast. One atomic per block, not per thread.

Atomics are the wrong tool when:

  • All threads target the same address. Restructure as a reduction.
  • You need complex multi-word updates. CAS loops with retries under high contention can be slower than a redesigned algorithm.
  • A reduction tree would work. Always prefer __syncthreads() with shared memory over atomics when the data fits in a block.

In practice

Synchronization bugs on GPUs are among the hardest to diagnose because they often manifest as intermittent wrong results rather than crashes. Production considerations:

  • Always run compute-sanitizer --tool synccheck on new kernels. It detects divergent barriers, missing barriers, and barrier misuse. The overhead is 10-50x, so use it during development, not in production.
  • Prefer two-level reductions over single-level atomics. Reduce within a block using shared memory and __syncthreads(), then merge block results with a small number of atomics or a second kernel. This pattern works for sum, min, max, and count.
  • Avoid global memory fences when possible. If your algorithm requires __threadfence() to communicate between blocks, consider whether splitting into two kernels would be simpler and equally fast. The 5-10 microsecond kernel launch overhead is usually cheaper than the complexity and fragility of fence-based coordination.
  • Use shared memory atomics for block-local histograms. Shared memory atomics are roughly 10x faster than global memory atomics because the latency is 20-30 cycles instead of 200-600. Build a private histogram in shared memory, then merge with global atomics at the end.
  • Profile atomic throughput with Nsight Compute. Look at the l1tex__t_sectors_pipe_lsu_mem_global_op_atom metric to see how many sectors are consumed by atomic operations. If atomics dominate your kernel time, restructure.

What comes next

Reduction is a building block. The next algorithm that every GPU programmer needs is the prefix sum (scan), which computes running totals across an array. Prefix sums use the same __syncthreads() and shared memory patterns covered here, combined with a clever up-sweep / down-sweep structure.

The next article, CUDA prefix sum and reduction patterns, covers inclusive and exclusive scans, work-efficient algorithms, and how these primitives compose into higher-level operations like stream compaction and radix sort.

Start typing to search across all content
navigate Enter open Esc close