Search…

Concurrent data structures on the GPU

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article assumes you have read the following:

You should be comfortable with warp-level execution, the CUDA memory model, and basic atomic operations before proceeding.

Why standard data structures break at GPU scale

A CPU program typically runs a handful of threads. Contention is manageable. A GPU launches tens of thousands of threads simultaneously, all hitting the same memory bus. This changes the calculus for every data structure decision.

Consider a simple counter protected by a mutex. On a CPU with 8 threads, lock contention is low. On a GPU with 10,000+ threads, every thread spinning on the same lock creates a serialization bottleneck that destroys parallelism. Worse, GPU threads within a warp execute in lockstep. If one thread in a warp holds a lock and another thread in the same warp spins waiting for it, you get a deadlock: the holder cannot release because the warp cannot advance past the spinning thread.

Three properties make GPU concurrency fundamentally different from CPU concurrency:

  1. Thread count. 10,000 to 1,000,000+ concurrent threads, not 8 or 64.
  2. SIMT execution. Threads in a warp share an instruction pointer. Divergent paths serialize within the warp.
  3. Memory latency hiding. GPUs rely on massive parallelism to hide latency. Locks that serialize threads eliminate this advantage.

The result: lock-free data structures are not just preferred on GPUs. They are often the only viable option.

Lock-based approach: GPU mutex with atomicCAS

Before exploring lock-free designs, it is worth understanding why locks are problematic. The simplest GPU lock uses atomicCAS to implement a spinlock.

// GPU spinlock using atomicCAS
__device__ int lock = 0;

__device__ void acquire(int* lock) {
    while (atomicCAS(lock, 0, 1) != 0) {
        // spin: keep trying until we swap 0 -> 1
    }
}

__device__ void release(int* lock) {
    atomicExch(lock, 0);
}

__global__ void critical_section_kernel(int* lock, int* shared_counter) {
    acquire(lock);
    // Only one thread at a time executes here
    *shared_counter += 1;
    release(lock);
}

atomicCAS(addr, compare, val) atomically reads *addr, compares it with compare, and writes val only if they match. It returns the old value. A return of 0 means the caller won the lock.

Example: four threads compete for a mutex

Suppose threads T0, T1, T2, T3 all call acquire() at roughly the same time:

StepT0T1T2T3lock value
1atomicCAS(lock,0,1) returns 0 ✓atomicCAS(lock,0,1) returns 1 ✗atomicCAS(lock,0,1) returns 1 ✗atomicCAS(lock,0,1) returns 1 ✗1
2executes critical sectionspinsspinsspins1
3atomicExch(lock,0)spinsspinsspins0
4doneatomicCAS(lock,0,1) returns 0 ✓atomicCAS(lock,0,1) returns 1 ✗atomicCAS(lock,0,1) returns 1 ✗1
5doneexecutes critical sectionspinsspins1

Only one thread succeeds per round. The others spin, wasting cycles and memory bandwidth. With 10,000 threads, this serialization makes the GPU slower than a single CPU core. If T0 and T1 are in the same warp, the warp cannot retire T0’s critical section until T1’s spin loop also reaches a convergence point, further degrading performance.

GPU spinlocks are a last resort. Use them only when the critical section is extremely short and contention is guaranteed to be low (e.g., one thread per block contending for a global lock).

Lock-free approaches: the GPU-native path

Lock-free data structures use atomic operations (primarily atomicCAS) to make progress without ever holding a lock. At least one thread is guaranteed to make progress at any time, eliminating deadlock and reducing serialization.

Concurrent stack with atomicCAS

A lock-free stack is the simplest useful lock-free structure. It maintains a top-of-stack index that threads update atomically.

#define MAX_STACK 1048576

struct LockFreeStack {
    int data[MAX_STACK];
    int top; // index of next free slot, starts at 0
};

__device__ bool push(LockFreeStack* s, int val) {
    int old_top, new_top;
    do {
        old_top = s->top;
        if (old_top >= MAX_STACK) return false; // full
        new_top = old_top + 1;
        // Write data optimistically before claiming the slot
        s->data[old_top] = val;
    } while (atomicCAS(&s->top, old_top, new_top) != old_top);
    return true;
}

__device__ bool pop(LockFreeStack* s, int* val) {
    int old_top, new_top;
    do {
        old_top = s->top;
        if (old_top <= 0) return false; // empty
        new_top = old_top - 1;
        *val = s->data[new_top]; // read optimistically
    } while (atomicCAS(&s->top, old_top, new_top) != old_top);
    return true;
}

__global__ void stack_test(LockFreeStack* s) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    push(s, tid);

    int val;
    if (pop(s, &val)) {
        // val contains some thread's ID (not necessarily our own)
    }
}

The CAS loop is the core pattern: read the current state, compute the desired new state, attempt the swap. If another thread modified the state between the read and the CAS, the CAS fails and we retry. This is wait-free for the winning thread and lock-free overall.

The optimistic write in push is safe here because the slot at old_top is only “claimed” once the CAS succeeds. If the CAS fails, the thread retries with a new old_top, and the previous write is harmless (it will be overwritten by whichever thread successfully claims that slot).

Concurrent worklist

A worklist (work queue) is essential for BFS, graph algorithms, and iterative solvers. Threads push new work items and pop items to process. The lock-free version uses two atomic counters: one for the read head and one for the write tail.

graph LR
  subgraph "GPU Worklist (circular buffer)"
      direction LR
      S0["Slot 0"]
      S1["Slot 1"]
      S2["Slot 2"]
      S3["Slot 3"]
      S4["Slot 4"]
      S5["Slot 5"]
  end

  HEAD["head (atomic)
Consumers atomicAdd
to claim read slot"]
  TAIL["tail (atomic)
Producers atomicAdd
to claim write slot"]

  HEAD -.-> S1
  TAIL -.-> S4

  style HEAD fill:#EF553B,color:#fff
  style TAIL fill:#636EFA,color:#fff
#define WL_SIZE 1048576

struct Worklist {
    int data[WL_SIZE];
    unsigned int head; // next slot to read
    unsigned int tail; // next slot to write
};

__device__ bool wl_push(Worklist* wl, int item) {
    unsigned int slot = atomicAdd(&wl->tail, 1);
    if (slot >= WL_SIZE) return false; // overflow
    wl->data[slot % WL_SIZE] = item;
    return true;
}

__device__ bool wl_pop(Worklist* wl, int* item) {
    unsigned int slot = atomicAdd(&wl->head, 1);
    if (slot >= wl->tail) {
        // No items available. In practice, check with a
        // load before doing the atomic to reduce contention.
        atomicSub(&wl->head, 1); // undo
        return false;
    }
    *item = wl->data[slot % WL_SIZE];
    return true;
}

Each atomicAdd returns a unique slot, so no two threads read or write the same position. This eliminates locking entirely. The trade-off is that the worklist has a fixed maximum size and does not handle wrap-around gracefully without additional logic.

Concurrent linked list insert with atomicCAS

Linked lists on the GPU are unusual (pointer chasing is terrible for coalescing), but they appear in hash table chaining and dynamic graph algorithms. The lock-free insert uses CAS to swing the head pointer.

graph TD
  START["Thread wants to insert node N"]
  READ["Read current head pointer
old_head = *head"]
  LINK["Set N->next = old_head"]
  CAS["atomicCAS(head, old_head, N)"]
  CHECK{"CAS succeeded?"}
  DONE["N is now the new head ✓"]

  START --> READ
  READ --> LINK
  LINK --> CAS
  CAS --> CHECK
  CHECK -- "Yes" --> DONE
  CHECK -- "No: another thread
changed head" --> READ

  style DONE fill:#00CC96,color:#fff
  style CHECK fill:#FFA15A,color:#fff
struct Node {
    int key;
    Node* next;
};

__device__ void list_insert(Node** head, Node* new_node) {
    Node* old_head;
    do {
        old_head = *head;
        new_node->next = old_head;
    } while (atomicCAS((unsigned long long*)head,
                        (unsigned long long)old_head,
                        (unsigned long long)new_node) !=
             (unsigned long long)old_head);
}

The retry loop guarantees that every insert eventually succeeds, but under heavy contention thousands of threads may retry many times. This is why linked lists are a poor choice for hot paths on the GPU.

Example: the ABA problem

Lock-free structures using CAS are vulnerable to the ABA problem. Here is a concrete example with a CAS-based linked list.

Initial state: Head -> A -> B -> C

Step 1: Thread T0 reads head = A, A->next = B. T0 is preempted (or stalled by the scheduler).

Step 2: Thread T1 removes A (head becomes B). Thread T2 removes B (head becomes C). Thread T1 reinserts A at the head (head becomes A -> C). The pointer value of head is now the same address as before (A), but A’s next is C, not B.

Step 3: T0 resumes. Its CAS compares head against A and succeeds because head still equals A. T0 sets head to B. But B was already freed or points to garbage. The list is now corrupted.

The value at head went A -> B -> C -> A, and T0 could not distinguish the “new A” from the “old A” because CAS only compares the pointer value.

Fix: Use a tagged pointer that combines the pointer with a monotonically increasing counter. Each CAS increments the counter, so even if the pointer returns to the same value, the tag differs and the CAS fails. On 64-bit GPUs, you can pack a 48-bit pointer and a 16-bit counter into a single 64-bit atomicCAS.

GPU hash tables

Hash tables are the most practically important concurrent data structure on the GPU. They appear in deduplication, graph analytics, database joins, and neural network embedding lookups.

Open addressing with linear probing

Open addressing avoids pointers entirely, making it coalescing-friendly. Each slot holds a key-value pair. Inserts and lookups probe consecutive slots until finding an empty slot or a match.

#define HT_EMPTY 0xFFFFFFFF
#define HT_CAPACITY 1048576

struct GPUHashTable {
    unsigned int keys[HT_CAPACITY];
    unsigned int values[HT_CAPACITY];
};

__device__ void ht_init(GPUHashTable* ht) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = tid; i < HT_CAPACITY; i += stride) {
        ht->keys[i] = HT_EMPTY;
        ht->values[i] = 0;
    }
}

__device__ bool ht_insert(GPUHashTable* ht, unsigned int key, unsigned int val) {
    unsigned int slot = key % HT_CAPACITY; // simple hash
    for (int i = 0; i < HT_CAPACITY; i++) {
        unsigned int prev = atomicCAS(&ht->keys[slot], HT_EMPTY, key);
        if (prev == HT_EMPTY || prev == key) {
            ht->values[slot] = val;
            return true;
        }
        slot = (slot + 1) % HT_CAPACITY; // linear probe
    }
    return false; // table full
}

__device__ bool ht_lookup(GPUHashTable* ht, unsigned int key, unsigned int* val) {
    unsigned int slot = key % HT_CAPACITY;
    for (int i = 0; i < HT_CAPACITY; i++) {
        unsigned int k = ht->keys[slot];
        if (k == key) {
            *val = ht->values[slot];
            return true;
        }
        if (k == HT_EMPTY) return false; // not found
        slot = (slot + 1) % HT_CAPACITY;
    }
    return false;
}

__global__ void insert_kernel(GPUHashTable* ht, unsigned int* keys,
                               unsigned int* vals, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        ht_insert(ht, keys[tid], vals[tid]);
    }
}

The atomicCAS on the key slot is the critical operation. If the slot is empty (HT_EMPTY), the CAS writes our key and we own the slot. If someone else already wrote the same key, we update the value. Otherwise, we probe the next slot.

Load factor matters. At 50% load, average probe length is about 1.5. At 90%, it jumps to 5.5. Keep GPU hash tables below 70% load for acceptable performance.

Cuckoo hashing

Cuckoo hashing uses two (or more) hash functions and two tables. Each key has exactly two possible locations. Inserts that find both locations occupied evict an existing key, which then reinserts itself at its alternate location. This guarantees O(1) worst-case lookups.

On the GPU, cuckoo hashing is attractive because lookups are a single memory access (no probing chain). The cuDPP and cuCollection libraries provide production-quality GPU cuckoo hash tables. The downside is that inserts can trigger eviction chains that are hard to parallelize.

Data structure comparison

StructureApproachScalability (10K+ threads)Key limitationBest use case
Spinlock-guarded arrayLock-based✗ Poor. Serializes all threads.Warp deadlock risk, high contentionPrototype only, never production
Lock-free stackLock-free (CAS)Moderate. Single CAS hotspot.Top-of-stack is contention pointSmall shared buffers, per-block stacks
Lock-free worklistLock-free (atomicAdd)✓ Good. Two independent counters.Fixed capacity, no wrap-aroundBFS frontiers, iterative algorithms
Linked list (CAS)Lock-free (CAS)✗ Poor. Pointer chasing kills coalescing.ABA problem, cache-hostileHash table chaining (small buckets only)
Hash table (open addressing)Lock-free (CAS)✓ Good. Distributed contention.Probe chains degrade past 70% loadDeduplication, joins, lookups
Cuckoo hash tableLock-free (CAS)✓ Excellent lookups. Insert chains complex.Eviction cascades hard to parallelizeRead-heavy workloads, exact lookups

When to use CPU structures instead

Not every data structure belongs on the GPU. Moving data to the CPU for processing is sometimes the right call.

Use CPU structures when:

  • The data structure is inherently sequential. Balanced BSTs, skip lists, and B-trees require complex rebalancing that serializes on the GPU. Build them on the CPU and copy a read-only snapshot to the GPU.
  • Contention is unavoidable. If your algorithm requires all threads to modify a single data structure in a dependent order, the GPU will serialize and a CPU with fewer, faster cores will win.
  • Dynamic allocation is needed. malloc inside a kernel (__device__ malloc) is slow and limited. If your data structure needs to grow dynamically, allocate on the CPU and transfer.
  • The working set is small. A hash table with 1,000 entries is not worth GPU acceleration. The kernel launch overhead alone exceeds the CPU processing time.

The hybrid pattern is common in production: build the data structure on the CPU, transfer a read-only copy to GPU global memory, process queries in parallel on the GPU, and transfer results back. This avoids concurrent mutation entirely while exploiting GPU throughput for the read/compute phase.

CPU: Build hash table (single-threaded, complex insertions)
      |
      v
  cudaMemcpy (host -> device)
      |
      v
GPU: 100K threads do parallel lookups (read-only, no contention)
      |
      v
  cudaMemcpy (device -> host)
      |
      v
CPU: Process results

This pattern underlies GPU-accelerated database joins, embedding table lookups in recommendation systems, and dictionary encoding in columnar databases.

In practice

Concurrent data structures on the GPU reward simplicity. The fewer CAS operations per logical operation, the better your structure will scale.

Production considerations:

  • Profile before choosing. Use Nsight Compute to measure actual CAS retry rates. If your lock-free stack has 50x retries per push, the theoretical lock-freedom is irrelevant and you need a different approach.
  • Prefer atomicAdd over atomicCAS when possible. atomicAdd is hardware-accelerated on modern GPUs and does not require a retry loop. Worklists using atomicAdd outperform CAS-based stacks by 5-10x under heavy contention.
  • Partition to reduce contention. Instead of one global hash table, give each thread block a local hash table in shared memory. Merge results in a second pass. This is the standard approach in GPU database joins.
  • Watch your load factor. GPU hash tables perform well at 50-70% load. Past 80%, probe chains grow rapidly and performance collapses. Always over-allocate.
  • Avoid linked structures. Pointer chasing on the GPU means random global memory accesses with no coalescing. Use arrays with atomic index manipulation wherever possible.
  • Test with race detectors. compute-sanitizer --tool racecheck detects data races in CUDA kernels. Run it on every concurrent data structure before deployment.

What comes next

Concurrent data structures enable GPU kernels to coordinate without CPU intervention. But maximum throughput requires overlapping data transfers with kernel execution so the GPU never stalls waiting for data.

The next article, CUDA streams and async execution, covers how to use CUDA streams to overlap memory copies with computation, launch multiple kernels concurrently, and build efficient multi-stage pipelines.

Start typing to search across all content
navigate Enter open Esc close