Mar 31, 2026 · 18 min read · GPU Programming

Dynamic parallelism: kernels launching kernels

In this series (28 parts)

Prerequisites

This article assumes you have read CUDA streams and asynchronous execution, where you learned how the host overlaps kernel launches, memory transfers, and synchronization. Dynamic parallelism moves that launch capability onto the device itself, so understanding streams and sync semantics is essential before going further.

You should be comfortable writing CUDA kernels, launching grids from host code, and reasoning about parent-child relationships between work items.

The problem with host-driven launches

Standard CUDA follows a strict pattern: the host decides how much work to launch, configures the grid, and calls the kernel. The GPU executes, returns control to the host, and the host decides what to launch next. This works well when you know the workload shape ahead of time.

It falls apart when you do not.

Consider a quadtree traversal. The root has four children. Each child may or may not need further subdivision depending on the data. A sparse region needs no recursion. A dense region needs many levels. The host cannot know the branching pattern without reading back intermediate results, which means a round-trip per level: launch kernel, copy results to host, inspect, launch next kernel. Each round-trip costs 5 to 20 microseconds of latency, and for a tree with 15 levels, that adds up to hundreds of microseconds of pure overhead sitting idle.

Dynamic parallelism eliminates this round-trip. Device code launches new kernels directly on the GPU, without returning to the host.

What dynamic parallelism is

Starting with compute capability 3.5 (Kepler), CUDA allows kernels to launch other kernels from device code. A thread running on the GPU can call the standard triple-chevron launch syntax:

__global__ void child_kernel(int* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] *= 2;
    }
}

__global__ void parent_kernel(int* data, int* sizes, int num_tasks) {
    int task = blockIdx.x * blockDim.x + threadIdx.x;
    if (task < num_tasks) {
        int offset = task * 1024;
        int n = sizes[task];
        // Device-side launch: each parent thread spawns a child grid
        child_kernel<<<(n + 255) / 256, 256>>>(data + offset, n);
    }
}

The parent thread issues a launch command that gets queued on the device. The GPU runtime schedules child grids without host involvement. From the perspective of the CUDA execution model, the child grid is a fully independent grid with its own blocks, warps, and threads.

Parent-child relationship

The relationship between parent and child grids follows specific rules:

graph TD
  H["Host"] -->|"launch"| P["Parent Grid
1024 threads"]
  P -->|"thread 0 launches"| C0["Child Grid 0
32 threads"]
  P -->|"thread 1 launches"| C1["Child Grid 1
32 threads"]
  P -->|"thread 2 launches"| C2["Child Grid 2
32 threads"]
  P -->|"..."| CN["Child Grid N
32 threads"]
  C0 -->|"implicit sync"| PE["Parent thread block
completes"]
  C1 -->|"implicit sync"| PE
  C2 -->|"implicit sync"| PE
  CN -->|"implicit sync"| PE
  style H fill:#f0f0f0,stroke:#333
  style P fill:#4a90d9,stroke:#333,color:#fff
  style C0 fill:#7bc67e,stroke:#333
  style C1 fill:#7bc67e,stroke:#333
  style C2 fill:#7bc67e,stroke:#333
  style CN fill:#7bc67e,stroke:#333
  style PE fill:#d9a44a,stroke:#333

Key rules:

Memory visibility: The child grid sees all global memory that was visible to the parent at launch time. Shared memory is not inherited. The child gets its own shared memory allocation.
Implicit synchronization: When a parent thread block completes, the runtime guarantees all child grids launched by threads in that block have finished. You do not need an explicit sync for this.
Explicit synchronization: If a parent thread needs to read results from a child grid before the block ends, it must call cudaDeviceSynchronize() from device code. This blocks the calling thread until all child grids launched by threads in the same block have completed.
Stream semantics: Child launches go into a per-thread default stream unless you explicitly create device-side streams. Launches from the same thread are serialized; launches from different threads may execute concurrently.

Enabling dynamic parallelism

Dynamic parallelism requires separate compilation and linking. The compiler needs to generate relocatable device code:

# Compile with relocatable device code
nvcc -arch=sm_70 -rdc=true -o program main.cu

# For separate compilation and linking
nvcc -arch=sm_70 -rdc=true -dc main.cu -o main.o
nvcc -arch=sm_70 -rdc=true -dlink main.o -o link.o
g++ main.o link.o -lcudart -lcudadevrt -o program

The critical flag is -rdc=true (relocatable device code). Without it, the compiler inlines everything into a single compilation unit and device-side launches will fail to compile. You also need to link against cudadevrt (the CUDA device runtime library).

In CMake:

set_target_properties(my_target PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(my_target PRIVATE cuda cudadevrt)

Synchronization from device code

The only synchronization primitive available on the device for child grids is cudaDeviceSynchronize(). When called from a device thread, it blocks that thread until all child grids launched by any thread in the same block have completed.

__global__ void parent_with_sync(int* data, int n) {
    int tid = threadIdx.x;

    // Phase 1: each thread launches a child to process a chunk
    int chunk_size = n / blockDim.x;
    int offset = tid * chunk_size;
    child_kernel<<<(chunk_size + 255) / 256, 256>>>(data + offset, chunk_size);

    // Wait for all children launched by this block
    cudaDeviceSynchronize();

    // Phase 2: safe to read results written by child grids
    if (tid == 0) {
        int sum = 0;
        for (int i = 0; i < n; i++) sum += data[i];
        data[0] = sum;
    }
}

⚠ cudaDeviceSynchronize() on the device is expensive. It stalls the calling warp and potentially the entire SM while waiting. Use it only when you genuinely need results from child grids before proceeding.

Recursive algorithms with dynamic parallelism

Recursion is where dynamic parallelism is most natural. Consider a recursive parallel reduction:

__global__ void recursive_reduce(int* data, int* result, int n) {
    extern __shared__ int sdata[];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Load into shared memory
    sdata[tid] = (idx < n) ? data[idx] : 0;
    __syncthreads();

    // Block-level reduction
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Write block result
    if (tid == 0) {
        result[blockIdx.x] = sdata[0];
    }

    // Thread 0 of block 0 launches the next level
    if (blockIdx.x == 0 && tid == 0) {
        int num_blocks = (n + blockDim.x - 1) / blockDim.x;
        if (num_blocks > 1) {
            int next_blocks = (num_blocks + blockDim.x - 1) / blockDim.x;
            recursive_reduce<<<next_blocks, blockDim.x,
                blockDim.x * sizeof(int)>>>(result, result, num_blocks);
            cudaDeviceSynchronize();
        }
    }
}

Each level reduces the data by a factor of the block size. Thread 0 of block 0 checks whether further reduction is needed and launches the next level. This continues until a single block can handle the remaining elements.

Recursive tree traversal: quadtree example

A more practical use case is adaptive mesh refinement or spatial subdivision. Here is a simplified quadtree construction:

struct BBox \{ float xmin, ymin, xmax, ymax; \};

__device__ bool needs_subdivision(float* points, int count, BBox box) {
    return count > 64;  // Subdivide if more than 64 points in region
}

__global__ void build_quadtree(float* points_x, float* points_y,
                                int* indices, int count,
                                BBox box, int depth, int max_depth,
                                int* tree_nodes) {
    if (depth >= max_depth || count <= 64) {
        // Leaf node: store point count
        if (threadIdx.x == 0) {
            atomicAdd(tree_nodes, 1);
        }
        return;
    }

    // Partition points into four quadrants (simplified)
    float mx = (box.xmin + box.xmax) * 0.5f;
    float my = (box.ymin + box.ymax) * 0.5f;

    // Each thread classifies one point
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    // ... partitioning logic using shared memory atomics ...

    __syncthreads();

    // Thread 0 launches four children for each quadrant
    if (threadIdx.x == 0) {
        BBox nw = \{box.xmin, my, mx, box.ymax\};
        BBox ne = \{mx, my, box.xmax, box.ymax\};
        BBox sw = \{box.xmin, box.ymin, mx, my\};
        BBox se = \{mx, box.ymin, box.xmax, my\};

        int blk = 256;
        if (count_nw > 0)
            build_quadtree<<<(count_nw+blk-1)/blk, blk>>>(
                points_x, points_y, idx_nw, count_nw,
                nw, depth+1, max_depth, tree_nodes);
        if (count_ne > 0)
            build_quadtree<<<(count_ne+blk-1)/blk, blk>>>(
                points_x, points_y, idx_ne, count_ne,
                ne, depth+1, max_depth, tree_nodes);
        if (count_sw > 0)
            build_quadtree<<<(count_sw+blk-1)/blk, blk>>>(
                points_x, points_y, idx_sw, count_sw,
                sw, depth+1, max_depth, tree_nodes);
        if (count_se > 0)
            build_quadtree<<<(count_se+blk-1)/blk, blk>>>(
                points_x, points_y, idx_se, count_se,
                se, depth+1, max_depth, tree_nodes);

        cudaDeviceSynchronize();
    }
}

graph TD
  R["Root Region
10000 points"] --> NW["NW Quadrant
3200 points"]
  R --> NE["NE Quadrant
45 points ✓ leaf"]
  R --> SW["SW Quadrant
5100 points"]
  R --> SE["SE Quadrant
1655 points"]

  NW --> NW1["NW-NW
800 pts"]
  NW --> NW2["NW-NE
30 pts ✓ leaf"]
  NW --> NW3["NW-SW
1500 pts"]
  NW --> NW4["NW-SE
870 pts"]

  SW --> SW1["SW-NW
2000 pts"]
  SW --> SW2["SW-NE
1200 pts"]
  SW --> SW3["SW-SW
50 pts ✓ leaf"]
  SW --> SW4["SW-SE
1850 pts"]

  SE --> SE1["SE-NW
55 pts ✓ leaf"]
  SE --> SE2["SE-NE
600 pts"]
  SE --> SE3["SE-SW
500 pts"]
  SE --> SE4["SE-SE
500 pts"]

  style NE fill:#7bc67e,stroke:#333
  style NW2 fill:#7bc67e,stroke:#333
  style SW3 fill:#7bc67e,stroke:#333
  style SE1 fill:#7bc67e,stroke:#333
  style R fill:#4a90d9,stroke:#333,color:#fff

The key insight: sparse quadrants (NE with 45 points) terminate early as leaf nodes, while dense quadrants (SW with 5100 points) recurse deeper. The GPU adapts the parallelism to the data distribution without host involvement.

Dynamic vs flat parallelism

Not every problem benefits from dynamic parallelism. The launch overhead is real, and for uniform workloads, a single flat kernel is almost always faster.

Aspect	Dynamic parallelism	Flat parallelism	Dynamic wins when…
Launch cost	~5 to 15 us per device launch	Single host launch ~5 us	Few large child grids, not many small ones
Work distribution	Adapts at runtime per-thread	Fixed at host launch time	Workload is irregular or data-dependent
Code complexity	Natural recursion, readable	Iterative workarounds, queues	Algorithm is inherently recursive
Memory overhead	Each child grid needs bookkeeping	Minimal overhead	Grid count stays manageable
Synchronization	`cudaDeviceSynchronize()` stalls SM	`__syncthreads()` within block	Cross-block dependencies exist
Nesting depth	Up to 24 levels	N/A (single level)	Tree depth is bounded and moderate
Occupancy impact	Child grids compete for SM resources	All resources planned upfront	SMs have spare capacity
Debugging	Harder to trace nested launches	Standard profiling works	Correctness matters more than peak perf

Worked example: launch overhead

Consider a parent kernel with 1024 threads, each launching a child grid of 32 threads:

Total child grids: 1024
Total child threads: 1024 * 32 = 32,768
Estimated launch overhead: ~10 us per device-side launch
Total launch overhead: 1024 * 10 us = 10,240 us = 10.24 ms

Compare this to a single flat kernel of 32,768 threads:

Launch overhead: ~5 us (one host-side launch)
Execution time: depends on workload, but the launch cost is negligible

The flat kernel spends 5 microseconds on launch overhead. The dynamic version spends over 10 milliseconds just on launches before any useful work begins. That is a 2000x difference in launch cost alone.

✓ Dynamic parallelism makes sense when each child grid does substantial work (thousands of operations) that justifies the launch overhead.

✗ It does not make sense when you can predict the workload shape and launch a single grid from the host.

⚠ The break-even point is roughly when each child grid runs for at least 50 to 100 microseconds of compute. Below that, launch overhead dominates.

Worked example: nesting depth limits

CUDA supports a maximum nesting depth of 24 levels. For a binary tree recursion where each node spawns two children:

Depth 0: 1 grid (root)
Depth 1: 2 grids
Depth d: 2^d grids
Depth 24: 2^24 = 16,777,216 grids

At depth 24, over 16 million grids would be in flight simultaneously. Each pending grid requires device memory for its launch parameters, stream state, and bookkeeping. The practical limit is far lower.

Realistic constraints:

Device memory for pending grids: each pending launch consumes roughly 1 to 4 KB of device memory for bookkeeping. At depth 24 with 16M pending grids, that is 16 to 64 GB of overhead, exceeding most GPU memory.
Pending launch buffer: the device runtime has a fixed-size buffer (default 2048 pending launches, configurable via cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, N)). Once full, launches block until space is available.
Practical maximum: most applications stay within 6 to 10 levels of nesting. Beyond that, the overhead of managing nested grids outweighs the benefits.

// Increase the pending launch buffer if needed
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 8192);

// Check the current limit
size_t limit;
cudaDeviceGetLimit(&limit, cudaLimitDevRuntimePendingLaunchCount);
printf("Pending launch limit: %zu\n", limit);

When NOT to use dynamic parallelism

Dynamic parallelism is a tool for specific situations. Reaching for it by default is a mistake. Flat kernels are faster in the common case.

Avoid dynamic parallelism when:

The workload is uniform. If every thread does roughly the same amount of work, a single flat launch is simpler and faster.
Child grids are tiny. Launching a grid of 32 threads costs more in overhead than the work those threads perform. Batch small tasks into a single grid instead.
You can precompute the work distribution. If a host-side scan or prefix sum can determine which threads need extra work, encode that into a single kernel’s launch parameters.
You need deterministic performance. Device-side launch timing is less predictable than host-side launches. Real-time applications should avoid it.

Prefer dynamic parallelism when:

The algorithm is inherently recursive (tree traversals, adaptive mesh refinement, hierarchical N-body).
The workload is highly irregular and cannot be balanced with a fixed grid configuration.
Host round-trips for multi-level algorithms create unacceptable latency.
The alternative is complex device-side work queues that are harder to write and maintain than a clean recursive launch.

Performance tuning tips

If you decide dynamic parallelism is the right tool, these practices help:

Minimize child grid count. Launch fewer, larger child grids rather than many small ones. If 32 threads in a warp all need to launch work, have one thread aggregate and launch a single grid for the warp.
Increase the pending launch buffer. The default of 2048 pending launches is conservative. For deep recursion, increase it:
```
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 16384);
```
Avoid deep nesting. Restructure algorithms to limit depth. A breadth-first approach with 3 to 4 levels often outperforms depth-first with 15 levels.
Use cudaDeviceSynchronize() sparingly. Each call stalls the warp. If you do not need intermediate results, let the implicit end-of-block synchronization handle it.
Profile with Nsight Compute. The launch__* metrics show device-side launch overhead. Compare against a flat baseline to quantify whether dynamic parallelism is actually helping.
Reserve SM resources. Child grids compete with parent grids for SM slots. If the GPU is fully occupied by parent blocks, child grids may stall waiting for resources, creating a deadlock-like situation. Launch parent grids with fewer blocks than SMs to leave room for children.

Common pitfalls

Deadlock from resource exhaustion: if the parent grid occupies all SMs and child grids cannot start, the parent blocks on cudaDeviceSynchronize() while children wait for SM resources. Neither can make progress. Solution: launch the parent with fewer blocks than available SMs.

Forgotten -rdc=true: the most common build error. Without relocatable device code, the compiler treats each file as a self-contained unit and device-side kernel references fail to resolve.

Assuming shared memory inheritance: child grids do not inherit the parent’s shared memory. Each child grid gets its own allocation. Data must be passed through global memory or kernel arguments.

Ignoring error codes: device-side launches can fail (out of memory, buffer full). Always check the return value:

child_kernel<<<grid, block>>>(args);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    // Handle launch failure on device
    printf("Device launch failed: %s\n", cudaGetErrorString(err));
}

In practice

Dynamic parallelism appears in a few well-defined domains:

Adaptive mesh refinement (AMR): computational fluid dynamics codes subdivide cells that contain shock fronts or turbulence. The subdivision pattern depends on the solution at each timestep. Dynamic parallelism avoids host round-trips between refinement levels.
Hierarchical N-body simulations: Barnes-Hut tree traversal naturally maps to recursive kernel launches. Each node in the octree decides whether to open its children based on the distance to the query point.
Graph algorithms on irregular graphs: BFS and SSSP on power-law graphs produce wildly varying frontier sizes. Dynamic parallelism lets each level launch a grid sized to its actual frontier rather than over-allocating for the worst case.
Ray tracing BVH traversal: some GPU ray tracers use dynamic parallelism for secondary ray generation, though modern approaches (like OptiX) use specialized hardware instead.

In most production CUDA code, you will not use dynamic parallelism. The majority of workloads (linear algebra, convolutions, reductions, scans) have predictable shapes and are better served by flat launches. But when you hit a genuinely irregular, recursive, or adaptive problem, dynamic parallelism turns a multi-launch host loop into a clean, self-contained GPU computation.

What comes next

The next article covers CUDA unified memory, where we look at how cudaMallocManaged lets the host and device share a single pointer, how page migration works under the hood, and when unified memory helps versus when explicit transfers are still faster. Unified memory pairs well with dynamic parallelism because child grids can access the same managed allocations without explicit data marshaling between parent and child.

← Back to all series