Search…

Heterogeneous computing: CPU and GPU working together

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article builds on two earlier posts:

You should be comfortable launching async kernels, using cudaStreamSynchronize, and reasoning about serial bottlenecks. Everything here extends those ideas to the full CPU+GPU system.

The heterogeneous model

A GPU is not a replacement for a CPU. It is a co-processor. The CPU handles control flow, I/O, memory management, OS interaction, and irregular workloads. The GPU handles data-parallel computation: the same operation applied to millions of elements. Production systems use both processors simultaneously, each doing what it does best.

This division is not optional. A GPU cannot open a file, parse JSON, or manage a network socket. A CPU cannot sustain 10 TFLOPS on dense matrix math. Treating the system as a single homogeneous device wastes one side or the other.

The programming model reflects this split. The CPU (host) launches work on the GPU (device), manages data movement, and coordinates synchronization. The GPU executes kernels and signals completion. The critical design question is not “CPU or GPU?” but rather “what runs where, and how do we keep both busy?”

Deciding what runs where

Not every computation benefits from GPU offloading. The decision depends on three factors: parallelism, data volume, and transfer cost.

GPU-friendly work:

  • ✓ Data-parallel operations (element-wise transforms, reductions, matrix ops)
  • ✓ Large data sets (millions of elements)
  • ✓ High arithmetic intensity (many FLOPs per byte transferred)
  • ✓ Regular memory access patterns (coalesced reads/writes)

CPU-friendly work:

  • ✓ Control-heavy logic (branching, state machines, recursion)
  • ✓ Small data sets (thousands of elements or fewer)
  • ✓ I/O-bound tasks (file reads, network calls, database queries)
  • ✓ Irregular memory access (pointer chasing, graph traversal)
  • ✓ Tasks with heavy OS interaction (thread management, signals)

The transfer cost trap. PCIe 4.0 x16 delivers roughly 25 GB/s in each direction. Transferring 1 MB takes about 40 microseconds. If your kernel runs in 10 microseconds, the transfer overhead dominates and you are better off staying on the CPU. The break-even point depends on the ratio of compute time to transfer time.

Example: offload decision analysis

Consider a task that takes 2 ms on the CPU. The GPU alternative requires 1 ms to transfer data to the device, 0.5 ms for the kernel, and 1 ms to transfer results back.

Total GPU path: 1 + 0.5 + 1 = 2.5 ms CPU path: 2 ms

The GPU path is slower. Offloading is not worth it here.

Break-even analysis. Let T_cpu = 2 ms, T_transfer = 2 ms (1 ms each way), and T_kernel = k ms. Offloading wins when:

T_transfer + T_kernel < T_cpu 2 + k < 2 k < 0

This means for a 2 ms CPU task with 2 ms total transfer overhead, no GPU kernel time makes offloading worthwhile in isolation. The kernel would need to be instantaneous (or negative, which is impossible). The economics change when you can overlap transfers with other work, batch multiple operations, or amortize transfer cost across many kernel calls.

If the CPU task took 10 ms instead:

2 + k < 10 k < 8

Any kernel under 8 ms would justify the offload. The lesson: offloading pays off when the CPU-side alternative is expensive relative to transfer overhead.

CPU-GPU synchronization patterns

The CUDA runtime provides several synchronization mechanisms, each with different granularity and cost.

Device-level synchronization. cudaDeviceSynchronize() blocks the CPU until all previously issued work on all streams completes. Simple but coarse. It drains the entire GPU pipeline and prevents any CPU-GPU overlap.

Stream-level synchronization. cudaStreamSynchronize(stream) blocks until all work in a specific stream completes. Other streams continue executing. This is the standard tool for waiting on a specific pipeline stage.

Event-based synchronization. cudaEventRecord places a marker in a stream. cudaEventSynchronize blocks the CPU until that event completes. cudaStreamWaitEvent makes one stream wait for an event in another stream without blocking the CPU. Events are the building block for fine-grained dependency management.

Callback-based. cudaLaunchHostFunc enqueues a CPU function to run after all preceding work in a stream completes. The callback executes on a CUDA driver thread, not the launching thread. Useful for signaling completion without polling.

sequenceDiagram
  participant CPU
  participant Stream1
  participant Stream2
  participant GPU

  CPU->>Stream1: cudaMemcpyAsync - H2D
  CPU->>Stream1: kernel_A<<<>>>
  CPU->>CPU: CPU work - preprocessing batch N+1
  Stream1->>GPU: Execute H2D + kernel_A
  CPU->>Stream2: cudaMemcpyAsync - H2D, batch N+1
  CPU->>Stream1: cudaEventRecord - event1
  CPU->>Stream2: cudaStreamWaitEvent - event1
  CPU->>Stream2: kernel_B<<<>>>
  GPU->>Stream2: Execute after event1 completes
  CPU->>Stream2: cudaMemcpyAsync - D2H
  CPU->>CPU: cudaStreamSynchronize - Stream2

This pattern lets the CPU prepare the next batch while the GPU processes the current one. Events create dependencies between streams without stalling the CPU.

Overlapping CPU work with GPU kernels

The key to high utilization is keeping both processors busy. After launching an async kernel, the CPU is free to do other work until it needs the results.

#include <cuda_runtime.h>
#include <cstdio>
#include <cstring>

__global__ void transform_kernel(float* data, int n, float scale) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Expensive per-element transform
        float val = data[idx];
        for (int i = 0; i < 100; i++) {
            val = sinf(val) * scale + cosf(val);
        }
        data[idx] = val;
    }
}

void cpu_preprocess(float* raw, float* prepared, int n) {
    // Normalize, filter outliers, apply calibration
    for (int i = 0; i < n; i++) {
        float val = raw[i];
        if (val < -1000.0f || val > 1000.0f) val = 0.0f;
        prepared[i] = (val - 0.5f) * 2.0f;
    }
}

void cpu_postprocess(float* results, int n) {
    // Aggregate, format, write to output buffer
    float sum = 0.0f;
    for (int i = 0; i < n; i++) sum += results[i];
    printf("Batch mean: %.4f\n", sum / n);
}

int main() {
    const int N = 1 << 20;
    const size_t bytes = N * sizeof(float);

    // Pinned host memory for async transfers
    float *h_raw, *h_prepared, *h_results;
    cudaMallocHost(&h_raw, bytes);
    cudaMallocHost(&h_prepared, bytes);
    cudaMallocHost(&h_results, bytes);

    float *d_data;
    cudaMalloc(&d_data, bytes);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // Simulate incoming raw data
    for (int i = 0; i < N; i++) h_raw[i] = (float)i / N;

    // Stage 1: CPU preprocesses batch N
    cpu_preprocess(h_raw, h_prepared, N);

    // Stage 2: Transfer to GPU (async)
    cudaMemcpyAsync(d_data, h_prepared, bytes,
                    cudaMemcpyHostToDevice, stream);

    // Stage 3: Launch GPU kernel (async)
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    transform_kernel<<<blocks, threads, 0, stream>>>(d_data, N, 1.5f);

    // Stage 4: Transfer results back (async)
    cudaMemcpyAsync(h_results, d_data, bytes,
                    cudaMemcpyDeviceToHost, stream);

    // While GPU is working on batch N, CPU preprocesses batch N+1
    // In a real pipeline, h_raw would contain new data here
    cpu_preprocess(h_raw, h_prepared, N);

    // Wait for GPU results
    cudaStreamSynchronize(stream);

    // Stage 5: CPU postprocesses GPU results
    cpu_postprocess(h_results, N);

    cudaStreamDestroy(stream);
    cudaFreeHost(h_raw);
    cudaFreeHost(h_prepared);
    cudaFreeHost(h_results);
    cudaFree(d_data);
    return 0;
}

The CPU does useful work (preprocessing the next batch) while the GPU executes the kernel and transfers data. Without this overlap, the CPU would idle during GPU execution, wasting half the system.

Programming models for heterogeneous computing

ModelLanguageOffload methodControl complexityPerformance ceilingUse case
CUDA C++C/C++Explicit kernel launchHighHighestProduction GPU code, libraries
OpenACCC/C++/FortranCompiler directivesLowMedium-highIncremental GPU porting
OpenMP targetC/C++/FortranCompiler directivesLow-mediumMediumPortable CPU+GPU code
SYCLC++Queue-based kernel submissionMediumHighCross-vendor GPU code
CuPy/NumbaPythonArray API / decoratorsLowMediumPrototyping, data science
KokkosC++Parallel patterns + backendsMediumHighHPC, portability across architectures

OpenACC: directive-based offloading

OpenACC lets you offload loops to the GPU by adding compiler directives. No explicit memory management, no kernel launches, no stream handling. The compiler generates the GPU code.

#include <cstdio>
#include <cmath>

int main() {
    const int N = 1000000;
    float a[N], b[N], c[N];

    // Initialize on CPU
    for (int i = 0; i < N; i++) {
        a[i] = sinf(i * 0.001f);
        b[i] = cosf(i * 0.001f);
    }

    // OpenACC: compiler handles data transfer and kernel generation
    #pragma acc parallel loop copyin(a[0:N], b[0:N]) copyout(c[0:N])
    for (int i = 0; i < N; i++) {
        c[i] = a[i] * b[i] + sqrtf(a[i] * a[i] + b[i] * b[i]);
    }

    printf("c[0] = %.4f\n", c[0]);
    return 0;
}

Compile with nvc++ -acc -Minfo=accel pipeline.cpp -o pipeline. The -Minfo=accel flag shows what the compiler offloaded.

OpenACC works well for loop-heavy Fortran and C code where manual CUDA porting would take months. It does not match hand-tuned CUDA for complex kernels. Think of it as the 80/20 tool: 80% of the performance for 20% of the effort.

OpenMP target offloading follows a similar philosophy with #pragma omp target directives. It is part of the OpenMP 5.0+ specification and supported by GCC, Clang, and vendor compilers. The syntax is similar to OpenACC but with different clauses. If your codebase already uses OpenMP for CPU threading, target offloading is a natural extension.

Amdahl’s law revisited: the serial CPU bottleneck

In a heterogeneous system, Amdahl’s law applies at the system level. The serial fraction is not just the non-parallelizable GPU code. It includes all CPU-side work that the GPU must wait for: data preparation, result processing, and synchronization overhead.

Let S be the fraction of total work that must run serially on the CPU, and let P = 1 - S be the parallelizable fraction that runs on the GPU. If the GPU provides a speedup of G on the parallel portion:

Effective speedup = 1 / (S + P/G)

Even with G = 1000 (a generous GPU speedup), if 10% of the pipeline is serial CPU work:

Speedup = 1 / (0.1 + 0.9/1000) = 1 / 0.1009 = 9.91x

The serial CPU fraction caps total system speedup at roughly 10x regardless of how fast the GPU is. This is why CPU-side optimization matters in GPU programs. A 2x improvement in CPU preprocessing can deliver more system-level speedup than a 10x improvement in GPU kernel performance when the CPU is the bottleneck.

The curves converge quickly. Beyond 5% serial fraction, even a 1000x GPU speedup delivers under 20x total. This is why heterogeneous pipeline design focuses on minimizing CPU-side bottlenecks and overlapping CPU work with GPU execution.

Real-world pipeline design

Example: four-stage processing pipeline

Consider a pipeline with four stages per batch:

StageProcessorDuration
PreprocessCPU10 ms
ComputeGPU30 ms
ReduceEither5 ms
PostprocessCPU8 ms

Single-batch latency: 10 + 30 + 5 + 8 = 53 ms (assigning Reduce to GPU since it is already there, saving a round trip).

Optimal plan for throughput: Overlap CPU and GPU stages across batches. While the GPU runs Compute for batch N, the CPU preprocesses batch N+1. While the GPU runs Reduce for batch N, the CPU postprocesses batch N-1.

gantt
  title Pipelined Execution (4 Stages, 3 Batches)
  dateFormat X
  axisFormat %s ms

  section CPU
  Preprocess B0      :a0, 0, 10
  Postprocess B0     :a3, 45, 8
  Preprocess B1      :a4, 10, 10
  Postprocess B1     :a7, 75, 8
  Preprocess B2      :a8, 20, 10

  section GPU
  Compute B0         :b1, 10, 30
  Reduce B0          :b2, 40, 5
  Compute B1         :b5, 40, 30
  Reduce B1          :b6, 70, 5
  Compute B2         :b9, 70, 30

The GPU is the bottleneck at 30 ms per batch (Compute) plus 5 ms (Reduce) = 35 ms. The CPU stages total 18 ms per batch. Once the pipeline is full, a new batch completes every 35 ms (the GPU-bound interval).

Throughput for 10 batches:

Without pipelining: 10 * 53 = 530 ms

With pipelining: The first batch takes 53 ms. Each subsequent batch adds 35 ms (the bottleneck stage duration). Total = 53 + 9 * 35 = 53 + 315 = 368 ms.

Speedup from pipelining: 530 / 368 = 1.44x

For larger batch counts the benefit grows. At 100 batches: without pipelining = 5300 ms, with pipelining = 53 + 99 * 35 = 3518 ms, giving 1.51x speedup. The asymptotic throughput is one batch per 35 ms (28.6 batches/second) vs one batch per 53 ms (18.9 batches/second) without pipelining.

Task graph with critical path

A more complex pipeline has dependencies between stages that constrain scheduling:

graph LR
  A[CPU: Load Data
5 ms] --> B[CPU: Preprocess
10 ms]
  B --> C[GPU: Transform
30 ms]
  B --> D[CPU: Metadata Extract
8 ms]
  C --> E[GPU: Reduce
5 ms]
  D --> F[CPU: Merge Results
6 ms]
  E --> F
  F --> G[CPU: Write Output
4 ms]

  style A fill:#4a90d9,color:#fff
  style B fill:#4a90d9,color:#fff
  style C fill:#50c878,color:#fff
  style D fill:#4a90d9,color:#fff
  style E fill:#50c878,color:#fff
  style F fill:#4a90d9,color:#fff
  style G fill:#4a90d9,color:#fff

Critical path: A -> B -> C -> E -> F -> G = 5 + 10 + 30 + 5 + 6 + 4 = 60 ms. The CPU metadata extraction (D, 8 ms) runs in parallel with the GPU transform (C, 30 ms) and does not affect the critical path. The total latency is bounded by the longest path through the graph, not the sum of all stages.

Python: CPU+GPU hybrid pipeline with CuPy

import cupy as cp
import numpy as np
import time

def cpu_preprocess(raw_data: np.ndarray) -> np.ndarray:
    """Normalize and filter on CPU. Handles NaN, outliers."""
    cleaned = np.nan_to_num(raw_data, nan=0.0)
    mean = cleaned.mean()
    std = cleaned.std() + 1e-8
    normalized = (cleaned - mean) / std
    # Clip outliers beyond 3 sigma
    return np.clip(normalized, -3.0, 3.0)

def gpu_transform(data_gpu: cp.ndarray) -> cp.ndarray:
    """Heavy element-wise transform on GPU."""
    result = cp.sin(data_gpu) * cp.cos(data_gpu * 0.5)
    result = cp.cumsum(result)
    return result / (cp.arange(len(result), dtype=cp.float32) + 1)

def cpu_postprocess(results: np.ndarray, batch_id: int) -> dict:
    """Aggregate results on CPU."""
    return {
        "batch": batch_id,
        "mean": float(results.mean()),
        "std": float(results.std()),
        "min": float(results.min()),
        "max": float(results.max()),
    }

def run_pipeline(n_batches: int = 10, n_elements: int = 5_000_000):
    stream = cp.cuda.Stream(non_blocking=True)
    summaries = []

    # Generate all raw batches (simulating I/O)
    raw_batches = [
        np.random.randn(n_elements).astype(np.float32)
        for _ in range(n_batches)
    ]

    start = time.perf_counter()

    prev_results_gpu = None
    prev_batch_id = None

    for i in range(n_batches):
        # CPU: preprocess current batch while GPU may still be
        # working on the previous one
        prepared = cpu_preprocess(raw_batches[i])

        # If previous GPU work is done, postprocess on CPU
        if prev_results_gpu is not None:
            stream.synchronize()
            results_cpu = cp.asnumpy(prev_results_gpu)
            summaries.append(cpu_postprocess(results_cpu, prev_batch_id))

        # Transfer to GPU and launch transform
        with stream:
            data_gpu = cp.asarray(prepared)
            prev_results_gpu = gpu_transform(data_gpu)
            prev_batch_id = i

    # Final batch postprocess
    stream.synchronize()
    results_cpu = cp.asnumpy(prev_results_gpu)
    summaries.append(cpu_postprocess(results_cpu, prev_batch_id))

    elapsed = time.perf_counter() - start
    print(f"Processed {n_batches} batches in {elapsed:.3f}s")
    print(f"Throughput: {n_batches / elapsed:.1f} batches/sec")
    return summaries

if __name__ == "__main__":
    results = run_pipeline(n_batches=10)
    for r in results[:3]:
        print(r)

The pattern is the same as the CUDA C++ version: launch GPU work, do CPU work while the GPU is busy, synchronize only when you need the results. CuPy’s stream context manager scopes all GPU operations to a non-blocking stream, enabling overlap.

In practice

Profile before you partition. Use Nsight Systems to visualize the CPU and GPU timelines side by side. Look for gaps where one processor is idle while the other works. Those gaps are your optimization targets.

Minimize synchronization points. Every cudaDeviceSynchronize or cudaStreamSynchronize is a potential bubble. Batch your synchronization: instead of syncing after every kernel, sync once after a group of dependent operations completes.

Double-buffer host memory. Use two sets of pinned host buffers. While the GPU processes data from buffer A, the CPU fills buffer B. Swap on each iteration. This eliminates the dependency between CPU preprocessing and GPU input for consecutive batches.

Assign the “either” tasks strategically. If a stage can run on CPU or GPU, assign it to whichever processor is less loaded. In the pipeline example, Reduce (5 ms) runs on the GPU because the CPU is busy with postprocessing. Moving it to the CPU would add 5 ms to the CPU path, making it the bottleneck.

Watch for hidden serialization. printf in device code serializes. Unified Memory page faults serialize. Default stream operations serialize with all other streams. Use --default-stream per-thread to avoid default stream serialization in multi-threaded host code.

Start with OpenACC or CuPy for prototyping. Validate that GPU offloading helps before investing in hand-tuned CUDA kernels. If the speedup is marginal at the prototype level, manual optimization will not change the fundamental economics.

What comes next

This article covered heterogeneous computing: the CPU-GPU execution model, offload decisions, synchronization patterns, pipeline overlap, directive-based offloading with OpenACC, and the impact of Amdahl’s law on system-level performance. The core principle is straightforward: keep both processors busy, minimize synchronization, and let each processor do what it does best.

The next article covers advanced memory access patterns, including strided access, structure-of-arrays vs array-of-structures, memory padding for bank conflict avoidance, and techniques for handling irregular data layouts on the GPU.

Start typing to search across all content
navigate Enter open Esc close