Apr 12, 2026 · 20 min read · GPU Programming

Heterogeneous computing: CPU and GPU working together

In this series (28 parts)

Prerequisites

This article builds on two earlier posts:

CUDA streams and asynchronous execution for overlapping transfers with compute, pinned memory, and stream synchronization.
Amdahl’s law and parallel scaling for understanding how serial fractions limit speedup.

You should be comfortable launching async kernels, using cudaStreamSynchronize, and reasoning about serial bottlenecks. Everything here extends those ideas to the full CPU+GPU system.

The heterogeneous model

A GPU is not a replacement for a CPU. It is a co-processor. The CPU handles control flow, I/O, memory management, OS interaction, and irregular workloads. The GPU handles data-parallel computation: the same operation applied to millions of elements. Production systems use both processors simultaneously, each doing what it does best.

This division is not optional. A GPU cannot open a file, parse JSON, or manage a network socket. A CPU cannot sustain 10 TFLOPS on dense matrix math. Treating the system as a single homogeneous device wastes one side or the other.

The programming model reflects this split. The CPU (host) launches work on the GPU (device), manages data movement, and coordinates synchronization. The GPU executes kernels and signals completion. The critical design question is not “CPU or GPU?” but rather “what runs where, and how do we keep both busy?”

Deciding what runs where

Not every computation benefits from GPU offloading. The decision depends on three factors: parallelism, data volume, and transfer cost.

GPU-friendly work:

✓ Data-parallel operations (element-wise transforms, reductions, matrix ops)
✓ Large data sets (millions of elements)
✓ High arithmetic intensity (many FLOPs per byte transferred)
✓ Regular memory access patterns (coalesced reads/writes)

CPU-friendly work:

✓ Control-heavy logic (branching, state machines, recursion)
✓ Small data sets (thousands of elements or fewer)
✓ I/O-bound tasks (file reads, network calls, database queries)
✓ Irregular memory access (pointer chasing, graph traversal)
✓ Tasks with heavy OS interaction (thread management, signals)

The transfer cost trap. PCIe 4.0 x16 delivers roughly 25 GB/s in each direction. Transferring 1 MB takes about 40 microseconds. If your kernel runs in 10 microseconds, the transfer overhead dominates and you are better off staying on the CPU. The break-even point depends on the ratio of compute time to transfer time.

Example: offload decision analysis

Consider a task that takes 2 ms on the CPU. The GPU alternative requires 1 ms to transfer data to the device, 0.5 ms for the kernel, and 1 ms to transfer results back.

Total GPU path: 1 + 0.5 + 1 = 2.5 ms CPU path: 2 ms

The GPU path is slower. Offloading is not worth it here.

Break-even analysis. Let T_cpu = 2 ms, T_transfer = 2 ms (1 ms each way), and T_kernel = k ms. Offloading wins when:

T_transfer + T_kernel < T_cpu 2 + k < 2 k < 0

This means for a 2 ms CPU task with 2 ms total transfer overhead, no GPU kernel time makes offloading worthwhile in isolation. The kernel would need to be instantaneous (or negative, which is impossible). The economics change when you can overlap transfers with other work, batch multiple operations, or amortize transfer cost across many kernel calls.

If the CPU task took 10 ms instead:

2 + k < 10 k < 8

Any kernel under 8 ms would justify the offload. The lesson: offloading pays off when the CPU-side alternative is expensive relative to transfer overhead.

CPU-GPU synchronization patterns

The CUDA runtime provides several synchronization mechanisms, each with different granularity and cost.

Device-level synchronization. cudaDeviceSynchronize() blocks the CPU until all previously issued work on all streams completes. Simple but coarse. It drains the entire GPU pipeline and prevents any CPU-GPU overlap.

Stream-level synchronization. cudaStreamSynchronize(stream) blocks until all work in a specific stream completes. Other streams continue executing. This is the standard tool for waiting on a specific pipeline stage.

Event-based synchronization. cudaEventRecord places a marker in a stream. cudaEventSynchronize blocks the CPU until that event completes. cudaStreamWaitEvent makes one stream wait for an event in another stream without blocking the CPU. Events are the building block for fine-grained dependency management.

Callback-based. cudaLaunchHostFunc enqueues a CPU function to run after all preceding work in a stream completes. The callback executes on a CUDA driver thread, not the launching thread. Useful for signaling completion without polling.

sequenceDiagram
  participant CPU
  participant Stream1
  participant Stream2
  participant GPU

  CPU->>Stream1: cudaMemcpyAsync - H2D
  CPU->>Stream1: kernel_A<<<>>>
  CPU->>CPU: CPU work - preprocessing batch N+1
  Stream1->>GPU: Execute H2D + kernel_A
  CPU->>Stream2: cudaMemcpyAsync - H2D, batch N+1
  CPU->>Stream1: cudaEventRecord - event1
  CPU->>Stream2: cudaStreamWaitEvent - event1
  CPU->>Stream2: kernel_B<<<>>>
  GPU->>Stream2: Execute after event1 completes
  CPU->>Stream2: cudaMemcpyAsync - D2H
  CPU->>CPU: cudaStreamSynchronize - Stream2

This pattern lets the CPU prepare the next batch while the GPU processes the current one. Events create dependencies between streams without stalling the CPU.

Overlapping CPU work with GPU kernels

The key to high utilization is keeping both processors busy. After launching an async kernel, the CPU is free to do other work until it needs the results.

#include <cuda_runtime.h>
#include <cstdio>
#include <cstring>

__global__ void transform_kernel(float* data, int n, float scale) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        // Expensive per-element transform
        float val = data[idx];
        for (int i = 0; i < 100; i++) {
            val = sinf(val) * scale + cosf(val);
        }
        data[idx] = val;
    }
}

void cpu_preprocess(float* raw, float* prepared, int n) {
    // Normalize, filter outliers, apply calibration
    for (int i = 0; i < n; i++) {
        float val = raw[i];
        if (val < -1000.0f || val > 1000.0f) val = 0.0f;
        prepared[i] = (val - 0.5f) * 2.0f;
    }
}

void cpu_postprocess(float* results, int n) {
    // Aggregate, format, write to output buffer
    float sum = 0.0f;
    for (int i = 0; i < n; i++) sum += results[i];
    printf("Batch mean: %.4f\n", sum / n);
}

int main() {
    const int N = 1 << 20;
    const size_t bytes = N * sizeof(float);

    // Pinned host memory for async transfers
    float *h_raw, *h_prepared, *h_results;
    cudaMallocHost(&h_raw, bytes);
    cudaMallocHost(&h_prepared, bytes);
    cudaMallocHost(&h_results, bytes);

    float *d_data;
    cudaMalloc(&d_data, bytes);

    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // Simulate incoming raw data
    for (int i = 0; i < N; i++) h_raw[i] = (float)i / N;

    // Stage 1: CPU preprocesses batch N
    cpu_preprocess(h_raw, h_prepared, N);

    // Stage 2: Transfer to GPU (async)
    cudaMemcpyAsync(d_data, h_prepared, bytes,
                    cudaMemcpyHostToDevice, stream);

    // Stage 3: Launch GPU kernel (async)
    int threads = 256;
    int blocks = (N + threads - 1) / threads;
    transform_kernel<<<blocks, threads, 0, stream>>>(d_data, N, 1.5f);

    // Stage 4: Transfer results back (async)
    cudaMemcpyAsync(h_results, d_data, bytes,
                    cudaMemcpyDeviceToHost, stream);

    // While GPU is working on batch N, CPU preprocesses batch N+1
    // In a real pipeline, h_raw would contain new data here
    cpu_preprocess(h_raw, h_prepared, N);

    // Wait for GPU results
    cudaStreamSynchronize(stream);

    // Stage 5: CPU postprocesses GPU results
    cpu_postprocess(h_results, N);

    cudaStreamDestroy(stream);
    cudaFreeHost(h_raw);
    cudaFreeHost(h_prepared);
    cudaFreeHost(h_results);
    cudaFree(d_data);
    return 0;
}

The CPU does useful work (preprocessing the next batch) while the GPU executes the kernel and transfers data. Without this overlap, the CPU would idle during GPU execution, wasting half the system.

Programming models for heterogeneous computing

Model	Language	Offload method	Control complexity	Performance ceiling	Use case
CUDA C++	C/C++	Explicit kernel launch	High	Highest	Production GPU code, libraries
OpenACC	C/C++/Fortran	Compiler directives	Low	Medium-high	Incremental GPU porting
OpenMP target	C/C++/Fortran	Compiler directives	Low-medium	Medium	Portable CPU+GPU code
SYCL	C++	Queue-based kernel submission	Medium	High	Cross-vendor GPU code
CuPy/Numba	Python	Array API / decorators	Low	Medium	Prototyping, data science
Kokkos	C++	Parallel patterns + backends	Medium	High	HPC, portability across architectures

OpenACC: directive-based offloading

OpenACC lets you offload loops to the GPU by adding compiler directives. No explicit memory management, no kernel launches, no stream handling. The compiler generates the GPU code.

#include <cstdio>
#include <cmath>

int main() {
    const int N = 1000000;
    float a[N], b[N], c[N];

    // Initialize on CPU
    for (int i = 0; i < N; i++) {
        a[i] = sinf(i * 0.001f);
        b[i] = cosf(i * 0.001f);
    }

    // OpenACC: compiler handles data transfer and kernel generation
    #pragma acc parallel loop copyin(a[0:N], b[0:N]) copyout(c[0:N])
    for (int i = 0; i < N; i++) {
        c[i] = a[i] * b[i] + sqrtf(a[i] * a[i] + b[i] * b[i]);
    }

    printf("c[0] = %.4f\n", c[0]);
    return 0;
}

Compile with nvc++ -acc -Minfo=accel pipeline.cpp -o pipeline. The -Minfo=accel flag shows what the compiler offloaded.

OpenACC works well for loop-heavy Fortran and C code where manual CUDA porting would take months. It does not match hand-tuned CUDA for complex kernels. Think of it as the 80/20 tool: 80% of the performance for 20% of the effort.

OpenMP target offloading follows a similar philosophy with #pragma omp target directives. It is part of the OpenMP 5.0+ specification and supported by GCC, Clang, and vendor compilers. The syntax is similar to OpenACC but with different clauses. If your codebase already uses OpenMP for CPU threading, target offloading is a natural extension.

Amdahl’s law revisited: the serial CPU bottleneck

In a heterogeneous system, Amdahl’s law applies at the system level. The serial fraction is not just the non-parallelizable GPU code. It includes all CPU-side work that the GPU must wait for: data preparation, result processing, and synchronization overhead.

Let S be the fraction of total work that must run serially on the CPU, and let P = 1 - S be the parallelizable fraction that runs on the GPU. If the GPU provides a speedup of G on the parallel portion:

Effective speedup = 1 / (S + P/G)

Even with G = 1000 (a generous GPU speedup), if 10% of the pipeline is serial CPU work:

Speedup = 1 / (0.1 + 0.9/1000) = 1 / 0.1009 = 9.91x

The serial CPU fraction caps total system speedup at roughly 10x regardless of how fast the GPU is. This is why CPU-side optimization matters in GPU programs. A 2x improvement in CPU preprocessing can deliver more system-level speedup than a 10x improvement in GPU kernel performance when the CPU is the bottleneck.

The curves converge quickly. Beyond 5% serial fraction, even a 1000x GPU speedup delivers under 20x total. This is why heterogeneous pipeline design focuses on minimizing CPU-side bottlenecks and overlapping CPU work with GPU execution.

Real-world pipeline design

Example: four-stage processing pipeline

Consider a pipeline with four stages per batch:

Stage	Processor	Duration
Preprocess	CPU	10 ms
Compute	GPU	30 ms
Reduce	Either	5 ms
Postprocess	CPU	8 ms

Single-batch latency: 10 + 30 + 5 + 8 = 53 ms (assigning Reduce to GPU since it is already there, saving a round trip).

Optimal plan for throughput: Overlap CPU and GPU stages across batches. While the GPU runs Compute for batch N, the CPU preprocesses batch N+1. While the GPU runs Reduce for batch N, the CPU postprocesses batch N-1.

gantt
  title Pipelined Execution (4 Stages, 3 Batches)
  dateFormat X
  axisFormat %s ms

  section CPU
  Preprocess B0      :a0, 0, 10
  Postprocess B0     :a3, 45, 8
  Preprocess B1      :a4, 10, 10
  Postprocess B1     :a7, 75, 8
  Preprocess B2      :a8, 20, 10

  section GPU
  Compute B0         :b1, 10, 30
  Reduce B0          :b2, 40, 5
  Compute B1         :b5, 40, 30
  Reduce B1          :b6, 70, 5
  Compute B2         :b9, 70, 30

The GPU is the bottleneck at 30 ms per batch (Compute) plus 5 ms (Reduce) = 35 ms. The CPU stages total 18 ms per batch. Once the pipeline is full, a new batch completes every 35 ms (the GPU-bound interval).

Throughput for 10 batches:

Without pipelining: 10 * 53 = 530 ms

With pipelining: The first batch takes 53 ms. Each subsequent batch adds 35 ms (the bottleneck stage duration). Total = 53 + 9 * 35 = 53 + 315 = 368 ms.

Speedup from pipelining: 530 / 368 = 1.44x

For larger batch counts the benefit grows. At 100 batches: without pipelining = 5300 ms, with pipelining = 53 + 99 * 35 = 3518 ms, giving 1.51x speedup. The asymptotic throughput is one batch per 35 ms (28.6 batches/second) vs one batch per 53 ms (18.9 batches/second) without pipelining.

Task graph with critical path

A more complex pipeline has dependencies between stages that constrain scheduling:

graph LR
  A[CPU: Load Data
5 ms] --> B[CPU: Preprocess
10 ms]
  B --> C[GPU: Transform
30 ms]
  B --> D[CPU: Metadata Extract
8 ms]
  C --> E[GPU: Reduce
5 ms]
  D --> F[CPU: Merge Results
6 ms]
  E --> F
  F --> G[CPU: Write Output
4 ms]

  style A fill:#4a90d9,color:#fff
  style B fill:#4a90d9,color:#fff
  style C fill:#50c878,color:#fff
  style D fill:#4a90d9,color:#fff
  style E fill:#50c878,color:#fff
  style F fill:#4a90d9,color:#fff
  style G fill:#4a90d9,color:#fff

Critical path: A -> B -> C -> E -> F -> G = 5 + 10 + 30 + 5 + 6 + 4 = 60 ms. The CPU metadata extraction (D, 8 ms) runs in parallel with the GPU transform (C, 30 ms) and does not affect the critical path. The total latency is bounded by the longest path through the graph, not the sum of all stages.

Python: CPU+GPU hybrid pipeline with CuPy

import cupy as cp
import numpy as np
import time

def cpu_preprocess(raw_data: np.ndarray) -> np.ndarray:
    """Normalize and filter on CPU. Handles NaN, outliers."""
    cleaned = np.nan_to_num(raw_data, nan=0.0)
    mean = cleaned.mean()
    std = cleaned.std() + 1e-8
    normalized = (cleaned - mean) / std
    # Clip outliers beyond 3 sigma
    return np.clip(normalized, -3.0, 3.0)

def gpu_transform(data_gpu: cp.ndarray) -> cp.ndarray:
    """Heavy element-wise transform on GPU."""
    result = cp.sin(data_gpu) * cp.cos(data_gpu * 0.5)
    result = cp.cumsum(result)
    return result / (cp.arange(len(result), dtype=cp.float32) + 1)

def cpu_postprocess(results: np.ndarray, batch_id: int) -> dict:
    """Aggregate results on CPU."""
    return {
        "batch": batch_id,
        "mean": float(results.mean()),
        "std": float(results.std()),
        "min": float(results.min()),
        "max": float(results.max()),
    }

def run_pipeline(n_batches: int = 10, n_elements: int = 5_000_000):
    stream = cp.cuda.Stream(non_blocking=True)
    summaries = []

    # Generate all raw batches (simulating I/O)
    raw_batches = [
        np.random.randn(n_elements).astype(np.float32)
        for _ in range(n_batches)
    ]

    start = time.perf_counter()

    prev_results_gpu = None
    prev_batch_id = None

    for i in range(n_batches):
        # CPU: preprocess current batch while GPU may still be
        # working on the previous one
        prepared = cpu_preprocess(raw_batches[i])

        # If previous GPU work is done, postprocess on CPU
        if prev_results_gpu is not None:
            stream.synchronize()
            results_cpu = cp.asnumpy(prev_results_gpu)
            summaries.append(cpu_postprocess(results_cpu, prev_batch_id))

        # Transfer to GPU and launch transform
        with stream:
            data_gpu = cp.asarray(prepared)
            prev_results_gpu = gpu_transform(data_gpu)
            prev_batch_id = i

    # Final batch postprocess
    stream.synchronize()
    results_cpu = cp.asnumpy(prev_results_gpu)
    summaries.append(cpu_postprocess(results_cpu, prev_batch_id))

    elapsed = time.perf_counter() - start
    print(f"Processed {n_batches} batches in {elapsed:.3f}s")
    print(f"Throughput: {n_batches / elapsed:.1f} batches/sec")
    return summaries

if __name__ == "__main__":
    results = run_pipeline(n_batches=10)
    for r in results[:3]:
        print(r)

The pattern is the same as the CUDA C++ version: launch GPU work, do CPU work while the GPU is busy, synchronize only when you need the results. CuPy’s stream context manager scopes all GPU operations to a non-blocking stream, enabling overlap.

In practice

Profile before you partition. Use Nsight Systems to visualize the CPU and GPU timelines side by side. Look for gaps where one processor is idle while the other works. Those gaps are your optimization targets.

Minimize synchronization points. Every cudaDeviceSynchronize or cudaStreamSynchronize is a potential bubble. Batch your synchronization: instead of syncing after every kernel, sync once after a group of dependent operations completes.

Double-buffer host memory. Use two sets of pinned host buffers. While the GPU processes data from buffer A, the CPU fills buffer B. Swap on each iteration. This eliminates the dependency between CPU preprocessing and GPU input for consecutive batches.

Assign the “either” tasks strategically. If a stage can run on CPU or GPU, assign it to whichever processor is less loaded. In the pipeline example, Reduce (5 ms) runs on the GPU because the CPU is busy with postprocessing. Moving it to the CPU would add 5 ms to the CPU path, making it the bottleneck.

Watch for hidden serialization. printf in device code serializes. Unified Memory page faults serialize. Default stream operations serialize with all other streams. Use --default-stream per-thread to avoid default stream serialization in multi-threaded host code.

Start with OpenACC or CuPy for prototyping. Validate that GPU offloading helps before investing in hand-tuned CUDA kernels. If the speedup is marginal at the prototype level, manual optimization will not change the fundamental economics.

What comes next

This article covered heterogeneous computing: the CPU-GPU execution model, offload decisions, synchronization patterns, pipeline overlap, directive-based offloading with OpenACC, and the impact of Amdahl’s law on system-level performance. The core principle is straightforward: keep both processors busy, minimize synchronization, and let each processor do what it does best.

The next article covers advanced memory access patterns, including strided access, structure-of-arrays vs array-of-structures, memory padding for bank conflict avoidance, and techniques for handling irregular data layouts on the GPU.

← Back to all series