Heterogeneous computing: CPU and GPU working together
In this series (28 parts)
- GPUs: from pixels to parallel supercomputers
- Your first CUDA program: kernels, threads, and grids
- Thread hierarchy in CUDA: threads, blocks, warps, and grids
- Warps and warp divergence: the hidden performance trap
- CUDA memory hierarchy: where your data lives matters
- Memory coalescing: the most important optimization you will learn
- Shared memory and tiling: the key to fast matrix operations
- Debugging and profiling CUDA programs
- Device functions, host functions, and CUDA function qualifiers
- Synchronization and atomic operations in CUDA
- Parallel prefix sum and reduction: the core parallel primitives
- Concurrent data structures on the GPU
- CUDA streams and asynchronous execution
- CUDA events and fine-grained synchronization
- Dynamic parallelism: kernels launching kernels
- Unified virtual memory: one pointer for CPU and GPU
- Multi-GPU programming and peer access
- Memory allocation patterns and multi-dimensional arrays in CUDA
- Texture and constant memory: specialized caches
- Occupancy, register pressure, and performance tuning
- Case study: matrix multiplication from naive to cuBLAS speed
- Case study: implementing a convolution layer in CUDA
- Case study: reduction and histogram at scale
- Heterogeneous computing: CPU and GPU working together
- Advanced memory patterns: pinned memory, zero-copy, and more
- Advanced stream patterns and concurrent kernel execution
- Performance case studies and optimization patterns
- Where to go from here: CUDA ecosystem and next steps
Prerequisites
This article builds on two earlier posts:
- CUDA streams and asynchronous execution for overlapping transfers with compute, pinned memory, and stream synchronization.
- Amdahl’s law and parallel scaling for understanding how serial fractions limit speedup.
You should be comfortable launching async kernels, using cudaStreamSynchronize, and reasoning about serial bottlenecks. Everything here extends those ideas to the full CPU+GPU system.
The heterogeneous model
A GPU is not a replacement for a CPU. It is a co-processor. The CPU handles control flow, I/O, memory management, OS interaction, and irregular workloads. The GPU handles data-parallel computation: the same operation applied to millions of elements. Production systems use both processors simultaneously, each doing what it does best.
This division is not optional. A GPU cannot open a file, parse JSON, or manage a network socket. A CPU cannot sustain 10 TFLOPS on dense matrix math. Treating the system as a single homogeneous device wastes one side or the other.
The programming model reflects this split. The CPU (host) launches work on the GPU (device), manages data movement, and coordinates synchronization. The GPU executes kernels and signals completion. The critical design question is not “CPU or GPU?” but rather “what runs where, and how do we keep both busy?”
Deciding what runs where
Not every computation benefits from GPU offloading. The decision depends on three factors: parallelism, data volume, and transfer cost.
GPU-friendly work:
- ✓ Data-parallel operations (element-wise transforms, reductions, matrix ops)
- ✓ Large data sets (millions of elements)
- ✓ High arithmetic intensity (many FLOPs per byte transferred)
- ✓ Regular memory access patterns (coalesced reads/writes)
CPU-friendly work:
- ✓ Control-heavy logic (branching, state machines, recursion)
- ✓ Small data sets (thousands of elements or fewer)
- ✓ I/O-bound tasks (file reads, network calls, database queries)
- ✓ Irregular memory access (pointer chasing, graph traversal)
- ✓ Tasks with heavy OS interaction (thread management, signals)
The transfer cost trap. PCIe 4.0 x16 delivers roughly 25 GB/s in each direction. Transferring 1 MB takes about 40 microseconds. If your kernel runs in 10 microseconds, the transfer overhead dominates and you are better off staying on the CPU. The break-even point depends on the ratio of compute time to transfer time.
Example: offload decision analysis
Consider a task that takes 2 ms on the CPU. The GPU alternative requires 1 ms to transfer data to the device, 0.5 ms for the kernel, and 1 ms to transfer results back.
Total GPU path: 1 + 0.5 + 1 = 2.5 ms CPU path: 2 ms
The GPU path is slower. Offloading is not worth it here.
Break-even analysis. Let T_cpu = 2 ms, T_transfer = 2 ms (1 ms each way), and T_kernel = k ms. Offloading wins when:
T_transfer + T_kernel < T_cpu 2 + k < 2 k < 0
This means for a 2 ms CPU task with 2 ms total transfer overhead, no GPU kernel time makes offloading worthwhile in isolation. The kernel would need to be instantaneous (or negative, which is impossible). The economics change when you can overlap transfers with other work, batch multiple operations, or amortize transfer cost across many kernel calls.
If the CPU task took 10 ms instead:
2 + k < 10 k < 8
Any kernel under 8 ms would justify the offload. The lesson: offloading pays off when the CPU-side alternative is expensive relative to transfer overhead.
CPU-GPU synchronization patterns
The CUDA runtime provides several synchronization mechanisms, each with different granularity and cost.
Device-level synchronization. cudaDeviceSynchronize() blocks the CPU until all previously issued work on all streams completes. Simple but coarse. It drains the entire GPU pipeline and prevents any CPU-GPU overlap.
Stream-level synchronization. cudaStreamSynchronize(stream) blocks until all work in a specific stream completes. Other streams continue executing. This is the standard tool for waiting on a specific pipeline stage.
Event-based synchronization. cudaEventRecord places a marker in a stream. cudaEventSynchronize blocks the CPU until that event completes. cudaStreamWaitEvent makes one stream wait for an event in another stream without blocking the CPU. Events are the building block for fine-grained dependency management.
Callback-based. cudaLaunchHostFunc enqueues a CPU function to run after all preceding work in a stream completes. The callback executes on a CUDA driver thread, not the launching thread. Useful for signaling completion without polling.
sequenceDiagram participant CPU participant Stream1 participant Stream2 participant GPU CPU->>Stream1: cudaMemcpyAsync - H2D CPU->>Stream1: kernel_A<<<>>> CPU->>CPU: CPU work - preprocessing batch N+1 Stream1->>GPU: Execute H2D + kernel_A CPU->>Stream2: cudaMemcpyAsync - H2D, batch N+1 CPU->>Stream1: cudaEventRecord - event1 CPU->>Stream2: cudaStreamWaitEvent - event1 CPU->>Stream2: kernel_B<<<>>> GPU->>Stream2: Execute after event1 completes CPU->>Stream2: cudaMemcpyAsync - D2H CPU->>CPU: cudaStreamSynchronize - Stream2
This pattern lets the CPU prepare the next batch while the GPU processes the current one. Events create dependencies between streams without stalling the CPU.
Overlapping CPU work with GPU kernels
The key to high utilization is keeping both processors busy. After launching an async kernel, the CPU is free to do other work until it needs the results.
#include <cuda_runtime.h>
#include <cstdio>
#include <cstring>
__global__ void transform_kernel(float* data, int n, float scale) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// Expensive per-element transform
float val = data[idx];
for (int i = 0; i < 100; i++) {
val = sinf(val) * scale + cosf(val);
}
data[idx] = val;
}
}
void cpu_preprocess(float* raw, float* prepared, int n) {
// Normalize, filter outliers, apply calibration
for (int i = 0; i < n; i++) {
float val = raw[i];
if (val < -1000.0f || val > 1000.0f) val = 0.0f;
prepared[i] = (val - 0.5f) * 2.0f;
}
}
void cpu_postprocess(float* results, int n) {
// Aggregate, format, write to output buffer
float sum = 0.0f;
for (int i = 0; i < n; i++) sum += results[i];
printf("Batch mean: %.4f\n", sum / n);
}
int main() {
const int N = 1 << 20;
const size_t bytes = N * sizeof(float);
// Pinned host memory for async transfers
float *h_raw, *h_prepared, *h_results;
cudaMallocHost(&h_raw, bytes);
cudaMallocHost(&h_prepared, bytes);
cudaMallocHost(&h_results, bytes);
float *d_data;
cudaMalloc(&d_data, bytes);
cudaStream_t stream;
cudaStreamCreate(&stream);
// Simulate incoming raw data
for (int i = 0; i < N; i++) h_raw[i] = (float)i / N;
// Stage 1: CPU preprocesses batch N
cpu_preprocess(h_raw, h_prepared, N);
// Stage 2: Transfer to GPU (async)
cudaMemcpyAsync(d_data, h_prepared, bytes,
cudaMemcpyHostToDevice, stream);
// Stage 3: Launch GPU kernel (async)
int threads = 256;
int blocks = (N + threads - 1) / threads;
transform_kernel<<<blocks, threads, 0, stream>>>(d_data, N, 1.5f);
// Stage 4: Transfer results back (async)
cudaMemcpyAsync(h_results, d_data, bytes,
cudaMemcpyDeviceToHost, stream);
// While GPU is working on batch N, CPU preprocesses batch N+1
// In a real pipeline, h_raw would contain new data here
cpu_preprocess(h_raw, h_prepared, N);
// Wait for GPU results
cudaStreamSynchronize(stream);
// Stage 5: CPU postprocesses GPU results
cpu_postprocess(h_results, N);
cudaStreamDestroy(stream);
cudaFreeHost(h_raw);
cudaFreeHost(h_prepared);
cudaFreeHost(h_results);
cudaFree(d_data);
return 0;
}
The CPU does useful work (preprocessing the next batch) while the GPU executes the kernel and transfers data. Without this overlap, the CPU would idle during GPU execution, wasting half the system.
Programming models for heterogeneous computing
| Model | Language | Offload method | Control complexity | Performance ceiling | Use case |
|---|---|---|---|---|---|
| CUDA C++ | C/C++ | Explicit kernel launch | High | Highest | Production GPU code, libraries |
| OpenACC | C/C++/Fortran | Compiler directives | Low | Medium-high | Incremental GPU porting |
| OpenMP target | C/C++/Fortran | Compiler directives | Low-medium | Medium | Portable CPU+GPU code |
| SYCL | C++ | Queue-based kernel submission | Medium | High | Cross-vendor GPU code |
| CuPy/Numba | Python | Array API / decorators | Low | Medium | Prototyping, data science |
| Kokkos | C++ | Parallel patterns + backends | Medium | High | HPC, portability across architectures |
OpenACC: directive-based offloading
OpenACC lets you offload loops to the GPU by adding compiler directives. No explicit memory management, no kernel launches, no stream handling. The compiler generates the GPU code.
#include <cstdio>
#include <cmath>
int main() {
const int N = 1000000;
float a[N], b[N], c[N];
// Initialize on CPU
for (int i = 0; i < N; i++) {
a[i] = sinf(i * 0.001f);
b[i] = cosf(i * 0.001f);
}
// OpenACC: compiler handles data transfer and kernel generation
#pragma acc parallel loop copyin(a[0:N], b[0:N]) copyout(c[0:N])
for (int i = 0; i < N; i++) {
c[i] = a[i] * b[i] + sqrtf(a[i] * a[i] + b[i] * b[i]);
}
printf("c[0] = %.4f\n", c[0]);
return 0;
}
Compile with nvc++ -acc -Minfo=accel pipeline.cpp -o pipeline. The -Minfo=accel flag shows what the compiler offloaded.
OpenACC works well for loop-heavy Fortran and C code where manual CUDA porting would take months. It does not match hand-tuned CUDA for complex kernels. Think of it as the 80/20 tool: 80% of the performance for 20% of the effort.
OpenMP target offloading follows a similar philosophy with #pragma omp target directives. It is part of the OpenMP 5.0+ specification and supported by GCC, Clang, and vendor compilers. The syntax is similar to OpenACC but with different clauses. If your codebase already uses OpenMP for CPU threading, target offloading is a natural extension.
Amdahl’s law revisited: the serial CPU bottleneck
In a heterogeneous system, Amdahl’s law applies at the system level. The serial fraction is not just the non-parallelizable GPU code. It includes all CPU-side work that the GPU must wait for: data preparation, result processing, and synchronization overhead.
Let S be the fraction of total work that must run serially on the CPU, and let P = 1 - S be the parallelizable fraction that runs on the GPU. If the GPU provides a speedup of G on the parallel portion:
Effective speedup = 1 / (S + P/G)
Even with G = 1000 (a generous GPU speedup), if 10% of the pipeline is serial CPU work:
Speedup = 1 / (0.1 + 0.9/1000) = 1 / 0.1009 = 9.91x
The serial CPU fraction caps total system speedup at roughly 10x regardless of how fast the GPU is. This is why CPU-side optimization matters in GPU programs. A 2x improvement in CPU preprocessing can deliver more system-level speedup than a 10x improvement in GPU kernel performance when the CPU is the bottleneck.
The curves converge quickly. Beyond 5% serial fraction, even a 1000x GPU speedup delivers under 20x total. This is why heterogeneous pipeline design focuses on minimizing CPU-side bottlenecks and overlapping CPU work with GPU execution.
Real-world pipeline design
Example: four-stage processing pipeline
Consider a pipeline with four stages per batch:
| Stage | Processor | Duration |
|---|---|---|
| Preprocess | CPU | 10 ms |
| Compute | GPU | 30 ms |
| Reduce | Either | 5 ms |
| Postprocess | CPU | 8 ms |
Single-batch latency: 10 + 30 + 5 + 8 = 53 ms (assigning Reduce to GPU since it is already there, saving a round trip).
Optimal plan for throughput: Overlap CPU and GPU stages across batches. While the GPU runs Compute for batch N, the CPU preprocesses batch N+1. While the GPU runs Reduce for batch N, the CPU postprocesses batch N-1.
gantt title Pipelined Execution (4 Stages, 3 Batches) dateFormat X axisFormat %s ms section CPU Preprocess B0 :a0, 0, 10 Postprocess B0 :a3, 45, 8 Preprocess B1 :a4, 10, 10 Postprocess B1 :a7, 75, 8 Preprocess B2 :a8, 20, 10 section GPU Compute B0 :b1, 10, 30 Reduce B0 :b2, 40, 5 Compute B1 :b5, 40, 30 Reduce B1 :b6, 70, 5 Compute B2 :b9, 70, 30
The GPU is the bottleneck at 30 ms per batch (Compute) plus 5 ms (Reduce) = 35 ms. The CPU stages total 18 ms per batch. Once the pipeline is full, a new batch completes every 35 ms (the GPU-bound interval).
Throughput for 10 batches:
Without pipelining: 10 * 53 = 530 ms
With pipelining: The first batch takes 53 ms. Each subsequent batch adds 35 ms (the bottleneck stage duration). Total = 53 + 9 * 35 = 53 + 315 = 368 ms.
Speedup from pipelining: 530 / 368 = 1.44x
For larger batch counts the benefit grows. At 100 batches: without pipelining = 5300 ms, with pipelining = 53 + 99 * 35 = 3518 ms, giving 1.51x speedup. The asymptotic throughput is one batch per 35 ms (28.6 batches/second) vs one batch per 53 ms (18.9 batches/second) without pipelining.
Task graph with critical path
A more complex pipeline has dependencies between stages that constrain scheduling:
graph LR A[CPU: Load Data 5 ms] --> B[CPU: Preprocess 10 ms] B --> C[GPU: Transform 30 ms] B --> D[CPU: Metadata Extract 8 ms] C --> E[GPU: Reduce 5 ms] D --> F[CPU: Merge Results 6 ms] E --> F F --> G[CPU: Write Output 4 ms] style A fill:#4a90d9,color:#fff style B fill:#4a90d9,color:#fff style C fill:#50c878,color:#fff style D fill:#4a90d9,color:#fff style E fill:#50c878,color:#fff style F fill:#4a90d9,color:#fff style G fill:#4a90d9,color:#fff
Critical path: A -> B -> C -> E -> F -> G = 5 + 10 + 30 + 5 + 6 + 4 = 60 ms. The CPU metadata extraction (D, 8 ms) runs in parallel with the GPU transform (C, 30 ms) and does not affect the critical path. The total latency is bounded by the longest path through the graph, not the sum of all stages.
Python: CPU+GPU hybrid pipeline with CuPy
import cupy as cp
import numpy as np
import time
def cpu_preprocess(raw_data: np.ndarray) -> np.ndarray:
"""Normalize and filter on CPU. Handles NaN, outliers."""
cleaned = np.nan_to_num(raw_data, nan=0.0)
mean = cleaned.mean()
std = cleaned.std() + 1e-8
normalized = (cleaned - mean) / std
# Clip outliers beyond 3 sigma
return np.clip(normalized, -3.0, 3.0)
def gpu_transform(data_gpu: cp.ndarray) -> cp.ndarray:
"""Heavy element-wise transform on GPU."""
result = cp.sin(data_gpu) * cp.cos(data_gpu * 0.5)
result = cp.cumsum(result)
return result / (cp.arange(len(result), dtype=cp.float32) + 1)
def cpu_postprocess(results: np.ndarray, batch_id: int) -> dict:
"""Aggregate results on CPU."""
return {
"batch": batch_id,
"mean": float(results.mean()),
"std": float(results.std()),
"min": float(results.min()),
"max": float(results.max()),
}
def run_pipeline(n_batches: int = 10, n_elements: int = 5_000_000):
stream = cp.cuda.Stream(non_blocking=True)
summaries = []
# Generate all raw batches (simulating I/O)
raw_batches = [
np.random.randn(n_elements).astype(np.float32)
for _ in range(n_batches)
]
start = time.perf_counter()
prev_results_gpu = None
prev_batch_id = None
for i in range(n_batches):
# CPU: preprocess current batch while GPU may still be
# working on the previous one
prepared = cpu_preprocess(raw_batches[i])
# If previous GPU work is done, postprocess on CPU
if prev_results_gpu is not None:
stream.synchronize()
results_cpu = cp.asnumpy(prev_results_gpu)
summaries.append(cpu_postprocess(results_cpu, prev_batch_id))
# Transfer to GPU and launch transform
with stream:
data_gpu = cp.asarray(prepared)
prev_results_gpu = gpu_transform(data_gpu)
prev_batch_id = i
# Final batch postprocess
stream.synchronize()
results_cpu = cp.asnumpy(prev_results_gpu)
summaries.append(cpu_postprocess(results_cpu, prev_batch_id))
elapsed = time.perf_counter() - start
print(f"Processed {n_batches} batches in {elapsed:.3f}s")
print(f"Throughput: {n_batches / elapsed:.1f} batches/sec")
return summaries
if __name__ == "__main__":
results = run_pipeline(n_batches=10)
for r in results[:3]:
print(r)
The pattern is the same as the CUDA C++ version: launch GPU work, do CPU work while the GPU is busy, synchronize only when you need the results. CuPy’s stream context manager scopes all GPU operations to a non-blocking stream, enabling overlap.
In practice
Profile before you partition. Use Nsight Systems to visualize the CPU and GPU timelines side by side. Look for gaps where one processor is idle while the other works. Those gaps are your optimization targets.
Minimize synchronization points. Every cudaDeviceSynchronize or cudaStreamSynchronize is a potential bubble. Batch your synchronization: instead of syncing after every kernel, sync once after a group of dependent operations completes.
Double-buffer host memory. Use two sets of pinned host buffers. While the GPU processes data from buffer A, the CPU fills buffer B. Swap on each iteration. This eliminates the dependency between CPU preprocessing and GPU input for consecutive batches.
Assign the “either” tasks strategically. If a stage can run on CPU or GPU, assign it to whichever processor is less loaded. In the pipeline example, Reduce (5 ms) runs on the GPU because the CPU is busy with postprocessing. Moving it to the CPU would add 5 ms to the CPU path, making it the bottleneck.
Watch for hidden serialization. printf in device code serializes. Unified Memory page faults serialize. Default stream operations serialize with all other streams. Use --default-stream per-thread to avoid default stream serialization in multi-threaded host code.
Start with OpenACC or CuPy for prototyping. Validate that GPU offloading helps before investing in hand-tuned CUDA kernels. If the speedup is marginal at the prototype level, manual optimization will not change the fundamental economics.
What comes next
This article covered heterogeneous computing: the CPU-GPU execution model, offload decisions, synchronization patterns, pipeline overlap, directive-based offloading with OpenACC, and the impact of Amdahl’s law on system-level performance. The core principle is straightforward: keep both processors busy, minimize synchronization, and let each processor do what it does best.
The next article covers advanced memory access patterns, including strided access, structure-of-arrays vs array-of-structures, memory padding for bank conflict avoidance, and techniques for handling irregular data layouts on the GPU.