Search…

Device functions, host functions, and CUDA function qualifiers

In this series (28 parts)
  1. GPUs: from pixels to parallel supercomputers
  2. Your first CUDA program: kernels, threads, and grids
  3. Thread hierarchy in CUDA: threads, blocks, warps, and grids
  4. Warps and warp divergence: the hidden performance trap
  5. CUDA memory hierarchy: where your data lives matters
  6. Memory coalescing: the most important optimization you will learn
  7. Shared memory and tiling: the key to fast matrix operations
  8. Debugging and profiling CUDA programs
  9. Device functions, host functions, and CUDA function qualifiers
  10. Synchronization and atomic operations in CUDA
  11. Parallel prefix sum and reduction: the core parallel primitives
  12. Concurrent data structures on the GPU
  13. CUDA streams and asynchronous execution
  14. CUDA events and fine-grained synchronization
  15. Dynamic parallelism: kernels launching kernels
  16. Unified virtual memory: one pointer for CPU and GPU
  17. Multi-GPU programming and peer access
  18. Memory allocation patterns and multi-dimensional arrays in CUDA
  19. Texture and constant memory: specialized caches
  20. Occupancy, register pressure, and performance tuning
  21. Case study: matrix multiplication from naive to cuBLAS speed
  22. Case study: implementing a convolution layer in CUDA
  23. Case study: reduction and histogram at scale
  24. Heterogeneous computing: CPU and GPU working together
  25. Advanced memory patterns: pinned memory, zero-copy, and more
  26. Advanced stream patterns and concurrent kernel execution
  27. Performance case studies and optimization patterns
  28. Where to go from here: CUDA ecosystem and next steps

Prerequisites

This article builds on Debugging and profiling CUDA programs. You should be comfortable writing kernels, launching them with the <<<blocks, threads>>> syntax, and reading basic Nsight output. We will reference thread indexing, shared memory, and warp-level concepts from earlier posts in the series.


Why function qualifiers matter

Every function in a CUDA program runs on either the host (CPU), the device (GPU), or both. The compiler needs to know which, because it generates different machine code for each target. CUDA uses three qualifier keywords to make this explicit: __global__, __device__, and __host__. Getting them wrong causes linker errors at best and silent correctness bugs at worst.

The qualifier also determines who can call the function. A __device__ function cannot be called from host code. A __global__ kernel cannot be called from another kernel. These constraints are not arbitrary. They follow from how the GPU hardware dispatches work.

The four qualifiers

QualifierCallable fromRuns onAutomatically inlinedRecursion supportTypical use
__global__Host (or device with dynamic parallelism)DeviceNoNoKernel entry points
__device__DeviceDeviceYes (by default)Yes (cc 3.5+)Helper functions called from kernels
__host__HostHostNormal C++ rulesYesRegular CPU functions (default)
__host__ __device__BothBoth (compiler generates two versions)VariesYes on bothShared math utilities, struct methods

The __host__ qualifier is the default. You never need to write it explicitly unless you combine it with __device__.

__global__: kernel entry points

A __global__ function is a kernel. It is launched from the host using the <<<gridDim, blockDim>>> syntax. It runs on the device. It must return void. It cannot be called from other device functions (unless you use dynamic parallelism, which has its own overhead).

__global__ void vector_add(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Every kernel launch involves a host-to-device dispatch. The CPU enqueues the kernel on a CUDA stream, the GPU scheduler picks it up, and the hardware distributes thread blocks across SMs. This is why __global__ functions are always the boundary between host and device execution.

__device__: device-only helpers

A __device__ function runs on the GPU and can only be called by other device code (kernels or other __device__ functions). The compiler typically inlines these aggressively. Small __device__ functions have zero call overhead after inlining.

__device__ float clamp(float val, float lo, float hi) {
    return fminf(fmaxf(val, lo), hi);
}

__global__ void normalize_kernel(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = clamp(data[idx], 0.0f, 1.0f);
    }
}

You cannot call clamp from main(). The compiler will reject it.

__host__ __device__: shared code

When you need the same logic on both CPU and GPU, mark it __host__ __device__. The compiler generates two copies: one for the host, one for the device. This is essential for utility functions, math helpers, and struct methods that appear in both host validation code and kernel logic.

__host__ __device__ float safe_divide(float a, float b) {
    return (b != 0.0f) ? a / b : 0.0f;
}

// Host code can call safe_divide directly
// Kernel code can also call safe_divide
__global__ void ratio_kernel(const float* num, const float* den, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = safe_divide(num[idx], den[idx]);
    }
}

A common pattern: define a __host__ __device__ operator inside a struct, then use that struct as a functor with Thrust algorithms. The same functor works in CPU fallback paths.

Use #ifdef __CUDA_ARCH__ inside __host__ __device__ functions when host and device need different implementations:

__host__ __device__ float fast_sqrt(float x) {
#ifdef __CUDA_ARCH__
    return __fsqrt_rn(x);  // Hardware intrinsic on GPU
#else
    return sqrtf(x);        // Standard library on CPU
#endif
}

Call graph and qualifier rules

The rules about which qualifier can call which are strict. The following diagram shows every valid call path.

graph TD
  H["Host code
host / main()"] -->|"launch <<<>>>"| G["global
kernel"]
  H --> HF["host function"]
  H --> HD["host device
function"]
  G --> D["device
function"]
  G --> HD
  D --> D2["device
function"]
  D --> HD
  HF --> HD
  HF --> HF2["host function"]

  style H fill:#4a90d9,color:#fff
  style G fill:#e8744f,color:#fff
  style D fill:#e8744f,color:#fff
  style D2 fill:#e8744f,color:#fff
  style HD fill:#9b59b6,color:#fff
  style HF fill:#4a90d9,color:#fff
  style HF2 fill:#4a90d9,color:#fff

Blue nodes run on the host. Orange nodes run on the device. Purple nodes compile for both. The key constraint: __global__ kernels are only launched from the host (or via dynamic parallelism from device code at compute capability 3.5+). There is no path from __device__ back to __global__ without dynamic parallelism.

Worked example: tracing a call graph

Consider this program:

__host__ __device__ float utility_C(float x) {
    return x * x + 1.0f;
}

__device__ float helper_B(float x) {
    return utility_C(x) * 2.0f;
}

__global__ void kernel_A(float* data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = helper_B(data[idx]);
    }
}

int main() {
    // ... allocate memory ...
    kernel_A<<<grid, block>>>(d_data, n);

    // Host can also call utility_C directly
    float host_result = utility_C(3.0f);
}

Trace each call:

  1. main() is __host__ (implicit). Runs on CPU. ✓
  2. main() launches kernel_A which is __global__. Valid: host can launch kernels. ✓
  3. kernel_A calls helper_B which is __device__. Valid: __global__ can call __device__. ✓
  4. helper_B calls utility_C which is __host__ __device__. Valid: device code can call the device-compiled version. ✓
  5. main() calls utility_C directly. Valid: host code calls the host-compiled version. ✓

What would not work:

  • main() calling helper_B() directly. ✗ __device__ functions are not callable from host.
  • kernel_A calling another __global__ kernel without dynamic parallelism. ✗
  • helper_B calling a __host__-only function. ✗ No host code is reachable from the device.

Inlining and recursion

Inlining

The nvcc compiler inlines __device__ functions by default when profitable. You can force it with __forceinline__ or prevent it with __noinline__:

__device__ __forceinline__ float fast_lerp(float a, float b, float t) {
    return a + t * (b - a);
}

__device__ __noinline__ float expensive_computation(float x) {
    // Complex logic you want as a real function call for debugging
    // ...
    return result;
}

Inlining eliminates call overhead and lets the compiler optimize across function boundaries. For tiny helpers (clamp, lerp, index calculations), always prefer inlining. For large functions, __noinline__ can reduce register pressure, because the compiler allocates registers independently for each non-inlined function.

Recursion

Device-side recursion requires compute capability 3.5 or higher. The compiler must allocate a per-thread stack frame, which lives in local memory (physically global memory, so it is slow). Recursive depth is limited by the stack size, which defaults to 1024 bytes per thread. You can increase it:

cudaDeviceSetLimit(cudaLimitStackSize, 8192);  // 8 KB per thread

Recursion on the GPU is legal but rarely a good idea. Each level of recursion adds local memory traffic. A tree traversal that recurses 20 levels deep will thrash memory. Iterative alternatives with an explicit stack in shared memory almost always perform better.

// Legal but slow: recursive device function
__device__ int fibonacci(int n) {
    if (n <= 1) return n;
    return fibonacci(n - 1) + fibonacci(n - 2);
}

⚠ This compiles and runs, but the exponential call tree combined with per-thread stack overhead makes it impractical for large n. Use iterative approaches on the GPU.

Function pointers

CUDA supports function pointers to __device__ functions starting at compute capability 2.0, but with restrictions:

  • You cannot take the address of a __global__ function and call it as a function pointer on the device.
  • Function pointers must point to __device__ functions.
  • Virtual functions in device code are supported from compute capability 5.0+.
typedef float (*DeviceFunc)(float);

__device__ float square(float x) \{ return x * x; \}
__device__ float cube(float x) \{ return x * x * x; \}

__device__ DeviceFunc ops[] = \{ square, cube \};

__global__ void apply_op(float* data, int n, int op_idx) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = ops[op_idx](data[idx]);
    }
}

Function pointers disable inlining and introduce indirect call overhead. The compiler cannot see through the pointer to optimize the callee. Prefer templates or compile-time dispatch (via if constexpr in C++17 device code) when the set of functions is known at compile time.

Libraries: cuBLAS, cuFFT, and Thrust

Writing raw kernels is necessary for custom logic, but NVIDIA provides heavily optimized libraries for common operations. These libraries use hand-tuned __device__ code internally. You should use them instead of reinventing matrix multiply or FFT.

cuBLAS

cuBLAS implements the BLAS (Basic Linear Algebra Subprograms) interface for GPU. It covers dense matrix multiplication, triangular solves, vector dot products, and more. The API follows Fortran BLAS conventions (column-major layout, one-indexed), which surprises C programmers.

cuFFT

cuFFT provides GPU-accelerated Fast Fourier Transforms. It supports 1D, 2D, and 3D transforms, batched transforms, and real-to-complex / complex-to-complex variants. A plan-based API lets you reuse transform plans across multiple calls.

Thrust

Thrust is a C++ template library that provides GPU-accelerated algorithms with an interface modeled after the C++ STL. Sort, reduce, scan, transform: all available with a single function call. Thrust manages memory transfers and kernel launches internally.

#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>

int main() {
    // Allocate and fill device vector
    thrust::device_vector<float> d_vec(1000000);
    // ... fill with data ...

    // Sort in-place on GPU
    thrust::sort(d_vec.begin(), d_vec.end());

    // Reduce (sum) all elements
    float total = thrust::reduce(d_vec.begin(), d_vec.end(), 0.0f, thrust::plus<float>());

    // Transform: square every element
    thrust::transform(d_vec.begin(), d_vec.end(), d_vec.begin(),
                      [] __device__ (float x) \{ return x * x; \});

    return 0;
}

Thrust picks the backend (CUDA, OpenMP, TBB) based on the iterator type. device_vector iterators dispatch to CUDA. host_vector iterators run on the CPU. Same algorithm, different hardware.

Python: CuPy as the high-level path

CuPy provides NumPy-compatible arrays on the GPU and wraps cuBLAS, cuFFT, and Thrust-equivalent operations under the hood. You get library-level performance without writing C++.

cuBLAS GEMM through CuPy

import cupy as cp

# Create two matrices on GPU
a = cp.random.randn(1024, 1024, dtype=cp.float32)
b = cp.random.randn(1024, 1024, dtype=cp.float32)

# Matrix multiply - calls cuBLAS SGEMM internally
c = a @ b

# Explicit BLAS call if you need more control
c = cp.cublas.sgemm('N', 'N', a, b)

Thrust-equivalent operations

import cupy as cp

# Create a large device array
data = cp.random.randn(1_000_000, dtype=cp.float32)

# Sort (calls Thrust or CUB sort internally)
sorted_data = cp.sort(data)

# Reduce (sum)
total = cp.sum(data)

# Dot product
x = cp.random.randn(1_000_000, dtype=cp.float32)
y = cp.random.randn(1_000_000, dtype=cp.float32)
dot = cp.dot(x, y)  # Calls cuBLAS dot internally

Every CuPy operation stays on the GPU. No host-device transfer happens unless you explicitly call .get() or convert to a NumPy array. This is the single most important performance rule in CuPy: keep data on the device.

Worked example: raw kernel vs Thrust for dot product

Computing the dot product of two 1M-element vectors is a textbook reduction. Let’s compare a raw kernel against Thrust.

Raw __global__ kernel (shared memory reduction)

__global__ void dot_product_kernel(const float* a, const float* b,
                                    float* partial_sums, int n) {
    __shared__ float sdata[256];
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread computes one product
    sdata[tid] = (idx < n) ? a[idx] * b[idx] : 0.0f;
    __syncthreads();

    // Reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        partial_sums[blockIdx.x] = sdata[0];
    }
}

// Host: launch kernel, then reduce partial sums (second kernel or CPU)
// ~45 lines total including memory management and error checking

Thrust one-liner

#include <thrust/inner_product.h>
#include <thrust/device_vector.h>

float dot = thrust::inner_product(d_a.begin(), d_a.end(), d_b.begin(), 0.0f);
// 1 line of actual computation

The raw kernel approach requires roughly 40 to 50 lines: the kernel, host memory allocation, two kernel launches (or a host-side final reduction), and error checking. The Thrust version is one line. Performance is comparable. Thrust internally generates optimized reduction kernels, often matching or slightly beating hand-written code because NVIDIA engineers tune it for each GPU architecture.

ApproachLines of codePerformance (A100, 1M elements)Correctness risk
Raw __global__ kernel~45~0.12 msOff-by-one, sync bugs
Thrust inner_product~1~0.11 msMinimal
CuPy cp.dot()~1 (Python)~0.13 msMinimal

The raw kernel gives you full control over shared memory tiling, warp-level primitives, and fusion with adjacent operations. Use it when you need that control. For standalone reductions, sorts, and scans, prefer the library.

In practice

Use __host__ __device__ for utility functions. Struct operators, math helpers, index calculations. Defining them once avoids drift between host validation code and device kernels.

Keep __device__ functions small and inlineable. The compiler eliminates call overhead when functions are small. Large __device__ functions increase register pressure and can spill to local memory.

Do not use recursion on the GPU unless you have profiled it. The per-thread stack lives in slow local memory. Iterative solutions with explicit stacks in shared memory are faster in virtually every case.

Use Thrust and cuBLAS before writing raw kernels. They are correct, tested, and tuned per architecture. Your raw reduction kernel is unlikely to beat Thrust on the architectures Thrust was tuned for. Write custom kernels when you need to fuse operations or implement algorithms the libraries do not cover.

Avoid function pointers on the device. They block inlining and introduce indirect branches. Templates and compile-time dispatch are almost always better.

Be careful with #ifdef __CUDA_ARCH__. It is only defined during device compilation passes. Code inside that block does not exist for the host compiler. This can cause subtle issues if you accidentally guard shared state behind it.

What comes next

With functions organized across host and device, the next challenge is coordinating when threads access shared data. Synchronization and atomics covers barriers, atomic operations, memory fences, and the patterns that prevent data races in concurrent GPU code.

Start typing to search across all content
navigate Enter open Esc close