Mar 8, 2026 · 18 min read · GPU Programming

Processes, threads, and context switching

In this series (6 parts)

This article builds on two earlier ones: CPU architecture overview covers how a core executes instructions, and CPU caches and memory explains the memory hierarchy that makes caching tricky when multiple threads are involved. Read those first if the terms “pipeline,” “cache line,” or “L1/L2/L3” are unfamiliar.

Process vs thread: what each owns

A process is an isolated execution environment. The operating system gives each process its own virtual address space, file descriptor table, and security context. One process crashing does not corrupt another. That isolation is expensive: creating a process means duplicating page tables, allocating kernel structures, and setting up a fresh address space.

A thread lives inside a process. All threads in a process share the same address space, heap, and open files. Each thread gets its own stack and its own set of CPU registers. Threads are cheaper to create (roughly 10x faster than a new process on Linux) because there is no address space to clone.

Think of a process as an apartment. The apartment has its own walls, plumbing, and electricity meter. Threads are roommates sharing that apartment. They share the kitchen and bathroom, but each has their own bed. Sharing is efficient, but one roommate leaving the stove on can burn down the whole place. That is the tradeoff: threads share memory for speed, but a stray pointer in one thread can corrupt data used by another.

graph TD
  subgraph Process["Process (Virtual Address Space)"]
      direction TB
      Code["Code (Text Segment)
Read-only, shared"]
      Heap["Heap
malloc/new allocations
Shared by all threads"]
      Libs["Shared Libraries
libc, libm, etc."]
      subgraph T1["Thread 1"]
          S1["Stack 1
Local vars, return addrs"]
          R1["Registers + PC"]
      end
      subgraph T2["Thread 2"]
          S2["Stack 2
Local vars, return addrs"]
          R2["Registers + PC"]
      end
      subgraph T3["Thread 3"]
          S3["Stack 3
Local vars, return addrs"]
          R3["Registers + PC"]
      end
  end

  Code --- Heap
  Heap --- Libs
  Libs --- T1
  Libs --- T2
  Libs --- T3

Key ownership breakdown:

Resource	Process-level (shared)	Thread-level (private)
Virtual address space	✓
Heap memory	✓
File descriptors	✓
Code segment	✓
Stack		✓
Register set		✓
Program counter		✓
Signal mask		✓ (on Linux)

Context switching: what gets saved and restored

When the OS decides to stop running thread A and start running thread B on the same core, it performs a context switch. The kernel must save thread A’s state and load thread B’s state. Here is what happens in order:

A timer interrupt fires (or thread A makes a system call, or thread A blocks on I/O).
The CPU traps into kernel mode.
The kernel saves thread A’s register file: general-purpose registers, program counter, stack pointer, floating-point registers, and status flags. On x86-64 that is roughly 16 general-purpose registers plus SIMD state (up to 2 KB with AVX-512).
The kernel updates thread A’s state to “ready” (or “blocked” if it was waiting on I/O).
The scheduler picks thread B from the run queue.
The kernel loads thread B’s saved register state.
If thread B belongs to a different process, the kernel also switches the page table base register (CR3 on x86), which flushes the TLB. This is the expensive case.
The CPU returns to user mode and thread B resumes where it left off.

sequenceDiagram
  participant Core as CPU Core
  participant KS as Kernel Scheduler
  participant TA as Thread A
  participant TB as Thread B

  TA->>Core: Running in user mode
  Note over Core: Timer interrupt fires
  Core->>KS: Trap to kernel mode
  KS->>KS: Save Thread A registers
  KS->>KS: Update Thread A state to Ready
  KS->>KS: Pick Thread B from run queue
  KS->>KS: Load Thread B registers
  Note over KS: If different process, switch page tables and flush TLB
  KS->>Core: Return to user mode
  Core->>TB: Running in user mode

Switching between threads in the same process is cheaper than switching between processes. Same-process switches skip the page table swap and TLB flush. That TLB flush alone can cost thousands of cycles because subsequent memory accesses all miss the TLB and must walk the page tables.

Example 1: The cost of context switching

A context switch takes roughly 1 to 5 microseconds on modern hardware. Let’s use 3 microseconds as a middle estimate.

How many CPU cycles is that? On a 3 GHz processor, one cycle takes 1/3,000,000,000 seconds, or about 0.33 nanoseconds. A 3-microsecond context switch is 3,000 nanoseconds.

3,000 ns / 0.33 ns per cycle = ~9,000 cycles

Nine thousand cycles. That is 9,000 instructions the core could have retired instead of shuffling register state.

What if a program does 1,000 context switches per second? Each switch burns 3 microseconds, so:

1,000 switches * 3 µs = 3,000 µs = 3 ms per second

One second has 1,000 ms, so 3 ms / 1,000 ms = 0.3% of CPU time spent on context switching overhead alone. That sounds small, but this ignores the indirect costs: cache pollution and TLB misses after each switch. The incoming thread likely has a completely different working set. L1 and L2 caches are warm for thread A, cold for thread B. Measurements show the true cost including cache warmup can be 10 to 30 microseconds, pushing overhead to 1 to 3% at 1,000 switches per second.

On busy servers handling 50,000+ connections, the switch rate can be much higher. This is one reason event-driven architectures (epoll, io_uring, kqueue) exist: fewer threads means fewer context switches.

Thread scheduling: how the OS decides who runs

Linux uses the Completely Fair Scheduler (CFS) for normal threads. The core idea: track how much CPU time each thread has consumed, and always run the thread that has received the least. CFS uses a red-black tree keyed by “virtual runtime” (vruntime). The thread with the smallest vruntime is picked next.

Key scheduling concepts:

Time slice (quantum): the maximum time a thread runs before the scheduler checks if someone else deserves the core. CFS targets a latency of about 6 ms for up to 8 runnable threads, scaling up with more threads.
Priority / nice value: threads with lower nice values get more CPU time. CFS achieves this by making vruntime advance slower for higher-priority threads.
Preemption: the kernel can interrupt a running thread before its time slice expires if a higher-priority thread becomes runnable.
CPU affinity: you can pin a thread to a specific core to avoid migration costs (cache warmup on the new core).

Real-time schedulers (SCHED_FIFO, SCHED_RR) bypass CFS entirely. They always preempt normal threads and run until they voluntarily yield or a higher-priority real-time thread arrives. These are used for audio processing, industrial control, and low-latency trading.

User-space threads vs kernel threads

A kernel thread (also called a 1:1 thread) is visible to the OS scheduler. Each pthread on Linux maps to one kernel thread. The OS handles preemption, scheduling, and can run kernel threads on different cores in parallel.

A user-space thread (also called a green thread or fiber) is managed entirely in user space. A runtime library handles switching between green threads without involving the kernel. The OS sees only one (or a few) kernel threads; the runtime multiplexes many green threads onto them (M:N threading).

Property	Kernel threads	User-space threads
Context switch cost	1 to 5 µs (kernel trap)	100 to 500 ns (no kernel trap)
Parallelism	True multi-core	Limited by backing kernel threads
Blocking I/O	Only blocks one thread	Can block the entire backing thread
Scheduling	OS scheduler (CFS)	Runtime scheduler
Stack size default	2 to 8 MB	Often 2 to 8 KB (growable)
Scalability	Thousands	Millions

Go’s goroutines are user-space threads backed by a small pool of kernel threads. The Go runtime’s scheduler (the “GMP” model) multiplexes goroutines onto OS threads and handles the problem of blocking I/O by parking the kernel thread and spinning up another. Erlang’s processes and Java’s virtual threads (Project Loom) follow similar M:N models.

The tradeoff is always the same: user-space threads are cheaper but require a cooperating runtime. If a green thread makes a blocking system call without the runtime’s knowledge, it stalls every green thread on that kernel thread.

Synchronization primitives

When threads share memory, you need tools to prevent data races. Here are the core primitives.

Primitive	Scope	Blocking?	Typical use case	Overhead
Mutex	One thread at a time	Yes (spins or sleeps)	Protecting a shared data structure	Low: 25 ns uncontended on Linux
Spinlock	One thread at a time	Yes (busy-wait only)	Very short critical sections (< 1 µs)	Lowest if hold time is tiny; wastes CPU otherwise
Semaphore	N threads at a time	Yes	Limiting concurrent access to a resource pool	Moderate: kernel involvement
Condition variable	Signal/wait pattern	Yes (sleeps)	Producer-consumer queues	Moderate: pairs with a mutex
Read-write lock	Many readers OR one writer	Yes	Read-heavy shared data	Higher than mutex due to bookkeeping
Atomic operations	Lock-free single values	No (hardware guarantee)	Counters, flags, CAS-based algorithms	Very low: single instruction

A mutex is the workhorse. Lock it before accessing shared data, unlock it when done. If another thread tries to lock a held mutex, it blocks (either spinning briefly then sleeping, or sleeping immediately, depending on implementation). The uncontended case is fast because modern mutexes use a futex: a user-space atomic check, falling back to a kernel sleep only on contention.

Spinlocks never sleep. The waiting thread loops checking the lock in a tight CPU loop. This is only efficient when the expected wait is shorter than the cost of a kernel sleep/wake cycle (a few microseconds). Spinlocks are common inside the kernel itself, where sleeping is not always safe.

This one is subtle and brutal. It comes from how CPU caches work, so make sure CPU caches and memory is fresh in your mind.

Definition: False sharing occurs when two threads on different cores modify different variables that happen to reside in the same cache line. Neither thread is actually sharing data with the other, but the cache coherence protocol does not know that. It operates at cache-line granularity (64 bytes on x86). Any write to a byte in the line invalidates the entire line on all other cores.

Suppose two threads each maintain a counter. The counters are adjacent in memory:

struct Counters {
    int64_t count_a;  // 8 bytes, offset 0
    int64_t count_b;  // 8 bytes, offset 8
};

Both fit within one 64-byte cache line. Thread A (on Core 0) increments count_a. Thread B (on Core 1) increments count_b.

Here is what happens at the cache level:

Thread A writes count_a. Core 0’s L1 cache takes ownership of the line (MESI state: Modified).
Thread B wants to write count_b. The coherence protocol detects that Core 0 holds the line in Modified state. Core 0 must flush the line to L3 (or write it back), and Core 1 must fetch it. This takes 40 to 70 ns on typical hardware.
Thread B writes count_b. Now Core 1 owns the line in Modified state.
Thread A wants to write count_a again. Same story: Core 1 flushes, Core 0 fetches. Another 40 to 70 ns.

This ping-pong repeats on every write. If each thread writes millions of times per second, the cost is enormous.

Quantifying the damage: A write to an L1-resident cache line takes about 1 ns. A write that triggers a cache-line transfer between cores takes 40 to 70 ns. That is a 40x to 70x slowdown per write operation. In benchmarks, false sharing can reduce throughput by 10x or more for tight loops.

The fix: Pad the struct so each counter sits in its own cache line:

struct Counters {
    alignas(64) int64_t count_a;  // Starts at cache line boundary
    alignas(64) int64_t count_b;  // Starts at next cache line boundary
};

Now count_a and count_b are in separate 64-byte lines. Both cores can write concurrently without triggering coherence traffic. The cost is 56 bytes of wasted padding per counter. For hot, frequently-written data, that trade is always worth it.

Java’s @Contended annotation and Linux kernel’s ____cacheline_aligned attribute exist specifically for this problem. If you are writing concurrent code and performance matters, check your struct layouts.

OS threads are not GPU threads

This distinction matters if you are moving into GPU programming, which is where this series heads next.

An OS thread is a heavyweight entity. It has kilobytes to megabytes of stack, a full register context, and the OS scheduler decides when it runs. Creating a thread costs microseconds. A typical server runs hundreds to a few thousand threads.

A GPU “thread” is a different animal entirely. On an NVIDIA GPU:

Property	OS (CPU) thread	GPU thread
Stack size	2 to 8 MB	0 (no stack by default, limited local memory)
Register context	~1 to 2 KB	~256 bytes (limited register file per thread)
Creation cost	~1 to 10 µs	Essentially free (launched in bulk)
Scheduling	OS kernel, preemptive	Hardware warp scheduler, cooperative
Count per chip	Hundreds	Tens of thousands simultaneously
Context switch	1 to 5 µs (kernel trap)	~0 (hardware switches warps in 1 cycle)
Memory model	Coherent caches	Non-coherent across SMs; explicit sync needed

GPU threads are designed to be launched by the millions. The hardware schedules groups of 32 (a “warp” on NVIDIA, “wavefront” on AMD) and switches between warps in a single cycle to hide memory latency. There is no kernel trap, no register save/restore in software. The hardware keeps all warp contexts resident simultaneously and just switches which one feeds the execution units.

The mental model shift: CPU threads are independent workers that the OS timeshares onto a few cores. GPU threads are a massive flock that moves together in lockstep groups, designed to saturate memory bandwidth through sheer parallelism. If you try to write GPU code with a CPU-threading mindset (few threads, lots of branching, large per-thread state), performance will be terrible.

In practice: real bottlenecks with threading

Lock contention is the most common killer. A mutex that is held for 100 ns but contested by 16 threads creates a serial bottleneck. Amdahl’s law is unforgiving: if 5% of your work is serial, you cannot get more than 20x speedup no matter how many cores you throw at it.

Thread creation overhead adds up. Creating a thread per HTTP request was common in early Java servers. At 10,000 requests per second and 10 µs per thread creation, that is 100 ms of CPU time per second just for thread lifecycle management. Thread pools exist for exactly this reason. Create N threads at startup, reuse them.

Cache thrashing from oversubscription. Running 200 threads on 8 cores means constant context switches. Each switch pollutes the L1 and L2 caches with the new thread’s working set. If your working set exceeds the cache, every switch triggers a cascade of cache misses. Rule of thumb: for CPU-bound work, run one thread per core. For I/O-bound work, you can go higher because blocked threads do not compete for cache.

NUMA effects. On multi-socket systems, memory is physically attached to specific CPU sockets. A thread on socket 0 accessing memory on socket 1 pays a 50 to 100% latency penalty. Pinning threads to cores and allocating memory on the local NUMA node matters for high-performance code.

Debugging is hard. Data races are non-deterministic. A race condition that appears once per million runs in testing can appear once per hundred runs in production under different timing. Tools like ThreadSanitizer (TSan), Helgrind, and Intel Inspector can detect races, but they add 5 to 15x overhead and cannot run in production. Designing for correctness (using well-tested concurrent data structures, minimizing shared mutable state) is cheaper than debugging races after the fact.

What comes next

This article covered how the OS manages execution: processes for isolation, threads for concurrency within a process, and the synchronization tools that keep shared memory coherent. The recurring theme is cost: context switches cost cycles, locks cost contention, false sharing costs cache bandwidth.

The next series, Parallel Thinking, steps back from hardware specifics and asks: how do you decompose a problem for parallel execution? That is a different skill from knowing what a mutex does. Start with Introduction to parallel thinking to begin Series 2.

← Back to all series