Processes, threads, and context switching
In this series (6 parts)
- How a CPU actually works: cores, clocks, and execution
- CPU caches and memory hierarchy: why memory access speed matters
- CPU pipelines and instruction-level parallelism
- Memory models and why concurrent CPU code is hard
- SIMD and vectorization: parallelism on a single CPU core
- Processes, threads, and context switching
This article builds on two earlier ones: CPU architecture overview covers how a core executes instructions, and CPU caches and memory explains the memory hierarchy that makes caching tricky when multiple threads are involved. Read those first if the terms “pipeline,” “cache line,” or “L1/L2/L3” are unfamiliar.
Process vs thread: what each owns
A process is an isolated execution environment. The operating system gives each process its own virtual address space, file descriptor table, and security context. One process crashing does not corrupt another. That isolation is expensive: creating a process means duplicating page tables, allocating kernel structures, and setting up a fresh address space.
A thread lives inside a process. All threads in a process share the same address space, heap, and open files. Each thread gets its own stack and its own set of CPU registers. Threads are cheaper to create (roughly 10x faster than a new process on Linux) because there is no address space to clone.
Think of a process as an apartment. The apartment has its own walls, plumbing, and electricity meter. Threads are roommates sharing that apartment. They share the kitchen and bathroom, but each has their own bed. Sharing is efficient, but one roommate leaving the stove on can burn down the whole place. That is the tradeoff: threads share memory for speed, but a stray pointer in one thread can corrupt data used by another.
graph TD
subgraph Process["Process (Virtual Address Space)"]
direction TB
Code["Code (Text Segment)
Read-only, shared"]
Heap["Heap
malloc/new allocations
Shared by all threads"]
Libs["Shared Libraries
libc, libm, etc."]
subgraph T1["Thread 1"]
S1["Stack 1
Local vars, return addrs"]
R1["Registers + PC"]
end
subgraph T2["Thread 2"]
S2["Stack 2
Local vars, return addrs"]
R2["Registers + PC"]
end
subgraph T3["Thread 3"]
S3["Stack 3
Local vars, return addrs"]
R3["Registers + PC"]
end
end
Code --- Heap
Heap --- Libs
Libs --- T1
Libs --- T2
Libs --- T3
Key ownership breakdown:
| Resource | Process-level (shared) | Thread-level (private) |
|---|---|---|
| Virtual address space | ✓ | |
| Heap memory | ✓ | |
| File descriptors | ✓ | |
| Code segment | ✓ | |
| Stack | ✓ | |
| Register set | ✓ | |
| Program counter | ✓ | |
| Signal mask | ✓ (on Linux) |
Context switching: what gets saved and restored
When the OS decides to stop running thread A and start running thread B on the same core, it performs a context switch. The kernel must save thread A’s state and load thread B’s state. Here is what happens in order:
- A timer interrupt fires (or thread A makes a system call, or thread A blocks on I/O).
- The CPU traps into kernel mode.
- The kernel saves thread A’s register file: general-purpose registers, program counter, stack pointer, floating-point registers, and status flags. On x86-64 that is roughly 16 general-purpose registers plus SIMD state (up to 2 KB with AVX-512).
- The kernel updates thread A’s state to “ready” (or “blocked” if it was waiting on I/O).
- The scheduler picks thread B from the run queue.
- The kernel loads thread B’s saved register state.
- If thread B belongs to a different process, the kernel also switches the page table base register (CR3 on x86), which flushes the TLB. This is the expensive case.
- The CPU returns to user mode and thread B resumes where it left off.
sequenceDiagram participant Core as CPU Core participant KS as Kernel Scheduler participant TA as Thread A participant TB as Thread B TA->>Core: Running in user mode Note over Core: Timer interrupt fires Core->>KS: Trap to kernel mode KS->>KS: Save Thread A registers KS->>KS: Update Thread A state to Ready KS->>KS: Pick Thread B from run queue KS->>KS: Load Thread B registers Note over KS: If different process, switch page tables and flush TLB KS->>Core: Return to user mode Core->>TB: Running in user mode
Switching between threads in the same process is cheaper than switching between processes. Same-process switches skip the page table swap and TLB flush. That TLB flush alone can cost thousands of cycles because subsequent memory accesses all miss the TLB and must walk the page tables.
Example 1: The cost of context switching
A context switch takes roughly 1 to 5 microseconds on modern hardware. Let’s use 3 microseconds as a middle estimate.
How many CPU cycles is that? On a 3 GHz processor, one cycle takes 1/3,000,000,000 seconds, or about 0.33 nanoseconds. A 3-microsecond context switch is 3,000 nanoseconds.
3,000 ns / 0.33 ns per cycle = ~9,000 cycles
Nine thousand cycles. That is 9,000 instructions the core could have retired instead of shuffling register state.
What if a program does 1,000 context switches per second? Each switch burns 3 microseconds, so:
1,000 switches * 3 µs = 3,000 µs = 3 ms per second
One second has 1,000 ms, so 3 ms / 1,000 ms = 0.3% of CPU time spent on context switching overhead alone. That sounds small, but this ignores the indirect costs: cache pollution and TLB misses after each switch. The incoming thread likely has a completely different working set. L1 and L2 caches are warm for thread A, cold for thread B. Measurements show the true cost including cache warmup can be 10 to 30 microseconds, pushing overhead to 1 to 3% at 1,000 switches per second.
On busy servers handling 50,000+ connections, the switch rate can be much higher. This is one reason event-driven architectures (epoll, io_uring, kqueue) exist: fewer threads means fewer context switches.
Thread scheduling: how the OS decides who runs
Linux uses the Completely Fair Scheduler (CFS) for normal threads. The core idea: track how much CPU time each thread has consumed, and always run the thread that has received the least. CFS uses a red-black tree keyed by “virtual runtime” (vruntime). The thread with the smallest vruntime is picked next.
Key scheduling concepts:
- Time slice (quantum): the maximum time a thread runs before the scheduler checks if someone else deserves the core. CFS targets a latency of about 6 ms for up to 8 runnable threads, scaling up with more threads.
- Priority / nice value: threads with lower nice values get more CPU time. CFS achieves this by making vruntime advance slower for higher-priority threads.
- Preemption: the kernel can interrupt a running thread before its time slice expires if a higher-priority thread becomes runnable.
- CPU affinity: you can pin a thread to a specific core to avoid migration costs (cache warmup on the new core).
Real-time schedulers (SCHED_FIFO, SCHED_RR) bypass CFS entirely. They always preempt normal threads and run until they voluntarily yield or a higher-priority real-time thread arrives. These are used for audio processing, industrial control, and low-latency trading.
User-space threads vs kernel threads
A kernel thread (also called a 1:1 thread) is visible to the OS scheduler. Each pthread on Linux maps to one kernel thread. The OS handles preemption, scheduling, and can run kernel threads on different cores in parallel.
A user-space thread (also called a green thread or fiber) is managed entirely in user space. A runtime library handles switching between green threads without involving the kernel. The OS sees only one (or a few) kernel threads; the runtime multiplexes many green threads onto them (M:N threading).
| Property | Kernel threads | User-space threads |
|---|---|---|
| Context switch cost | 1 to 5 µs (kernel trap) | 100 to 500 ns (no kernel trap) |
| Parallelism | True multi-core | Limited by backing kernel threads |
| Blocking I/O | Only blocks one thread | Can block the entire backing thread |
| Scheduling | OS scheduler (CFS) | Runtime scheduler |
| Stack size default | 2 to 8 MB | Often 2 to 8 KB (growable) |
| Scalability | Thousands | Millions |
Go’s goroutines are user-space threads backed by a small pool of kernel threads. The Go runtime’s scheduler (the “GMP” model) multiplexes goroutines onto OS threads and handles the problem of blocking I/O by parking the kernel thread and spinning up another. Erlang’s processes and Java’s virtual threads (Project Loom) follow similar M:N models.
The tradeoff is always the same: user-space threads are cheaper but require a cooperating runtime. If a green thread makes a blocking system call without the runtime’s knowledge, it stalls every green thread on that kernel thread.
Synchronization primitives
When threads share memory, you need tools to prevent data races. Here are the core primitives.
| Primitive | Scope | Blocking? | Typical use case | Overhead |
|---|---|---|---|---|
| Mutex | One thread at a time | Yes (spins or sleeps) | Protecting a shared data structure | Low: 25 ns uncontended on Linux |
| Spinlock | One thread at a time | Yes (busy-wait only) | Very short critical sections (< 1 µs) | Lowest if hold time is tiny; wastes CPU otherwise |
| Semaphore | N threads at a time | Yes | Limiting concurrent access to a resource pool | Moderate: kernel involvement |
| Condition variable | Signal/wait pattern | Yes (sleeps) | Producer-consumer queues | Moderate: pairs with a mutex |
| Read-write lock | Many readers OR one writer | Yes | Read-heavy shared data | Higher than mutex due to bookkeeping |
| Atomic operations | Lock-free single values | No (hardware guarantee) | Counters, flags, CAS-based algorithms | Very low: single instruction |
A mutex is the workhorse. Lock it before accessing shared data, unlock it when done. If another thread tries to lock a held mutex, it blocks (either spinning briefly then sleeping, or sleeping immediately, depending on implementation). The uncontended case is fast because modern mutexes use a futex: a user-space atomic check, falling back to a kernel sleep only on contention.
Spinlocks never sleep. The waiting thread loops checking the lock in a tight CPU loop. This is only efficient when the expected wait is shorter than the cost of a kernel sleep/wake cycle (a few microseconds). Spinlocks are common inside the kernel itself, where sleeping is not always safe.
False sharing: the silent performance killer
This one is subtle and brutal. It comes from how CPU caches work, so make sure CPU caches and memory is fresh in your mind.
Definition: False sharing occurs when two threads on different cores modify different variables that happen to reside in the same cache line. Neither thread is actually sharing data with the other, but the cache coherence protocol does not know that. It operates at cache-line granularity (64 bytes on x86). Any write to a byte in the line invalidates the entire line on all other cores.
Example 2: False sharing in action
Suppose two threads each maintain a counter. The counters are adjacent in memory:
struct Counters {
int64_t count_a; // 8 bytes, offset 0
int64_t count_b; // 8 bytes, offset 8
};
Both fit within one 64-byte cache line. Thread A (on Core 0) increments count_a. Thread B (on Core 1) increments count_b.
Here is what happens at the cache level:
- Thread A writes
count_a. Core 0’s L1 cache takes ownership of the line (MESI state: Modified). - Thread B wants to write
count_b. The coherence protocol detects that Core 0 holds the line in Modified state. Core 0 must flush the line to L3 (or write it back), and Core 1 must fetch it. This takes 40 to 70 ns on typical hardware. - Thread B writes
count_b. Now Core 1 owns the line in Modified state. - Thread A wants to write
count_aagain. Same story: Core 1 flushes, Core 0 fetches. Another 40 to 70 ns.
This ping-pong repeats on every write. If each thread writes millions of times per second, the cost is enormous.
Quantifying the damage: A write to an L1-resident cache line takes about 1 ns. A write that triggers a cache-line transfer between cores takes 40 to 70 ns. That is a 40x to 70x slowdown per write operation. In benchmarks, false sharing can reduce throughput by 10x or more for tight loops.
The fix: Pad the struct so each counter sits in its own cache line:
struct Counters {
alignas(64) int64_t count_a; // Starts at cache line boundary
alignas(64) int64_t count_b; // Starts at next cache line boundary
};
Now count_a and count_b are in separate 64-byte lines. Both cores can write concurrently without triggering coherence traffic. The cost is 56 bytes of wasted padding per counter. For hot, frequently-written data, that trade is always worth it.
Java’s @Contended annotation and Linux kernel’s ____cacheline_aligned attribute exist specifically for this problem. If you are writing concurrent code and performance matters, check your struct layouts.
OS threads are not GPU threads
This distinction matters if you are moving into GPU programming, which is where this series heads next.
An OS thread is a heavyweight entity. It has kilobytes to megabytes of stack, a full register context, and the OS scheduler decides when it runs. Creating a thread costs microseconds. A typical server runs hundreds to a few thousand threads.
A GPU “thread” is a different animal entirely. On an NVIDIA GPU:
| Property | OS (CPU) thread | GPU thread |
|---|---|---|
| Stack size | 2 to 8 MB | 0 (no stack by default, limited local memory) |
| Register context | ~1 to 2 KB | ~256 bytes (limited register file per thread) |
| Creation cost | ~1 to 10 µs | Essentially free (launched in bulk) |
| Scheduling | OS kernel, preemptive | Hardware warp scheduler, cooperative |
| Count per chip | Hundreds | Tens of thousands simultaneously |
| Context switch | 1 to 5 µs (kernel trap) | ~0 (hardware switches warps in 1 cycle) |
| Memory model | Coherent caches | Non-coherent across SMs; explicit sync needed |
GPU threads are designed to be launched by the millions. The hardware schedules groups of 32 (a “warp” on NVIDIA, “wavefront” on AMD) and switches between warps in a single cycle to hide memory latency. There is no kernel trap, no register save/restore in software. The hardware keeps all warp contexts resident simultaneously and just switches which one feeds the execution units.
The mental model shift: CPU threads are independent workers that the OS timeshares onto a few cores. GPU threads are a massive flock that moves together in lockstep groups, designed to saturate memory bandwidth through sheer parallelism. If you try to write GPU code with a CPU-threading mindset (few threads, lots of branching, large per-thread state), performance will be terrible.
In practice: real bottlenecks with threading
Lock contention is the most common killer. A mutex that is held for 100 ns but contested by 16 threads creates a serial bottleneck. Amdahl’s law is unforgiving: if 5% of your work is serial, you cannot get more than 20x speedup no matter how many cores you throw at it.
Thread creation overhead adds up. Creating a thread per HTTP request was common in early Java servers. At 10,000 requests per second and 10 µs per thread creation, that is 100 ms of CPU time per second just for thread lifecycle management. Thread pools exist for exactly this reason. Create N threads at startup, reuse them.
Cache thrashing from oversubscription. Running 200 threads on 8 cores means constant context switches. Each switch pollutes the L1 and L2 caches with the new thread’s working set. If your working set exceeds the cache, every switch triggers a cascade of cache misses. Rule of thumb: for CPU-bound work, run one thread per core. For I/O-bound work, you can go higher because blocked threads do not compete for cache.
NUMA effects. On multi-socket systems, memory is physically attached to specific CPU sockets. A thread on socket 0 accessing memory on socket 1 pays a 50 to 100% latency penalty. Pinning threads to cores and allocating memory on the local NUMA node matters for high-performance code.
Debugging is hard. Data races are non-deterministic. A race condition that appears once per million runs in testing can appear once per hundred runs in production under different timing. Tools like ThreadSanitizer (TSan), Helgrind, and Intel Inspector can detect races, but they add 5 to 15x overhead and cannot run in production. Designing for correctness (using well-tested concurrent data structures, minimizing shared mutable state) is cheaper than debugging races after the fact.
What comes next
This article covered how the OS manages execution: processes for isolation, threads for concurrency within a process, and the synchronization tools that keep shared memory coherent. The recurring theme is cost: context switches cost cycles, locks cost contention, false sharing costs cache bandwidth.
The next series, Parallel Thinking, steps back from hardware specifics and asks: how do you decompose a problem for parallel execution? That is a different skill from knowing what a mutex does. Start with Introduction to parallel thinking to begin Series 2.