How containers work
In this series (8 parts)
Containers are not virtual machines. This is the single most important thing to understand before working with Docker. A container is a regular Linux process that has been given a restricted view of the system using kernel features that have existed for over a decade. No hypervisor. No guest OS. Just isolation primitives applied to ordinary processes.
VMs vs containers
A virtual machine runs a full guest operating system on emulated hardware. A container shares the host kernel and isolates only the userspace.
graph TD
subgraph VM["Virtual Machine Stack"]
H1[Host OS + Hypervisor] --> G1[Guest OS 1]
H1 --> G2[Guest OS 2]
G1 --> A1[App A]
G2 --> A2[App B]
end
subgraph CT["Container Stack"]
H2[Host OS + Container Runtime] --> C1[App A]
H2 --> C2[App B]
end
style H1 fill:#64b5f6,stroke:#1976d2,color:#000
style H2 fill:#81c784,stroke:#388e3c,color:#000
style G1 fill:#ffb74d,stroke:#f57c00,color:#000
style G2 fill:#ffb74d,stroke:#f57c00,color:#000
style C1 fill:#ce93d8,stroke:#7b1fa2,color:#000
style C2 fill:#ce93d8,stroke:#7b1fa2,color:#000
VM stack vs container stack. Containers skip the guest OS entirely.
VMs provide strong isolation because each guest has its own kernel. Containers are lighter because they share the host kernel, but they depend on kernel-level isolation being correctly configured. The trade-off is startup time and resource overhead (seconds vs minutes, megabytes vs gigabytes) in exchange for a thinner security boundary.
Linux namespaces
Namespaces are the kernel feature that gives each container its own isolated view of system resources. There are six namespaces that matter.
PID namespace
Each container gets its own process ID tree. The first process inside the container sees itself as PID 1, even though the host sees it with a completely different PID.
# Create a new PID namespace and run a shell inside it
sudo unshare --pid --fork --mount-proc /bin/bash
# Inside the new namespace
ps aux
# You will only see the bash process and ps itself
The --fork flag is required because PID namespaces apply to children of the calling process, not the caller itself. The --mount-proc flag remounts /proc so that ps reads from the new namespace.
Network namespace
Each container gets its own network stack: its own interfaces, routing table, iptables rules, and port space.
# Create a network namespace
sudo ip netns add testns
# Run a command inside it
sudo ip netns exec testns ip addr
# Only the loopback interface exists, and it is DOWN
# Clean up
sudo ip netns del testns
This is why two containers can both bind to port 80 without conflict. They each have their own port space.
Mount namespace
Each container gets its own filesystem mount tree. Changes to mounts inside the container do not affect the host.
# Create a new mount namespace
sudo unshare --mount /bin/bash
# Mounts made here are invisible to the host
mount -t tmpfs none /mnt
ls /mnt # empty tmpfs, only visible in this namespace
UTS namespace
The UTS namespace isolates the hostname and domain name. Each container can have its own hostname without affecting others.
sudo unshare --uts /bin/bash
hostname container-01
hostname # shows container-01, host is unchanged
IPC namespace
Isolates System V IPC objects and POSIX message queues. Processes in different IPC namespaces cannot see each other’s shared memory segments or semaphores.
User namespace
Maps UIDs inside the container to different UIDs on the host. A process can be root (UID 0) inside the container but map to an unprivileged user (e.g., UID 100000) on the host. This is the foundation of rootless containers.
Control groups (cgroups)
Namespaces control what a process can see. Cgroups control what a process can use. They enforce resource limits on CPU, memory, I/O, and more.
Memory limits
# Create a cgroup (cgroups v2)
sudo mkdir -p /sys/fs/cgroup/demo
# Set a 100MB memory limit
echo $((100 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/demo/memory.max
# Move the current shell into this cgroup
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
# Any process in this cgroup that exceeds 100MB will be OOM-killed
CPU limits
# Limit to 50% of one CPU core (50ms out of every 100ms)
echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max
The format is quota period. A quota of 50000 microseconds within a 100000 microsecond period means the cgroup gets at most 50% of a single core. Set quota to max to remove the limit.
What happens without cgroups
Without cgroups, a single container could consume all available memory and starve every other process on the host. Cgroups are what make multi-tenant container hosts viable.
Union filesystems
Containers need a filesystem. Copying an entire OS image for every container would waste enormous amounts of disk space. Union filesystems solve this by layering read-only image layers with a thin writable layer on top.
How OverlayFS works
OverlayFS merges multiple directories into a single unified view. It uses four directories:
- lowerdir: One or more read-only layers (the image layers)
- upperdir: A writable layer where all changes are stored
- workdir: A scratch directory used internally by OverlayFS
- merged: The unified view presented to the container
# Set up an OverlayFS mount
mkdir -p /data/lower /data/upper /data/work /data/merged
# Populate the lower (read-only) layer
echo "from image" > /data/lower/config.txt
# Mount the overlay
sudo mount -t overlay overlay \
-o lowerdir=/data/lower,upperdir=/data/upper,workdir=/data/work \
/data/merged
# The container sees the merged view
cat /data/merged/config.txt # "from image"
# Writing goes to the upper layer (copy-on-write)
echo "modified" > /data/merged/config.txt
cat /data/upper/config.txt # "modified"
cat /data/lower/config.txt # "from image" (unchanged)
When a container modifies a file from a lower layer, OverlayFS copies it to the upper layer first, then applies the modification. The lower layer is never touched. This is copy-on-write.
Layers in practice
A Docker image is a stack of OverlayFS layers. Each instruction in a Dockerfile creates a new layer. When you pull an image that shares base layers with an image you already have, Docker only downloads the new layers. When you run five containers from the same image, they all share the same read-only layers and each get their own thin writable upper layer.
What Docker adds
Everything above exists in the Linux kernel without Docker. You could build a container by hand with unshare, ip netns, cgroups, and mount -t overlay. Docker wraps these primitives into a usable system.
The Docker daemon
dockerd is a long-running daemon that manages container lifecycles. It handles creating namespaces, setting up cgroups, mounting filesystems, configuring networking, and cleaning up when containers exit.
The CLI
docker run, docker build, docker ps. The CLI talks to the daemon over a Unix socket (/var/run/docker.sock). Every docker command is an API call to the daemon.
Image format (OCI)
Docker standardized a format for packaging filesystem layers and metadata into a portable image. The Open Container Initiative (OCI) now governs this spec. An OCI image is a manifest pointing to a config blob and an ordered list of layer tarballs.
Registry
Docker Hub and other registries are HTTP APIs for storing and distributing images. docker pull downloads layers from a registry. docker push uploads them. The protocol supports content-addressable storage, so layers are deduplicated by their SHA256 digest.
Putting it together
When you run docker run -it ubuntu bash:
- The daemon pulls the
ubuntuimage layers from the registry (if not cached) - It stacks the layers using OverlayFS and adds a writable upper layer
- It creates new namespaces (pid, net, mnt, uts, ipc, user)
- It sets up cgroups with the configured resource limits
- It creates a virtual network interface and attaches it to a bridge
- It starts
bashas PID 1 inside the new namespaces
The result looks like an isolated machine. It is still just a process on the host.
What comes next
Now that you understand the kernel primitives behind containers, the next article covers Docker fundamentals: installing Docker, building images with Dockerfiles, managing containers, and working with volumes and networks.