Kubernetes architecture
In this series (14 parts)
- Why Kubernetes exists
- Kubernetes architecture
- Core Kubernetes objects
- Kubernetes networking
- Storage in Kubernetes
- Kubernetes configuration and secrets
- Resource management and autoscaling
- Kubernetes workload types
- Kubernetes observability
- Kubernetes security
- Helm and package management
- GitOps with ArgoCD
- Kubernetes cluster operations
- Service mesh concepts
Kubernetes splits into two layers. The control plane makes decisions. The nodes execute them. Every cluster operation, from scheduling a pod to scaling a deployment, flows through this architecture.
The control plane
The control plane runs on one or more dedicated machines (often three for high availability). It consists of five components.
graph TD
subgraph ControlPlane["Control Plane"]
API["API Server"]
ETCD["etcd"]
SCHED["Scheduler"]
CM["Controller Manager"]
CCM["Cloud Controller Manager"]
end
subgraph Node1["Worker Node 1"]
KL1["kubelet"]
KP1["kube-proxy"]
CR1["Container Runtime"]
P1["Pod A"]
P2["Pod B"]
end
subgraph Node2["Worker Node 2"]
KL2["kubelet"]
KP2["kube-proxy"]
CR2["Container Runtime"]
P3["Pod C"]
end
API --> ETCD
SCHED --> API
CM --> API
CCM --> API
KL1 --> API
KL2 --> API
KP1 --> API
KP2 --> API
KL1 --> CR1
KL2 --> CR2
CR1 --> P1
CR1 --> P2
CR2 --> P3
The API server is the single point of communication. Every component talks to it, never to each other directly.
API server
The API server (kube-apiserver) is the front door. Every kubectl command, every controller action, and every kubelet status report goes through it. It validates requests, authenticates callers, and persists state to etcd.
Key behaviors:
- Serves a RESTful API over HTTPS.
- Supports watch streams so controllers can react to changes in real time.
- Runs admission controllers that enforce policies before objects are persisted.
- Is horizontally scalable. Multiple instances can run behind a load balancer.
etcd
etcd is a distributed key-value store that holds all cluster state. Every object you create (Pods, Services, ConfigMaps) is serialized and stored here.
Important properties:
- Uses the Raft consensus protocol for strong consistency.
- Requires a quorum (majority of nodes agree) for writes.
- Should be backed up regularly. Losing etcd means losing the cluster state.
- Only the API server communicates with etcd directly.
A three-node etcd cluster tolerates one node failure. A five-node cluster tolerates two. Running etcd on fast SSDs is critical because write latency directly affects API server response time.
Scheduler
The scheduler (kube-scheduler) watches for newly created Pods that have no node assignment. It evaluates each node against the pod’s requirements and picks the best fit.
The scheduling process has two phases:
- Filtering. Eliminate nodes that cannot run the pod. Reasons include insufficient CPU, memory, or disk; taints that the pod does not tolerate; node selectors that do not match.
- Scoring. Rank the remaining nodes. Factors include resource balance, affinity rules, and data locality. The node with the highest score wins.
Controller manager
The controller manager (kube-controller-manager) runs a collection of controllers, each responsible for one type of reconciliation:
| Controller | Watches | Acts on |
|---|---|---|
| Deployment | Deployment objects | Creates/updates ReplicaSets |
| ReplicaSet | ReplicaSet objects | Creates/deletes Pods |
| Node | Node heartbeats | Marks nodes as NotReady |
| Job | Job objects | Creates Pods, tracks completions |
| Endpoint | Services and Pods | Updates endpoint lists |
Each controller runs an independent reconciliation loop. They share a process but operate on different object types.
Cloud controller manager
The cloud controller manager (cloud-controller-manager) integrates with the underlying cloud provider. It handles:
- Creating load balancers when you create a
LoadBalancerService. - Mapping Kubernetes nodes to cloud instances.
- Managing cloud routes for pod networking.
This component only exists in cloud-managed clusters. On bare metal, you skip it or replace it with MetalLB for load balancer support.
Node components
Every worker node runs three components.
kubelet
The kubelet is an agent that runs on each node. It receives pod specifications from the API server and ensures the described containers are running and healthy.
Responsibilities:
- Pulls container images.
- Starts, stops, and restarts containers via the container runtime.
- Reports node status and pod status back to the API server.
- Executes liveness and readiness probes.
- Mounts volumes into containers.
The kubelet does not manage containers that were not created by Kubernetes. It only cares about pods assigned to its node.
kube-proxy
kube-proxy maintains network rules on the node. When you create a Service, kube-proxy programs iptables (or IPVS) rules so that traffic to the Service IP reaches the correct pods.
It operates in one of three modes:
- iptables (default): Programs netfilter rules. Works well up to a few thousand services.
- IPVS: Uses kernel-level load balancing. Better performance at scale.
- nftables: Newer alternative to iptables with a cleaner rule set.
Container runtime
The container runtime is the software that actually runs containers. Kubernetes talks to it through the Container Runtime Interface (CRI).
Common runtimes:
- containerd: The default in most distributions. Lightweight and stable.
- CRI-O: Built specifically for Kubernetes. Common in OpenShift.
Docker was removed as a direct runtime in Kubernetes 1.24. Images built with Docker still work because they produce OCI-compliant images. Only the runtime interface changed.
How a pod gets scheduled
Bringing it all together, here is the sequence when you run kubectl apply -f deployment.yaml:
sequenceDiagram participant User participant API as API Server participant etcd participant CM as Controller Manager participant Sched as Scheduler participant KL as kubelet participant CR as Container Runtime User->>API: kubectl apply (Deployment) API->>etcd: Store Deployment object API-->>CM: Watch event: new Deployment CM->>API: Create ReplicaSet API->>etcd: Store ReplicaSet CM->>API: Create Pod (unscheduled) API->>etcd: Store Pod (nodeName empty) API-->>Sched: Watch event: unscheduled Pod Sched->>Sched: Filter and score nodes Sched->>API: Bind Pod to Node API->>etcd: Update Pod (nodeName set) API-->>KL: Watch event: Pod assigned KL->>CR: Pull image, start container KL->>API: Report Pod status: Running API->>etcd: Update Pod status
From kubectl apply to a running container, six components collaborate. The API server brokers every interaction.
Let’s walk through each step:
- User submits a Deployment. The API server validates it, runs admission controllers, and stores it in etcd.
- Deployment controller reacts. It sees the new Deployment via a watch stream and creates a ReplicaSet with the desired replica count.
- ReplicaSet controller reacts. It sees the new ReplicaSet and creates individual Pod objects. These pods have no
nodeNameyet. - Scheduler picks a node. It filters out unsuitable nodes, scores the rest, and binds the pod to the best one by setting
nodeName. - kubelet starts the container. The kubelet on the assigned node sees the pod, pulls the image, and starts the container through the runtime.
- Status flows back. The kubelet reports pod status to the API server, which stores it in etcd.
Control plane high availability
In production, you run multiple replicas of each control plane component:
# Example: three control plane nodes in a kubeadm cluster
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "k8s-api.example.com:6443"
etcd:
local:
extraArgs:
listen-client-urls: "https://0.0.0.0:2379"
advertise-client-urls: "https://10.0.1.10:2379"
apiServer:
certSANs:
- "k8s-api.example.com"
- "10.0.1.10"
- "10.0.1.11"
- "10.0.1.12"
networking:
podSubnet: "10.244.0.0/16"
serviceSubnet: "10.96.0.0/12"
Three API server instances sit behind a load balancer. Three etcd nodes form a quorum. The scheduler and controller manager use leader election so only one instance is active at a time, with the others on standby.
Inspecting the architecture
You can verify what is running on your cluster:
# List control plane pods
kubectl get pods -n kube-system
# Check component health
kubectl get componentstatuses
# View node details
kubectl describe node worker-01
# Check kubelet status on a node
systemctl status kubelet
The kube-system namespace contains all control plane pods in a kubeadm-managed cluster. Managed services like EKS and GKE hide the control plane entirely.
What comes next
You now know what each component does and how they collaborate. The next article covers core Kubernetes objects: Pods, Deployments, Services, and the other building blocks you will use daily.