Search…

Kubernetes architecture

In this series (14 parts)
  1. Why Kubernetes exists
  2. Kubernetes architecture
  3. Core Kubernetes objects
  4. Kubernetes networking
  5. Storage in Kubernetes
  6. Kubernetes configuration and secrets
  7. Resource management and autoscaling
  8. Kubernetes workload types
  9. Kubernetes observability
  10. Kubernetes security
  11. Helm and package management
  12. GitOps with ArgoCD
  13. Kubernetes cluster operations
  14. Service mesh concepts

Kubernetes splits into two layers. The control plane makes decisions. The nodes execute them. Every cluster operation, from scheduling a pod to scaling a deployment, flows through this architecture.

The control plane

The control plane runs on one or more dedicated machines (often three for high availability). It consists of five components.

graph TD
  subgraph ControlPlane["Control Plane"]
      API["API Server"]
      ETCD["etcd"]
      SCHED["Scheduler"]
      CM["Controller Manager"]
      CCM["Cloud Controller Manager"]
  end

  subgraph Node1["Worker Node 1"]
      KL1["kubelet"]
      KP1["kube-proxy"]
      CR1["Container Runtime"]
      P1["Pod A"]
      P2["Pod B"]
  end

  subgraph Node2["Worker Node 2"]
      KL2["kubelet"]
      KP2["kube-proxy"]
      CR2["Container Runtime"]
      P3["Pod C"]
  end

  API --> ETCD
  SCHED --> API
  CM --> API
  CCM --> API
  KL1 --> API
  KL2 --> API
  KP1 --> API
  KP2 --> API
  KL1 --> CR1
  KL2 --> CR2
  CR1 --> P1
  CR1 --> P2
  CR2 --> P3

The API server is the single point of communication. Every component talks to it, never to each other directly.

API server

The API server (kube-apiserver) is the front door. Every kubectl command, every controller action, and every kubelet status report goes through it. It validates requests, authenticates callers, and persists state to etcd.

Key behaviors:

  • Serves a RESTful API over HTTPS.
  • Supports watch streams so controllers can react to changes in real time.
  • Runs admission controllers that enforce policies before objects are persisted.
  • Is horizontally scalable. Multiple instances can run behind a load balancer.

etcd

etcd is a distributed key-value store that holds all cluster state. Every object you create (Pods, Services, ConfigMaps) is serialized and stored here.

Important properties:

  • Uses the Raft consensus protocol for strong consistency.
  • Requires a quorum (majority of nodes agree) for writes.
  • Should be backed up regularly. Losing etcd means losing the cluster state.
  • Only the API server communicates with etcd directly.

A three-node etcd cluster tolerates one node failure. A five-node cluster tolerates two. Running etcd on fast SSDs is critical because write latency directly affects API server response time.

Scheduler

The scheduler (kube-scheduler) watches for newly created Pods that have no node assignment. It evaluates each node against the pod’s requirements and picks the best fit.

The scheduling process has two phases:

  1. Filtering. Eliminate nodes that cannot run the pod. Reasons include insufficient CPU, memory, or disk; taints that the pod does not tolerate; node selectors that do not match.
  2. Scoring. Rank the remaining nodes. Factors include resource balance, affinity rules, and data locality. The node with the highest score wins.

Controller manager

The controller manager (kube-controller-manager) runs a collection of controllers, each responsible for one type of reconciliation:

ControllerWatchesActs on
DeploymentDeployment objectsCreates/updates ReplicaSets
ReplicaSetReplicaSet objectsCreates/deletes Pods
NodeNode heartbeatsMarks nodes as NotReady
JobJob objectsCreates Pods, tracks completions
EndpointServices and PodsUpdates endpoint lists

Each controller runs an independent reconciliation loop. They share a process but operate on different object types.

Cloud controller manager

The cloud controller manager (cloud-controller-manager) integrates with the underlying cloud provider. It handles:

  • Creating load balancers when you create a LoadBalancer Service.
  • Mapping Kubernetes nodes to cloud instances.
  • Managing cloud routes for pod networking.

This component only exists in cloud-managed clusters. On bare metal, you skip it or replace it with MetalLB for load balancer support.

Node components

Every worker node runs three components.

kubelet

The kubelet is an agent that runs on each node. It receives pod specifications from the API server and ensures the described containers are running and healthy.

Responsibilities:

  • Pulls container images.
  • Starts, stops, and restarts containers via the container runtime.
  • Reports node status and pod status back to the API server.
  • Executes liveness and readiness probes.
  • Mounts volumes into containers.

The kubelet does not manage containers that were not created by Kubernetes. It only cares about pods assigned to its node.

kube-proxy

kube-proxy maintains network rules on the node. When you create a Service, kube-proxy programs iptables (or IPVS) rules so that traffic to the Service IP reaches the correct pods.

It operates in one of three modes:

  • iptables (default): Programs netfilter rules. Works well up to a few thousand services.
  • IPVS: Uses kernel-level load balancing. Better performance at scale.
  • nftables: Newer alternative to iptables with a cleaner rule set.

Container runtime

The container runtime is the software that actually runs containers. Kubernetes talks to it through the Container Runtime Interface (CRI).

Common runtimes:

  • containerd: The default in most distributions. Lightweight and stable.
  • CRI-O: Built specifically for Kubernetes. Common in OpenShift.

Docker was removed as a direct runtime in Kubernetes 1.24. Images built with Docker still work because they produce OCI-compliant images. Only the runtime interface changed.

How a pod gets scheduled

Bringing it all together, here is the sequence when you run kubectl apply -f deployment.yaml:

sequenceDiagram
  participant User
  participant API as API Server
  participant etcd
  participant CM as Controller Manager
  participant Sched as Scheduler
  participant KL as kubelet
  participant CR as Container Runtime

  User->>API: kubectl apply (Deployment)
  API->>etcd: Store Deployment object
  API-->>CM: Watch event: new Deployment
  CM->>API: Create ReplicaSet
  API->>etcd: Store ReplicaSet
  CM->>API: Create Pod (unscheduled)
  API->>etcd: Store Pod (nodeName empty)
  API-->>Sched: Watch event: unscheduled Pod
  Sched->>Sched: Filter and score nodes
  Sched->>API: Bind Pod to Node
  API->>etcd: Update Pod (nodeName set)
  API-->>KL: Watch event: Pod assigned
  KL->>CR: Pull image, start container
  KL->>API: Report Pod status: Running
  API->>etcd: Update Pod status

From kubectl apply to a running container, six components collaborate. The API server brokers every interaction.

Let’s walk through each step:

  1. User submits a Deployment. The API server validates it, runs admission controllers, and stores it in etcd.
  2. Deployment controller reacts. It sees the new Deployment via a watch stream and creates a ReplicaSet with the desired replica count.
  3. ReplicaSet controller reacts. It sees the new ReplicaSet and creates individual Pod objects. These pods have no nodeName yet.
  4. Scheduler picks a node. It filters out unsuitable nodes, scores the rest, and binds the pod to the best one by setting nodeName.
  5. kubelet starts the container. The kubelet on the assigned node sees the pod, pulls the image, and starts the container through the runtime.
  6. Status flows back. The kubelet reports pod status to the API server, which stores it in etcd.

Control plane high availability

In production, you run multiple replicas of each control plane component:

# Example: three control plane nodes in a kubeadm cluster
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controlPlaneEndpoint: "k8s-api.example.com:6443"
etcd:
  local:
    extraArgs:
      listen-client-urls: "https://0.0.0.0:2379"
      advertise-client-urls: "https://10.0.1.10:2379"
apiServer:
  certSANs:
    - "k8s-api.example.com"
    - "10.0.1.10"
    - "10.0.1.11"
    - "10.0.1.12"
networking:
  podSubnet: "10.244.0.0/16"
  serviceSubnet: "10.96.0.0/12"

Three API server instances sit behind a load balancer. Three etcd nodes form a quorum. The scheduler and controller manager use leader election so only one instance is active at a time, with the others on standby.

Inspecting the architecture

You can verify what is running on your cluster:

# List control plane pods
kubectl get pods -n kube-system

# Check component health
kubectl get componentstatuses

# View node details
kubectl describe node worker-01

# Check kubelet status on a node
systemctl status kubelet

The kube-system namespace contains all control plane pods in a kubeadm-managed cluster. Managed services like EKS and GKE hide the control plane entirely.

What comes next

You now know what each component does and how they collaborate. The next article covers core Kubernetes objects: Pods, Deployments, Services, and the other building blocks you will use daily.

Start typing to search across all content
navigate Enter open Esc close