Kubernetes cluster operations

In this series (14 parts)

Running a cluster is not a one-time event. You upgrade control planes, rotate certificates, drain nodes, recover from etcd failures, and scale capacity. This post covers the operational tasks that keep production clusters healthy.

Upgrading a cluster

Kubernetes enforces a version skew policy. The API server can be at most one minor version ahead of kubelets. That means you upgrade the control plane first, then the worker nodes.

sequenceDiagram
  participant Op as Operator
  participant CP1 as Control Plane 1
  participant CP2 as Control Plane 2
  participant CP3 as Control Plane 3
  participant W as Worker Nodes

  Op->>CP1: Upgrade to v1.30
  CP1-->>Op: Ready
  Op->>CP2: Upgrade to v1.30
  CP2-->>Op: Ready
  Op->>CP3: Upgrade to v1.30
  CP3-->>Op: Ready
  Op->>W: Cordon node-1
  Op->>W: Drain node-1
  Op->>W: Upgrade kubelet on node-1
  Op->>W: Uncordon node-1
  Note over W: Repeat for each worker

Upgrade sequence: control plane nodes first, then workers one at a time.

For a kubeadm-managed cluster, the control plane upgrade looks like this:

# Check available versions
sudo apt-cache madison kubeadm

# Upgrade kubeadm itself
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.0-1.1

# Verify the upgrade plan
sudo kubeadm upgrade plan

# Apply the upgrade on the first control plane node
sudo kubeadm upgrade apply v1.30.0

# Upgrade kubelet and kubectl
sudo apt-get install -y kubelet=1.30.0-1.1 kubectl=1.30.0-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

On additional control plane nodes, replace upgrade apply with upgrade node. Never skip minor versions. Going from 1.28 to 1.30 requires upgrading through 1.29 first.

Node draining and cordoning

Before upgrading or decommissioning a node, you cordon it to prevent new pod scheduling, then drain existing pods.

# Mark node as unschedulable
kubectl cordon node-1

# Evict all pods, respecting PodDisruptionBudgets
kubectl drain node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# After maintenance, allow scheduling again
kubectl uncordon node-1

The drain command respects PodDisruptionBudgets (PDBs). A PDB guarantees minimum availability during voluntary disruptions like drains and rolling updates.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web

This PDB ensures at least two web pods remain running. If draining a node would violate this constraint, the drain blocks until pods reschedule elsewhere. You can also use maxUnavailable: 1 instead of minAvailable for the same effect.

Set PDBs for every stateful or critical workload. Without them, a drain can evict all replicas simultaneously.

etcd backup and restore

etcd holds all cluster state. Losing it means losing the cluster. Back it up regularly.

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db \
  --write-table

Automate this with a CronJob that runs daily and pushes snapshots to object storage. Schedule the job on control plane nodes using nodeSelector and appropriate tolerations, then upload the snapshot to S3 or GCS.

To restore from a snapshot, stop the API server and etcd, then run:

# Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored

# Replace the old data directory
sudo mv /var/lib/etcd /var/lib/etcd-old
sudo mv /var/lib/etcd-restored /var/lib/etcd

# Restart etcd and the API server
sudo systemctl restart etcd
sudo systemctl restart kube-apiserver

Test your restore procedure in a staging environment. An untested backup is not a backup.

Certificate rotation

Kubernetes components communicate over TLS. Certificates issued by kubeadm expire after one year by default. Check expiry dates:

sudo kubeadm certs check-expiration

Renew all certificates at once:

sudo kubeadm certs renew all

After renewal, restart the control plane static pods by restarting the kubelet:

sudo systemctl restart kubelet

Plan certificate rotation into your upgrade cycle. Running kubeadm upgrade apply automatically renews certificates. If you skip upgrades for months, set calendar reminders.

Cluster autoscaler

The Cluster Autoscaler adjusts node count based on pending pods. When pods cannot be scheduled, it provisions new nodes. When nodes sit underutilized, it removes them.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.0
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m

Key flags to tune: --scale-down-delay-after-add prevents thrashing by waiting after a scale-up. --scale-down-unneeded-time sets how long a node must be idle before removal. --expander=least-waste picks the node group that wastes the fewest resources.

The autoscaler respects PDBs during scale-down and skips nodes with local storage unless overridden.

Multi-cluster patterns

Single clusters have blast radius limits, region constraints, and scaling ceilings. Multi-cluster architectures solve these at the cost of complexity.

Common patterns include fleet management with Argo CD’s ApplicationSet and service mesh federation for cross-cluster service discovery. A management cluster typically controls workload clusters:

# Register a workload cluster with Argo CD
argocd cluster add workload-cluster-1 \
  --kubeconfig=/path/to/workload-1-kubeconfig

# Deploy an ApplicationSet across clusters
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: web-app
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            env: production
  template:
    metadata:
      name: "web-app-{{name}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/org/web-app
        targetRevision: main
        path: k8s/overlays/production
      destination:
        server: "{{server}}"
        namespace: web-app
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
EOF

This ApplicationSet deploys to every cluster labeled env: production. Adding a new cluster with that label automatically triggers deployment. For DNS-based multi-cluster routing, external-dns with weighted records distributes traffic across regions.

What comes next

Operational maturity grows with observability and traffic management. The next post covers service mesh concepts, where tools like Istio and Linkerd add mutual TLS, traffic shaping, and fine-grained observability without changing application code.

← Back to all series