Search…

Resource management and autoscaling

In this series (14 parts)
  1. Why Kubernetes exists
  2. Kubernetes architecture
  3. Core Kubernetes objects
  4. Kubernetes networking
  5. Storage in Kubernetes
  6. Kubernetes configuration and secrets
  7. Resource management and autoscaling
  8. Kubernetes workload types
  9. Kubernetes observability
  10. Kubernetes security
  11. Helm and package management
  12. GitOps with ArgoCD
  13. Kubernetes cluster operations
  14. Service mesh concepts

Without resource limits, a single runaway pod can starve an entire node. Without autoscaling, you either over-provision (wasting money) or under-provision (degrading performance). Kubernetes gives you tools for both problems.

Requests vs limits

Every container can declare two resource boundaries.

Requests are what the scheduler uses to place pods. A pod requesting 200m CPU will only be scheduled on a node with at least 200m available. Requests are a guarantee.

Limits are the maximum a container can consume. If it exceeds the memory limit, it is killed (OOMKilled). If it exceeds the CPU limit, it is throttled.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: myregistry/api:3.2.0
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          ports:
            - containerPort: 8080

CPU is measured in millicores. 200m is 0.2 of a CPU core. Memory uses standard units: Mi (mebibytes), Gi (gibibytes).

The gap between requests and limits

Setting limits much higher than requests allows bursting. A pod requesting 200m CPU can burst to 500m if the node has spare capacity. This works well for bursty workloads. But if every pod on a node bursts simultaneously, the node becomes overcommitted and pods get throttled or killed.

A common guideline: set the memory limit equal to or close to the request (memory overcommit causes OOMKills), and allow CPU limits to be 2-3x the request (CPU overcommit causes throttling, which is recoverable).

QoS classes

Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. QoS determines eviction order when a node runs out of resources.

graph TD
  E["Node under memory pressure"] --> BF["BestEffort pods evicted first"]
  BF --> B["Burstable pods evicted next"]
  B --> G["Guaranteed pods evicted last"]

When a node runs low on resources, BestEffort pods are evicted first, then Burstable, and Guaranteed pods are evicted only as a last resort.

QoS classConditionEviction priority
GuaranteedEvery container has equal requests and limits for CPU and memoryLast (lowest priority for eviction)
BurstableAt least one container has requests or limits set, but they are not equalMiddle
BestEffortNo requests or limits set on any containerFirst (highest priority for eviction)

Example of a Guaranteed pod:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Requests equal limits for both CPU and memory. This pod gets the strongest protection against eviction.

LimitRange

A LimitRange sets default and maximum resource values for a namespace. It prevents pods from being created without resource specs.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - type: Container
      default:
        cpu: "200m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
    - type: Pod
      max:
        cpu: "4"
        memory: "8Gi"

If a developer deploys a pod without resource specs, it automatically gets 100m CPU and 128Mi memory as requests, and 200m CPU and 256Mi memory as limits.

ResourceQuota

ResourceQuota limits total resource consumption across all pods in a namespace. This prevents one team from consuming the entire cluster.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"
    services: "20"
    persistentvolumeclaims: "30"

When the quota is reached, new pods are rejected until existing pods are deleted or scaled down.

Horizontal Pod Autoscaler (HPA)

HPA adjusts the number of pod replicas based on observed metrics. The most common trigger is CPU utilization.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

The behavior section is important. Without it, HPA can thrash: scaling up and down rapidly. The stabilization window prevents action until the metric has been stable for the specified duration.

HPA with custom metrics

CPU is a coarse signal. Custom metrics give finer control. For example, scaling based on HTTP requests per second:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

This requires a metrics adapter (like Prometheus Adapter) that exposes custom metrics to the Kubernetes metrics API.

HPA scaling simulation

Here is how replica count changes under a traffic spike:

At 10:00, three replicas handle baseline traffic. CPU spikes at 10:10 and HPA begins adding pods. By 10:25, it hits the maximum of 20. After traffic subsides, scale-down is slower due to the stabilization window. This asymmetry is intentional: scaling up fast protects availability while scaling down slowly prevents premature reduction.

Vertical Pod Autoscaler (VPA)

VPA adjusts CPU and memory requests for individual pods based on historical usage. It is useful when you do not know the right resource values.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "4Gi"

VPA modes:

ModeBehavior
OffOnly provides recommendations. Does not change pods.
InitialSets resources on pod creation. Does not update running pods.
AutoEvicts and recreates pods with updated resources.

Caution: VPA and HPA should not target the same metric. If HPA scales on CPU and VPA adjusts CPU requests, they can conflict. Use VPA for memory and HPA for CPU, or use VPA in Off mode for recommendations only.

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA extends HPA with event-driven triggers. It can scale based on queue depth, database row count, cron schedules, and dozens of other sources.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 1
  maxReplicaCount: 50
  pollingInterval: 15
  cooldownPeriod: 120
  triggers:
    - type: rabbitmq
      metadata:
        host: "amqp://rabbitmq.production.svc.cluster.local:5672"
        queueName: orders
        queueLength: "10"
    - type: cron
      metadata:
        timezone: "America/New_York"
        start: "0 8 * * 1-5"
        end: "0 20 * * 1-5"
        desiredReplicas: "5"

This configuration scales the order processor based on RabbitMQ queue depth (one replica per 10 messages) and maintains at least 5 replicas during business hours.

KEDA can also scale to zero, which HPA cannot. When the queue is empty outside business hours, the deployment scales to zero pods. When a message arrives, KEDA spins up a pod to process it.

Monitoring resource usage

Before tuning, measure:

# Current resource usage per pod
kubectl top pods -n production

# Node-level resource usage
kubectl top nodes

# Check HPA status
kubectl get hpa -n production

# Detailed HPA events
kubectl describe hpa api-server-hpa -n production

# VPA recommendations
kubectl describe vpa api-server-vpa -n production

What comes next

Resource management and autoscaling keep your cluster efficient and responsive. This wraps up the core Kubernetes series. From here you can explore advanced topics: service meshes for traffic management, GitOps with ArgoCD for deployment automation, and observability with Prometheus and Grafana for monitoring everything we have built.

Start typing to search across all content
navigate Enter open Esc close