Resource management and autoscaling
In this series (14 parts)
- Why Kubernetes exists
- Kubernetes architecture
- Core Kubernetes objects
- Kubernetes networking
- Storage in Kubernetes
- Kubernetes configuration and secrets
- Resource management and autoscaling
- Kubernetes workload types
- Kubernetes observability
- Kubernetes security
- Helm and package management
- GitOps with ArgoCD
- Kubernetes cluster operations
- Service mesh concepts
Without resource limits, a single runaway pod can starve an entire node. Without autoscaling, you either over-provision (wasting money) or under-provision (degrading performance). Kubernetes gives you tools for both problems.
Requests vs limits
Every container can declare two resource boundaries.
Requests are what the scheduler uses to place pods. A pod requesting 200m CPU will only be scheduled on a node with at least 200m available. Requests are a guarantee.
Limits are the maximum a container can consume. If it exceeds the memory limit, it is killed (OOMKilled). If it exceeds the CPU limit, it is throttled.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api
image: myregistry/api:3.2.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
ports:
- containerPort: 8080
CPU is measured in millicores. 200m is 0.2 of a CPU core. Memory uses standard units: Mi (mebibytes), Gi (gibibytes).
The gap between requests and limits
Setting limits much higher than requests allows bursting. A pod requesting 200m CPU can burst to 500m if the node has spare capacity. This works well for bursty workloads. But if every pod on a node bursts simultaneously, the node becomes overcommitted and pods get throttled or killed.
A common guideline: set the memory limit equal to or close to the request (memory overcommit causes OOMKills), and allow CPU limits to be 2-3x the request (CPU overcommit causes throttling, which is recoverable).
QoS classes
Kubernetes assigns a Quality of Service class to each pod based on its resource configuration. QoS determines eviction order when a node runs out of resources.
graph TD E["Node under memory pressure"] --> BF["BestEffort pods evicted first"] BF --> B["Burstable pods evicted next"] B --> G["Guaranteed pods evicted last"]
When a node runs low on resources, BestEffort pods are evicted first, then Burstable, and Guaranteed pods are evicted only as a last resort.
| QoS class | Condition | Eviction priority |
|---|---|---|
| Guaranteed | Every container has equal requests and limits for CPU and memory | Last (lowest priority for eviction) |
| Burstable | At least one container has requests or limits set, but they are not equal | Middle |
| BestEffort | No requests or limits set on any container | First (highest priority for eviction) |
Example of a Guaranteed pod:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
Requests equal limits for both CPU and memory. This pod gets the strongest protection against eviction.
LimitRange
A LimitRange sets default and maximum resource values for a namespace. It prevents pods from being created without resource specs.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default:
cpu: "200m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
min:
cpu: "50m"
memory: "64Mi"
- type: Pod
max:
cpu: "4"
memory: "8Gi"
If a developer deploys a pod without resource specs, it automatically gets 100m CPU and 128Mi memory as requests, and 200m CPU and 256Mi memory as limits.
ResourceQuota
ResourceQuota limits total resource consumption across all pods in a namespace. This prevents one team from consuming the entire cluster.
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "100"
services: "20"
persistentvolumeclaims: "30"
When the quota is reached, new pods are rejected until existing pods are deleted or scaled down.
Horizontal Pod Autoscaler (HPA)
HPA adjusts the number of pod replicas based on observed metrics. The most common trigger is CPU utilization.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
The behavior section is important. Without it, HPA can thrash: scaling up and down rapidly. The stabilization window prevents action until the metric has been stable for the specified duration.
HPA with custom metrics
CPU is a coarse signal. Custom metrics give finer control. For example, scaling based on HTTP requests per second:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
This requires a metrics adapter (like Prometheus Adapter) that exposes custom metrics to the Kubernetes metrics API.
HPA scaling simulation
Here is how replica count changes under a traffic spike:
At 10:00, three replicas handle baseline traffic. CPU spikes at 10:10 and HPA begins adding pods. By 10:25, it hits the maximum of 20. After traffic subsides, scale-down is slower due to the stabilization window. This asymmetry is intentional: scaling up fast protects availability while scaling down slowly prevents premature reduction.
Vertical Pod Autoscaler (VPA)
VPA adjusts CPU and memory requests for individual pods based on historical usage. It is useful when you do not know the right resource values.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
VPA modes:
| Mode | Behavior |
|---|---|
| Off | Only provides recommendations. Does not change pods. |
| Initial | Sets resources on pod creation. Does not update running pods. |
| Auto | Evicts and recreates pods with updated resources. |
Caution: VPA and HPA should not target the same metric. If HPA scales on CPU and VPA adjusts CPU requests, they can conflict. Use VPA for memory and HPA for CPU, or use VPA in Off mode for recommendations only.
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA extends HPA with event-driven triggers. It can scale based on queue depth, database row count, cron schedules, and dozens of other sources.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 1
maxReplicaCount: 50
pollingInterval: 15
cooldownPeriod: 120
triggers:
- type: rabbitmq
metadata:
host: "amqp://rabbitmq.production.svc.cluster.local:5672"
queueName: orders
queueLength: "10"
- type: cron
metadata:
timezone: "America/New_York"
start: "0 8 * * 1-5"
end: "0 20 * * 1-5"
desiredReplicas: "5"
This configuration scales the order processor based on RabbitMQ queue depth (one replica per 10 messages) and maintains at least 5 replicas during business hours.
KEDA can also scale to zero, which HPA cannot. When the queue is empty outside business hours, the deployment scales to zero pods. When a message arrives, KEDA spins up a pod to process it.
Monitoring resource usage
Before tuning, measure:
# Current resource usage per pod
kubectl top pods -n production
# Node-level resource usage
kubectl top nodes
# Check HPA status
kubectl get hpa -n production
# Detailed HPA events
kubectl describe hpa api-server-hpa -n production
# VPA recommendations
kubectl describe vpa api-server-vpa -n production
What comes next
Resource management and autoscaling keep your cluster efficient and responsive. This wraps up the core Kubernetes series. From here you can explore advanced topics: service meshes for traffic management, GitOps with ArgoCD for deployment automation, and observability with Prometheus and Grafana for monitoring everything we have built.