Chaos engineering
In this series (11 parts)
- What SRE is
- Reliability fundamentals
- SLIs, SLOs, and error budgets in practice
- Toil reduction and automation
- Capacity planning
- Performance testing and load testing
- Chaos engineering
- Incident response in practice
- Postmortems and learning from failure
- Production readiness reviews
- Reliability patterns for services
Your load tests pass. Your capacity plan looks solid. But what happens when an availability zone goes down? When a downstream API starts returning 500s? When DNS resolution fails for 30 seconds?
Chaos engineering answers these questions by breaking things on purpose, in a controlled way, before they break on their own.
The origin story
In 2010, Netflix migrated to AWS and faced a new reality. Cloud infrastructure fails. Instances disappear. Networks partition. Instead of hoping failures would not happen, they built Chaos Monkey, a tool that randomly terminated production instances during business hours.
The idea was counterintuitive. Break things on purpose to build confidence that the system survives real failures. It worked. Netflix engineers designed services to tolerate instance loss because they knew Chaos Monkey would test them.
This evolved into the Simian Army: Chaos Monkey for instance failures, Latency Monkey for network delays, Chaos Gorilla for availability zone outages. The underlying principle became a discipline: chaos engineering.
Chaos engineering is hypothesis testing
Chaos engineering is not random destruction. It follows the scientific method.
You start with a hypothesis about how your system should behave during a specific failure. You design an experiment to test that hypothesis. You observe the results. You either confirm the hypothesis or discover a weakness to fix.
The difference between chaos engineering and breaking things randomly is the hypothesis. Without it, you are just causing outages.
The steady-state hypothesis
Before you can test resilience, you need to define what “normal” looks like. This is your steady-state hypothesis.
Pick metrics that represent normal system behavior:
- Request success rate above 99.9%
- P95 latency below 300ms
- Order processing rate between 50 and 70 per minute
- Error rate below 0.1%
Your experiment’s goal is to verify that these metrics stay within acceptable ranges during the injected failure. If they do, the system is resilient to that failure mode. If they do not, you found something to fix.
Experiment design
Every chaos experiment has five components.
1. Hypothesis
State what you expect to happen. Be specific.
“When we terminate 1 of 3 instances in the checkout service, the overall error rate will stay below 0.5% and P95 latency will stay below 500ms because the load balancer will route traffic to the remaining instances.”
A vague hypothesis like “the system should be fine” gives you nothing to measure against.
2. Method
Define exactly what failure you will inject. Common injection types:
| Failure type | What it tests | Example |
|---|---|---|
| Instance termination | Redundancy and failover | Kill 1 of N pods |
| Network latency | Timeout handling | Add 500ms delay to API calls |
| Network partition | Split-brain handling | Block traffic between services |
| Dependency failure | Circuit breaker behavior | Return 500s from a downstream API |
| Resource exhaustion | Graceful degradation | Fill disk to 95% capacity |
| DNS failure | Fallback resolution | Block DNS for a service |
3. Blast radius
Define the scope of impact. Start as small as possible.
- Smallest: a single pod in a non-critical service in staging
- Small: a single pod in production during low-traffic hours
- Medium: multiple pods in one availability zone
- Large: an entire availability zone (only after smaller experiments pass)
Never start with a large blast radius. Expand gradually as confidence grows.
4. Abort conditions
Define when to stop the experiment immediately. Examples:
- Error rate exceeds 5%
- P99 latency exceeds 10 seconds
- More than 100 failed requests in 60 seconds
- Any customer-facing impact detected
Abort conditions are non-negotiable. If any condition triggers, stop the experiment, restore the system, and analyze what happened.
5. Observability
You cannot run chaos experiments without strong monitoring. Before running any experiment, verify that you can see:
- Real-time error rates per service
- Latency percentiles (P50, P95, P99)
- Resource utilization (CPU, memory, network)
- Dependency health checks
- Alerting pipelines are active
If you cannot observe the impact, do not run the experiment. Blind chaos is just an outage.
The chaos experiment workflow
flowchart TD
A[Define steady state] --> B[Form hypothesis]
B --> C[Design experiment]
C --> D{Blast radius acceptable?}
D -->|No| C
D -->|Yes| E[Run experiment]
E --> F{Abort conditions triggered?}
F -->|Yes| G[Stop and restore]
G --> H[Analyze results]
F -->|No| I{Experiment complete?}
I -->|No| E
I -->|Yes| H
H --> J{Hypothesis confirmed?}
J -->|Yes| K[Document and expand scope]
J -->|No| L[Fix weakness]
L --> M[Re-run experiment]
K --> N[Design next experiment]
The chaos experiment lifecycle. Every experiment loops through hypothesis, execution, and analysis.
The workflow is iterative. A confirmed hypothesis leads to expanding scope or designing new experiments. A failed hypothesis leads to fixing the weakness and re-testing. Over time, you build a library of validated resilience properties.
Tools for chaos engineering
Several tools automate failure injection and experiment management.
Chaos Monkey
Netflix’s original tool. Randomly terminates instances in production. Simple, focused, and battle-tested. Best for teams that run on AWS and want to start with basic instance failure testing.
Chaos Monkey is intentionally limited in scope. It does one thing well: kill instances. For more complex failure modes, you need additional tools.
Gremlin
A commercial chaos engineering platform. It provides a wide range of failure injections: CPU stress, memory pressure, network manipulation, process killing, disk filling, and more. Gremlin includes a web UI for designing experiments, scheduling runs, and tracking results.
The main advantage is ease of use. Teams can run experiments without writing custom tooling. The trade-off is cost and the dependency on a third-party platform.
LitmusChaos
An open-source, Kubernetes-native chaos engineering framework. It defines experiments as ChaosEngine custom resources. Experiments run as Kubernetes jobs, making them easy to integrate with GitOps workflows.
LitmusChaos is the best fit for teams running on Kubernetes who want to keep everything in their existing toolchain. The experiment catalog covers pod failures, node drains, network chaos, and more.
Choosing a tool
| Factor | Chaos Monkey | Gremlin | LitmusChaos |
|---|---|---|---|
| Cost | Free | Paid | Free |
| Platform | AWS | Multi-cloud | Kubernetes |
| Failure types | Instance only | Comprehensive | Kubernetes-focused |
| Setup effort | Low | Low | Medium |
| Best for | Getting started | Enterprise teams | K8s-native teams |
Running your first experiment
Start simple. Here is a step-by-step guide for your first chaos experiment.
Step 1: Pick a non-critical service. Choose a service that is redundant and not on the critical payment path. A product catalog service with 3 replicas is a good candidate.
Step 2: Define steady state. Measure the service for 10 minutes under normal traffic. Record P95 latency, error rate, and throughput. These are your baseline numbers.
Step 3: Write the hypothesis. “Terminating 1 of 3 pods in the catalog service will cause a brief latency increase (under 1 second) but the error rate will stay below 0.5% and throughput will recover within 60 seconds.”
Step 4: Set abort conditions. Error rate above 2%. P99 latency above 5 seconds. Any 5xx errors lasting more than 2 minutes.
Step 5: Run during business hours. This sounds counterintuitive, but running during business hours means engineers are available to respond. Weeknight experiments with skeleton staff are riskier.
Step 6: Kill the pod.
kubectl delete pod catalog-service-abc123 --namespace production
Step 7: Observe. Watch your dashboards. Did the load balancer route traffic to the remaining pods? Did latency spike? Did Kubernetes schedule a replacement pod? How long did recovery take?
Step 8: Document results. Write up what happened. Did the hypothesis hold? If not, what failed? What needs to change?
Building a chaos engineering culture
Tools are the easy part. Culture is harder.
Teams resist chaos engineering because it feels dangerous. Overcome this by starting small, celebrating what you learn (especially when experiments reveal weaknesses), and framing chaos engineering as proactive reliability investment.
A good progression:
- Run experiments in staging for the first month
- Move to production with single-pod experiments in month two
- Expand to multi-pod and dependency experiments in month three
- Schedule regular automated experiments by month four
Share results widely. When a chaos experiment finds a weakness that gets fixed before it causes a real outage, that is a win worth publicizing. It builds organizational support for continued investment.
Common failure injection patterns
Once you have the basics running, explore these patterns:
Clock skew. Set the system clock forward or backward. Tests time-dependent logic like cache TTLs, certificate validation, and session expiry.
Certificate expiry. Replace a valid TLS certificate with an expired one. Tests whether services fail open (insecure) or fail closed (unavailable).
Dependency slowdown. Add latency to a downstream service instead of killing it. A slow dependency is often worse than a dead one because it ties up threads and connections.
Data corruption. Inject malformed responses from a dependency. Tests input validation and error handling paths that rarely execute in normal operation.
What comes next
Chaos experiments reveal weaknesses. Some of those weaknesses will cause real incidents before you fix them. In Incident response in practice, you will learn how to detect, manage, and learn from incidents with clear roles, communication cadences, and blameless postmortems.