Apr 19, 2026 · 16 min read · DevOps

Chaos engineering

In this series (11 parts)

Your load tests pass. Your capacity plan looks solid. But what happens when an availability zone goes down? When a downstream API starts returning 500s? When DNS resolution fails for 30 seconds?

Chaos engineering answers these questions by breaking things on purpose, in a controlled way, before they break on their own.

The origin story

In 2010, Netflix migrated to AWS and faced a new reality. Cloud infrastructure fails. Instances disappear. Networks partition. Instead of hoping failures would not happen, they built Chaos Monkey, a tool that randomly terminated production instances during business hours.

The idea was counterintuitive. Break things on purpose to build confidence that the system survives real failures. It worked. Netflix engineers designed services to tolerate instance loss because they knew Chaos Monkey would test them.

This evolved into the Simian Army: Chaos Monkey for instance failures, Latency Monkey for network delays, Chaos Gorilla for availability zone outages. The underlying principle became a discipline: chaos engineering.

Chaos engineering is hypothesis testing

Chaos engineering is not random destruction. It follows the scientific method.

You start with a hypothesis about how your system should behave during a specific failure. You design an experiment to test that hypothesis. You observe the results. You either confirm the hypothesis or discover a weakness to fix.

The difference between chaos engineering and breaking things randomly is the hypothesis. Without it, you are just causing outages.

The steady-state hypothesis

Before you can test resilience, you need to define what “normal” looks like. This is your steady-state hypothesis.

Pick metrics that represent normal system behavior:

Request success rate above 99.9%
P95 latency below 300ms
Order processing rate between 50 and 70 per minute
Error rate below 0.1%

Your experiment’s goal is to verify that these metrics stay within acceptable ranges during the injected failure. If they do, the system is resilient to that failure mode. If they do not, you found something to fix.

Experiment design

Every chaos experiment has five components.

1. Hypothesis

State what you expect to happen. Be specific.

“When we terminate 1 of 3 instances in the checkout service, the overall error rate will stay below 0.5% and P95 latency will stay below 500ms because the load balancer will route traffic to the remaining instances.”

A vague hypothesis like “the system should be fine” gives you nothing to measure against.

2. Method

Define exactly what failure you will inject. Common injection types:

Failure type	What it tests	Example
Instance termination	Redundancy and failover	Kill 1 of N pods
Network latency	Timeout handling	Add 500ms delay to API calls
Network partition	Split-brain handling	Block traffic between services
Dependency failure	Circuit breaker behavior	Return 500s from a downstream API
Resource exhaustion	Graceful degradation	Fill disk to 95% capacity
DNS failure	Fallback resolution	Block DNS for a service

3. Blast radius

Define the scope of impact. Start as small as possible.

Smallest: a single pod in a non-critical service in staging
Small: a single pod in production during low-traffic hours
Medium: multiple pods in one availability zone
Large: an entire availability zone (only after smaller experiments pass)

Never start with a large blast radius. Expand gradually as confidence grows.

4. Abort conditions

Define when to stop the experiment immediately. Examples:

Error rate exceeds 5%
P99 latency exceeds 10 seconds
More than 100 failed requests in 60 seconds
Any customer-facing impact detected

Abort conditions are non-negotiable. If any condition triggers, stop the experiment, restore the system, and analyze what happened.

5. Observability

You cannot run chaos experiments without strong monitoring. Before running any experiment, verify that you can see:

Real-time error rates per service
Latency percentiles (P50, P95, P99)
Resource utilization (CPU, memory, network)
Dependency health checks
Alerting pipelines are active

If you cannot observe the impact, do not run the experiment. Blind chaos is just an outage.

The chaos experiment workflow

flowchart TD
  A[Define steady state] --> B[Form hypothesis]
  B --> C[Design experiment]
  C --> D{Blast radius acceptable?}
  D -->|No| C
  D -->|Yes| E[Run experiment]
  E --> F{Abort conditions triggered?}
  F -->|Yes| G[Stop and restore]
  G --> H[Analyze results]
  F -->|No| I{Experiment complete?}
  I -->|No| E
  I -->|Yes| H
  H --> J{Hypothesis confirmed?}
  J -->|Yes| K[Document and expand scope]
  J -->|No| L[Fix weakness]
  L --> M[Re-run experiment]
  K --> N[Design next experiment]

The chaos experiment lifecycle. Every experiment loops through hypothesis, execution, and analysis.

The workflow is iterative. A confirmed hypothesis leads to expanding scope or designing new experiments. A failed hypothesis leads to fixing the weakness and re-testing. Over time, you build a library of validated resilience properties.

Tools for chaos engineering

Several tools automate failure injection and experiment management.

Chaos Monkey

Netflix’s original tool. Randomly terminates instances in production. Simple, focused, and battle-tested. Best for teams that run on AWS and want to start with basic instance failure testing.

Chaos Monkey is intentionally limited in scope. It does one thing well: kill instances. For more complex failure modes, you need additional tools.

Gremlin

A commercial chaos engineering platform. It provides a wide range of failure injections: CPU stress, memory pressure, network manipulation, process killing, disk filling, and more. Gremlin includes a web UI for designing experiments, scheduling runs, and tracking results.

The main advantage is ease of use. Teams can run experiments without writing custom tooling. The trade-off is cost and the dependency on a third-party platform.

LitmusChaos

An open-source, Kubernetes-native chaos engineering framework. It defines experiments as ChaosEngine custom resources. Experiments run as Kubernetes jobs, making them easy to integrate with GitOps workflows.

LitmusChaos is the best fit for teams running on Kubernetes who want to keep everything in their existing toolchain. The experiment catalog covers pod failures, node drains, network chaos, and more.

Choosing a tool

Factor	Chaos Monkey	Gremlin	LitmusChaos
Cost	Free	Paid	Free
Platform	AWS	Multi-cloud	Kubernetes
Failure types	Instance only	Comprehensive	Kubernetes-focused
Setup effort	Low	Low	Medium
Best for	Getting started	Enterprise teams	K8s-native teams

Running your first experiment

Start simple. Here is a step-by-step guide for your first chaos experiment.

Step 1: Pick a non-critical service. Choose a service that is redundant and not on the critical payment path. A product catalog service with 3 replicas is a good candidate.

Step 2: Define steady state. Measure the service for 10 minutes under normal traffic. Record P95 latency, error rate, and throughput. These are your baseline numbers.

Step 3: Write the hypothesis. “Terminating 1 of 3 pods in the catalog service will cause a brief latency increase (under 1 second) but the error rate will stay below 0.5% and throughput will recover within 60 seconds.”

Step 4: Set abort conditions. Error rate above 2%. P99 latency above 5 seconds. Any 5xx errors lasting more than 2 minutes.

Step 5: Run during business hours. This sounds counterintuitive, but running during business hours means engineers are available to respond. Weeknight experiments with skeleton staff are riskier.

Step 6: Kill the pod.

kubectl delete pod catalog-service-abc123 --namespace production

Step 7: Observe. Watch your dashboards. Did the load balancer route traffic to the remaining pods? Did latency spike? Did Kubernetes schedule a replacement pod? How long did recovery take?

Step 8: Document results. Write up what happened. Did the hypothesis hold? If not, what failed? What needs to change?

Building a chaos engineering culture

Tools are the easy part. Culture is harder.

Teams resist chaos engineering because it feels dangerous. Overcome this by starting small, celebrating what you learn (especially when experiments reveal weaknesses), and framing chaos engineering as proactive reliability investment.

A good progression:

Run experiments in staging for the first month
Move to production with single-pod experiments in month two
Expand to multi-pod and dependency experiments in month three
Schedule regular automated experiments by month four

Share results widely. When a chaos experiment finds a weakness that gets fixed before it causes a real outage, that is a win worth publicizing. It builds organizational support for continued investment.

Common failure injection patterns

Once you have the basics running, explore these patterns:

Clock skew. Set the system clock forward or backward. Tests time-dependent logic like cache TTLs, certificate validation, and session expiry.

Certificate expiry. Replace a valid TLS certificate with an expired one. Tests whether services fail open (insecure) or fail closed (unavailable).

Dependency slowdown. Add latency to a downstream service instead of killing it. A slow dependency is often worse than a dead one because it ties up threads and connections.

Data corruption. Inject malformed responses from a dependency. Tests input validation and error handling paths that rarely execute in normal operation.

What comes next

Chaos experiments reveal weaknesses. Some of those weaknesses will cause real incidents before you fix them. In Incident response in practice, you will learn how to detect, manage, and learn from incidents with clear roles, communication cadences, and blameless postmortems.

← Back to all series