Search…

Performance testing and load testing

In this series (11 parts)
  1. What SRE is
  2. Reliability fundamentals
  3. SLIs, SLOs, and error budgets in practice
  4. Toil reduction and automation
  5. Capacity planning
  6. Performance testing and load testing
  7. Chaos engineering
  8. Incident response in practice
  9. Postmortems and learning from failure
  10. Production readiness reviews
  11. Reliability patterns for services

You built a capacity plan. It says you need to handle 3,000 requests per second. But can your system actually do that? Capacity models are predictions. Load tests are proof.

Performance testing puts your system under controlled pressure and measures how it responds. Without it, your first real load test happens in production. During an outage. With users watching.

Types of performance tests

Four types of tests cover different failure modes. Each answers a different question about your system.

graph LR
  A[Performance Testing] --> B[Load Test]
  A --> C[Stress Test]
  A --> D[Soak Test]
  A --> E[Spike Test]
  B --> B1["Steady expected traffic<br/>Duration: 30-60 min"]
  C --> C1["Beyond capacity limits<br/>Duration: until failure"]
  D --> D1["Normal load, long duration<br/>Duration: 4-24 hours"]
  E --> E1["Sudden traffic burst<br/>Duration: 5-10 min spikes"]

Four types of performance tests, each targeting a different failure mode.

Load testing

A load test simulates your expected peak traffic and sustains it for 30 to 60 minutes. The goal is to confirm your system handles normal peak load without degradation.

You are not trying to break anything. You are validating that response times stay within SLOs, error rates remain below thresholds, and resource utilization stays in safe ranges.

If your capacity plan says peak traffic is 2,000 RPS, run a load test at 2,000 RPS for 45 minutes. If latency stays under your P99 target and error rate stays below 0.1%, you pass.

Stress testing

A stress test pushes traffic beyond your known limits. Start at your expected peak and ramp up until the system degrades. The goal is to find the breaking point and understand how the system fails.

Good systems degrade gracefully. They shed load, return errors to new requests while completing in-flight requests, and recover automatically when traffic drops. Bad systems crash hard and need manual intervention.

Stress tests answer: what is the absolute ceiling? What breaks first? How does the system recover?

Soak testing

A soak test runs normal traffic for an extended period, typically 4 to 24 hours. The goal is to find slow leaks. Memory leaks that accumulate over hours. Connection pool exhaustion that takes time to manifest. Disk usage that grows gradually.

These bugs hide during short tests. A service that looks healthy at 30 minutes might OOM-kill at hour 6 because of a memory leak in a rarely-triggered code path.

Spike testing

A spike test sends sudden traffic bursts. Traffic jumps from baseline to 5x or 10x in seconds, holds for a few minutes, then drops back. The goal is to test auto-scaling response time and queue behavior under sudden load.

Spike tests reveal whether your auto-scaler reacts fast enough and whether request queues handle the burst without dropping messages.

Key metrics to watch

During any performance test, track these metrics:

MetricWhat it tells youAlert threshold
P50 latencyTypical user experienceService-specific
P95 latencyExperience for slower requests2x to 3x of P50
P99 latencyWorst-case user experienceSLO boundary
Error rateSystem reliability under loadBelow 0.1% for load tests
Throughput (RPS)Actual requests processedMatch target load
CPU utilizationCompute saturationBelow 70% at peak
Memory utilizationMemory pressureBelow 80% at peak
DB connection poolDatabase bottleneckBelow 80% of max connections

P50 tells you the median. P99 tells you the tail. The gap between them reveals how consistent your system is under load. A service with P50 of 50ms and P99 of 2,000ms has a serious consistency problem even if the median looks fine.

Practical k6 example

k6 is an open-source load testing tool built for developer workflows. Tests are written in JavaScript, run from the command line, and integrate with CI/CD pipelines.

Here is a realistic user journey test that simulates a user logging in, browsing products, and completing a checkout:

import http from 'k6/http';
import { check, sleep, group } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // ramp up to 100 users
    { duration: '5m', target: 100 },   // hold at 100 users
    { duration: '2m', target: 300 },   // ramp to 300 users
    { duration: '5m', target: 300 },   // hold at 300 users
    { duration: '2m', target: 0 },     // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500', 'p(99)<1500'],
    http_req_failed: ['rate<0.01'],
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://staging.example.com';

export default function () {
  group('01_login', function () {
    const loginRes = http.post(`${BASE_URL}/api/auth/login`, JSON.stringify({
      email: 'loadtest@example.com',
      password: 'test-password',
    }), { headers: { 'Content-Type': 'application/json' } });

    check(loginRes, {
      'login returns 200': (r) => r.status === 200,
      'login returns token': (r) => r.json('token') !== undefined,
    });

    sleep(1);
  });

  group('02_browse_products', function () {
    const catalogRes = http.get(`${BASE_URL}/api/products?page=1&limit=20`);

    check(catalogRes, {
      'catalog returns 200': (r) => r.status === 200,
      'catalog has products': (r) => r.json('items').length > 0,
    });

    sleep(2);

    const productRes = http.get(`${BASE_URL}/api/products/prod-001`);

    check(productRes, {
      'product detail returns 200': (r) => r.status === 200,
    });

    sleep(1);
  });

  group('03_checkout', function () {
    const cartRes = http.post(`${BASE_URL}/api/cart/items`, JSON.stringify({
      productId: 'prod-001',
      quantity: 1,
    }), { headers: { 'Content-Type': 'application/json' } });

    check(cartRes, {
      'add to cart returns 200': (r) => r.status === 200,
    });

    sleep(1);

    const checkoutRes = http.post(`${BASE_URL}/api/checkout`, JSON.stringify({
      paymentMethod: 'card',
      shippingAddress: { zip: '94105' },
    }), { headers: { 'Content-Type': 'application/json' } });

    check(checkoutRes, {
      'checkout returns 200': (r) => r.status === 200,
      'order id returned': (r) => r.json('orderId') !== undefined,
    });

    sleep(2);
  });
}

Run it with:

k6 run --env BASE_URL=https://staging.example.com load-test.js

The stages configuration creates a stepped load pattern. It ramps up gradually, holds steady, pushes higher, holds again, then ramps down. This mimics real traffic patterns better than a flat load.

The thresholds block defines pass/fail criteria. If P95 latency exceeds 500ms or the error rate exceeds 1%, the test fails. This integrates cleanly with CI pipelines that gate deploys on test results.

Locust for Python teams

Locust is a Python-based load testing framework. If your team writes Python and wants to define user behavior in a familiar language, Locust is a solid choice.

Locust uses a concept of “users” that each run a task set. It provides a web UI for real-time monitoring and supports distributed load generation across multiple machines.

The trade-off compared to k6: Locust generates less raw throughput per machine because Python’s GIL limits concurrency. For high-RPS tests, you need more Locust worker nodes. k6, written in Go, handles higher throughput per instance.

Identifying bottlenecks from test results

A load test that fails is more valuable than one that passes. Failures point directly at what to fix.

CPU saturation

Symptoms: latency climbs linearly with load, CPU utilization hits 90%+ on application servers. Fix by optimizing hot code paths, adding caching, or scaling horizontally.

Memory pressure

Symptoms: increasing garbage collection pauses, eventually OOM kills. Common in soak tests. Fix by profiling memory allocation, fixing leaks, and right-sizing instance memory.

Database connections

Symptoms: sudden latency spike when connection pool exhausts, errors about connection timeouts. Fix by tuning pool size, adding read replicas for read-heavy workloads, or optimizing slow queries.

Network throughput

Symptoms: latency increases but CPU and memory remain low. Often caused by bandwidth limits, DNS resolution delays, or TLS handshake overhead. Fix by enabling connection reuse, moving to HTTP/2, or increasing network allocation.

Testing without hitting production

Never run load tests against production unless you have traffic isolation and a kill switch. Here is a safer progression.

Staging environments are the starting point. Mirror your production infrastructure at a smaller scale. Validate functionality and relative performance. Staging results do not predict exact production numbers because the infrastructure differs, but they catch regressions.

Traffic shadowing (also called traffic mirroring) copies production traffic to a test environment. The test environment processes real request patterns without serving responses to users. This gives you realistic load profiles without risk.

Isolated production testing is the most accurate but riskiest approach. Some teams carve out a subset of production infrastructure, route synthetic traffic to it, and monitor for degradation. This requires strong traffic isolation and automatic rollback.

Start with staging. Graduate to shadowing. Only run production tests after you have solid observability and automated safety rails.

Integrating load tests into CI/CD

Load tests belong in your deployment pipeline, not just in quarterly reviews. A lightweight load test after every deploy catches performance regressions before they reach users.

Keep CI load tests short. A 5-minute test at moderate load with strict latency thresholds catches most regressions. Reserve longer tests for scheduled runs.

# Example CI step
performance-test:
  stage: verify
  script:
    - k6 run --duration 5m --vus 50 smoke-test.js
  allow_failure: false

The --vus 50 flag runs 50 virtual users. Enough to detect regressions, not enough to require dedicated load generation infrastructure.

What comes next

Load tests validate that your system handles expected traffic. But what happens when things go wrong in unexpected ways? In Chaos engineering, you will learn how to intentionally inject failures to test your system’s resilience before real failures find the weak spots for you.

Start typing to search across all content
navigate Enter open Esc close