Search…

Distributed tracing

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

A monolith has one call stack. When something is slow, a profiler shows you exactly where. Microservices scatter a single request across dozens of processes, each with its own logs and metrics. Without distributed tracing, you are left guessing which service caused the slowdown.

Distributed tracing reconstructs the full journey of a request by propagating a trace context across service boundaries. Each step is a span. Spans assemble into a trace tree that shows timing, dependencies, and errors in one view.

Core concepts

Trace

A trace represents one end-to-end request. It has a globally unique trace_id generated at the entry point (usually the API gateway or load balancer).

Span

A span represents a single unit of work within a trace. Each span records:

  • trace_id: links to the parent trace.
  • span_id: unique identifier for this span.
  • parent_span_id: the span that created this one.
  • operation: a human-readable name like POST /checkout or SELECT orders.
  • start_time and duration: when the work started and how long it took.
  • status: OK, ERROR, or UNSET.
  • attributes: key-value metadata like http.method, db.statement, rpc.service.

Context propagation

For spans across different services to belong to the same trace, the trace context must travel with the request. The W3C Trace Context standard defines two HTTP headers:

traceparent: 00-<trace_id>-<parent_span_id>-<flags>
tracestate: vendor1=value1,vendor2=value2

Every HTTP client and server in the request chain must read incoming context, create a child span, and forward the updated context to the next service.

Why logs alone fail

Consider a checkout flow that calls five services:

sequenceDiagram
  participant User
  participant Gateway
  participant Auth
  participant Cart
  participant Inventory
  participant Payment

  User->>Gateway: POST /checkout
  Gateway->>Auth: Validate token (15ms)
  Auth-->>Gateway: OK
  Gateway->>Cart: Get cart items (25ms)
  Cart->>Inventory: Reserve stock (120ms)
  Inventory-->>Cart: Reserved
  Cart-->>Gateway: Items ready
  Gateway->>Payment: Charge card (380ms)
  Payment-->>Gateway: Charged
  Gateway-->>User: 200 OK (total: 540ms)

A checkout request spanning four services. The payment service dominates total latency.

Logs from each service show their local processing time. But no single service log tells you that the payment call took 380ms out of 540ms total. Only a trace view assembles the full picture.

With a trace, you click on the slow request and immediately see:

Trace: 7f2a1b3c4d5e6f00
├── gateway: POST /checkout          0ms - 540ms
│   ├── auth: validate-token         5ms - 20ms
│   ├── cart: get-items              25ms - 50ms
│   │   └── inventory: reserve       30ms - 150ms
│   └── payment: charge-card         160ms - 540ms   [SLOW]
│       └── stripe-api: POST         180ms - 530ms   [SLOW]

The bottleneck is the Stripe API call inside the payment service. No amount of log searching would reveal this as quickly.

OpenTelemetry instrumentation

OpenTelemetry provides automatic and manual instrumentation. Automatic instrumentation patches common libraries (HTTP clients, database drivers, gRPC) to create spans without code changes.

Automatic instrumentation (Python example)

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument \
  --service_name order-api \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

This automatically creates spans for Flask/FastAPI routes, SQLAlchemy queries, requests/httpx calls, and more.

Manual instrumentation

For custom business logic, create spans explicitly:

from opentelemetry import trace

tracer = trace.get_tracer("order-api")

def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = charge_card(order_id)
            span.set_attribute("payment.status", result.status)

Each with block creates a child span nested under the parent. The OTel SDK handles context propagation automatically within a process.

Span events and attributes

Spans can carry events (timestamped annotations) and attributes (key-value metadata).

span.add_event("cache_miss", {"cache.key": "user:9281"})
span.set_attribute("http.status_code", 200)
span.set_attribute("db.rows_affected", 3)
span.set_status(trace.StatusCode.OK)

Use attributes for data you want to filter and search by. Use events for notable moments within a span’s lifetime.

Avoid adding high-cardinality data to span attributes in systems that index them. User IDs are fine. Full request bodies are not.

Head sampling vs tail sampling

Tracing every request is expensive. A service handling 10,000 requests per second generates enormous trace volume. Sampling reduces this to a manageable level.

Head sampling decides at the start of a trace whether to record it. Simple to implement but random: you might miss the one slow request that matters.

# OTel SDK config: sample 10% of traces
processors:
  probabilistic_sampler:
    sampling_percentage: 10

Tail sampling waits until a trace is complete, then decides based on the outcome. Keep all traces with errors, all traces above a latency threshold, and a small percentage of everything else.

# OTel Collector tail sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

Tail sampling is strictly better in terms of data quality but requires buffering complete traces in the collector, which increases memory usage and complexity.

Trace backends: Jaeger and Tempo

Jaeger is a mature, CNCF-graduated tracing backend. It supports Elasticsearch or Cassandra for storage, has a built-in UI, and supports OpenTelemetry natively.

Grafana Tempo is a newer alternative that stores traces in object storage (S3, GCS). It is designed to be cost-effective at scale and integrates tightly with Grafana for trace-to-log and trace-to-metric navigation.

Both accept OTLP (OpenTelemetry Protocol) data from the OTel Collector. The choice often comes down to existing infrastructure: if you already run Elasticsearch, Jaeger fits. If you want to minimize storage costs and use Grafana, Tempo fits.

Connecting traces to logs and metrics

The real power of tracing emerges when you can pivot between pillars:

  1. A Grafana dashboard shows a P99 latency spike on the payment service.
  2. You click the spike. A data link opens Tempo filtered by service=payment-api and the spike’s time range.
  3. You find a slow trace and click a span with an error status.
  4. A link opens Loki filtered by trace_id=7f2a1b3c4d5e6f00, showing the exact error log.

This workflow requires three things: shared trace_id across all telemetry, data source configuration in Grafana, and data links between panels.

What comes next

Tracing shows the path of individual requests. Log aggregation pipelines explains how to get all those correlated logs into a searchable system so the trace-to-log jump actually works.

Start typing to search across all content
navigate Enter open Esc close