Distributed tracing
In this series (10 parts)
A monolith has one call stack. When something is slow, a profiler shows you exactly where. Microservices scatter a single request across dozens of processes, each with its own logs and metrics. Without distributed tracing, you are left guessing which service caused the slowdown.
Distributed tracing reconstructs the full journey of a request by propagating a trace context across service boundaries. Each step is a span. Spans assemble into a trace tree that shows timing, dependencies, and errors in one view.
Core concepts
Trace
A trace represents one end-to-end request. It has a globally unique trace_id generated at the entry point (usually the API gateway or load balancer).
Span
A span represents a single unit of work within a trace. Each span records:
trace_id: links to the parent trace.span_id: unique identifier for this span.parent_span_id: the span that created this one.operation: a human-readable name likePOST /checkoutorSELECT orders.start_timeandduration: when the work started and how long it took.status: OK, ERROR, or UNSET.attributes: key-value metadata likehttp.method,db.statement,rpc.service.
Context propagation
For spans across different services to belong to the same trace, the trace context must travel with the request. The W3C Trace Context standard defines two HTTP headers:
traceparent: 00-<trace_id>-<parent_span_id>-<flags>
tracestate: vendor1=value1,vendor2=value2
Every HTTP client and server in the request chain must read incoming context, create a child span, and forward the updated context to the next service.
Why logs alone fail
Consider a checkout flow that calls five services:
sequenceDiagram participant User participant Gateway participant Auth participant Cart participant Inventory participant Payment User->>Gateway: POST /checkout Gateway->>Auth: Validate token (15ms) Auth-->>Gateway: OK Gateway->>Cart: Get cart items (25ms) Cart->>Inventory: Reserve stock (120ms) Inventory-->>Cart: Reserved Cart-->>Gateway: Items ready Gateway->>Payment: Charge card (380ms) Payment-->>Gateway: Charged Gateway-->>User: 200 OK (total: 540ms)
A checkout request spanning four services. The payment service dominates total latency.
Logs from each service show their local processing time. But no single service log tells you that the payment call took 380ms out of 540ms total. Only a trace view assembles the full picture.
With a trace, you click on the slow request and immediately see:
Trace: 7f2a1b3c4d5e6f00
├── gateway: POST /checkout 0ms - 540ms
│ ├── auth: validate-token 5ms - 20ms
│ ├── cart: get-items 25ms - 50ms
│ │ └── inventory: reserve 30ms - 150ms
│ └── payment: charge-card 160ms - 540ms [SLOW]
│ └── stripe-api: POST 180ms - 530ms [SLOW]
The bottleneck is the Stripe API call inside the payment service. No amount of log searching would reveal this as quickly.
OpenTelemetry instrumentation
OpenTelemetry provides automatic and manual instrumentation. Automatic instrumentation patches common libraries (HTTP clients, database drivers, gRPC) to create spans without code changes.
Automatic instrumentation (Python example)
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install
opentelemetry-instrument \
--service_name order-api \
--exporter_otlp_endpoint http://otel-collector:4317 \
python app.py
This automatically creates spans for Flask/FastAPI routes, SQLAlchemy queries, requests/httpx calls, and more.
Manual instrumentation
For custom business logic, create spans explicitly:
from opentelemetry import trace
tracer = trace.get_tracer("order-api")
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_inventory"):
check_inventory(order_id)
with tracer.start_as_current_span("charge_payment"):
result = charge_card(order_id)
span.set_attribute("payment.status", result.status)
Each with block creates a child span nested under the parent. The OTel SDK handles context propagation automatically within a process.
Span events and attributes
Spans can carry events (timestamped annotations) and attributes (key-value metadata).
span.add_event("cache_miss", {"cache.key": "user:9281"})
span.set_attribute("http.status_code", 200)
span.set_attribute("db.rows_affected", 3)
span.set_status(trace.StatusCode.OK)
Use attributes for data you want to filter and search by. Use events for notable moments within a span’s lifetime.
Avoid adding high-cardinality data to span attributes in systems that index them. User IDs are fine. Full request bodies are not.
Head sampling vs tail sampling
Tracing every request is expensive. A service handling 10,000 requests per second generates enormous trace volume. Sampling reduces this to a manageable level.
Head sampling decides at the start of a trace whether to record it. Simple to implement but random: you might miss the one slow request that matters.
# OTel SDK config: sample 10% of traces
processors:
probabilistic_sampler:
sampling_percentage: 10
Tail sampling waits until a trace is complete, then decides based on the outcome. Keep all traces with errors, all traces above a latency threshold, and a small percentage of everything else.
# OTel Collector tail sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-requests
type: latency
latency:
threshold_ms: 1000
- name: baseline
type: probabilistic
probabilistic:
sampling_percentage: 5
Tail sampling is strictly better in terms of data quality but requires buffering complete traces in the collector, which increases memory usage and complexity.
Trace backends: Jaeger and Tempo
Jaeger is a mature, CNCF-graduated tracing backend. It supports Elasticsearch or Cassandra for storage, has a built-in UI, and supports OpenTelemetry natively.
Grafana Tempo is a newer alternative that stores traces in object storage (S3, GCS). It is designed to be cost-effective at scale and integrates tightly with Grafana for trace-to-log and trace-to-metric navigation.
Both accept OTLP (OpenTelemetry Protocol) data from the OTel Collector. The choice often comes down to existing infrastructure: if you already run Elasticsearch, Jaeger fits. If you want to minimize storage costs and use Grafana, Tempo fits.
Connecting traces to logs and metrics
The real power of tracing emerges when you can pivot between pillars:
- A Grafana dashboard shows a P99 latency spike on the payment service.
- You click the spike. A data link opens Tempo filtered by
service=payment-apiand the spike’s time range. - You find a slow trace and click a span with an error status.
- A link opens Loki filtered by
trace_id=7f2a1b3c4d5e6f00, showing the exact error log.
This workflow requires three things: shared trace_id across all telemetry, data source configuration in Grafana, and data links between panels.
What comes next
Tracing shows the path of individual requests. Log aggregation pipelines explains how to get all those correlated logs into a searchable system so the trace-to-log jump actually works.