The three pillars of observability
In this series (10 parts)
Monitoring tells you when something is broken. Observability tells you why. The difference matters because modern distributed systems fail in ways nobody predicted at design time. You cannot pre-build a dashboard for every failure mode.
Observability rests on three pillars: logs, metrics, and traces. Each one captures a different slice of reality. None is sufficient alone.
Monitoring vs observability
Monitoring is asking known questions. “Is CPU above 90%?” or “Are more than 1% of requests returning 500?” You define the questions in advance, wire up checks, and get alerts when thresholds are crossed.
Observability is the ability to ask new questions without shipping new code. When a user reports “checkout is slow for customers in Germany,” you need to slice data by region, payment provider, and request path. If your system only exposes pre-aggregated counters, you cannot answer that question.
A system is observable when its internal state can be inferred from its external outputs. The three pillars are those outputs.
Pillar 1: Logs
A log is an immutable, timestamped record of a discrete event. Every time a request arrives, a database query runs, or an error occurs, the application writes a log line.
{
"timestamp": "2026-04-20T14:23:01.442Z",
"level": "error",
"service": "payment-api",
"trace_id": "abc123def456",
"message": "charge failed",
"customer_id": "cust_9281",
"error_code": "card_declined",
"latency_ms": 340
}
Logs are rich in context. They carry arbitrary key-value pairs that describe exactly what happened. The cost is volume: a busy service can produce gigabytes per hour. Structured logging and sampling strategies keep this manageable.
Pillar 2: Metrics
A metric is a numeric measurement aggregated over time. Instead of recording every event, metrics summarize: “In the last minute, 12,400 requests arrived and 37 returned errors.”
Metrics are cheap to store because they compress naturally. A counter that increments 10,000 times per second still occupies a single time series. This makes metrics ideal for dashboards and alerting.
The four standard metric types in Prometheus are:
- Counter increments monotonically. Request count, bytes sent, errors total.
- Gauge goes up and down. CPU usage, queue depth, active connections.
- Histogram buckets observations into configurable ranges. Request duration, response size.
- Summary calculates quantiles client-side. Similar to histogram but with different trade-offs.
Pillar 3: Traces
A trace follows a single request as it crosses service boundaries. Each step in the journey is a span. Spans nest to form a tree that shows exactly where time was spent.
Trace: abc123def456
├── api-gateway 0ms - 420ms
│ ├── auth-service 12ms - 45ms
│ ├── product-api 50ms - 180ms
│ │ └── postgres 60ms - 170ms
│ └── payment-api 200ms - 410ms
│ └── stripe 220ms - 400ms
When a request is slow, the trace shows which service and which downstream call is responsible. Logs tell you what happened inside that service. Metrics tell you if the problem is widespread or isolated.
How the three pillars connect
The key that ties everything together is context propagation. A trace_id generated at the entry point flows through every service, gets embedded in every log line, and tags the metrics for that request.
flowchart LR
User -->|HTTP request| Gateway
Gateway -->|trace_id=abc123| AuthSvc[Auth Service]
Gateway -->|trace_id=abc123| ProductAPI[Product API]
Gateway -->|trace_id=abc123| PaymentAPI[Payment API]
subgraph Pillars["Three Pillars per Service"]
direction TB
L["Logs: structured events with trace_id"]
M["Metrics: counters, histograms, gauges"]
T["Traces: spans with timing + parent refs"]
end
AuthSvc --> Pillars
ProductAPI --> Pillars
PaymentAPI --> Pillars
Pillars --> Backend["Observability Backend<br/>Loki / Prometheus / Tempo"]
Each service emits logs, metrics, and traces tagged with the same trace_id. The observability backend correlates them.
When you get a metric alert (“P99 latency spike on payment-api”), you pivot to traces filtered by that time window, find the slow spans, then jump to logs for those specific trace IDs. This workflow is only possible when all three pillars share correlation identifiers.
Cardinality: the hidden cost
Every unique combination of label values creates a new time series. If you add a user_id label to a metric and you have 10 million users, you just created 10 million time series. Your Prometheus instance will run out of memory.
High cardinality is fine for logs and traces. They are events, not aggregations. But metrics demand discipline:
- Use bounded labels: HTTP method, status code class (
2xx,4xx,5xx), service name. - Never put user IDs, request IDs, or IP addresses in metric labels.
- Move high-cardinality debugging data to logs and traces.
A practical rule: if a label can take more than a few hundred distinct values, it does not belong on a metric.
OpenTelemetry as a unifying framework
OpenTelemetry (OTel) is a vendor-neutral standard for generating, collecting, and exporting telemetry data. It provides:
- APIs and SDKs for instrumenting applications in most languages.
- A common data model for logs, metrics, and traces with shared context.
- The OTel Collector that receives, processes, and exports telemetry to any backend.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
exporters:
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
The collector decouples instrumentation from backends. You instrument once with OTel, then route data to Prometheus, Loki, Tempo, Datadog, or whatever combination your organization uses. Switching backends later requires only a config change, not a code change.
Choosing what to instrument first
Start with the signals that answer the most urgent questions:
- RED metrics on every service boundary. Rate, Errors, Duration. These power SLO dashboards and alerts.
- Distributed traces across critical paths. Checkout flow, login flow, API gateway to database.
- Structured logs on error paths. Every
catchblock should log the error with context.
Avoid instrumenting everything on day one. Telemetry has storage and compute costs. Instrument the critical path, learn what questions you actually ask, then expand coverage.
What comes next
With the three pillars understood, the next articles dive into each one. Structured logging covers how to produce logs that are actually queryable. Metrics and Prometheus explains the scrape model and PromQL. Distributed tracing shows how spans propagate across service boundaries.