Observability: logging, metrics, tracing
In this series (20 parts)
- What is system design and why it matters
- Estimations and back-of-envelope calculations
- Scalability: vertical vs horizontal scaling
- CAP theorem and distributed system tradeoffs
- Consistency models
- Load balancing
- Caching: strategies and patterns
- Content Delivery Networks
- Databases: SQL vs NoSQL and when to use each
- Database replication
- Database sharding and partitioning
- Consistent hashing
- Message queues and event streaming
- API design: REST, GraphQL, gRPC
- Rate limiting and throttling
- Proxies: forward and reverse
- Networking concepts for system design
- Reliability patterns: timeouts, retries, circuit breakers
- Observability: logging, metrics, tracing
- Security in system design
It is 2 a.m. and your pager fires. Latency on the checkout service spiked to 4 seconds. You open your dashboard: CPU is fine, memory is fine, disk is fine. The four golden signals look normal at the aggregate level. But somewhere between the API gateway and the payment provider, something went wrong. Without the right observability in place, you are blind. You will spend the next two hours grepping through unstructured logs across 14 services, guessing at causality, while customers abandon their carts.
Observability is the measure of how well you can understand the internal state of a system by examining its external outputs. The term comes from control theory, but in distributed systems it boils down to a practical question: when something breaks, can you figure out why without deploying new code? If you have followed the reliability patterns discussion, you know how to build systems that tolerate failure. This article is about seeing those failures clearly when they happen.
The three pillars
Observability rests on three complementary signals: logs, metrics, and traces. Each answers a different question. Logs tell you what happened. Metrics tell you how much. Traces tell you where.
None of these signals is sufficient on its own. A metric can tell you that p99 latency jumped from 120 ms to 900 ms, but it cannot tell you which request suffered or why. A log entry can tell you that a database query timed out, but it cannot tell you how often that timeout occurs across the fleet. A trace can show you the full path of a single request, but it cannot tell you the overall error rate for the service. You need all three working together. The power comes from correlation: linking a metric anomaly to a set of traces, then drilling into the logs for those specific trace IDs.
Structured logging
Most engineers start with logging. It is the oldest observability signal and the most intuitive. You print something, you read it later. The problem is that most logging is done wrong.
Unstructured logs look like this: "Failed to process order 4821 for user john@example.com". A human can read it. A machine cannot parse it reliably. When you have 50,000 log lines per second across 200 service instances, human reading stops being an option.
Structured logging means emitting every log entry as a typed, machine-parseable record. JSON is the most common format. Each field has a name, a type, and a consistent meaning across services.
{
"timestamp": "2025-03-15T02:14:33.482Z",
"level": "ERROR",
"service": "order-service",
"traceId": "abc123def456",
"spanId": "span-789",
"userId": "u-10042",
"orderId": "ord-4821",
"message": "payment gateway timeout after 3000ms",
"durationMs": 3000,
"retryCount": 2
}
This log entry is instantly searchable. You can query all errors for orderId=ord-4821, or all payment timeouts longer than 2000 ms in the last hour, or all logs sharing traceId=abc123def456. The trace ID is the critical piece. It ties this log entry to a distributed trace and to every other log emitted during the same request.
There are rules that save you pain. Always include a timestamp in ISO 8601 with millisecond precision. Always include the service name and instance ID. Always include the trace ID if one exists. Use consistent field names across every service in the organization. Log at the boundary: when a request enters your service, when it leaves, and when something unexpected happens in between. Do not log inside tight loops. A service handling 10,000 RPS that logs 5 entries per request produces 50,000 log entries per second. At an average of 500 bytes per entry, that is 25 MB/s of log data from a single service. Multiply by 40 services and you are shipping a gigabyte of logs every second. Storage, indexing, and query costs scale with volume. Be deliberate.
Log levels matter. DEBUG for development. INFO for routine lifecycle events. WARN for recoverable anomalies. ERROR for failures that affect user experience. FATAL for process-ending events. In production, most services should run at INFO. The ability to dynamically raise a single service instance to DEBUG without redeploying is a capability worth investing in early.
Time-series metrics
Metrics are numerical measurements collected at regular intervals. They answer questions about aggregates and trends. Unlike logs, which capture individual events, metrics compress reality into statistical summaries. That compression is what makes them cheap to store and fast to query.
There are three fundamental metric types.
Counters are monotonically increasing values. They only go up (or reset to zero on process restart). Total HTTP requests served, total bytes sent, total errors encountered. You never look at the raw counter value. You look at the rate of change. A counter that increases by 500 in one second tells you the rate is 500 per second. Counters are the workhorse of operational monitoring.
Gauges are point-in-time values that can go up or down. Current CPU usage, current memory consumption, number of active connections, queue depth. A gauge that reads 85% tells you something right now. Gauges are fragile because you only see the value at the moment of collection. If your scrape interval is 15 seconds and a memory spike lasted 3 seconds, you might miss it entirely.
Histograms (and their cousin, summaries) capture the distribution of values. Request latency is the classic use case. Knowing that average latency is 150 ms tells you almost nothing. Knowing that p50 is 80 ms, p95 is 200 ms, and p99 is 1,400 ms tells you a very different story. Histograms work by sorting observations into predefined buckets. A histogram with buckets at 50 ms, 100 ms, 250 ms, 500 ms, 1,000 ms, and 5,000 ms lets you compute approximate percentiles without storing every individual measurement.
The standard approach in modern systems is the pull model popularized by Prometheus. Each service exposes a /metrics endpoint. A central collector scrapes that endpoint every 15 or 30 seconds. The collected data points become a time series: a sequence of {timestamp, value} pairs identified by a metric name and a set of labels. For example, http_requests_total{method="POST", path="/api/orders", status="500"} is a distinct time series from http_requests_total{method="GET", path="/api/users", status="200"}.
Labels give you dimensionality. They also give you cardinality problems. If you add a userId label to a metric and you have 10 million users, you just created 10 million time series. Most time-series databases fall over well before that point. Keep label cardinality bounded. Use labels for dimensions with tens or hundreds of values, not thousands.
The four golden signals
Google’s SRE book codified four signals that every service should track: latency (how long requests take), traffic (how many requests arrive), errors (how many requests fail), and saturation (how full your resources are). If you instrument nothing else, instrument these four. They cover the vast majority of production incidents you will encounter.
Distributed tracing
In a monolith, a stack trace tells you the full path of a request. In a microservices architecture, a single user action might touch 8 different services across 3 data centers. A stack trace in any one service shows only a fragment. Distributed tracing reconstructs the full picture.
A trace represents the complete journey of a single request through the system. It has a globally unique trace ID. Every service that participates in handling the request creates one or more spans. A span represents a unit of work: an HTTP call, a database query, a message publish. Each span records its start time, duration, status, and a reference to its parent span. The tree of spans forms a directed acyclic graph that shows exactly how time was spent.
Context propagation is the mechanism that makes this work. When Service A calls Service B, it injects the trace ID and its own span ID into the outgoing request headers (typically using the W3C traceparent header or a B3 header). Service B extracts those IDs, creates a child span, and continues propagating. This works across HTTP, gRPC, and message queues. The key requirement is that every service in the call chain participates in propagation. One uninstrumented service breaks the trace.
sequenceDiagram participant G as API Gateway participant O as Order Service participant P as Payment Service participant I as Inventory Service participant N as Notification Service G->>O: POST /checkout<br/>traceId: t-100, spanId: s-1 O->>P: chargeCard<br/>traceId: t-100, parentSpan: s-2 P-->>O: 200 OK (340ms) O->>I: reserveStock<br/>traceId: t-100, parentSpan: s-3 I-->>O: 200 OK (85ms) O->>N: sendConfirmation<br/>traceId: t-100, parentSpan: s-4 N-->>O: 202 Accepted (12ms) O-->>G: 200 OK (510ms)
A single checkout request produces trace t-100. The trace ID propagates through every service call. Each service creates a child span with timing data. The Payment Service took 340 ms, revealing it as the latency bottleneck.
Sampling is essential. Recording 100% of traces at 10,000 RPS means storing millions of spans per minute. Most tracing systems use head-based sampling (decide at the entry point whether to trace) or tail-based sampling (decide after the trace completes, keeping interesting traces like errors or slow requests). A 1% head-based sample rate is common in high-traffic systems. Tail-based sampling is more expensive to operate but captures the traces you actually want to investigate.
The OpenTelemetry project has emerged as the industry standard for instrumentation. It provides vendor-neutral APIs and SDKs for generating traces, metrics, and logs. Instrument once with OpenTelemetry, then export to whatever backend you choose: Jaeger, Zipkin, Grafana Tempo, Datadog, or any other compatible system.
SLIs, SLOs, and SLAs
Observability data is useless without a framework for deciding what is acceptable. That framework is the SLI/SLO/SLA hierarchy.
A Service Level Indicator (SLI) is a quantitative measure of some aspect of service quality. The most common SLIs are request latency, error rate, throughput, and availability. An SLI must be precise. “Latency” is not an SLI. “The proportion of requests served in under 300 ms, measured at the load balancer” is an SLI.
A Service Level Objective (SLO) is a target value for an SLI. For example: “99.9% of requests will complete in under 300 ms, measured over a rolling 30-day window.” That 99.9% target means you have an error budget of 0.1%. Over 30 days, that is roughly 43 minutes of allowed downtime or degraded performance. The error budget is the most powerful concept here. It transforms reliability from a vague aspiration into a concrete, spendable resource. When the budget is healthy, teams can ship faster and take more risks. When the budget burns down, teams slow deployments and focus on stability.
A Service Level Agreement (SLA) is a contract with consequences. It is the business wrapper around an SLO. If you promise 99.95% availability in your SLA and deliver 99.90%, there are financial penalties. SLAs are typically less aggressive than internal SLOs. If your SLA promises 99.9%, your internal SLO should target 99.95% to give yourself margin.
The relationship between observability and SLOs is direct. Your metrics pipeline measures SLIs in real time. Your alerting rules fire when the error budget burn rate exceeds a threshold. Your traces and logs provide the detail needed to diagnose what is consuming the budget. Without solid observability, SLOs become aspirational numbers on a slide deck rather than operational tools.
Connecting the signals
The real value of observability emerges when you can move fluidly between pillars. A practical workflow looks like this: an alert fires because the error rate SLI exceeded the burn rate threshold. You open the metrics dashboard and see that 5xx errors spiked 12 minutes ago on the order service. You filter traces for that time window and find that 40% of them show a 3-second timeout on the payment service span. You pull the trace ID from one of those traces, search for it in the log aggregator, and find the structured log entry showing the exact error message, retry count, and upstream response code.
That entire investigation took 4 minutes. Without structured logs, you would have grepped raw text files across dozens of instances. Without traces, you would not have known which downstream service was responsible. Without metrics, you would not have known the error started 12 minutes ago or that it affected 40% of requests.
Building this kind of observability requires investment. You need consistent instrumentation across every service. You need a log aggregator that can handle your volume. You need a metrics backend with enough retention for trend analysis. You need a tracing backend with reasonable sampling and fast query performance. And you need your networking layer instrumented so that service-to-service calls carry context headers correctly.
The cost is real. At scale, observability infrastructure can consume 5 to 10 percent of total infrastructure spend. But the alternative is flying blind. Every minute of downtime you cannot diagnose is revenue lost, trust eroded, and engineers burning out in war rooms.
What comes next
Observability gives you the eyes to see what your system is doing. But seeing is not enough. You also need to protect your system from malicious actors and ensure that the data flowing through your traces and logs is itself secure. The next article covers security in system design, including authentication, authorization, encryption, and the threat models that shape how you build and operate production systems.