Structured logging
In this series (10 parts)
A log line like Payment failed for user 9281 is readable by humans. It is nearly useless to machines. Extracting the user ID requires a regex that breaks every time the message format changes.
Structured logging solves this by emitting events as key-value pairs, typically serialized as JSON. Every field is explicitly named, typed, and queryable without parsing heuristics.
Why plain text fails
Consider a typical unstructured log:
2026-04-20 14:23:01 ERROR [payment-api] Charge failed for customer cust_9281 - card_declined (340ms)
To answer “how many card_declined errors happened in the last hour,” you need to:
- Write a regex to extract
card_declinedfrom the free-text message. - Hope nobody changes the message format in a future commit.
- Handle edge cases where the error code appears in a different position.
Now multiply this by 50 services, each with their own log format conventions. The regex approach does not scale.
The structured alternative
The same event as structured JSON:
{
"timestamp": "2026-04-20T14:23:01.442Z",
"level": "error",
"logger": "payment-api",
"message": "charge failed",
"trace_id": "abc123def456",
"span_id": "span_789",
"customer_id": "cust_9281",
"error_code": "card_declined",
"payment_method": "credit_card",
"latency_ms": 340,
"environment": "production"
}
Now the query is trivial: filter where error_code = "card_declined" and timestamp is within the last hour. No regex. No guessing. Works in Loki, Elasticsearch, CloudWatch, or any log backend that understands JSON.
Essential fields
Every structured log event should include a baseline set of fields:
| Field | Purpose |
|---|---|
timestamp | ISO 8601 with millisecond precision. Let the logging framework set this. |
level | Severity: debug, info, warn, error, fatal. |
service | Which service emitted the log. Critical in microservice architectures. |
trace_id | Links the log to a distributed trace. |
span_id | Links to the specific span within the trace. |
message | Short, stable description of the event. Not a template with interpolated values. |
Beyond the baseline, add context fields relevant to the event. For an HTTP request: method, path, status_code, latency_ms. For a database query: query_name, rows_affected, duration_ms.
Log levels and when to use them
Log levels filter noise. Use them consistently across services:
- DEBUG: Detailed diagnostic info. Disabled in production unless temporarily enabled for investigation.
- INFO: Normal operations. Request received, job completed, cache refreshed.
- WARN: Something unexpected that the system handled. Retry succeeded, fallback activated, deprecated API called.
- ERROR: An operation failed. The user or caller received an error response.
- FATAL: The process cannot continue and is shutting down.
A common mistake is logging expected conditions as errors. A 404 Not Found response is not an error if the resource genuinely does not exist. Reserve error level for conditions that indicate something is broken.
Correlation IDs
Correlation IDs tie together all the events that belong to a single logical operation. The most important correlation ID is the trace_id from distributed tracing.
A request enters at the API gateway, which generates a trace ID and passes it downstream via the traceparent HTTP header (W3C Trace Context standard). Every service extracts the trace ID and attaches it to every log line it writes.
import structlog
import uuid
def middleware(request, call_next):
trace_id = request.headers.get("x-trace-id", str(uuid.uuid4()))
structlog.contextvars.bind_contextvars(trace_id=trace_id)
response = call_next(request)
structlog.contextvars.unbind_contextvars("trace_id")
return response
When investigating a slow request, you search logs by trace_id and see every event across every service in chronological order.
Logging in practice: a web service example
Here is what structured logging looks like for a typical API request lifecycle:
{"timestamp":"2026-04-20T14:23:01.100Z","level":"info","service":"order-api","trace_id":"t1","message":"request started","method":"POST","path":"/orders","client_ip":"10.0.1.42"}
{"timestamp":"2026-04-20T14:23:01.150Z","level":"info","service":"order-api","trace_id":"t1","message":"inventory check passed","sku":"WIDGET-42","quantity":3}
{"timestamp":"2026-04-20T14:23:01.300Z","level":"info","service":"order-api","trace_id":"t1","message":"payment charged","amount_cents":4500,"currency":"USD","provider":"stripe"}
{"timestamp":"2026-04-20T14:23:01.420Z","level":"info","service":"order-api","trace_id":"t1","message":"request completed","status_code":201,"latency_ms":320}
Four events, one request. Each one is independently queryable. You can answer “what is the average latency of POST /orders?” or “which SKUs are ordered most frequently?” without touching application code.
What not to log
Structured logging makes it easy to add fields. That ease creates risk:
Never log these:
- Passwords, API keys, tokens, or secrets of any kind.
- Full credit card numbers, SSNs, or government IDs.
- Personal health information.
- Raw request bodies that may contain user-submitted PII.
Mitigations:
- Allowlist fields rather than blocklist. Only log fields you explicitly choose.
- Redact or hash sensitive values before they reach the logger.
- Use a log pipeline processor (Fluent Bit, Vector) to strip fields matching patterns like
password,secret,token.
REDACT_FIELDS = {"password", "credit_card", "ssn", "authorization"}
def sanitize(event: dict) -> dict:
return {
k: "***REDACTED***" if k in REDACT_FIELDS else v
for k, v in event.items()
}
Log sampling
A service handling 50,000 requests per second produces enormous log volume. Not every request needs full logging.
Head sampling decides at the start of a request whether to log it. For example, log 10% of successful requests but 100% of errors.
import random
def should_log_request(status_code: int) -> bool:
if status_code >= 400:
return True
return random.random() < 0.1 # 10% sample for success
Dynamic sampling adjusts the rate based on current conditions. During normal operation, sample at 1%. When error rates spike, automatically increase to 100%.
The key rule: always log errors at full volume. You never want to miss the one log line that explains a production incident.
Choosing a logging library
Most languages have mature structured logging libraries:
| Language | Library | Notes |
|---|---|---|
| Python | structlog | Context variables, processor pipeline, JSON output |
| Go | slog (stdlib) | Built into Go 1.21+, structured by design |
| Java | Logback + logstash-encoder | JSON encoder for the standard SLF4J facade |
| Node.js | pino | Fast JSON logger, low overhead |
| Rust | tracing + tracing-subscriber | Structured spans and events |
The library matters less than the discipline. Pick one, configure JSON output, enforce it across all services, and add it to your service template so new services start with structured logging from day one.
What comes next
Producing structured logs is the first step. Getting them into a searchable system is the next. Log aggregation pipelines covers how to ship, parse, enrich, and store logs using the ELK stack and Loki.