Log aggregation pipelines
In this series (10 parts)
A single container running on Kubernetes writes logs to stdout. When that container crashes, the logs disappear with it. Multiply by 200 pods across 30 services and you have a system where investigating any issue requires SSH-ing into nodes and hoping the relevant log lines still exist.
Log aggregation solves this by continuously shipping logs from every source to a centralized, searchable store. The pipeline has four stages: collection, parsing, enrichment, and storage.
Pipeline architecture
flowchart LR App1[App Pod 1] -->|stdout| Agent1[Fluent Bit<br/>DaemonSet] App2[App Pod 2] -->|stdout| Agent1 App3[App Pod 3] -->|stdout| Agent2[Fluent Bit<br/>DaemonSet] App4[App Pod 4] -->|stdout| Agent2 Agent1 --> Aggregator[Fluentd / Vector<br/>Aggregator] Agent2 --> Aggregator Aggregator -->|parsed + enriched| Store[(Elasticsearch<br/>or Loki)] Store --> UI[Kibana / Grafana]
Lightweight agents on each node collect logs. An aggregator layer handles parsing and enrichment before forwarding to storage.
This two-tier architecture separates concerns. The node agent (Fluent Bit) is lightweight and handles collection. The aggregator (Fluentd or Vector) handles the heavy lifting of parsing, filtering, and routing.
Collection: getting logs off the machine
Fluent Bit
Fluent Bit is a lightweight log processor designed for edge and container environments. It runs as a DaemonSet in Kubernetes, tailing log files from each node.
# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 10MB
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
K8S-Logging.Parser On
[OUTPUT]
Name forward
Match *
Host fluentd-aggregator
Port 24224
The Kubernetes filter enriches each log line with pod name, namespace, labels, and container metadata. This means you can filter logs by namespace=production or app=order-api without any application-level changes.
Filebeat (ELK alternative)
If you run the Elastic stack, Filebeat is the native log shipper. It ships directly to Elasticsearch or Logstash:
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
output.logstash:
hosts: ["logstash:5044"]
Parsing and enrichment
Raw container logs arrive as single strings. Parsing extracts structure.
JSON parsing
If your application emits structured JSON (as covered in the structured logging article), parsing is straightforward. The aggregator detects JSON and extracts fields automatically.
# Raw log line from container runtime
{"log":"{\"timestamp\":\"2026-04-20T14:23:01Z\",\"level\":\"error\",\"service\":\"order-api\",\"message\":\"charge failed\",\"error_code\":\"card_declined\"}\n","stream":"stdout","time":"2026-04-20T14:23:01.442Z"}
The pipeline unwraps the container runtime envelope, parses the inner JSON, and produces a flat document with level, service, message, and error_code as searchable fields.
Grok parsing for legacy apps
Not every application produces JSON. Legacy apps write unstructured text. Grok patterns extract fields using named regex groups:
# Logstash grok filter
filter {
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:service}\] %{GREEDYDATA:message}"
}
}
}
Grok is fragile. Any change to the log format breaks the pattern. This is why structured logging is worth the upfront investment.
Enrichment
Beyond parsing, the aggregator can add fields that help with querying:
- GeoIP: Map client IP addresses to countries and cities.
- Service catalog: Add team ownership based on service name.
- Environment tagging: Add
env=productionorenv=stagingbased on Kubernetes namespace.
The ELK stack
ELK stands for Elasticsearch, Logstash, Kibana. It is the most established log aggregation stack.
Elasticsearch stores and indexes logs. It supports full-text search, structured queries, and aggregations. Each log event becomes a JSON document indexed by timestamp.
Logstash is the aggregation and processing layer. It receives logs, applies filters (grok, mutate, geoip), and sends them to Elasticsearch.
Kibana is the visualization layer. It provides a search interface, dashboards, and the Discover view for ad-hoc log exploration.
ELK is powerful but operationally heavy. Elasticsearch clusters require careful sizing, shard management, and index lifecycle policies. Storage costs scale linearly with log volume.
Loki: the lightweight alternative
Grafana Loki takes a different approach. Instead of indexing log content (like Elasticsearch does), Loki indexes only labels (like Prometheus does). The log content itself is stored compressed in object storage.
# Loki configuration
auth_enabled: false
server:
http_listen_port: 3100
common:
ring:
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /loki
schema_config:
configs:
- from: 2026-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/loki-logs
bucketnames: loki-logs
region: us-east-1
Querying Loki uses LogQL, which is syntactically similar to PromQL:
{service="order-api", level="error"} |= "card_declined"
{namespace="production"} | json | latency_ms > 1000
sum(rate({service="order-api"} | json | level="error" [5m])) by (error_code)
Loki is significantly cheaper to run than Elasticsearch because object storage (S3/GCS) costs a fraction of SSD-backed Elasticsearch nodes. The trade-off is that full-text search is slower since Loki must scan log content rather than consult an index.
Retention and cost management
Log storage grows continuously. Without lifecycle management, costs compound:
Index lifecycle policies (Elasticsearch)
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "30d",
"actions": { "delete": {} }
}
}
}
}
Tiered retention
Not all logs deserve the same retention:
| Log type | Retention | Rationale |
|---|---|---|
| Error logs | 90 days | Needed for root cause analysis of recurring issues |
| Access logs | 30 days | Useful for recent debugging, expensive at volume |
| Debug logs | 7 days | Only needed during active investigation |
| Audit logs | 1 year+ | Compliance requirement, low volume |
Cost reduction strategies
- Sample verbose logs. Keep 100% of errors, 10% of info-level access logs.
- Drop noisy fields. Remove full request headers and bodies before indexing.
- Compress aggressively. Loki with object storage achieves 10-20x compression.
- Use cold storage tiers. Move old indices to S3 Glacier or equivalent.
What comes next
With logs flowing into a central store, you need to know when something goes wrong. Alerting design covers how to build alerts that are actionable and not noisy, using both metric-based and log-based alert rules.