Search…

Log aggregation pipelines

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

A single container running on Kubernetes writes logs to stdout. When that container crashes, the logs disappear with it. Multiply by 200 pods across 30 services and you have a system where investigating any issue requires SSH-ing into nodes and hoping the relevant log lines still exist.

Log aggregation solves this by continuously shipping logs from every source to a centralized, searchable store. The pipeline has four stages: collection, parsing, enrichment, and storage.

Pipeline architecture

flowchart LR
  App1[App Pod 1] -->|stdout| Agent1[Fluent Bit<br/>DaemonSet]
  App2[App Pod 2] -->|stdout| Agent1
  App3[App Pod 3] -->|stdout| Agent2[Fluent Bit<br/>DaemonSet]
  App4[App Pod 4] -->|stdout| Agent2

  Agent1 --> Aggregator[Fluentd / Vector<br/>Aggregator]
  Agent2 --> Aggregator

  Aggregator -->|parsed + enriched| Store[(Elasticsearch<br/>or Loki)]
  Store --> UI[Kibana / Grafana]

Lightweight agents on each node collect logs. An aggregator layer handles parsing and enrichment before forwarding to storage.

This two-tier architecture separates concerns. The node agent (Fluent Bit) is lightweight and handles collection. The aggregator (Fluentd or Vector) handles the heavy lifting of parsing, filtering, and routing.

Collection: getting logs off the machine

Fluent Bit

Fluent Bit is a lightweight log processor designed for edge and container environments. It runs as a DaemonSet in Kubernetes, tailing log files from each node.

# fluent-bit-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush        5
        Log_Level    info
        Parsers_File parsers.conf

    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        Parser           cri
        Tag              kube.*
        Refresh_Interval 5
        Mem_Buf_Limit    10MB

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        K8S-Logging.Parser  On

    [OUTPUT]
        Name          forward
        Match         *
        Host          fluentd-aggregator
        Port          24224

The Kubernetes filter enriches each log line with pod name, namespace, labels, and container metadata. This means you can filter logs by namespace=production or app=order-api without any application-level changes.

Filebeat (ELK alternative)

If you run the Elastic stack, Filebeat is the native log shipper. It ships directly to Elasticsearch or Logstash:

# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/log/containers/*.log
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}

output.logstash:
  hosts: ["logstash:5044"]

Parsing and enrichment

Raw container logs arrive as single strings. Parsing extracts structure.

JSON parsing

If your application emits structured JSON (as covered in the structured logging article), parsing is straightforward. The aggregator detects JSON and extracts fields automatically.

# Raw log line from container runtime
{"log":"{\"timestamp\":\"2026-04-20T14:23:01Z\",\"level\":\"error\",\"service\":\"order-api\",\"message\":\"charge failed\",\"error_code\":\"card_declined\"}\n","stream":"stdout","time":"2026-04-20T14:23:01.442Z"}

The pipeline unwraps the container runtime envelope, parses the inner JSON, and produces a flat document with level, service, message, and error_code as searchable fields.

Grok parsing for legacy apps

Not every application produces JSON. Legacy apps write unstructured text. Grok patterns extract fields using named regex groups:

# Logstash grok filter
filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:service}\] %{GREEDYDATA:message}"
    }
  }
}

Grok is fragile. Any change to the log format breaks the pattern. This is why structured logging is worth the upfront investment.

Enrichment

Beyond parsing, the aggregator can add fields that help with querying:

  • GeoIP: Map client IP addresses to countries and cities.
  • Service catalog: Add team ownership based on service name.
  • Environment tagging: Add env=production or env=staging based on Kubernetes namespace.

The ELK stack

ELK stands for Elasticsearch, Logstash, Kibana. It is the most established log aggregation stack.

Elasticsearch stores and indexes logs. It supports full-text search, structured queries, and aggregations. Each log event becomes a JSON document indexed by timestamp.

Logstash is the aggregation and processing layer. It receives logs, applies filters (grok, mutate, geoip), and sends them to Elasticsearch.

Kibana is the visualization layer. It provides a search interface, dashboards, and the Discover view for ad-hoc log exploration.

ELK is powerful but operationally heavy. Elasticsearch clusters require careful sizing, shard management, and index lifecycle policies. Storage costs scale linearly with log volume.

Loki: the lightweight alternative

Grafana Loki takes a different approach. Instead of indexing log content (like Elasticsearch does), Loki indexes only labels (like Prometheus does). The log content itself is stored compressed in object storage.

# Loki configuration
auth_enabled: false
server:
  http_listen_port: 3100

common:
  ring:
    kvstore:
      store: inmemory
  replication_factor: 1
  path_prefix: /loki

schema_config:
  configs:
    - from: 2026-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

storage_config:
  aws:
    s3: s3://us-east-1/loki-logs
    bucketnames: loki-logs
    region: us-east-1

Querying Loki uses LogQL, which is syntactically similar to PromQL:

{service="order-api", level="error"} |= "card_declined"

{namespace="production"} | json | latency_ms > 1000

sum(rate({service="order-api"} | json | level="error" [5m])) by (error_code)

Loki is significantly cheaper to run than Elasticsearch because object storage (S3/GCS) costs a fraction of SSD-backed Elasticsearch nodes. The trade-off is that full-text search is slower since Loki must scan log content rather than consult an index.

Retention and cost management

Log storage grows continuously. Without lifecycle management, costs compound:

Index lifecycle policies (Elasticsearch)

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Tiered retention

Not all logs deserve the same retention:

Log typeRetentionRationale
Error logs90 daysNeeded for root cause analysis of recurring issues
Access logs30 daysUseful for recent debugging, expensive at volume
Debug logs7 daysOnly needed during active investigation
Audit logs1 year+Compliance requirement, low volume

Cost reduction strategies

  1. Sample verbose logs. Keep 100% of errors, 10% of info-level access logs.
  2. Drop noisy fields. Remove full request headers and bodies before indexing.
  3. Compress aggressively. Loki with object storage achieves 10-20x compression.
  4. Use cold storage tiers. Move old indices to S3 Glacier or equivalent.

What comes next

With logs flowing into a central store, you need to know when something goes wrong. Alerting design covers how to build alerts that are actionable and not noisy, using both metric-based and log-based alert rules.

Start typing to search across all content
navigate Enter open Esc close