Kubernetes observability

In this series (14 parts)

A pod that restarts silently at 3 a.m. is worse than a pod that crashes loudly. Silent failures compound. By the time a user reports a problem, the root cause is buried under hours of cascading effects. Observability is the practice of making cluster behavior visible so you can diagnose issues before they become outages.

Kubernetes provides built-in tools for basic debugging. For production systems, you need external instrumentation: health probes, metrics pipelines, dashboards, and distributed traces. This article covers the full stack from kubectl debugging to OpenTelemetry collection.

Debugging with kubectl

Three commands handle most day-to-day debugging.

kubectl logs streams stdout and stderr from a container. Add -f to follow in real time, --previous to read logs from the last terminated container, and -c to target a specific container in a multi-container pod.

kubectl logs deploy/api-server --tail=100
kubectl logs pod/api-server-7d4b8c -c sidecar --previous

kubectl describe shows the full lifecycle of a resource. For pods, the Events section at the bottom reveals scheduling decisions, image pull errors, probe failures, and OOM kills. This is the first place to look when a pod is stuck in Pending or CrashLoopBackOff.

kubectl describe pod api-server-7d4b8c

kubectl get events lists cluster-wide events sorted by time. Pipe it through grep to filter by resource name or reason.

kubectl get events --sort-by='.lastTimestamp' -n production

These tools are reactive. They tell you what already happened. Probes and metrics let you detect problems as they develop.

Health probes

Kubernetes supports three probe types. Each serves a different purpose and misconfiguring them is one of the most common causes of unnecessary downtime.

Liveness probes tell the kubelet whether a container is alive. If the probe fails, the kubelet kills the container and restarts it. Use liveness probes for deadlock detection, not for dependency checks. A liveness probe that calls an external database will restart your pod every time the database is slow.

Readiness probes tell the kubelet whether a container is ready to accept traffic. If the probe fails, the pod is removed from Service endpoints. The container keeps running. Use readiness probes to drain traffic during deployments or when a dependency is temporarily unavailable.

Startup probes run only during container initialization. They disable liveness and readiness checks until the startup probe succeeds. Use them for slow-starting applications like Java services that need 30+ seconds to initialize.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
        - name: api
          image: api-server:1.4.0
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            failureThreshold: 2

The startup probe gives the container up to 60 seconds (30 attempts * 2 seconds) to start. Once it passes, the liveness probe takes over with a 10-second interval. The readiness probe checks a separate /ready endpoint that can return failure when downstream services are unreachable, pulling the pod from the load balancer without killing it.

Prometheus metrics collection

Probes answer a binary question: is this container healthy? Prometheus metrics answer quantitative questions: how many requests per second, what is the p99 latency, how much memory is allocated.

Prometheus scrapes HTTP endpoints that expose metrics in a text format. In Kubernetes, the Prometheus Operator simplifies configuration through custom resources. A ServiceMonitor tells Prometheus which Services to scrape and how.

flowchart LR
  A[Application Pods] -->|/metrics| B[Prometheus]
  B -->|PromQL queries| C[Grafana]
  B -->|alerting rules| D[Alertmanager]
  D -->|notifications| E[Slack / PagerDuty]
  F[ServiceMonitor CRD] -->|configures scrape targets| B
  G[kube-state-metrics] -->|cluster state| B
  H[node-exporter] -->|host metrics| B

Observability pipeline: Prometheus scrapes application pods and cluster exporters, Grafana queries Prometheus for dashboards, and Alertmanager routes alerts to notification channels.

First, deploy the Prometheus Operator using its Helm chart. Then create a ServiceMonitor that matches your application’s Service by label.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-server-monitor
  namespace: production
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: api-server
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
      - production

The release: kube-prometheus-stack label is critical. The Prometheus Operator uses a label selector to discover ServiceMonitors. If this label does not match, your ServiceMonitor is silently ignored. Check the Prometheus Operator’s serviceMonitorSelector field if scraping is not working.

Two exporters provide cluster-level metrics out of the box. kube-state-metrics exposes the state of Kubernetes objects: deployment replica counts, pod phases, node conditions. node-exporter exposes host-level metrics: CPU, memory, disk, and network. Both are included in the kube-prometheus-stack Helm chart.

Grafana dashboards

Grafana connects to Prometheus as a data source and renders PromQL queries as time-series panels. The kube-prometheus-stack ships preconfigured dashboards for node health, pod resource usage, and namespace-level metrics.

For application dashboards, focus on the four golden signals: latency, traffic, errors, and saturation. A practical starting dashboard includes:

Request rate per service and endpoint, broken down by HTTP status code.
Latency histograms showing p50, p90, and p99 response times.
Error rate as a percentage of total requests.
Pod resource utilization comparing requests, limits, and actual usage.

Set Grafana alerts on the same PromQL expressions that power your dashboards. A common rule fires when the error rate exceeds 1% of total traffic over a 5-minute window. Route alerts through Alertmanager to deduplicate, group, and silence notifications.

OpenTelemetry in Kubernetes

Metrics and logs tell you that something is wrong. Distributed tracing tells you where. OpenTelemetry (OTel) is the vendor-neutral standard for collecting traces, metrics, and logs from applications.

The OpenTelemetry Collector runs as a DaemonSet or Deployment in your cluster. Applications send telemetry to the Collector using OTLP (OpenTelemetry Protocol). The Collector processes, batches, and exports the data to backends like Jaeger, Tempo, or a managed service.

A typical setup uses the DaemonSet mode so every node has a local Collector. Applications send traces to localhost:4317, avoiding cross-node network hops. The DaemonSet Collectors then forward data to a central gateway Collector that handles export.

Instrument your applications using the OTel SDK for your language. For HTTP services, auto-instrumentation libraries handle span creation for inbound and outbound requests. Add manual spans for business-critical operations like payment processing or inventory checks. Propagate trace context through HTTP headers so spans from different services join into a single trace.

The combination of Prometheus metrics and OTel traces gives you a powerful debugging workflow. Start with a metric anomaly on your Grafana dashboard. Use the exemplar feature to jump from a metric data point directly to a trace. Follow the trace across services to find the slow span. Pull up the logs for that span’s trace ID to read the full context.

What comes next

You can see what your cluster is doing. Now you need to lock it down. The next article covers Kubernetes security: RBAC policies, PodSecurity Standards, network policies, and admission controllers that prevent misconfigurations from reaching production.

← Back to all series