Search…

Grafana and dashboards

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

A metric that lives in Prometheus but never appears on a dashboard might as well not exist. Grafana is the visualization layer that turns time series into panels, alerts, and shared context for your team.

The goal is not to build beautiful dashboards. The goal is to build dashboards that answer the right questions fast enough to matter during an incident.

Connecting Prometheus as a data source

Grafana supports dozens of data sources. Prometheus is the most common for infrastructure and application metrics.

# grafana-datasource.yaml (provisioning)
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      httpMethod: POST
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ["service"]
      tracesToMetrics:
        datasourceUid: prometheus

Provisioning data sources as YAML means your Grafana configuration is version-controlled and reproducible.

Panel types that matter

Grafana offers many panel types. In practice, four cover 90% of use cases:

Time series is the default. It plots metrics over time and is the right choice for rates, latencies, and resource usage.

Stat shows a single current value with optional thresholds. Use for high-level indicators: current error rate, active users, deployment count.

Table displays tabular data. Useful for top-N queries like “top 10 endpoints by error count.”

Heatmap visualizes histogram distributions over time. Each cell shows how many observations fell into a bucket during a time window. This is the best way to see latency distribution shifts.

Building a RED dashboard

The RED method (Rate, Errors, Duration) gives every service a standard three-panel layout:

Panel 1: Request Rate

sum(rate(http_requests_total{service="$service"}[5m])) by (path)

Panel 2: Error Rate (%)

sum(rate(http_requests_total{service="$service",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
* 100

Panel 3: Latency (P50, P90, P99)

histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le)
) * 1000

Repeat the histogram quantile query with 0.90 and 0.50 to overlay multiple percentile lines.

The USE method for infrastructure

RED works for services. The USE method (Utilization, Saturation, Errors) works for resources like CPU, memory, disk, and network:

Utilization: What fraction of the resource is in use?

# CPU utilization per node
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

Saturation: How much extra work is queued?

# CPU saturation (load average vs CPU count)
node_load1 / count without(cpu) (node_cpu_seconds_total{mode="idle"})

Errors: How many error events has the resource produced?

# Disk errors
rate(node_disk_io_errs_total[5m])

A USE dashboard per node type (application servers, database servers, cache nodes) gives you infrastructure visibility.

Template variables

Variables make dashboards reusable across services, environments, and clusters. Instead of hardcoding service="order-api", use a variable $service populated from a query.

# Variable: service
# Query: label_values(http_requests_total, service)

This creates a dropdown at the top of the dashboard listing every service. Selecting a service reloads all panels with filtered data.

Common variables:

VariableQueryPurpose
$servicelabel_values(http_requests_total, service)Filter by service
$namespacelabel_values(kube_pod_info, namespace)Filter by K8s namespace
$instancelabel_values(up{job="$service"}, instance)Filter by instance
$intervalCustom: 1m,5m,15m,1hControl aggregation window

Chain variables so selecting a namespace filters the service dropdown to services in that namespace.

Alerting in Grafana

Grafana supports native alerting that evaluates queries and sends notifications. For Prometheus-based setups, you have two options:

  1. Prometheus Alertmanager: Define alert rules in Prometheus, route through Alertmanager. This is the standard approach.
  2. Grafana Unified Alerting: Define alert rules directly in Grafana. Useful when you have multiple data sources.

For most teams, Prometheus Alertmanager is preferable because alert rules live in version control alongside recording rules. Grafana alerting is useful for multi-source alerts (e.g., “error rate is high AND deployment happened in the last 10 minutes”).

Dashboard as code

Clicking through the Grafana UI to build dashboards is fine for prototyping. For production dashboards, treat them as code.

Grafonnet is a Jsonnet library for generating Grafana dashboard JSON:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'Order API - RED',
  tags=['generated', 'red'],
  schemaVersion=30,
)
.addPanel(
  graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(http_requests_total{service="order-api"}[5m])) by (path)',
      legendFormat='{{ path }}',
    )
  ),
  gridPos={ h: 8, w: 12, x: 0, y: 0 },
)

Run jsonnet to compile to JSON, then provision via the Grafana API or a ConfigMap.

Terraform can also manage Grafana dashboards using the grafana_dashboard resource. This fits well if your infrastructure is already Terraform-managed.

The key benefit of dashboard-as-code is review. Changes go through pull requests, and you can see exactly what a dashboard looked like at any point in history.

Dashboard design principles

Effective dashboards follow a few rules:

  1. Top row answers “is everything okay?” A row of stat panels showing aggregate error rate, request rate, and P99 latency across all services. Green means fine. Red means dig deeper.

  2. Drill down by row. Second row breaks down by service. Third row breaks down by endpoint within a service.

  3. Time range consistency. All panels should use the dashboard time picker, not hardcoded ranges. This lets the on-call engineer zoom in to the incident window.

  4. Link to traces and logs. Grafana supports data links. Clicking a spike on a metrics panel can open Tempo filtered by the same service and time range.

  5. Avoid vanity dashboards. If a panel has not been looked at during an incident or a planning meeting in the last 90 days, remove it. Unused panels slow load times and dilute attention.

RED panels are consulted in nearly every incident. Infrastructure panels are useful but secondary.

What comes next

Dashboards show you the current state. Alerting design covers how to turn metric thresholds into actionable notifications that wake the right person at the right time.

Start typing to search across all content
navigate Enter open Esc close