Metrics and Prometheus
In this series (10 parts)
Metrics tell you what is happening across your entire system right now. Unlike logs, which record individual events, metrics aggregate. A counter that incremented 50,000 times in the last minute occupies a single time series point. This compression is what makes metrics cheap to store and fast to query.
Prometheus is the dominant open-source metrics system. It uses a pull model, a powerful query language called PromQL, and integrates tightly with Kubernetes and Grafana.
The scrape model
Prometheus pulls metrics from your services by making HTTP GET requests to a /metrics endpoint. Each service exposes its current metric values in a simple text format.
# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/orders",status="200"} 142857
http_requests_total{method="POST",path="/api/orders",status="201"} 8432
http_requests_total{method="POST",path="/api/orders",status="500"} 17
# HELP http_request_duration_seconds Request duration in seconds.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 98234
http_request_duration_seconds_bucket{le="0.05"} 130000
http_request_duration_seconds_bucket{le="0.1"} 138500
http_request_duration_seconds_bucket{le="0.5"} 141900
http_request_duration_seconds_bucket{le="1.0"} 142800
http_request_duration_seconds_bucket{le="+Inf"} 142857
http_request_duration_seconds_sum 4821.3
http_request_duration_seconds_count 142857
The scrape interval (typically 15 or 30 seconds) determines resolution. Prometheus stores the scraped values as time series data points.
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "order-api"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}
Metric types
Counter
Monotonically increasing value. Resets to zero only on process restart. Use rate() or increase() to get meaningful values.
http_requests_total
errors_total
bytes_sent_total
Gauge
A value that goes up and down. Represents a current snapshot.
temperature_celsius
queue_depth
active_connections
memory_usage_bytes
Histogram
Distributes observations into configurable buckets. Essential for latency measurement because averages hide outliers.
http_request_duration_seconds_bucket{le="0.05"} -- requests under 50ms
http_request_duration_seconds_bucket{le="0.1"} -- requests under 100ms
http_request_duration_seconds_bucket{le="+Inf"} -- all requests
Summary
Calculates quantiles on the client side. Less flexible than histograms because you cannot aggregate summaries across instances. Prefer histograms unless you have a specific reason to use summaries.
PromQL basics
PromQL selects and transforms time series. Start with a metric name and optional label filters:
http_requests_total{service="order-api", status=~"5.."}
This selects all time series for http_requests_total where service is order-api and status matches the regex 5.. (any 5xx status).
Rate and increase
Counters are cumulative. To get per-second request rate over the last 5 minutes:
rate(http_requests_total[5m])
To get the total increase over the last hour:
increase(http_requests_total[1h])
Aggregation
Sum the request rate across all instances of a service:
sum(rate(http_requests_total[5m])) by (service)
Get the top 5 endpoints by error rate:
topk(5,
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)
)
Histogram quantiles
P99 latency from a histogram:
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
P50 (median) latency:
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
The RED method
RED stands for Rate, Errors, Duration. It gives you three queries per service that answer the most important questions:
# Rate: requests per second
sum(rate(http_requests_total{service="order-api"}[5m]))
# Errors: error percentage
sum(rate(http_requests_total{service="order-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="order-api"}[5m]))
* 100
# Duration: P99 latency in milliseconds
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="order-api"}[5m])) by (le)
) * 1000
Every service should expose these three signals. They map directly to user experience: are requests flowing, are they succeeding, and are they fast?
Recording rules
Complex PromQL queries evaluated on every dashboard load waste CPU. Recording rules pre-compute results and store them as new time series.
# recording-rules.yml
groups:
- name: red_metrics
interval: 30s
rules:
- record: service:http_request_rate:5m
expr: sum(rate(http_requests_total[5m])) by (service)
- record: service:http_error_ratio:5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
- record: service:http_latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
Recording rules also make alerting rules simpler. Alert on service:http_error_ratio:5m > 0.01 instead of repeating the full expression.
Alerting rules
Alerting rules fire when a PromQL expression is true for a specified duration:
# alerting-rules.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: service:http_error_ratio:5m > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: service:http_latency_p99:5m > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 1s on {{ $labels.service }}"
The for clause prevents flapping. The alert only fires after the condition has been true continuously for the specified duration.
Federation
In large deployments, a single Prometheus server cannot scrape everything. Federation allows a global Prometheus to scrape pre-aggregated metrics from regional ones.
# global prometheus config
scrape_configs:
- job_name: "federate-region-us"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"service:.+"}'
static_configs:
- targets: ["prometheus-us:9090"]
- job_name: "federate-region-eu"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"service:.+"}'
static_configs:
- targets: ["prometheus-eu:9090"]
Federate only recording rules (pre-aggregated metrics), not raw metrics. This keeps the global instance lightweight.
For larger scale, consider Thanos or Cortex, which provide long-term storage, global querying, and horizontal scaling on top of Prometheus.
What comes next
Metrics are only useful when visualized. Grafana and dashboards covers how to build effective dashboards using Prometheus as a data source, including variables, alert panels, and dashboard-as-code workflows.