Search…

Real User Monitoring and synthetic testing

In this series (10 parts)
  1. The three pillars of observability
  2. Structured logging
  3. Metrics and Prometheus
  4. Grafana and dashboards
  5. Distributed tracing
  6. Log aggregation pipelines
  7. Alerting design
  8. SLIs, SLOs, and error budgets
  9. Real User Monitoring and synthetic testing
  10. On-call tooling and runbooks

Your backend SLOs say 99.9% of API responses return in under 300ms. Your users in Southeast Asia experience 3-second page loads. The disconnect is the last mile: DNS resolution, TLS negotiation, content download, JavaScript parsing, and rendering. Backend metrics do not capture any of this.

Real User Monitoring (RUM) and synthetic testing close the gap by measuring performance from the user’s perspective.

Real User Monitoring

RUM collects performance data from actual user sessions using a JavaScript agent embedded in the page. Every page load, route transition, and user interaction generates telemetry sent to a collection endpoint.

What RUM captures

The browser Performance API exposes detailed timing data:

// Simplified RUM collection
const navigation = performance.getEntriesByType("navigation")[0];

const metrics = {
  dns_ms: navigation.domainLookupEnd - navigation.domainLookupStart,
  tcp_ms: navigation.connectEnd - navigation.connectStart,
  tls_ms: navigation.secureConnectionStart > 0
    ? navigation.connectEnd - navigation.secureConnectionStart
    : 0,
  ttfb_ms: navigation.responseStart - navigation.requestStart,
  download_ms: navigation.responseEnd - navigation.responseStart,
  dom_interactive_ms: navigation.domInteractive - navigation.fetchStart,
  dom_complete_ms: navigation.domComplete - navigation.fetchStart,
  page_load_ms: navigation.loadEventEnd - navigation.fetchStart,
};

// Send to collection endpoint
navigator.sendBeacon("/rum/collect", JSON.stringify({
  url: window.location.href,
  user_agent: navigator.userAgent,
  connection: navigator.connection?.effectiveType,
  ...metrics,
  timestamp: new Date().toISOString(),
}));

This reveals where time is spent. A high ttfb_ms points to backend or CDN issues. A high gap between dom_interactive_ms and dom_complete_ms points to heavy JavaScript execution.

Core Web Vitals

Google’s Core Web Vitals are three metrics that capture user experience:

Largest Contentful Paint (LCP): How long until the largest visible element renders. Target: under 2.5 seconds.

Interaction to Next Paint (INP): How long the browser takes to respond to user interactions. Target: under 200ms.

Cumulative Layout Shift (CLS): How much the page layout shifts unexpectedly during loading. Target: under 0.1.

// Collecting Core Web Vitals using web-vitals library
import { onLCP, onINP, onCLS } from "web-vitals";

function sendMetric(metric) {
  navigator.sendBeacon("/rum/vitals", JSON.stringify({
    name: metric.name,
    value: metric.value,
    rating: metric.rating, // "good", "needs-improvement", "poor"
    url: window.location.href,
    timestamp: new Date().toISOString(),
  }));
}

onLCP(sendMetric);
onINP(sendMetric);
onCLS(sendMetric);

Analyzing RUM data

RUM data is high-volume. Aggregate by percentiles, not averages:

# P75 LCP by page (from RUM metrics exported to Prometheus)
histogram_quantile(0.75,
  sum(rate(rum_lcp_seconds_bucket[1h])) by (le, page)
)

Segment by meaningful dimensions:

  • Geography: Users in Mumbai see different performance than users in New York.
  • Device type: Mobile users on slow networks have fundamentally different experiences.
  • Connection type: 4G vs WiFi vs 3G.
  • Page/route: The homepage and the checkout page have different performance profiles.

SE-Asia and Africa exceed the 2.5s LCP target. Backend latency alone does not explain this. CDN coverage, asset size, and connection quality matter.

Synthetic testing

Synthetic tests probe your application from external locations on a schedule. Unlike RUM, which depends on real user traffic, synthetic tests run continuously even at 3 AM when no users are active.

Types of synthetic tests

Uptime checks: Simple HTTP requests to verify the service is responding.

# Prometheus Blackbox Exporter config
modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      fail_if_body_not_matches_regexp:
        - "OK"

  http_checkout_flow:
    prober: http
    timeout: 15s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": true}'
      valid_status_codes: [200, 201]
# Prometheus scrape config for Blackbox Exporter
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
          - https://checkout.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Browser-based checks: Headless Chrome navigates through critical user flows (login, search, checkout) and measures performance.

// Playwright-based synthetic check
const { chromium } = require("playwright");

async function checkCheckoutFlow() {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  const start = Date.now();

  await page.goto("https://example.com/products");
  await page.click('[data-test="add-to-cart"]');
  await page.click('[data-test="checkout-button"]');
  await page.waitForSelector('[data-test="order-confirmation"]');

  const duration = Date.now() - start;

  // Report metric
  console.log(`checkout_flow_duration_ms=${duration}`);

  await browser.close();
}

API checks: Test individual API endpoints for correct responses and acceptable latency.

Uptime vs performance monitoring

Uptime monitoring answers: “Is the service reachable?” Performance monitoring answers: “Is the service fast enough?”

Both matter. A service that returns 200 OK in 15 seconds is “up” but broken from the user’s perspective. Run both:

  1. Uptime probes every 30 seconds from multiple regions. Alert if any region sees failures for 2 consecutive checks.
  2. Performance probes every 5 minutes that measure full page load or API response time. Alert if P50 exceeds threshold.

Multi-region probing

Run synthetic checks from multiple geographic locations. This catches:

  • CDN misconfigurations affecting specific regions.
  • DNS propagation delays.
  • Routing issues with specific ISPs or cloud regions.
  • Regional infrastructure failures.
# Synthetic check locations
probes:
  - region: us-east-1
    target: https://api.example.com/health
    interval: 30s
  - region: eu-west-1
    target: https://api.example.com/health
    interval: 30s
  - region: ap-southeast-1
    target: https://api.example.com/health
    interval: 30s

If US-East shows healthy but AP-Southeast shows degraded, the problem is likely CDN or routing, not the origin server.

Combining RUM and synthetic data

RUM and synthetic testing complement each other:

DimensionRUMSynthetic
CoverageOnly when users are active24/7
RealismReal devices, real networksControlled environment
VariabilityHigh (diverse users)Low (consistent)
Alerting speedDepends on trafficImmediate
DebuggingHard to reproduceEasy to reproduce

Use synthetic tests as your early warning system. Use RUM as your ground truth. When synthetic tests show degradation, check RUM to see if real users are affected. When RUM shows regional issues, add synthetic probes in that region to isolate the cause.

What comes next

Monitoring detects problems. People fix them. On-call tooling and runbooks covers how to structure runbooks, link them to alerts, and build on-call processes that actually work at 3 AM.

Start typing to search across all content
navigate Enter open Esc close