Real User Monitoring and synthetic testing
In this series (10 parts)
Your backend SLOs say 99.9% of API responses return in under 300ms. Your users in Southeast Asia experience 3-second page loads. The disconnect is the last mile: DNS resolution, TLS negotiation, content download, JavaScript parsing, and rendering. Backend metrics do not capture any of this.
Real User Monitoring (RUM) and synthetic testing close the gap by measuring performance from the user’s perspective.
Real User Monitoring
RUM collects performance data from actual user sessions using a JavaScript agent embedded in the page. Every page load, route transition, and user interaction generates telemetry sent to a collection endpoint.
What RUM captures
The browser Performance API exposes detailed timing data:
// Simplified RUM collection
const navigation = performance.getEntriesByType("navigation")[0];
const metrics = {
dns_ms: navigation.domainLookupEnd - navigation.domainLookupStart,
tcp_ms: navigation.connectEnd - navigation.connectStart,
tls_ms: navigation.secureConnectionStart > 0
? navigation.connectEnd - navigation.secureConnectionStart
: 0,
ttfb_ms: navigation.responseStart - navigation.requestStart,
download_ms: navigation.responseEnd - navigation.responseStart,
dom_interactive_ms: navigation.domInteractive - navigation.fetchStart,
dom_complete_ms: navigation.domComplete - navigation.fetchStart,
page_load_ms: navigation.loadEventEnd - navigation.fetchStart,
};
// Send to collection endpoint
navigator.sendBeacon("/rum/collect", JSON.stringify({
url: window.location.href,
user_agent: navigator.userAgent,
connection: navigator.connection?.effectiveType,
...metrics,
timestamp: new Date().toISOString(),
}));
This reveals where time is spent. A high ttfb_ms points to backend or CDN issues. A high gap between dom_interactive_ms and dom_complete_ms points to heavy JavaScript execution.
Core Web Vitals
Google’s Core Web Vitals are three metrics that capture user experience:
Largest Contentful Paint (LCP): How long until the largest visible element renders. Target: under 2.5 seconds.
Interaction to Next Paint (INP): How long the browser takes to respond to user interactions. Target: under 200ms.
Cumulative Layout Shift (CLS): How much the page layout shifts unexpectedly during loading. Target: under 0.1.
// Collecting Core Web Vitals using web-vitals library
import { onLCP, onINP, onCLS } from "web-vitals";
function sendMetric(metric) {
navigator.sendBeacon("/rum/vitals", JSON.stringify({
name: metric.name,
value: metric.value,
rating: metric.rating, // "good", "needs-improvement", "poor"
url: window.location.href,
timestamp: new Date().toISOString(),
}));
}
onLCP(sendMetric);
onINP(sendMetric);
onCLS(sendMetric);
Analyzing RUM data
RUM data is high-volume. Aggregate by percentiles, not averages:
# P75 LCP by page (from RUM metrics exported to Prometheus)
histogram_quantile(0.75,
sum(rate(rum_lcp_seconds_bucket[1h])) by (le, page)
)
Segment by meaningful dimensions:
- Geography: Users in Mumbai see different performance than users in New York.
- Device type: Mobile users on slow networks have fundamentally different experiences.
- Connection type: 4G vs WiFi vs 3G.
- Page/route: The homepage and the checkout page have different performance profiles.
SE-Asia and Africa exceed the 2.5s LCP target. Backend latency alone does not explain this. CDN coverage, asset size, and connection quality matter.
Synthetic testing
Synthetic tests probe your application from external locations on a schedule. Unlike RUM, which depends on real user traffic, synthetic tests run continuously even at 3 AM when no users are active.
Types of synthetic tests
Uptime checks: Simple HTTP requests to verify the service is responding.
# Prometheus Blackbox Exporter config
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
fail_if_body_not_matches_regexp:
- "OK"
http_checkout_flow:
prober: http
timeout: 15s
http:
method: POST
headers:
Content-Type: application/json
body: '{"test": true}'
valid_status_codes: [200, 201]
# Prometheus scrape config for Blackbox Exporter
scrape_configs:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
- https://checkout.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Browser-based checks: Headless Chrome navigates through critical user flows (login, search, checkout) and measures performance.
// Playwright-based synthetic check
const { chromium } = require("playwright");
async function checkCheckoutFlow() {
const browser = await chromium.launch();
const page = await browser.newPage();
const start = Date.now();
await page.goto("https://example.com/products");
await page.click('[data-test="add-to-cart"]');
await page.click('[data-test="checkout-button"]');
await page.waitForSelector('[data-test="order-confirmation"]');
const duration = Date.now() - start;
// Report metric
console.log(`checkout_flow_duration_ms=${duration}`);
await browser.close();
}
API checks: Test individual API endpoints for correct responses and acceptable latency.
Uptime vs performance monitoring
Uptime monitoring answers: “Is the service reachable?” Performance monitoring answers: “Is the service fast enough?”
Both matter. A service that returns 200 OK in 15 seconds is “up” but broken from the user’s perspective. Run both:
- Uptime probes every 30 seconds from multiple regions. Alert if any region sees failures for 2 consecutive checks.
- Performance probes every 5 minutes that measure full page load or API response time. Alert if P50 exceeds threshold.
Multi-region probing
Run synthetic checks from multiple geographic locations. This catches:
- CDN misconfigurations affecting specific regions.
- DNS propagation delays.
- Routing issues with specific ISPs or cloud regions.
- Regional infrastructure failures.
# Synthetic check locations
probes:
- region: us-east-1
target: https://api.example.com/health
interval: 30s
- region: eu-west-1
target: https://api.example.com/health
interval: 30s
- region: ap-southeast-1
target: https://api.example.com/health
interval: 30s
If US-East shows healthy but AP-Southeast shows degraded, the problem is likely CDN or routing, not the origin server.
Combining RUM and synthetic data
RUM and synthetic testing complement each other:
| Dimension | RUM | Synthetic |
|---|---|---|
| Coverage | Only when users are active | 24/7 |
| Realism | Real devices, real networks | Controlled environment |
| Variability | High (diverse users) | Low (consistent) |
| Alerting speed | Depends on traffic | Immediate |
| Debugging | Hard to reproduce | Easy to reproduce |
Use synthetic tests as your early warning system. Use RUM as your ground truth. When synthetic tests show degradation, check RUM to see if real users are affected. When RUM shows regional issues, add synthetic probes in that region to isolate the cause.
What comes next
Monitoring detects problems. People fix them. On-call tooling and runbooks covers how to structure runbooks, link them to alerts, and build on-call processes that actually work at 3 AM.