Search…

Monitoring and observability for frontends

In this series (12 parts)
  1. What frontend system design covers
  2. Rendering strategies: CSR, SSR, SSG, ISR
  3. Performance fundamentals: Core Web Vitals
  4. Loading performance and resource optimization
  5. State management at scale
  6. Component architecture and design systems
  7. Client-side caching and offline support
  8. Real-time on the frontend
  9. Frontend security
  10. Scalability for frontend systems
  11. Accessibility as a system design concern
  12. Monitoring and observability for frontends

Prerequisite: Accessibility as a system design concern.

You cannot improve what you cannot measure. Backend teams have mature observability practices: structured logs, distributed tracing, metric dashboards. Frontend observability is younger but equally important. Your server can return a 200 in 50ms while the user stares at a blank screen for 4 seconds because a render-blocking script failed to load from a CDN edge in Southeast Asia.

This article covers the tools and patterns for understanding what real users experience in production: Real User Monitoring, synthetic testing, error tracking, and how to build dashboards that surface problems before support tickets arrive.


Real User Monitoring (RUM)

RUM collects performance and behavioral data from actual user sessions. Every page load, every interaction, every error. The browser provides most of this data through standard APIs.

Core Web Vitals

Google’s Core Web Vitals are the baseline metrics every frontend should track:

  • Largest Contentful Paint (LCP): how long until the largest visible element renders. Target: under 2.5 seconds.
  • Interaction to Next Paint (INP): responsiveness to user input. Target: under 200ms.
  • Cumulative Layout Shift (CLS): visual stability. Target: under 0.1.
import { onLCP, onINP, onCLS } from 'web-vitals';

onLCP((metric) => sendToAnalytics('LCP', metric));
onINP((metric) => sendToAnalytics('INP', metric));
onCLS((metric) => sendToAnalytics('CLS', metric));

The web-vitals library handles the browser API complexity. It normalizes the data and provides attribution information that tells you which element caused a poor score.

Beyond Core Web Vitals

Core Web Vitals are a starting point. For deeper insight, also track:

  • Time to First Byte (TTFB): server and network latency.
  • First Contentful Paint (FCP): when the first content pixel renders.
  • Long tasks: JavaScript tasks that block the main thread for more than 50ms.
  • Resource timing: load time for individual scripts, stylesheets, images, and fonts.

Target thresholds for key frontend performance metrics. Values above these indicate a degraded user experience.


The RUM data pipeline

Collecting metrics in the browser is only half the problem. You need a pipeline to ingest, process, store, and visualize them.

flowchart LR
  A[Browser] -->|Beacon API| B[Collector Endpoint]
  B --> C[Stream Processing]
  C --> D[Time-Series DB]
  C --> E[Data Warehouse]
  D --> F[Real-Time Dashboards]
  E --> G[Long-Term Analysis]
  F --> H[Alerts]

RUM data flows from the browser through a collector, gets processed into time-series and warehouse storage, and surfaces through dashboards and alerts.

Sending data from the browser

Use the Beacon API or fetch with keepalive: true to send data reliably, even as the page unloads:

function sendToAnalytics(name, metric) {
  const body = JSON.stringify({
    name,
    value: metric.value,
    id: metric.id,
    page: location.pathname,
    connection: navigator.connection?.effectiveType,
    timestamp: Date.now(),
  });

  if (navigator.sendBeacon) {
    navigator.sendBeacon('/analytics', body);
  } else {
    fetch('/analytics', { body, method: 'POST', keepalive: true });
  }
}

Include contextual dimensions: page path, connection type, device category, geographic region. These let you slice metrics to find that LCP is great in the US but terrible in India because a third-party script loads slowly from a single origin.

Sampling

At scale, sending every metric from every session is expensive. Sample at 10 to 25% for performance metrics. Keep 100% sampling for errors; you do not want to miss a crash that affects 0.1% of users.


Synthetic monitoring

RUM tells you what happened. Synthetic monitoring tells you what is happening right now, even when no real users are active.

Synthetic tests run automated browsers (Playwright, Puppeteer) on a schedule from multiple locations. They load your pages and measure performance metrics, check for errors, and verify critical user flows.

What to monitor synthetically

  • Homepage load time from 5+ geographic locations.
  • Critical user flows: login, checkout, signup. If these break at 3 AM, you want to know before users wake up.
  • Third-party script availability: ad tags, analytics, chat widgets. These are the most common source of frontend failures.

Synthetic vs RUM

AspectRUMSynthetic
Data sourceReal user sessionsAutomated browsers
CoverageAll pages users visitPages you configure
ConditionsVaried (devices, networks)Controlled
Detects regressionsAfter users are affectedBefore users are affected
CostScales with trafficFixed per test

Use both. Synthetic monitoring is your smoke detector. RUM is your full diagnostic report.


Error tracking

JavaScript errors in production are invisible unless you capture them. Unhandled exceptions, promise rejections, network failures, and resource loading errors all need to be collected, grouped, and prioritized.

What to capture

window.addEventListener('error', (event) => {
  reportError({
    message: event.message,
    filename: event.filename,
    lineno: event.lineno,
    colno: event.colno,
    stack: event.error?.stack,
  });
});

window.addEventListener('unhandledrejection', (event) => {
  reportError({
    message: event.reason?.message || String(event.reason),
    stack: event.reason?.stack,
  });
});

Source maps

Production JavaScript is minified. A stack trace pointing to app.min.js:1:45023 is useless. Upload source maps to your error tracking service (Sentry, Datadog, Bugsnag) so errors are mapped back to original filenames and line numbers.

Keep source maps private. Do not serve them to users (they expose your source code). Upload them to the error tracking service via their API during your CI/CD pipeline.

Error grouping and prioritization

A single error can fire thousands of times across user sessions. Good error tracking groups identical errors by stack trace and lets you prioritize by:

  • User impact: how many unique users are affected.
  • Frequency: how often it fires.
  • Newness: errors introduced in the latest deploy need immediate attention.
  • Page: an error on the checkout page is more urgent than one on the settings page.
flowchart TD
  A[JS Error in Browser] --> B[Error Handler Captures]
  B --> C[Attach Context]
  C --> D[Send to Error Service]
  D --> E[Source Map Lookup]
  E --> F[Group by Stack Trace]
  F --> G[Prioritize by Impact]
  G --> H{Exceeds Threshold?}
  H -->|Yes| I[Alert On-Call Engineer]
  H -->|No| J[Queue for Review]

Error tracking pipeline. Raw browser errors are enriched with context, de-minified with source maps, grouped, and prioritized before alerting.


Web analytics vs product analytics

These serve different audiences and answer different questions.

Web analytics (Google Analytics, Plausible) answers: how many users visited, where did they come from, which pages did they view, what is the bounce rate. This is marketing and SEO data.

Product analytics (Amplitude, Mixpanel, PostHog) answers: which features do users engage with, where do they drop off in a funnel, what is the retention rate by cohort. This is product and engineering data.

For system design purposes, product analytics matters more. It tells you which features are worth optimizing and which are dead code. If a feature has 0.1% usage, spending a sprint optimizing its performance is a poor investment.

Implementation pattern

Define events in a typed analytics module. Do not scatter track() calls throughout your codebase:

// analytics.ts
type Event =
  | { name: 'checkout_started'; properties: { cart_size: number } }
  | { name: 'checkout_completed'; properties: { total: number; payment_method: string } }
  | { name: 'search_performed'; properties: { query: string; result_count: number } };

export function track(event: Event) {
  analyticsProvider.track(event.name, event.properties);
}

This gives you type safety, a single place to audit what you track, and makes it easy to swap providers.


Performance dashboards

A dashboard that shows everything shows nothing. Design dashboards for specific audiences.

The engineering dashboard

  • Core Web Vitals (p50, p75, p95) trended over time.
  • Error rate by page.
  • Largest bundle sizes.
  • Deploy markers overlaid on metric charts.

Deploy markers are critical. When LCP spikes, the first question is “did we deploy something?” If the spike aligns with a deploy marker, you have your answer.

The product dashboard

  • Feature usage rates.
  • Funnel completion rates.
  • Session duration by cohort.
  • Engagement metrics for A/B test variants.

Alerting

Alert on p75 thresholds, not averages. Averages hide tail latency. If your p75 LCP crosses 2.5 seconds, something is wrong for a meaningful percentage of users.

Set up alerts for:

  • Core Web Vitals p75 exceeding good thresholds.
  • Error rate exceeding baseline by 2x after a deploy.
  • Synthetic check failures.
  • Client-side JavaScript crash rate exceeding 1%.

A deploy on Thursday caused LCP p75 to spike above the 2.5s threshold. Deploy markers make regressions immediately visible.


Closing the loop

Monitoring without action is just data hoarding. The value comes from the feedback loop:

  1. Detect: dashboards and alerts surface a regression.
  2. Diagnose: drill into the metric by page, device, region, and deploy version to find the cause.
  3. Fix: ship a fix and verify the metric recovers.
  4. Prevent: add a test, a budget, or a lint rule that catches the same class of problem before it ships again.

This loop is what separates teams that ship with confidence from teams that deploy and pray.


What comes next

This article concludes the Frontend System Design series. You now have a foundation that spans component architecture, state management, data fetching, rendering strategies, performance, caching, real-time communication, security, scalability, accessibility, and observability. Each topic connects to the broader system design fundamentals covered in earlier series. The next step is to practice: take a product you use daily and design its frontend architecture from scratch, applying the patterns and trade-offs discussed across these articles.

Start typing to search across all content
navigate Enter open Esc close