Dec 6, 2025 · 21 min read · System Design

Design a notification system

In this series (18 parts)

Sending one notification is easy. Sending the right notification to each of 10 million users, across the right channel, at the right time, without duplicates, is a systems problem. This case study walks through designing a notification platform that handles push, SMS, and email at scale. We will work through requirements, capacity math, architecture, and the tricky deep dives that separate toy implementations from production systems.

Before reading this, you should be comfortable with the concepts in notification systems HLD and the class-level design in LLD notification system.

Requirements

Functional requirements

Send notifications through three channels: mobile push (iOS and Android), SMS, and email.
Support both targeted notifications (single user) and broadcast notifications (all users matching a segment).
Users can set per-channel preferences: opt in, opt out, or digest mode.
Device registration and token management for push notifications.
Template-based notification rendering with variable substitution.
Delivery tracking: sent, delivered, failed, retried.

Non-functional requirements

10 million daily active users (DAU).
Average 5 notifications per user per day across all channels: 50 million notifications/day.
Peak QPS of roughly 3,000 notification dispatches per second (assuming 5x average load during peak hours).
At-least-once delivery guarantee for critical notifications (password resets, payment confirmations).
End-to-end latency under 30 seconds for push, under 5 minutes for email and SMS.
99.9% availability for the ingestion API.

Capacity estimation

QPS: 50M notifications/day divided by 86,400 seconds gives ~580 average QPS. With a 5x peak factor, we target 3,000 QPS sustained.

Storage: Each notification record is roughly 500 bytes (ID, user ID, channel, template ID, status, timestamps, payload reference). At 50M/day, that is 25 GB/day of raw notification records. With 90-day retention, we need ~2.25 TB for the notification log.

Device tokens: 10M users with an average of 2 devices each means 20M device token records. At ~200 bytes per record, that is 4 GB, easily fits in a single database.

Bandwidth: If the average notification payload is 1 KB (including template rendering), peak outbound bandwidth is 3,000 * 1 KB = 3 MB/s. This is modest; the bottleneck will be external provider rate limits, not network bandwidth.

These numbers show that the notification log dominates storage. The device registry and preference store are small enough to fit on a single database instance with replication. The notification log, at 2.25 TB, needs a sharded data store partitioned by user ID or by time range.

High-level architecture

The system breaks into five layers: ingestion, fan-out, channel routing, delivery, and feedback processing.

graph TD
  API["Notification API"] --> VAL["Validator + Rate Limiter"]
  VAL --> Q1["Message Queue"]
  Q1 --> FO["Fan-out Service"]
  FO --> PQ["Priority Queue"]
  PQ --> CR["Channel Router"]
  CR --> PUSH["Push Worker Pool"]
  CR --> SMS["SMS Worker Pool"]
  CR --> EMAIL["Email Worker Pool"]
  PUSH --> APNS["APNs / FCM"]
  SMS --> TWILIO["SMS Provider"]
  EMAIL --> SES["Email Provider"]
  APNS --> FB["Feedback Processor"]
  TWILIO --> FB
  SES --> FB
  FB --> DB["Notification DB"]
  FO --> PREF["Preference Service"]
  FO --> TOKEN["Device Registry"]

High-level architecture of the notification system. Events flow from the API through fan-out, channel routing, and provider-specific workers.

The API accepts notification requests from internal services. Each request specifies a notification type, a target (single user ID, user segment, or broadcast), and a payload. The validator checks required fields and applies rate limiting so a buggy upstream service cannot flood the queue.

The fan-out service is the core of the system. It resolves “send to segment X” into individual user-level notification tasks. For each user, it consults the preference service and the device registry to determine which channels to use and which device tokens to target.

Channel-specific workers handle the last mile. Push workers batch requests to APNs and FCM. SMS and email workers call third-party APIs with appropriate rate limiting. The feedback processor listens for delivery receipts, bounces, and token invalidation signals, then updates the notification database.

Deep dive 1: fan-out for millions of subscribers

The hardest part of this system is fan-out. When a notification targets “all users who follow account X” and that account has 5 million followers, you need to generate 5 million individual notification tasks quickly and reliably. A naive approach, iterating through 5 million rows and sending one message per iteration, would take hours. The design must parallelize fan-out while maintaining correctness.

Push vs pull fan-out

There are two strategies. In push fan-out (fan-out on write), you generate all individual tasks at send time and enqueue them. In pull fan-out (fan-out on read), you store one event and let each user’s device query for new notifications.

Push fan-out gives lower latency for the user: the notification is already waiting in their queue. But it is expensive for high-follower accounts and wastes work if many users never open the app. Pull fan-out is cheaper on write but shifts the cost to read time, which hurts latency and requires maintaining a per-user inbox.

The practical answer is a hybrid. Use push fan-out for normal accounts (under 100K followers). For celebrity accounts, use pull fan-out: store the event once and merge it into the user’s notification feed at read time. This is similar to how Twitter handles the home timeline problem.

The threshold is configurable. If operational data shows that accounts with 50K+ followers cause fan-out backlogs, lower the threshold. The key metric is the 99th percentile fan-out latency: if it exceeds your SLA, more accounts need pull-based handling.

Fan-out sequence

sequenceDiagram
  participant API as Notification API
  participant Q as Message Queue
  participant FO as Fan-out Service
  participant SEG as Segment Resolver
  participant PREF as Preference Service
  participant PQ as Priority Queue

  API->>Q: Publish notification event
  Q->>FO: Consume event
  FO->>SEG: Resolve target segment
  SEG-->>FO: Return 5M user IDs
  loop Batch of 1000 users
      FO->>PREF: Fetch preferences for batch
      PREF-->>FO: Channel preferences per user
      FO->>PQ: Enqueue per-user tasks
  end
  Note over FO,PQ: Fan-out completes in batches to avoid memory pressure

Sequence diagram showing how a single broadcast event fans out into millions of per-user notification tasks.

The fan-out service processes users in batches of 1,000. For each batch, it fetches preferences, filters out users who have opted out of this notification type, and enqueues individual tasks onto the priority queue. Batching is critical. Loading 5 million user preferences into memory at once would kill the service.

The priority queue has multiple priority levels. Password resets and security alerts go to the high-priority lane. Marketing notifications go to the low-priority lane. This prevents a large broadcast from starving time-sensitive notifications.

For message queues, we use a partitioned topic (Kafka or similar). Each partition is consumed by one fan-out worker, giving us horizontal scalability. At 3,000 QPS with 10 partitions, each worker handles 300 messages per second, which is comfortable.

Failure handling during fan-out

If the fan-out service crashes midway through processing 5 million users, we need to resume without duplicates. The solution is checkpoint-based processing. The service records its progress (last processed user offset) in the notification database. On restart, it reads the checkpoint and continues from where it stopped.

Downstream workers must be idempotent. Each notification task carries a deduplication key (notification ID + user ID + channel). Workers check this key before dispatching. This is where reliability patterns like idempotency keys become essential.

The checkpoint granularity matters. Checkpointing after every single user adds database write overhead. Checkpointing after every batch of 1,000 means you might re-process up to 999 users on restart. Since downstream workers are idempotent, the re-processing is safe, just slightly wasteful. The batch-level checkpoint is the right trade-off for throughput.

Deep dive 2: device registration pipeline

Push notifications require accurate device token management. A stale or invalid token means a wasted API call to APNs/FCM and a silent delivery failure. At 20 million device records, token hygiene is a real operational concern.

Registration flow

When a user installs the app or grants notification permission, the app receives a device token from the OS. The app sends this token to our backend along with the user ID, device ID, platform (iOS/Android), and app version.

graph TD
  APP["Mobile App"] -->|Register token| GW["API Gateway"]
  GW --> AUTH["Auth Service"]
  AUTH -->|Validate session| GW
  GW --> DR["Device Registry Service"]
  DR --> CACHE["Token Cache"]
  DR --> DB["Device DB"]
  DB --> CLEANUP["Stale Token Cleaner"]
  APNS["APNs Feedback"] --> FP["Feedback Processor"]
  FCM["FCM Feedback"] --> FP
  FP --> DB
  CLEANUP -->|Delete expired| DB

Device registration pipeline. Tokens flow in from mobile apps and get pruned by feedback signals and periodic cleanup.

The device registry service upserts the token. If the device ID already exists, it updates the token (tokens change on reinstall). If the user ID is new for that device, it creates a new record. The service also maintains a cache of active tokens so the fan-out service does not need to hit the database for every notification.

Token invalidation

Tokens become invalid for several reasons: app uninstall, OS update, token rotation by the platform. APNs provides a feedback service that returns a list of invalid tokens. FCM returns canonical registration IDs when a token has been replaced.

The feedback processor runs continuously. For each invalid token signal, it marks the token as inactive in the database and removes it from the cache. We do not delete tokens immediately; instead, we mark them inactive with a timestamp and run a cleanup job weekly. This gives us a window to debug delivery issues.

Multi-device handling

A user with three devices (phone, tablet, watch) should receive a push notification on all of them. The fan-out service queries the device registry for all active tokens for a given user ID and creates one delivery task per token. This is where the 2x multiplier in our capacity estimation comes from.

The deduplication key includes the device token, so delivering to three devices produces three distinct tasks. However, we still need to avoid showing the same notification three times in the in-app notification center. The in-app feed deduplicates by notification ID, showing one entry regardless of how many devices received the push.

There is a subtle problem with multi-device delivery and user activity. If the user has already read the notification on their phone, the push to their tablet is noise. Some systems track read status in real time and suppress pushes to other devices once one device acknowledges. This requires a fast shared state store (Redis, for example) and adds latency to the delivery path. Most systems skip this optimization and accept the minor redundancy.

Deep dive 3: delivery guarantees and retries

Different notification types need different guarantees. A marketing push about a flash sale can be dropped if delivery fails twice. A password reset email must eventually arrive.

At-least-once delivery

For critical notifications, the system uses at-least-once delivery. The worker sends the notification to the provider and waits for an acknowledgment. If the provider returns an error or times out, the worker puts the task back on the queue with an incremented retry count and exponential backoff.

The retry schedule is: 1 second, 5 seconds, 30 seconds, 2 minutes, 10 minutes. After 5 retries, the task moves to a dead-letter queue for manual inspection. For email and SMS, the provider usually handles retries internally, so our retry logic focuses on transient failures like network errors and rate limiting.

Rate limiting per provider

Each provider imposes rate limits. APNs allows bursts but will throttle if you exceed sustained limits. Twilio has per-second and per-day caps depending on your plan. SES has sending quotas.

The channel workers use token-bucket rate limiters tuned to each provider’s limits. When the bucket is empty, the worker backs off and retries from the queue. This is important during large broadcasts: a 5-million-user push can easily exceed provider limits if you do not throttle.

Exactly-once display

Even with at-least-once delivery, we want exactly-once display for the user. The mobile app deduplicates by notification ID before showing a notification. The server-side notification log stores a status of delivered once the provider confirms receipt. If a retry produces a second delivery, the app ignores the duplicate.

For in-app notifications (displayed in the notification center within the app), the server stores each notification once and serves it via a paginated feed API. The feed is the source of truth for what the user sees. Push, SMS, and email are delivery channels that alert the user to check the feed; they are not the canonical notification store.

Unsubscribe and preference management

User preferences add a filter layer between fan-out and delivery. The preference service stores per-user, per-notification-type, per-channel settings. A user might want push notifications for direct messages but email-only for marketing updates, and no SMS at all.

The preference schema looks like this: (user_id, notification_type, channel, enabled, mode). The mode field supports values like immediate, digest, and muted. Digest mode collects notifications over a time window (hourly or daily) and sends a single summary.

During fan-out, the preference lookup is the most latency-sensitive step. We cache hot preferences in Redis with a TTL of 5 minutes. Cache invalidation happens on preference update: the preference service publishes a change event, and the cache listener evicts the stale entry. The 5-minute TTL is a safety net in case the event is lost.

Unsubscribe links in emails must comply with CAN-SPAM and GDPR. Every email includes a one-click unsubscribe link that hits the preference service directly. The unsubscribe must take effect within one business day (legally), but we process it in real time. The preference update propagates to the cache within seconds, preventing further emails of that type.

Trade-offs and alternative approaches

Single queue vs per-channel queues. We use a single priority queue feeding into channel-specific workers. An alternative is separate queues per channel. Per-channel queues simplify scaling (you can independently scale SMS workers) but add complexity to the fan-out service, which now writes to multiple queues. We chose the unified queue because the channel router is lightweight and the priority semantics are easier to manage centrally.

Sync vs async notification API. Our API is fully asynchronous: it accepts the request, validates it, and returns a tracking ID immediately. The caller does not wait for delivery. An alternative is a synchronous API for single-user, high-priority notifications (return success only after provider acknowledgment). We avoided this because it couples the caller’s latency to the provider’s latency, and APNs/FCM can spike to 500ms+ during outages.

In-house vs managed service. AWS SNS, Google Cloud Pub/Sub, and Firebase Notifications offer managed fan-out and delivery. For a startup, using a managed service and building only the preference and template layers makes sense. At scale, the cost savings and control benefits of an in-house system justify the engineering investment.

Polling-based preference lookup vs event-driven cache. We fetch preferences during fan-out. An alternative is maintaining a local cache updated by preference-change events. The cache approach is faster but introduces eventual consistency: a user who just opted out might still receive one notification before the cache updates. For most notification types, this brief inconsistency is acceptable.

What real systems actually do

Twitter/X uses a hybrid fan-out model. Regular users get push fan-out for home timeline and notifications. Celebrity accounts with millions of followers use pull-based fan-out to avoid overwhelming the write path.

Facebook runs a notification system called Foxtrot that processes billions of notifications daily. It uses a multi-stage pipeline with aggressive deduplication. If you get 15 likes on a post in 30 seconds, you see one notification (“15 people liked your post”), not 15 separate ones.

Slack batches notifications aggressively. If a channel has rapid-fire messages, Slack collapses push notifications rather than sending one per message. The mobile app fetches full message history when opened, so the push notification is just a signal to check the app.

WhatsApp relies on persistent connections (XMPP-based) rather than platform push services for online users, falling back to APNs/FCM only when the app is in the background. This gives them lower latency and avoids platform-imposed payload size limits.

Most production systems do not implement exactly-once delivery at the infrastructure level. They implement at-least-once with application-level deduplication. The cost of true exactly-once (consensus protocols, distributed transactions) is not worth it when the client can deduplicate cheaply.

What comes next

This case study covered the core architecture. There are several dimensions we did not explore in depth.

Notification aggregation and collapsing. When a user receives 50 likes in a minute, you should not send 50 push notifications. Designing the aggregation window, the collapsing logic, and the “someone and 49 others liked your post” template is a deep problem.

A/B testing notification copy. Which subject line gets higher open rates? Integrating an experimentation framework into the template layer lets you run controlled tests on notification content.

Timezone-aware scheduling. A promotional notification sent at 3 AM gets ignored or, worse, annoys the user into disabling notifications. Scheduling delivery windows based on the user’s timezone and historical engagement patterns improves open rates significantly.

Cost optimization. SMS is expensive (fractions of a cent per message adds up at 50 million messages/day). Routing low-priority notifications to cheaper channels (push instead of SMS, in-app instead of email) can save meaningful money.

Each of these topics builds on the architecture described here. The fan-out pipeline, preference system, and delivery workers remain the foundation. The deep dives we covered, fan-out strategy, device registration hygiene, and delivery guarantees, are the problems you will face first. Get those right, and the extensions become tractable.

A final thought: notification systems are one of the few services where getting the design wrong has an immediate, visible impact on user experience. A missed password reset email locks a user out. A flood of duplicate push notifications triggers an uninstall. The engineering investment in reliability, deduplication, and preference management pays for itself in user retention.

← Back to all series