Nov 8, 2025 · 16 min read · System Design

Email and notification delivery

In this series (15 parts)

Every product sends notifications. Password resets, order confirmations, weekly digests, alert thresholds. The moment you treat notification delivery as an afterthought, you get duplicate emails, messages landing in spam folders, and angry support tickets from users who never received their verification link.

This article covers the full pipeline: how transactional email works at the protocol level, how DNS records prove you are who you say you are, how to handle bounces without destroying your sender reputation, and how to make the whole system idempotent so a retry never means a duplicate inbox hit.

For a broader look at notification infrastructure at scale, see notification systems.

Transactional vs marketing email

The distinction matters because email providers treat them differently. Transactional emails are triggered by a user action: a password reset, a receipt, a shipping update. Marketing emails are bulk sends: newsletters, promotions, re-engagement campaigns.

Best practice is to send them from separate subdomains. Your transactional subdomain (tx.example.com) builds a reputation based on high open rates and low complaint rates. Your marketing subdomain (mail.example.com) will naturally have higher unsubscribe and complaint rates. Mixing them on the same domain drags transactional deliverability down.

Most teams use a dedicated ESP (email service provider) like SendGrid, Postmark, or Amazon SES. You can still architect the system well or poorly regardless of which provider sits at the bottom of the stack.

SMTP: the protocol under every email

SMTP (Simple Mail Transfer Protocol) has been around since 1982. The core flow is simple: your server opens a TCP connection to the recipient’s mail server (found via MX DNS records), negotiates a TLS handshake, then sends envelope metadata (sender, recipient) followed by the message body.

sequenceDiagram
  participant App as Application
  participant Q as Message Queue
  participant Worker as Email Worker
  participant ESP as Email Service Provider
  participant MTA as Recipient MTA
  participant Inbox as User Inbox

  App->>Q: enqueue email job (idempotency key)
  Q->>Worker: dequeue job
  Worker->>Worker: render template with context
  Worker->>Worker: check idempotency store
  Worker->>ESP: send via API (SMTP relay)
  ESP->>MTA: SMTP handshake + TLS
  MTA->>MTA: check SPF, DKIM, DMARC
  MTA->>Inbox: deliver to mailbox
  ESP-->>Worker: delivery status (accepted/rejected)
  Worker->>Q: ack message

Full transactional email pipeline from application trigger to inbox delivery.

In practice you rarely speak raw SMTP yourself. Your ESP exposes an HTTP API that accepts the message and handles SMTP delivery, retries, and connection pooling on your behalf. But understanding the protocol helps you debug deliverability issues.

Key SMTP response codes to know: 250 means accepted, 421 means the server is temporarily unavailable (retry later), 550 means the mailbox does not exist (hard bounce), and 552 means the mailbox is full (soft bounce).

DNS authentication: SPF, DKIM, and DMARC

If SMTP is the delivery truck, DNS authentication records are the credentials the driver shows at the gate. Without them, receiving servers have no way to verify that your email actually came from you.

SPF (Sender Policy Framework)

SPF is a DNS TXT record that lists which IP addresses are authorized to send email on behalf of your domain. When a receiving server gets an email claiming to be from tx.example.com, it looks up the SPF record and checks whether the sending IP is in the list.

A typical SPF record:

v=spf1 include:sendgrid.net include:_spf.google.com ~all

The include directives pull in the IP ranges of your ESP and your corporate email provider. The ~all at the end means “soft fail anything not listed,” which flags but does not reject. Use -all (hard fail) once you are confident your SPF record is complete.

DKIM (DomainKeys Identified Mail)

DKIM attaches a cryptographic signature to every outgoing email. Your ESP signs the message headers and body with a private key. The public key lives in a DNS TXT record. The receiving server fetches the public key, verifies the signature, and confirms the message was not tampered with in transit.

The selector mechanism lets you rotate keys without downtime. Your DKIM header references a selector like s1._domainkey.tx.example.com, and you publish the new key under s2 before retiring s1.

DMARC (Domain-based Message Authentication, Reporting, and Conformance)

DMARC ties SPF and DKIM together with a policy. It tells receiving servers what to do when neither SPF nor DKIM passes alignment: none (just report), quarantine (send to spam), or reject (drop the message entirely).

Start with p=none and monitor the aggregate reports. Once you see that legitimate mail consistently passes, move to p=quarantine, then p=reject.

Each authentication layer measurably improves deliverability. DMARC with p=reject gives receiving servers the highest confidence.

Bounce handling

Bounces fall into two categories. Hard bounces mean the address is permanently invalid: the mailbox does not exist, the domain does not resolve, or the server explicitly rejects your mail. Soft bounces are temporary: the mailbox is full, the server is overloaded, or there is a transient network issue.

Your bounce handling strategy directly affects sender reputation:

Hard bounces: remove the address from your send list immediately. Continuing to send to hard-bounced addresses tells ISPs you do not maintain your list, which tanks your reputation score.
Soft bounces: retry with exponential backoff. After three to five consecutive soft bounces over several days, treat the address as a hard bounce.
Complaint tracking: when a user marks your email as spam, the ISP sends a feedback loop report. Process these immediately and suppress the address.

Most ESPs handle bounce processing via webhooks. Your system receives a webhook event for each bounce or complaint, and you update your suppression list accordingly.

Template rendering server-side

Never let the client assemble notification content. All template rendering should happen server-side for three reasons: security (you control what gets interpolated), consistency (every channel gets the same data), and auditability (you can log exactly what was sent).

A typical approach uses a template engine like Handlebars, Mjml, or React Email. Templates live in version control alongside your application code. The rendering step takes a template identifier and a context object, produces the final HTML and plain-text versions, and passes them to the delivery layer.

// Pseudocode for template rendering
const template = loadTemplate("order-confirmation");
const context = {
  userName: order.user.name,
  orderNumber: order.id,
  items: order.lineItems,
  total: formatCurrency(order.total)
};
const { html, text } = render(template, context);
await enqueueEmail({ to: order.user.email, subject, html, text, idempotencyKey });

Keep templates simple. Complex conditional logic in templates is a maintenance nightmare. If you need branching, do it in the service layer and select the appropriate template.

Idempotent notification delivery

The worst user experience is receiving the same notification twice. Or five times. This happens when a worker crashes after sending but before acknowledging the queue message, or when a retry fires because the ESP response timed out even though the email was actually sent.

The fix is an idempotency key. Every notification job gets a deterministic key derived from the event that triggered it. Before sending, the worker checks a store (Redis or a database table) for the key. If it exists, the notification was already sent; skip it. If not, send the notification and write the key atomically.

CREATE TABLE notification_idempotency (
    idempotency_key TEXT PRIMARY KEY,
    channel TEXT NOT NULL,
    sent_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

The idempotency key should encode the event type and entity: order_confirmation:order_12345 or password_reset:user_789:req_abc. This way, a new password reset request for the same user generates a different key and correctly sends a new email.

Set a TTL on these keys. You do not need to remember that you sent an order confirmation six months ago. A 72-hour window covers any reasonable retry scenario.

Multi-channel notification routing

Most products eventually support more than email: push notifications, SMS, in-app messages, Slack. The architecture stays the same. A notification service accepts an event, determines which channels the user has enabled, renders channel-specific content, and enqueues a delivery job per channel.

flowchart LR
  Event["Event Bus"] --> NS["Notification Service"]
  NS --> Prefs["User Preferences"]
  NS --> Router["Channel Router"]
  Router --> EmailQ["Email Queue"]
  Router --> PushQ["Push Queue"]
  Router --> SMSQ["SMS Queue"]
  Router --> InAppQ["In-App Queue"]
  EmailQ --> EmailW["Email Worker"]
  PushQ --> PushW["Push Worker"]
  SMSQ --> SMSW["SMS Worker"]
  InAppQ --> InAppW["In-App Worker"]

Multi-channel notification routing. Each channel has its own queue and worker pool for independent scaling and failure isolation.

User preferences matter. Let users choose which channels they want for each notification category. Store these preferences in a fast lookup (Redis hash or a denormalized table). Never send a push notification to a user who explicitly turned them off.

Rate limiting and batching

Some events generate a burst of notifications. A deploy that triggers 500 monitoring alerts should not result in 500 individual emails. Implement notification batching: collect events within a time window (30 seconds to 5 minutes) and merge them into a single digest.

Rate limiting protects your ESP quota and your sender reputation. Apply limits at multiple levels:

Per-user: no more than N notifications per hour per channel. This catches runaway loops.
Per-template: cap the send rate for marketing templates separately from transactional ones.
Global: stay within your ESP’s rate limit to avoid 429 responses and temporary suspensions.

Observability for notifications

Track these metrics per channel:

Metric	Why it matters
Enqueue rate	Detects upstream spikes or drops
Delivery latency (p50, p99)	Measures time from event to delivery
Delivery success rate	Catches ESP outages or authentication failures
Bounce rate	Rising bounces signal list hygiene problems
Open rate (email)	Declining opens may indicate spam folder placement
Unsubscribe rate	Measures content relevance

Alert on sudden changes. A spike in bounce rate after a deploy likely means a code change is generating invalid addresses. A drop in delivery success rate might mean your DKIM key expired.

Common pitfalls

Sending from a no-reply address. Users reply to transactional emails more often than you think. Use a monitored address or at least route replies to a ticketing system.

Not including a plain-text version. Some email clients and spam filters penalize HTML-only emails. Always send a multipart message with both HTML and plain text.

Ignoring time zones. Sending a marketing email at 3 AM in the user’s time zone destroys engagement. For transactional email this matters less, but for digests and summaries, schedule delivery in the user’s local morning.

Hardcoding ESP credentials. Use environment variables and secret management. When you rotate keys or switch providers, you do not want to redeploy every service.

What comes next

With email and notification delivery in place, the next article covers webhooks: how to push events to external systems reliably, sign payloads for verification, and handle the retry and fan-out challenges that come with outbound event delivery.

← Back to all series