Search…

Webhooks: design and security

In this series (15 parts)
  1. Backend system design scope
  2. Designing RESTful APIs
  3. Authentication and session management
  4. Database design for backend systems
  5. Caching in backend systems
  6. Background jobs and task queues
  7. File upload and storage
  8. Search integration
  9. Email and notification delivery
  10. Webhooks: design and security
  11. Payments integration
  12. Multi-tenancy patterns
  13. Backend for Frontend (BFF) pattern
  14. GraphQL server design
  15. gRPC and internal service APIs

Webhooks invert the typical request pattern. Instead of a consumer polling your API for changes, your system pushes events to the consumer’s endpoint the moment something happens. This sounds simple. In practice, building a reliable webhook system means solving delivery guarantees, signature verification, retry logic, fan-out, and failure observability.

For foundational patterns on building resilient delivery systems, see reliability patterns.

What webhooks actually are

A webhook is an HTTP POST request your system sends to a URL the consumer has registered. The payload describes an event: an order was placed, a payment succeeded, a user updated their profile. The consumer’s server receives the POST, processes the event, and returns a 2xx status code to acknowledge receipt.

That is the entire contract. The simplicity is the appeal, but it is also the source of every problem. HTTP is unreliable. Consumer endpoints go down. Networks partition. Responses time out. Your system must handle all of this gracefully.

Signed webhook delivery

The consumer needs to verify that an incoming POST actually came from your system and was not tampered with in transit. The standard approach is HMAC-SHA256 signing.

When a consumer registers a webhook endpoint, your system generates a shared secret. On every delivery, you compute an HMAC signature over the raw request body using that secret, then include the signature in a header.

sequenceDiagram
  participant App as Your Application
  participant Q as Webhook Queue
  participant Worker as Webhook Worker
  participant Consumer as Consumer Endpoint

  App->>Q: enqueue event (order.completed)
  Q->>Worker: dequeue event
  Worker->>Worker: serialize payload to JSON
  Worker->>Worker: compute HMAC-SHA256(secret, body)
  Worker->>Consumer: POST /webhook<br/>X-Signature: sha256=abc123<br/>X-Webhook-ID: evt_456<br/>X-Timestamp: 1714000000
  Consumer->>Consumer: recompute HMAC with shared secret
  Consumer->>Consumer: compare signatures (constant-time)
  Consumer->>Consumer: check timestamp freshness
  Consumer-->>Worker: 200 OK
  Worker->>Q: ack message

Signed webhook delivery flow. The consumer recomputes the HMAC and compares signatures using constant-time comparison to prevent timing attacks.

The consumer verification steps are critical:

  1. Recompute the signature using the raw body bytes (not a re-serialized version) and the shared secret.
  2. Constant-time comparison to prevent timing attacks that could leak the secret byte by byte.
  3. Timestamp validation to reject replayed requests. If the timestamp is older than five minutes, reject the delivery. This prevents an attacker who captured a valid signed payload from replaying it later.

Include a webhook ID header (X-Webhook-ID or similar) so the consumer can deduplicate retries.

Delivery guarantees

Webhooks provide at-least-once delivery. You cannot guarantee exactly-once over HTTP because you cannot distinguish between “the consumer processed the event but the response was lost” and “the consumer never received the event.” Both look like a timeout to your worker.

This means:

  • Your system must retry on failure.
  • Consumers must handle duplicate deliveries (idempotency on their end, typically using the webhook ID).
  • Your documentation must clearly state the at-least-once guarantee so consumers build accordingly.

Retry logic

When a delivery fails (network error, timeout, non-2xx response), retry with exponential backoff. A common schedule:

AttemptDelayCumulative time
1Immediate0
21 minute1 min
35 minutes6 min
430 minutes36 min
52 hours2h 36m
68 hours10h 36m
724 hours~35 hours

After the final attempt, mark the delivery as failed and notify the consumer (via email or dashboard alert). Do not retry indefinitely; you will overwhelm a consumer that is having persistent issues.

Exponential backoff spreads retries over roughly 35 hours before giving up. The log scale shows how quickly the gaps between retries grow.

Add jitter to prevent thundering herd. If 1,000 webhook deliveries fail at the same time (consumer outage), their retries should not all fire at exactly the same moment. Add a random delay of up to 20% of the base interval.

Fan-out to multiple endpoints

A single event often needs to reach multiple consumers. A marketplace might send order.completed to the seller’s system, a fulfillment partner, and an analytics pipeline. Each consumer registers their own endpoint.

The fan-out architecture creates a separate delivery job per consumer per event. This is important because:

  • Each consumer has independent retry state. One consumer being down should not delay delivery to others.
  • Each consumer has its own signing secret. Sharing secrets across consumers is a security failure.
  • Rate limits are per-consumer. A slow consumer should not back-pressure delivery to fast ones.
flowchart TD
  Event["order.completed event"] --> Fanout["Fan-out Service"]
  Fanout --> Job1["Delivery Job: Seller Webhook"]
  Fanout --> Job2["Delivery Job: Fulfillment Webhook"]
  Fanout --> Job3["Delivery Job: Analytics Webhook"]
  Job1 --> Q["Delivery Queue"]
  Job2 --> Q
  Job3 --> Q
  Q --> W1["Worker Pool"]
  W1 --> EP1["seller.com/webhook"]
  W1 --> EP2["fulfillment.co/webhook"]
  W1 --> EP3["analytics.io/webhook"]

Fan-out creates independent delivery jobs per consumer. Each job tracks its own retry state and signing credentials.

Endpoint registration and validation

When a consumer registers a webhook URL, validate it immediately:

  1. URL format check: must be HTTPS. Never deliver webhooks over plain HTTP; the payload and signature would be visible to network observers.
  2. DNS resolution: verify the domain resolves. Reject private IP ranges (10.x.x.x, 172.16.x.x, 192.168.x.x, 127.0.0.1) to prevent SSRF (server-side request forgery).
  3. Verification challenge: send a test request with a challenge token. The consumer must echo the token back to prove they control the endpoint. This prevents attackers from registering arbitrary URLs and weaponizing your webhook system as a request cannon.

Store the endpoint, the signing secret, the event types the consumer subscribed to, and the endpoint’s current health status.

Circuit breaker for unhealthy endpoints

If a consumer endpoint fails consistently, stop hammering it. Implement a circuit breaker:

  • Closed: deliveries flow normally.
  • Open: after N consecutive failures (or a failure rate above a threshold), stop sending. Queue events but do not attempt delivery.
  • Half-open: after a cooldown period, attempt one delivery. If it succeeds, close the circuit. If it fails, reopen.

Notify the consumer when their endpoint enters the open state. Give them a dashboard to see queued events and manually trigger redelivery once they have fixed their endpoint.

Event schema design

Design webhook payloads as thin notifications, not full data dumps. Include the event type, a reference ID, and a timestamp. Let the consumer call your API for the full resource if they need it.

{
  "id": "evt_a1b2c3",
  "type": "order.completed",
  "created_at": "2026-04-20T14:30:00Z",
  "data": {
    "order_id": "ord_xyz789",
    "total_cents": 4999,
    "currency": "usd"
  }
}

This approach has trade-offs. Thin payloads mean the consumer might need to make an API call to get full details, adding latency. Fat payloads risk sending stale data if the resource changed between event creation and delivery. The thin approach is more common because it avoids stale data and keeps payloads small.

Version your webhook schema from day one. Include a version field or use versioned event types (order.completed.v2). Breaking changes in webhook payloads cause production incidents for your consumers.

Debugging webhook failures

Webhook failures are notoriously hard to debug because you do not control the consumer’s server. Build these tools:

  1. Delivery logs: store every delivery attempt with the request headers, body, response status, response body (first 1 KB), and latency. Make these searchable by event ID, endpoint, and time range.
  2. Replay: let consumers trigger redelivery of any event from the last 30 days. This is the most requested feature in every webhook system.
  3. CLI testing tool: provide a CLI or webhook testing endpoint that lets developers inspect delivered payloads during integration development. Tools like webhook.site fill this role during initial integration but you should offer something native.
  4. Event catalog: document every event type, its schema, and when it fires. This is your contract with consumers.

Security checklist

ConcernMitigation
Payload tamperingHMAC-SHA256 signature verification
Replay attacksTimestamp header with 5-minute freshness window
SSRF via registrationBlock private IPs, require HTTPS, challenge verification
Secret leakagePer-consumer secrets, rotatable without downtime
Denial of serviceRate limit deliveries per endpoint, circuit breaker
Data exposureThin payloads, require consumer to authenticate API calls for full data

Monitoring and alerting

Track these metrics:

  • Delivery success rate per consumer and globally. Alert when it drops below 95%.
  • Delivery latency (p50, p99). Slow consumers affect your worker pool.
  • Queue depth. Growing depth means consumers are failing or your workers cannot keep up.
  • Circuit breaker state changes. Every state transition should be logged and alerted on.
  • Retry rate. A spike in retries often signals a widespread consumer issue or a payload schema change that broke consumers.

What comes next

With webhook delivery in place, the next article covers payments integration: how to integrate with payment processors like Stripe, use idempotency keys for safe retries, and handle the webhook-driven asynchronous flow that powers modern payment systems.

Start typing to search across all content
navigate Enter open Esc close