Webhooks: design and security
In this series (15 parts)
- Backend system design scope
- Designing RESTful APIs
- Authentication and session management
- Database design for backend systems
- Caching in backend systems
- Background jobs and task queues
- File upload and storage
- Search integration
- Email and notification delivery
- Webhooks: design and security
- Payments integration
- Multi-tenancy patterns
- Backend for Frontend (BFF) pattern
- GraphQL server design
- gRPC and internal service APIs
Webhooks invert the typical request pattern. Instead of a consumer polling your API for changes, your system pushes events to the consumer’s endpoint the moment something happens. This sounds simple. In practice, building a reliable webhook system means solving delivery guarantees, signature verification, retry logic, fan-out, and failure observability.
For foundational patterns on building resilient delivery systems, see reliability patterns.
What webhooks actually are
A webhook is an HTTP POST request your system sends to a URL the consumer has registered. The payload describes an event: an order was placed, a payment succeeded, a user updated their profile. The consumer’s server receives the POST, processes the event, and returns a 2xx status code to acknowledge receipt.
That is the entire contract. The simplicity is the appeal, but it is also the source of every problem. HTTP is unreliable. Consumer endpoints go down. Networks partition. Responses time out. Your system must handle all of this gracefully.
Signed webhook delivery
The consumer needs to verify that an incoming POST actually came from your system and was not tampered with in transit. The standard approach is HMAC-SHA256 signing.
When a consumer registers a webhook endpoint, your system generates a shared secret. On every delivery, you compute an HMAC signature over the raw request body using that secret, then include the signature in a header.
sequenceDiagram participant App as Your Application participant Q as Webhook Queue participant Worker as Webhook Worker participant Consumer as Consumer Endpoint App->>Q: enqueue event (order.completed) Q->>Worker: dequeue event Worker->>Worker: serialize payload to JSON Worker->>Worker: compute HMAC-SHA256(secret, body) Worker->>Consumer: POST /webhook<br/>X-Signature: sha256=abc123<br/>X-Webhook-ID: evt_456<br/>X-Timestamp: 1714000000 Consumer->>Consumer: recompute HMAC with shared secret Consumer->>Consumer: compare signatures (constant-time) Consumer->>Consumer: check timestamp freshness Consumer-->>Worker: 200 OK Worker->>Q: ack message
Signed webhook delivery flow. The consumer recomputes the HMAC and compares signatures using constant-time comparison to prevent timing attacks.
The consumer verification steps are critical:
- Recompute the signature using the raw body bytes (not a re-serialized version) and the shared secret.
- Constant-time comparison to prevent timing attacks that could leak the secret byte by byte.
- Timestamp validation to reject replayed requests. If the timestamp is older than five minutes, reject the delivery. This prevents an attacker who captured a valid signed payload from replaying it later.
Include a webhook ID header (X-Webhook-ID or similar) so the consumer can deduplicate retries.
Delivery guarantees
Webhooks provide at-least-once delivery. You cannot guarantee exactly-once over HTTP because you cannot distinguish between “the consumer processed the event but the response was lost” and “the consumer never received the event.” Both look like a timeout to your worker.
This means:
- Your system must retry on failure.
- Consumers must handle duplicate deliveries (idempotency on their end, typically using the webhook ID).
- Your documentation must clearly state the at-least-once guarantee so consumers build accordingly.
Retry logic
When a delivery fails (network error, timeout, non-2xx response), retry with exponential backoff. A common schedule:
| Attempt | Delay | Cumulative time |
|---|---|---|
| 1 | Immediate | 0 |
| 2 | 1 minute | 1 min |
| 3 | 5 minutes | 6 min |
| 4 | 30 minutes | 36 min |
| 5 | 2 hours | 2h 36m |
| 6 | 8 hours | 10h 36m |
| 7 | 24 hours | ~35 hours |
After the final attempt, mark the delivery as failed and notify the consumer (via email or dashboard alert). Do not retry indefinitely; you will overwhelm a consumer that is having persistent issues.
Exponential backoff spreads retries over roughly 35 hours before giving up. The log scale shows how quickly the gaps between retries grow.
Add jitter to prevent thundering herd. If 1,000 webhook deliveries fail at the same time (consumer outage), their retries should not all fire at exactly the same moment. Add a random delay of up to 20% of the base interval.
Fan-out to multiple endpoints
A single event often needs to reach multiple consumers. A marketplace might send order.completed to the seller’s system, a fulfillment partner, and an analytics pipeline. Each consumer registers their own endpoint.
The fan-out architecture creates a separate delivery job per consumer per event. This is important because:
- Each consumer has independent retry state. One consumer being down should not delay delivery to others.
- Each consumer has its own signing secret. Sharing secrets across consumers is a security failure.
- Rate limits are per-consumer. A slow consumer should not back-pressure delivery to fast ones.
flowchart TD Event["order.completed event"] --> Fanout["Fan-out Service"] Fanout --> Job1["Delivery Job: Seller Webhook"] Fanout --> Job2["Delivery Job: Fulfillment Webhook"] Fanout --> Job3["Delivery Job: Analytics Webhook"] Job1 --> Q["Delivery Queue"] Job2 --> Q Job3 --> Q Q --> W1["Worker Pool"] W1 --> EP1["seller.com/webhook"] W1 --> EP2["fulfillment.co/webhook"] W1 --> EP3["analytics.io/webhook"]
Fan-out creates independent delivery jobs per consumer. Each job tracks its own retry state and signing credentials.
Endpoint registration and validation
When a consumer registers a webhook URL, validate it immediately:
- URL format check: must be HTTPS. Never deliver webhooks over plain HTTP; the payload and signature would be visible to network observers.
- DNS resolution: verify the domain resolves. Reject private IP ranges (
10.x.x.x,172.16.x.x,192.168.x.x,127.0.0.1) to prevent SSRF (server-side request forgery). - Verification challenge: send a test request with a challenge token. The consumer must echo the token back to prove they control the endpoint. This prevents attackers from registering arbitrary URLs and weaponizing your webhook system as a request cannon.
Store the endpoint, the signing secret, the event types the consumer subscribed to, and the endpoint’s current health status.
Circuit breaker for unhealthy endpoints
If a consumer endpoint fails consistently, stop hammering it. Implement a circuit breaker:
- Closed: deliveries flow normally.
- Open: after N consecutive failures (or a failure rate above a threshold), stop sending. Queue events but do not attempt delivery.
- Half-open: after a cooldown period, attempt one delivery. If it succeeds, close the circuit. If it fails, reopen.
Notify the consumer when their endpoint enters the open state. Give them a dashboard to see queued events and manually trigger redelivery once they have fixed their endpoint.
Event schema design
Design webhook payloads as thin notifications, not full data dumps. Include the event type, a reference ID, and a timestamp. Let the consumer call your API for the full resource if they need it.
{
"id": "evt_a1b2c3",
"type": "order.completed",
"created_at": "2026-04-20T14:30:00Z",
"data": {
"order_id": "ord_xyz789",
"total_cents": 4999,
"currency": "usd"
}
}
This approach has trade-offs. Thin payloads mean the consumer might need to make an API call to get full details, adding latency. Fat payloads risk sending stale data if the resource changed between event creation and delivery. The thin approach is more common because it avoids stale data and keeps payloads small.
Version your webhook schema from day one. Include a version field or use versioned event types (order.completed.v2). Breaking changes in webhook payloads cause production incidents for your consumers.
Debugging webhook failures
Webhook failures are notoriously hard to debug because you do not control the consumer’s server. Build these tools:
- Delivery logs: store every delivery attempt with the request headers, body, response status, response body (first 1 KB), and latency. Make these searchable by event ID, endpoint, and time range.
- Replay: let consumers trigger redelivery of any event from the last 30 days. This is the most requested feature in every webhook system.
- CLI testing tool: provide a CLI or webhook testing endpoint that lets developers inspect delivered payloads during integration development. Tools like
webhook.sitefill this role during initial integration but you should offer something native. - Event catalog: document every event type, its schema, and when it fires. This is your contract with consumers.
Security checklist
| Concern | Mitigation |
|---|---|
| Payload tampering | HMAC-SHA256 signature verification |
| Replay attacks | Timestamp header with 5-minute freshness window |
| SSRF via registration | Block private IPs, require HTTPS, challenge verification |
| Secret leakage | Per-consumer secrets, rotatable without downtime |
| Denial of service | Rate limit deliveries per endpoint, circuit breaker |
| Data exposure | Thin payloads, require consumer to authenticate API calls for full data |
Monitoring and alerting
Track these metrics:
- Delivery success rate per consumer and globally. Alert when it drops below 95%.
- Delivery latency (p50, p99). Slow consumers affect your worker pool.
- Queue depth. Growing depth means consumers are failing or your workers cannot keep up.
- Circuit breaker state changes. Every state transition should be logged and alerted on.
- Retry rate. A spike in retries often signals a widespread consumer issue or a payload schema change that broke consumers.
What comes next
With webhook delivery in place, the next article covers payments integration: how to integrate with payment processors like Stripe, use idempotency keys for safe retries, and handle the webhook-driven asynchronous flow that powers modern payment systems.