Search…

Design a payments platform

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

Money movement is unforgiving. A bug in a social feed shows a stale post. A bug in a payments platform charges someone twice. The tolerance for error is zero, and every design choice reflects that constraint.

This case study builds a platform that accepts charges, tracks them through a state machine, records every monetary movement in a double-entry ledger, and reconciles with external processors. If you have not worked through payments integration and reliability patterns, start there.

1. Requirements

Functional

  • Accept payment requests (card, bank transfer, wallet) via API.
  • Process charges through external payment processors (Stripe, Adyen).
  • Support refunds, partial refunds, and disputes.
  • Maintain an immutable ledger of all monetary movements.
  • Expose payment status via API and webhooks.
  • Provide a merchant dashboard for transaction history and reporting.

Non-functional

MetricTarget
DAU5M buyers, 200K merchants
Peak charge QPS3,000
Charge latency (p99)2 seconds
Availability99.99%
IdempotencyEvery write endpoint
Data retention7 years (regulatory)

2. Capacity estimation

Storage. Each payment record averages 2 KB. At 100M transactions per month, that is 200 GB/month raw. Ledger entries double it (two rows per transaction). With indexes and audit logs, budget 1 TB/month, or roughly 12 TB/year.

Bandwidth. At peak 3,000 QPS with 2 KB payloads: 6 MB/s inbound. Webhook fanout to merchants adds another 10 MB/s outbound at peak.

QPS breakdown. Charges dominate at 60% of traffic. Status checks account for 30%. Refunds and disputes make up the remaining 10%. The read-to-write ratio is approximately 4:1 when including dashboard queries.

3. High-level architecture

graph TD
  Client[Client / Merchant API] -->|HTTPS| Gateway[API Gateway]
  Gateway --> PaySvc[Payment Service]
  PaySvc --> Idempotency[(Idempotency Store<br/>Redis)]
  PaySvc --> Ledger[Ledger Service]
  Ledger --> LedgerDB[(Ledger DB<br/>PostgreSQL)]
  PaySvc --> PSP[PSP Adapter<br/>Stripe / Adyen]
  PSP -->|async callback| Webhook[Webhook Ingester]
  Webhook --> Queue[Message Queue]
  Queue --> StateMachine[State Machine Worker]
  StateMachine --> PaySvc
  StateMachine --> NotifySvc[Notification Service]
  NotifySvc -->|webhooks| MerchantAPI[Merchant Webhook Endpoint]
  PaySvc --> PayDB[(Payment DB<br/>PostgreSQL)]

High-level architecture. The payment service orchestrates charges while the ledger service records every monetary movement independently.

The API gateway handles authentication and rate limiting. The payment service owns the charge lifecycle. The ledger service is a separate bounded context that only appends entries. The PSP adapter normalizes interactions with external processors.

4. Deep dives

4.1 Charge flow and idempotency

Every charge request carries a client-supplied idempotency key. The payment service checks Redis for a matching key before doing anything else. If the key exists and the previous request succeeded, it returns the cached response. If it exists but is still processing, it returns a 409 Conflict. This prevents double charges even when clients retry aggressively.

sequenceDiagram
  participant C as Client
  participant G as API Gateway
  participant P as Payment Service
  participant R as Redis (Idempotency)
  participant DB as Payment DB
  participant PSP as PSP Adapter

  C->>G: POST /charges (idempotency-key: abc123)
  G->>P: Forward request
  P->>R: GET abc123
  R-->>P: Not found
  P->>R: SET abc123 status=processing TTL=24h
  P->>DB: INSERT payment (status=pending)
  P->>PSP: Create charge
  PSP-->>P: processor_id: txn_xyz
  P->>DB: UPDATE payment (status=processing, processor_id)
  P->>R: SET abc123 status=completed result={...}
  P-->>C: 201 Created (payment_id, status)

Charge flow with idempotency. Redis acts as a distributed lock that prevents duplicate processor calls even under concurrent retries.

The idempotency key has a 24-hour TTL. After expiry, the same key can be reused. The payment service uses a compare-and-swap pattern on the Redis entry to handle race conditions between concurrent requests with the same key.

If the PSP call fails, the payment transitions to failed and the idempotency entry stores the failure. Retries with the same key return the failure response rather than re-attempting the charge. Clients must generate a new key to retry.

4.2 Double-entry ledger

Every monetary movement creates exactly two ledger entries: a debit and a credit. A charge debits the buyer’s liability account and credits the merchant’s receivable account. A refund reverses those entries. This structure makes reconciliation straightforward because the sum of all debits always equals the sum of all credits.

erDiagram
  PAYMENT {
      uuid payment_id PK
      uuid merchant_id FK
      string idempotency_key UK
      string status
      int amount_cents
      string currency
      string processor_id
      timestamp created_at
      timestamp updated_at
  }
  LEDGER_ENTRY {
      uuid entry_id PK
      uuid payment_id FK
      uuid account_id FK
      string entry_type "debit or credit"
      int amount_cents
      string currency
      timestamp created_at
  }
  ACCOUNT {
      uuid account_id PK
      string account_type "liability, receivable, revenue, escrow"
      uuid owner_id
      string currency
  }
  LEDGER_TRANSACTION {
      uuid transaction_id PK
      uuid payment_id FK
      string transaction_type "charge, refund, payout, fee"
      timestamp created_at
  }

  PAYMENT ||--o{ LEDGER_TRANSACTION : triggers
  LEDGER_TRANSACTION ||--|{ LEDGER_ENTRY : contains
  LEDGER_ENTRY }o--|| ACCOUNT : affects

Ledger data model. Every ledger transaction contains balanced debit/credit entries that sum to zero.

The ledger service enforces a critical invariant: no transaction commits unless its entries balance. This check runs inside a database transaction. If the sum of debits minus credits is nonzero, the transaction rolls back.

Ledger entries are append-only. Corrections happen through reversal entries, never updates. This gives auditors a complete history and satisfies regulatory retention requirements.

Account balances are derived by summing entries rather than stored as a mutable field. For performance, the system maintains a materialized balance that updates asynchronously, but the source of truth is always the entry log.

4.3 Payment state machine

Payments move through a well-defined set of states. The state machine worker consumes events from a message queue and applies transitions. Invalid transitions are rejected and logged for investigation.

stateDiagram-v2
  [*] --> Pending: charge created
  Pending --> Processing: PSP accepted
  Processing --> Succeeded: PSP confirmed
  Processing --> Failed: PSP declined
  Pending --> Failed: validation error
  Succeeded --> RefundPending: refund requested
  RefundPending --> Refunded: refund confirmed
  RefundPending --> RefundFailed: refund declined
  Succeeded --> Disputed: chargeback opened
  Disputed --> Succeeded: dispute won
  Disputed --> Refunded: dispute lost
  Failed --> [*]
  Refunded --> [*]

Payment state machine. Each transition triggers a ledger entry and a merchant webhook notification.

Every state transition performs three actions atomically:

  1. Update the payment record in the database.
  2. Create the corresponding ledger entries.
  3. Enqueue a webhook event for the merchant.

Steps 1 and 2 happen in a single database transaction. Step 3 uses the transactional outbox pattern: the webhook event is written to an outbox table in the same transaction, then a background worker reads the outbox and publishes to the notification service. This avoids the dual-write problem where the database commits but the queue publish fails.

5. Reconciliation

Reconciliation runs as a daily batch job that compares internal ledger state against PSP settlement reports. The job identifies three categories of discrepancies:

  • Missing internally. The PSP processed a charge we have no record of. Likely a dropped callback.
  • Missing externally. We recorded a charge the PSP has no record of. Likely a phantom write.
  • Amount mismatch. Both sides have the record but amounts differ. Usually a currency conversion issue.

Each discrepancy generates an alert. The system auto-resolves missing callbacks by fetching the charge status from the PSP API. Amount mismatches and phantom writes require manual review.

6. Failure handling

Failures in a payments platform fall into three buckets.

Transient PSP failures. The processor returns a 500 or times out. The state machine worker retries with exponential backoff, capped at 3 attempts. After exhausting retries, the payment moves to failed and the merchant receives a webhook.

Network partitions. If the payment service cannot reach the PSP, it must not assume the charge failed. The PSP might have processed it. The service marks the payment as unknown and a background reconciler polls the PSP API to resolve the final state. This prevents both double charges and lost charges.

Database failures. If the ledger write fails after a successful PSP charge, the system has taken money without recording it. The transactional outbox pattern prevents this: the PSP call only happens after the initial pending record is committed. If the subsequent ledger write fails, the reconciliation job catches the mismatch and creates the missing entries.

7. Trade-offs and alternatives

Synchronous vs. async processing. A synchronous model simplifies the code path but creates tight coupling with the PSP. If the PSP takes 5 seconds to respond, the client waits 5 seconds. The async model adds complexity (state machine, queues, workers) but decouples the charge acceptance from the charge resolution. Most production systems use a hybrid: synchronous for the initial PSP call, async for everything after.

Single database vs. separate ledger. Keeping payments and ledger entries in the same database simplifies consistency. Separating them lets each scale independently and enforces domain boundaries. The trade-off is that cross-service consistency requires either distributed transactions (expensive) or eventual consistency with reconciliation (complex but practical). We chose separation with eventual consistency.

Idempotency key ownership. Some platforms generate keys server-side. Client-supplied keys give merchants control over retry behavior and are the industry standard (Stripe, Adyen, PayPal all use this approach).

Balance computation. Computing balances by summing entries is correct but slow at scale. Materialized balances with periodic snapshots provide O(1) reads at the cost of async lag. For merchant dashboards, a 30-second delay is acceptable. For payout calculations, the system reads from the entry log directly.

8. What real systems actually do

Stripe uses a double-entry ledger backed by a custom database. Idempotency keys are required for all mutating API calls. State transitions happen through an internal event system.

PayPal runs a massive reconciliation pipeline that processes billions of transactions daily. Their ledger system uses a combination of real-time and batch processing.

Square separates the payment gateway from the ledger system. The gateway handles real-time charge flow while the ledger provides the financial record.

All three enforce idempotency at the API layer, use some form of double-entry accounting, and run reconciliation as a continuous process rather than a one-time check.

9. What comes next

This design handles the core charge lifecycle, but production systems need more:

  • Multi-currency support. Currency conversion adds exchange rate tracking, conversion ledger entries, and exposure to FX risk.
  • Payout scheduling. Aggregating merchant balances and initiating bank transfers on a schedule (T+2, weekly, monthly).
  • Fraud detection. Real-time scoring of charges before sending them to the PSP. This sits between the API gateway and the payment service.
  • PCI compliance. Card data never touches your servers. Use tokenization through the PSP or a vault service.
  • Rate limiting per merchant. Prevents a single merchant from consuming all processing capacity during flash sales.

The ledger and state machine patterns from this design carry over directly into these extensions. Get the core right and the rest is incremental.

Start typing to search across all content
navigate Enter open Esc close