Design a payments platform
In this series (18 parts)
- Design a URL shortener
- Design a key-value store
- Design a rate limiter
- Design a web crawler
- Design a notification system
- Design a news feed
- Design a chat application
- Design a video streaming platform
- Design a music streaming service
- Design a ride-sharing service
- Design a food delivery platform
- Design a hotel booking platform
- Design a search engine
- Design a distributed message queue
- Design a code deployment system
- Design a payments platform
- Design an ad click aggregation system
- Design a distributed cache
Money movement is unforgiving. A bug in a social feed shows a stale post. A bug in a payments platform charges someone twice. The tolerance for error is zero, and every design choice reflects that constraint.
This case study builds a platform that accepts charges, tracks them through a state machine, records every monetary movement in a double-entry ledger, and reconciles with external processors. If you have not worked through payments integration and reliability patterns, start there.
1. Requirements
Functional
- Accept payment requests (card, bank transfer, wallet) via API.
- Process charges through external payment processors (Stripe, Adyen).
- Support refunds, partial refunds, and disputes.
- Maintain an immutable ledger of all monetary movements.
- Expose payment status via API and webhooks.
- Provide a merchant dashboard for transaction history and reporting.
Non-functional
| Metric | Target |
|---|---|
| DAU | 5M buyers, 200K merchants |
| Peak charge QPS | 3,000 |
| Charge latency (p99) | 2 seconds |
| Availability | 99.99% |
| Idempotency | Every write endpoint |
| Data retention | 7 years (regulatory) |
2. Capacity estimation
Storage. Each payment record averages 2 KB. At 100M transactions per month, that is 200 GB/month raw. Ledger entries double it (two rows per transaction). With indexes and audit logs, budget 1 TB/month, or roughly 12 TB/year.
Bandwidth. At peak 3,000 QPS with 2 KB payloads: 6 MB/s inbound. Webhook fanout to merchants adds another 10 MB/s outbound at peak.
QPS breakdown. Charges dominate at 60% of traffic. Status checks account for 30%. Refunds and disputes make up the remaining 10%. The read-to-write ratio is approximately 4:1 when including dashboard queries.
3. High-level architecture
graph TD Client[Client / Merchant API] -->|HTTPS| Gateway[API Gateway] Gateway --> PaySvc[Payment Service] PaySvc --> Idempotency[(Idempotency Store<br/>Redis)] PaySvc --> Ledger[Ledger Service] Ledger --> LedgerDB[(Ledger DB<br/>PostgreSQL)] PaySvc --> PSP[PSP Adapter<br/>Stripe / Adyen] PSP -->|async callback| Webhook[Webhook Ingester] Webhook --> Queue[Message Queue] Queue --> StateMachine[State Machine Worker] StateMachine --> PaySvc StateMachine --> NotifySvc[Notification Service] NotifySvc -->|webhooks| MerchantAPI[Merchant Webhook Endpoint] PaySvc --> PayDB[(Payment DB<br/>PostgreSQL)]
High-level architecture. The payment service orchestrates charges while the ledger service records every monetary movement independently.
The API gateway handles authentication and rate limiting. The payment service owns the charge lifecycle. The ledger service is a separate bounded context that only appends entries. The PSP adapter normalizes interactions with external processors.
4. Deep dives
4.1 Charge flow and idempotency
Every charge request carries a client-supplied idempotency key. The payment service checks Redis for a matching key before doing anything else. If the key exists and the previous request succeeded, it returns the cached response. If it exists but is still processing, it returns a 409 Conflict. This prevents double charges even when clients retry aggressively.
sequenceDiagram
participant C as Client
participant G as API Gateway
participant P as Payment Service
participant R as Redis (Idempotency)
participant DB as Payment DB
participant PSP as PSP Adapter
C->>G: POST /charges (idempotency-key: abc123)
G->>P: Forward request
P->>R: GET abc123
R-->>P: Not found
P->>R: SET abc123 status=processing TTL=24h
P->>DB: INSERT payment (status=pending)
P->>PSP: Create charge
PSP-->>P: processor_id: txn_xyz
P->>DB: UPDATE payment (status=processing, processor_id)
P->>R: SET abc123 status=completed result={...}
P-->>C: 201 Created (payment_id, status)
Charge flow with idempotency. Redis acts as a distributed lock that prevents duplicate processor calls even under concurrent retries.
The idempotency key has a 24-hour TTL. After expiry, the same key can be reused. The payment service uses a compare-and-swap pattern on the Redis entry to handle race conditions between concurrent requests with the same key.
If the PSP call fails, the payment transitions to failed and the idempotency entry stores the failure. Retries with the same key return the failure response rather than re-attempting the charge. Clients must generate a new key to retry.
4.2 Double-entry ledger
Every monetary movement creates exactly two ledger entries: a debit and a credit. A charge debits the buyer’s liability account and credits the merchant’s receivable account. A refund reverses those entries. This structure makes reconciliation straightforward because the sum of all debits always equals the sum of all credits.
erDiagram
PAYMENT {
uuid payment_id PK
uuid merchant_id FK
string idempotency_key UK
string status
int amount_cents
string currency
string processor_id
timestamp created_at
timestamp updated_at
}
LEDGER_ENTRY {
uuid entry_id PK
uuid payment_id FK
uuid account_id FK
string entry_type "debit or credit"
int amount_cents
string currency
timestamp created_at
}
ACCOUNT {
uuid account_id PK
string account_type "liability, receivable, revenue, escrow"
uuid owner_id
string currency
}
LEDGER_TRANSACTION {
uuid transaction_id PK
uuid payment_id FK
string transaction_type "charge, refund, payout, fee"
timestamp created_at
}
PAYMENT ||--o{ LEDGER_TRANSACTION : triggers
LEDGER_TRANSACTION ||--|{ LEDGER_ENTRY : contains
LEDGER_ENTRY }o--|| ACCOUNT : affects
Ledger data model. Every ledger transaction contains balanced debit/credit entries that sum to zero.
The ledger service enforces a critical invariant: no transaction commits unless its entries balance. This check runs inside a database transaction. If the sum of debits minus credits is nonzero, the transaction rolls back.
Ledger entries are append-only. Corrections happen through reversal entries, never updates. This gives auditors a complete history and satisfies regulatory retention requirements.
Account balances are derived by summing entries rather than stored as a mutable field. For performance, the system maintains a materialized balance that updates asynchronously, but the source of truth is always the entry log.
4.3 Payment state machine
Payments move through a well-defined set of states. The state machine worker consumes events from a message queue and applies transitions. Invalid transitions are rejected and logged for investigation.
stateDiagram-v2 [*] --> Pending: charge created Pending --> Processing: PSP accepted Processing --> Succeeded: PSP confirmed Processing --> Failed: PSP declined Pending --> Failed: validation error Succeeded --> RefundPending: refund requested RefundPending --> Refunded: refund confirmed RefundPending --> RefundFailed: refund declined Succeeded --> Disputed: chargeback opened Disputed --> Succeeded: dispute won Disputed --> Refunded: dispute lost Failed --> [*] Refunded --> [*]
Payment state machine. Each transition triggers a ledger entry and a merchant webhook notification.
Every state transition performs three actions atomically:
- Update the payment record in the database.
- Create the corresponding ledger entries.
- Enqueue a webhook event for the merchant.
Steps 1 and 2 happen in a single database transaction. Step 3 uses the transactional outbox pattern: the webhook event is written to an outbox table in the same transaction, then a background worker reads the outbox and publishes to the notification service. This avoids the dual-write problem where the database commits but the queue publish fails.
5. Reconciliation
Reconciliation runs as a daily batch job that compares internal ledger state against PSP settlement reports. The job identifies three categories of discrepancies:
- Missing internally. The PSP processed a charge we have no record of. Likely a dropped callback.
- Missing externally. We recorded a charge the PSP has no record of. Likely a phantom write.
- Amount mismatch. Both sides have the record but amounts differ. Usually a currency conversion issue.
Each discrepancy generates an alert. The system auto-resolves missing callbacks by fetching the charge status from the PSP API. Amount mismatches and phantom writes require manual review.
6. Failure handling
Failures in a payments platform fall into three buckets.
Transient PSP failures. The processor returns a 500 or times out. The state machine worker retries with exponential backoff, capped at 3 attempts. After exhausting retries, the payment moves to failed and the merchant receives a webhook.
Network partitions. If the payment service cannot reach the PSP, it must not assume the charge failed. The PSP might have processed it. The service marks the payment as unknown and a background reconciler polls the PSP API to resolve the final state. This prevents both double charges and lost charges.
Database failures. If the ledger write fails after a successful PSP charge, the system has taken money without recording it. The transactional outbox pattern prevents this: the PSP call only happens after the initial pending record is committed. If the subsequent ledger write fails, the reconciliation job catches the mismatch and creates the missing entries.
7. Trade-offs and alternatives
Synchronous vs. async processing. A synchronous model simplifies the code path but creates tight coupling with the PSP. If the PSP takes 5 seconds to respond, the client waits 5 seconds. The async model adds complexity (state machine, queues, workers) but decouples the charge acceptance from the charge resolution. Most production systems use a hybrid: synchronous for the initial PSP call, async for everything after.
Single database vs. separate ledger. Keeping payments and ledger entries in the same database simplifies consistency. Separating them lets each scale independently and enforces domain boundaries. The trade-off is that cross-service consistency requires either distributed transactions (expensive) or eventual consistency with reconciliation (complex but practical). We chose separation with eventual consistency.
Idempotency key ownership. Some platforms generate keys server-side. Client-supplied keys give merchants control over retry behavior and are the industry standard (Stripe, Adyen, PayPal all use this approach).
Balance computation. Computing balances by summing entries is correct but slow at scale. Materialized balances with periodic snapshots provide O(1) reads at the cost of async lag. For merchant dashboards, a 30-second delay is acceptable. For payout calculations, the system reads from the entry log directly.
8. What real systems actually do
Stripe uses a double-entry ledger backed by a custom database. Idempotency keys are required for all mutating API calls. State transitions happen through an internal event system.
PayPal runs a massive reconciliation pipeline that processes billions of transactions daily. Their ledger system uses a combination of real-time and batch processing.
Square separates the payment gateway from the ledger system. The gateway handles real-time charge flow while the ledger provides the financial record.
All three enforce idempotency at the API layer, use some form of double-entry accounting, and run reconciliation as a continuous process rather than a one-time check.
9. What comes next
This design handles the core charge lifecycle, but production systems need more:
- Multi-currency support. Currency conversion adds exchange rate tracking, conversion ledger entries, and exposure to FX risk.
- Payout scheduling. Aggregating merchant balances and initiating bank transfers on a schedule (T+2, weekly, monthly).
- Fraud detection. Real-time scoring of charges before sending them to the PSP. This sits between the API gateway and the payment service.
- PCI compliance. Card data never touches your servers. Use tokenization through the PSP or a vault service.
- Rate limiting per merchant. Prevents a single merchant from consuming all processing capacity during flash sales.
The ledger and state machine patterns from this design carry over directly into these extensions. Get the core right and the rest is incremental.