Search…
High Level Design · Part 12

Multi-region and global systems

In this series (12 parts)
  1. Monolith vs microservices
  2. Microservice communication patterns
  3. Service discovery and registration
  4. Event-driven architecture
  5. Distributed data patterns
  6. Caching architecture patterns
  7. Search architecture
  8. Storage systems at scale
  9. Notification systems
  10. Real-time systems architecture
  11. Batch and stream processing
  12. Multi-region and global systems

Your service runs in a single data center in Virginia. A user in Tokyo experiences 180ms of network latency on every request before your server even starts processing. A user in Frankfurt sees 90ms. When that Virginia data center has an outage (and it will), every user on the planet loses access simultaneously.

Multi-region architecture solves latency, availability, and data sovereignty problems, but it introduces a fundamentally harder set of trade-offs. Data must be replicated across regions, conflicts must be resolved when two regions accept writes to the same record, and your deployment pipeline must coordinate releases across continents. The complexity is real, but for any service with a global user base, it is unavoidable.

Active-passive vs active-active

The two foundational patterns for multi-region deployment are active-passive and active-active. Every global architecture is a variation of one of these.

In active-passive, one region (the primary) handles all write traffic and serves reads. One or more secondary regions receive replicated data and sit idle, ready to take over if the primary fails. Failover can be manual (an engineer flips the switch) or automatic (a health check triggers promotion). The secondary region is your insurance policy, not a performance optimization.

Active-passive is simpler to reason about because there is a single source of truth. Database replication flows in one direction. There are no write conflicts. The downside is that users far from the primary region still suffer high latency, and failover (especially automatic) is risky: promoting a replica that has not fully caught up means losing the most recent writes.

In active-active, multiple regions accept both reads and writes simultaneously. Users are routed to their nearest region, so latency is minimized. If one region goes down, the others absorb its traffic without a failover event. There is no single point of failure at the regional level.

The catch is data consistency. When two users in different regions update the same record at the same time, you have a write conflict. Resolving these conflicts correctly is the central challenge of active-active systems. We will cover resolution strategies shortly.

graph TD
  subgraph Active-Passive
      direction TB
      U1P["Users"] --> P["Primary Region (US East)"]
      P -->|Async Replication| S["Standby Region (EU West)"]
      S -.->|Failover| P
  end

Active-passive: all traffic goes to the primary, standby waits for failover.

graph TD
  subgraph Active-Active
      direction TB
      U1["US Users"] --> R1["US Region"]
      U2["EU Users"] --> R2["EU Region"]
      U3["AP Users"] --> R3["AP Region"]
      R1 <-->|Bi-directional Replication| R2
      R2 <-->|Bi-directional Replication| R3
      R1 <-->|Bi-directional Replication| R3
  end

Active-active: all regions serve traffic, data replicates bidirectionally.

Latency-based routing

Getting users to the right region is the job of your load balancing and DNS infrastructure. Latency-based DNS routing (offered by Route 53, Cloudflare, and others) resolves a domain name to the IP address of the region with the lowest measured latency for that user’s location.

This is not the same as geographic routing. A user in Mexico might have lower latency to the US West region than to US East, even though US East is technically closer as the crow flies. Latency-based routing uses actual network measurements, not geography.

For active-passive, DNS routes all traffic to the primary. On failover, you update DNS to point to the secondary. DNS TTLs mean this switch is not instant; clients cache the old IP for minutes. Lowering TTLs in advance (before a planned failover) helps, but unplanned failovers suffer from stale DNS.

For active-active, DNS routes each user to their nearest region continuously. If one region goes down, the health check removes it from DNS, and traffic shifts to remaining regions. The shift is gradual as DNS caches expire, but the remaining regions must have enough capacity to absorb the extra load.

Data sovereignty

Data sovereignty laws require that certain data physically resides within specific national boundaries. GDPR (EU), LGPD (Brazil), PIPL (China), and many others impose restrictions on where personal data can be stored and processed.

This is not just a compliance checkbox. It constrains your architecture. If EU user data must stay in the EU, you cannot replicate it to a US region for failover. You need region-pinned data: certain records are tagged with a jurisdiction, and the system ensures they only exist in permitted regions.

Implementation typically involves a routing layer that inspects the user’s jurisdiction and directs their requests to the appropriate region. The user’s data is partitioned by region at the database level. Cross-region queries that might touch data from multiple jurisdictions require careful handling: either the query is decomposed and executed in each region separately, or the data is anonymized/aggregated before crossing borders.

A common pattern is to separate data into two tiers: user-identifying data (pinned to the user’s jurisdiction) and anonymized analytics data (replicated globally). Your analytics pipeline can aggregate across regions as long as individual user records stay put.

Conflict resolution across regions

In an active-active system, two regions can accept conflicting writes to the same data. User A in New York updates their profile name to “Alice” at the exact moment user B (an admin in London) updates the same field to “Alice Smith.” Both writes succeed locally. When replication carries each write to the other region, you have a conflict.

There are several resolution strategies, each with trade-offs.

Last-writer-wins (LWW) uses timestamps to pick the most recent write. It is simple and requires no application logic, but it silently discards one write. If clocks across regions are skewed (which they always are, slightly), LWW can discard the objectively “later” write. This is acceptable for low-stakes data (display name, profile bio) but dangerous for financial data.

Merge functions apply domain-specific logic. For a shopping cart, the merge function takes the union of items from both versions. For a counter, it sums the increments from both regions. Merge functions preserve intent but require careful design for each data type.

Conflict-free replicated data types (CRDTs) are data structures designed to merge automatically without coordination. Counters, sets, and registers all have CRDT variants that guarantee convergence regardless of the order operations arrive. CRDTs are elegant but limited to specific data structures; not every problem maps to a CRDT naturally.

Application-level resolution presents both versions to the user and asks them to choose. This is what Google Docs does when it detects conflicting edits that its operational transformation algorithm cannot auto-merge. It is the most correct approach but the worst user experience for automated systems.

The best strategy depends on the data type. Use LWW for idempotent, low-stakes updates. Use merge functions or CRDTs for data with clear algebraic properties. Reserve application-level resolution for cases where silent data loss is unacceptable.

Disaster recovery: RTO and RPO

Two metrics define your disaster recovery posture: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

RTO is the maximum acceptable time between failure and full recovery. If your RTO is 1 hour, you have 1 hour to restore service after a regional outage. RTO drives your failover architecture: manual failover might achieve a 2-hour RTO, automatic failover can hit minutes, and active-active can achieve near-zero RTO (because there is no failover; traffic just shifts).

RPO is the maximum acceptable data loss measured in time. If your RPO is 5 minutes, you can tolerate losing the last 5 minutes of writes. RPO drives your replication strategy: synchronous replication gives RPO of zero (no data loss) but adds latency to every write. Asynchronous replication is faster but means the replica lags behind, and any data not yet replicated is lost during failover.

The CAP theorem is directly relevant here. During a network partition between regions, you must choose between availability (continue accepting writes in both regions, risking conflicts) and consistency (reject writes in the partitioned region until connectivity is restored). Active-active systems choose availability. Active-passive systems that require synchronous replication choose consistency.

Testing multi-region systems

You cannot validate a multi-region architecture without testing failures. Chaos engineering practices (popularized by Netflix’s Chaos Monkey) deliberately inject failures: killing instances, blocking network traffic between regions, introducing latency spikes.

Key scenarios to test include complete region failure (does traffic shift to surviving regions within your RTO?), replication lag spikes (does the system degrade gracefully when cross-region replication falls behind?), split-brain conditions (what happens when regions cannot communicate but both continue accepting writes?), and failback after recovery (when the failed region comes back online, does data reconciliation work correctly?).

Runbooks document the exact steps for each failure scenario. An engineer woken at 3am should not have to reason about multi-region failover from first principles. The runbook tells them which commands to run, which dashboards to check, and which stakeholders to notify.

Cost considerations

Multi-region deployments multiply infrastructure costs. You are running the same compute, storage, and networking in multiple locations. Cross-region data transfer is the hidden cost multiplier: AWS charges per gigabyte for data moving between regions, and replication traffic adds up fast.

Strategies to manage cost include running the secondary region at reduced capacity (scaling up only during failover), using reserved instances in the primary and spot/preemptible instances in the secondary, compressing replication traffic, and being selective about which data replicates (replicate the database, but regenerate caches locally in each region).

The cost question is really a risk question. How much revenue do you lose per minute of downtime? If the answer is 50,000perminute,thecostofasecondregioniseasytojustify.Ifdowntimecosts50,000 per minute, the cost of a second region is easy to justify. If downtime costs 50 per minute, the economics are different.

graph TD
  DNS["DNS (Latency-based Routing)"] --> R1
  DNS --> R2
  
  subgraph R1["US East Region"]
      LB1["Load Balancer"] --> APP1["App Servers"]
      APP1 --> DB1["Database Primary"]
      APP1 --> C1["Cache"]
  end
  
  subgraph R2["EU West Region"]
      LB2["Load Balancer"] --> APP2["App Servers"]
      APP2 --> DB2["Database Primary"]
      APP2 --> C2["Cache"]
  end
  
  DB1 <-->|Bi-directional Async Replication| DB2

Active-active multi-region architecture with latency-based DNS routing and bidirectional database replication.

What comes next

This article closes out the High Level Design series on infrastructure patterns. With search, storage, notifications, real-time systems, data processing, and multi-region architecture covered, you have the building blocks to design systems that serve millions of users across the globe. The next step is applying these patterns to concrete design problems: designing a chat system, a URL shortener, a video streaming platform, or a social media feed from the ground up.

Start typing to search across all content
navigate Enter open Esc close