Distributed data patterns
In this series (12 parts)
- Monolith vs microservices
- Microservice communication patterns
- Service discovery and registration
- Event-driven architecture
- Distributed data patterns
- Caching architecture patterns
- Search architecture
- Storage systems at scale
- Notification systems
- Real-time systems architecture
- Batch and stream processing
- Multi-region and global systems
The hardest part of microservices is not the services; it is the data. When you split a monolith, the database does not split itself. You end up with either a shared database that couples everything back together or independent databases that force you to rethink how data flows between services. Getting data ownership right is the difference between a genuinely decoupled architecture and a distributed monolith with extra network hops.
Data ownership per service
The foundational rule of microservice data architecture: each service owns its data and exposes it only through its API. No other service reads from or writes to another service’s database directly. This is the “database per service” pattern, and it exists for good reason.
graph TD
OS["Order Service"] --> ODB[("Orders DB")]
IS["Inventory Service"] --> IDB[("Inventory DB")]
PS["Payment Service"] --> PDB[("Payments DB")]
US["User Service"] --> UDB[("Users DB")]
OS -->|"API call"| IS
OS -->|"API call"| PS
OS -->|"API call"| US
style ODB fill:#636EFA,color:#fff
style IDB fill:#EF553B,color:#fff
style PDB fill:#00CC96,color:#fff
style UDB fill:#AB63FA,color:#fff
Database per service: each service owns its data store. Cross-service data access happens through APIs, never through shared tables.
When the order service needs user information, it calls the user service’s API. It does not join against the users table. This feels inefficient compared to a monolith’s single SQL join, and it is. The tradeoff is that each team can evolve their schema independently, choose the database technology that fits their workload (relational for payments, document store for product catalog, time-series for metrics), and deploy without coordinating schema migrations across teams.
The “database per service” does not mean one physical database server per service. Multiple services can share a database cluster as long as they use separate schemas with no cross-schema queries. The boundary is logical, not physical.
The shared database anti-pattern
The shared database anti-pattern is what happens when teams skip the hard work of defining service boundaries. Multiple services read from and write to the same tables, creating implicit coupling that defeats the purpose of having separate services.
graph TD
OS["Order Service"] --> DB[("Shared Database")]
IS["Inventory Service"] --> DB
PS["Payment Service"] --> DB
US["User Service"] --> DB
style DB fill:#EF553B,color:#fff
Shared database anti-pattern: all services couple to the same schema. A column rename breaks everyone.
The problems accumulate quickly. A schema change by one team breaks queries in other services. Performance issues in one service’s queries affect all services sharing the database. You cannot independently scale the database for different workloads. Lock contention between services causes unpredictable latency spikes. And the most insidious problem: teams start depending on each other’s tables to avoid the “overhead” of API calls, creating invisible dependencies that only surface during incidents.
If you find yourself in a shared database situation (and many teams do, especially those migrating from a monolith), the path out starts with identifying which tables belong to which service, creating APIs to replace direct table access, and gradually moving tables to service-owned schemas. The strangler fig pattern applies here too.
Data denormalization at scale
In a monolith, you normalize data to eliminate redundancy and enforce consistency through foreign keys. In a distributed system, normalization across service boundaries means every read requires multiple API calls. Denormalization, storing copies of data where it is needed, trades storage for performance and autonomy.
Suppose the order service needs to display order details including the customer name and product title. In a monolith, that is a three-table join. In microservices, it requires calls to the user service and product service for every order detail page. With denormalization, the order service stores a copy of the customer name and product title alongside the order data.
graph TD
subgraph OrderService["Order Service"]
OT["orders table<br/>order_id, customer_name,<br/>product_title, quantity, total"]
end
subgraph UserService["User Service"]
UT["users table<br/>user_id, name, email"]
end
subgraph ProductService["Product Service"]
PT["products table<br/>product_id, title, price"]
end
UserService -->|"UserUpdated event"| OrderService
ProductService -->|"ProductUpdated event"| OrderService
Denormalized data: the order service keeps copies of user and product data, updated through events.
The trade-off is clear. Reads become fast and self-contained, one database query returns everything the UI needs. But writes become complex: when a user changes their name, the user service publishes a UserUpdated event, and every service that stores a copy must update it. The data is eventually consistent; there is a window where the order service shows the old name while the user service shows the new one.
Denormalize aggressively for read-heavy data that changes infrequently (product titles, user display names). Keep real-time API calls for data that must be current (account balance, inventory count).
Change data capture (CDC)
Change data capture reads the database’s transaction log and streams every insert, update, and delete as an event. Instead of modifying application code to publish events (which developers forget or get wrong), CDC captures changes at the database level, guaranteeing that every write becomes an event.
graph LR
App["Application"] -->|"writes"| SrcDB[("Source DB")]
SrcDB -->|"transaction log"| CDC["CDC Connector<br/>(Debezium)"]
CDC -->|"change events"| Kafka["Kafka"]
Kafka --> Search["Search Index<br/>(Elasticsearch)"]
Kafka --> Cache["Cache<br/>(Redis)"]
Kafka --> Analytics["Analytics DB<br/>(ClickHouse)"]
Kafka --> DW["Data Warehouse<br/>(Snowflake)"]
CDC pipeline: Debezium reads the transaction log and publishes change events to Kafka. Multiple consumers build derived data stores.
Debezium is the dominant open-source CDC tool. It connects to PostgreSQL’s WAL, MySQL’s binlog, or MongoDB’s oplog and publishes structured change events to Kafka. Each event contains the before and after state of the row, the operation type, and metadata like the transaction ID and timestamp.
CDC solves several problems simultaneously. It replaces the outbox pattern with a simpler approach: just write to the database, and CDC will capture it. It feeds search indices, caches, and analytics systems without modifying the source application. It enables real-time data pipelines that replace batch ETL jobs running overnight.
The operational considerations are real. CDC adds a dependency on the database’s transaction log, which must be configured for logical replication (PostgreSQL) or row-based binlog (MySQL). The CDC connector must track its position in the log to resume after restarts. Kafka needs enough retention to handle consumer downtime. And schema changes in the source database propagate through the entire pipeline, requiring schema registry support (Confluent Schema Registry or Apicurio).
Read replicas for analytics
Production databases are optimized for transactional workloads: fast inserts, updates, and point queries. Analytical queries, aggregations across millions of rows, full-table scans for reporting, compete for the same resources and can degrade the production workload.
Database replication provides a clean separation. Read replicas receive a stream of changes from the primary and serve read traffic. Analytical queries run against replicas, leaving the primary free for transactional workloads.
graph TD
App["Application"] -->|"reads + writes"| Primary[("Primary DB")]
Primary -->|"replication stream"| R1[("Read Replica 1<br/>API reads")]
Primary -->|"replication stream"| R2[("Read Replica 2<br/>Analytics")]
Primary -->|"replication stream"| R3[("Read Replica 3<br/>Reporting")]
API["API Queries"] --> R1
Dash["Dashboard"] --> R2
Report["Weekly Reports"] --> R3
Read replicas serve different workloads. Analytics and reporting queries never touch the primary.
For heavier analytical workloads, CDC to a columnar database (ClickHouse, BigQuery, Redshift) provides better performance than replicas. Columnar stores compress data efficiently and execute analytical queries orders of magnitude faster than row-oriented databases. The trade-off is higher latency: CDC to a warehouse might introduce seconds to minutes of delay compared to near-real-time replica lag.
Data mesh principles
As organizations scale to dozens or hundreds of services, centralized data teams become bottlenecks. The data mesh concept applies microservice principles to analytical data: each domain team owns and publishes their data as a product, with discoverable schemas, SLAs, and quality guarantees.
The order team does not just own the order service; they own the “orders” data product that other teams consume for analytics. They are responsible for its schema documentation, data quality, freshness guarantees, and access control. A central platform team provides the infrastructure (Kafka, schema registry, data catalog) but does not own the data itself.
This is conceptual rather than a specific technology choice, but it influences how you design CDC pipelines, database sharding strategies, and event schemas. Teams that publish data products think more carefully about their event contracts because they know other teams depend on them.
Practical migration strategy
If you are migrating from a shared database to owned databases, here is a practical sequence. First, identify the domain boundaries and assign table ownership. Second, create service APIs that encapsulate access to owned tables. Third, migrate read paths first: other services call the API instead of querying the table directly. Fourth, migrate write paths: other services send commands instead of inserting into the table directly. Fifth, physically move tables to service-owned schemas or databases. Sixth, set up CDC for denormalized copies and analytics.
This takes months or years for a large monolith. That timeline is normal. The goal is incremental progress where each step delivers value, not a big-bang migration.
What comes next
With data flowing between services through events and CDC, caching becomes critical for performance. The next article covers caching architecture patterns, from in-process caches to distributed Redis clusters to geographic cache hierarchies. Continue with caching architecture patterns.