Data modeling for system design
In this series (15 parts)
- Introduction to low level design
- SOLID principles
- Design patterns: Creational
- Design patterns: Structural
- Design patterns: Behavioral
- Designing a parking lot
- Designing a library management system
- Designing an elevator system
- Designing a hotel booking system
- Designing a ride-sharing model
- Designing a rate limiter
- Designing a logging framework
- Designing a notification system
- API design and contract-first development
- Data modeling for system design
The data model is the foundation everything else sits on. Get it wrong and you will spend months working around awkward joins, missing indexes, and denormalization hacks. Get it right and your queries are fast, your schema evolves cleanly, and your application code stays simple.
This article covers the low-level decisions you face when designing a data model: how to normalize, when to denormalize, how to index, how to model for NoSQL, and how to handle time-series data. For a broader look at database technologies and their trade-offs, see databases overview. For how data distribution affects modeling, see database sharding.
Starting with entities and relationships
Every data model starts by identifying the nouns in your domain and the relationships between them. For an e-commerce system, the core entities are User, Product, Order, OrderItem, and Payment.
erDiagram
USER ||--o{ ORDER : places
USER ||--o{ ADDRESS : has
ORDER ||--|{ ORDER_ITEM : contains
ORDER ||--|| PAYMENT : "paid by"
ORDER }o--|| ADDRESS : "shipped to"
PRODUCT ||--o{ ORDER_ITEM : "appears in"
PRODUCT }o--|| CATEGORY : "belongs to"
USER {
uuid id PK
string email UK
string name
string password_hash
timestamp created_at
}
ORDER {
uuid id PK
uuid user_id FK
uuid address_id FK
string status
decimal total_amount
timestamp created_at
}
ORDER_ITEM {
uuid id PK
uuid order_id FK
uuid product_id FK
int quantity
decimal unit_price
decimal subtotal
}
PRODUCT {
uuid id PK
uuid category_id FK
string name
string description
decimal price
int stock_count
}
PAYMENT {
uuid id PK
uuid order_id FK
string method
string status
decimal amount
timestamp processed_at
}
ADDRESS {
uuid id PK
uuid user_id FK
string street
string city
string country
string postal_code
}
CATEGORY {
uuid id PK
string name
string slug UK
}
ER diagram showing a normalized e-commerce schema with users, orders, products, and payments.
This is a fully normalized schema. Every fact is stored once. The user’s name lives in the USER table. The product’s price lives in the PRODUCT table. The ORDER_ITEM records unit_price separately because the product price can change after the order is placed.
Normalization: the three forms that matter
Normalization eliminates redundancy. Each normal form adds a constraint:
First Normal Form (1NF): Every column holds a single atomic value. No arrays, no comma-separated lists. If a user has multiple phone numbers, that is a separate table, not a phone_numbers text column.
Second Normal Form (2NF): Every non-key column depends on the entire primary key. If your ORDER_ITEM table has a composite key of (order_id, product_id), then product_name violates 2NF because it depends only on product_id. Move it to the PRODUCT table.
Third Normal Form (3NF): No transitive dependencies. If ORDER has user_id and user_email, the email depends on the user, not the order. Remove user_email from ORDER and join through USER.
The rule of thumb: normalize until it hurts, then denormalize where the pain is worst.
When to denormalize
Normalization optimizes for write correctness. Denormalization optimizes for read performance. You trade storage space and update complexity for faster queries.
Common denormalization patterns:
Materialized aggregates. Instead of computing total_orders and total_spent from the ORDER table on every request, store them as columns on USER. Update them when an order is placed.
ALTER TABLE "user" ADD COLUMN total_orders INTEGER DEFAULT 0;
ALTER TABLE "user" ADD COLUMN total_spent DECIMAL DEFAULT 0;
-- Update on order completion
UPDATE "user"
SET total_orders = total_orders + 1,
total_spent = total_spent + :orderAmount
WHERE id = :userId;
Embedded references. Store the product name and price on ORDER_ITEM even though they exist in PRODUCT. This captures the point-in-time value and avoids a join when displaying order history.
Pre-joined views. Create a denormalized table or materialized view for a dashboard query that would otherwise join five tables.
erDiagram
ORDER_ITEM_DENORMALIZED {
uuid id PK
uuid order_id FK
uuid product_id FK
string product_name "denormalized from PRODUCT"
string category_name "denormalized from CATEGORY"
int quantity
decimal unit_price "snapshot at order time"
decimal subtotal
}
ORDER_SUMMARY {
uuid order_id PK
uuid user_id FK
string user_email "denormalized from USER"
string shipping_city "denormalized from ADDRESS"
int item_count "materialized aggregate"
decimal total_amount
string status
timestamp created_at
}
ER diagram showing denormalized schemas optimized for read-heavy query patterns.
The cost of denormalization
Every denormalized field creates a maintenance burden. When a product name changes, you need to update every ORDER_ITEM that references it, or accept that historical records show the old name (which is often the correct behavior for orders). When a user’s email changes, you need to update ORDER_SUMMARY, or accept staleness.
The decision framework:
| Factor | Normalize | Denormalize |
|---|---|---|
| Write-heavy workload | Preferred | Risky |
| Read-heavy workload | Expensive joins | Preferred |
| Data changes frequently | Preferred | Update anomalies |
| Data rarely changes | Either | Preferred |
| Strict consistency required | Preferred | Complex |
| Eventual consistency acceptable | Either | Preferred |
Query time vs normalization
As normalization level increases, write operations become simpler but read queries involving joins become more expensive. The relationship is not linear; it depends heavily on data volume and query patterns.
The sweet spot for most applications is 3NF with targeted denormalization for the most frequently read paths.
Indexing strategy
Indexes make reads fast and writes slow. Every index must earn its place. Start with these rules:
Primary keys get an index automatically. Use UUIDs or auto-incrementing integers.
Foreign keys should always be indexed. Without an index on ORDER.user_id, a query like “find all orders for user X” does a full table scan.
Columns in WHERE clauses are candidates for indexes. If you frequently query WHERE status = 'pending' AND created_at > ?, a composite index on (status, created_at) is appropriate.
Column order in composite indexes matters. The index on (status, created_at) supports queries filtering on status alone, or status and created_at together. It does not help queries filtering only on created_at. Put the most selective column first, or the column that appears in equality conditions before range conditions.
-- Good: supports WHERE status = ? AND created_at > ?
CREATE INDEX idx_order_status_created ON "order" (status, created_at);
-- Good: supports WHERE user_id = ? ORDER BY created_at DESC
CREATE INDEX idx_order_user_created ON "order" (user_id, created_at DESC);
-- Covering index: query answered entirely from index, no table lookup
CREATE INDEX idx_order_summary ON "order" (user_id, status, total_amount)
INCLUDE (created_at);
Do not index everything. Each index adds overhead to INSERT, UPDATE, and DELETE operations. On a write-heavy table, three unnecessary indexes can double your write latency.
Document modeling for NoSQL
When using a document store like MongoDB or DynamoDB, the modeling philosophy inverts. Instead of normalizing and joining, you embed related data in a single document to minimize read operations.
The access pattern drives the model. Ask: “What queries does my application run?” Then design documents that answer those queries in a single read.
{
"_id": "order-789",
"userId": "user-456",
"userEmail": "jane@example.com",
"status": "shipped",
"createdAt": "2026-04-20T10:30:00Z",
"shippingAddress": {
"street": "123 Main St",
"city": "Portland",
"country": "US",
"postalCode": "97201"
},
"items": [
{
"productId": "prod-001",
"productName": "Mechanical Keyboard",
"quantity": 1,
"unitPrice": 149.99,
"subtotal": 149.99
},
{
"productId": "prod-042",
"productName": "USB-C Cable",
"quantity": 2,
"unitPrice": 12.99,
"subtotal": 25.98
}
],
"payment": {
"method": "credit_card",
"status": "captured",
"amount": 175.97
}
}
This single document answers: “Show me order 789 with all its items, shipping address, and payment status.” In a normalized relational schema, that is a five-table join.
The trade-off: if you need to query across orders by product, you need a secondary index on items.productId. And if product names change, you need to decide whether to update all historical documents or accept point-in-time snapshots.
Embedding vs referencing. Embed when the related data is always accessed together and does not grow unboundedly. Reference (store an ID and look up separately) when the related data is shared across many documents or grows without limit (like comments on a post).
Time-series data modeling
Time-series data has unique characteristics: it is append-only, ordered by time, and queried by time ranges. Examples include metrics, logs, IoT sensor readings, and financial transactions.
Relational approach. A narrow table with a timestamp index works for moderate volumes.
CREATE TABLE metric (
id BIGSERIAL PRIMARY KEY,
metric_name VARCHAR(255) NOT NULL,
value DOUBLE PRECISION NOT NULL,
tags JSONB,
recorded_at TIMESTAMP NOT NULL
);
CREATE INDEX idx_metric_name_time ON metric (metric_name, recorded_at DESC);
-- Partition by time for automatic data lifecycle management
CREATE TABLE metric_2026_04 PARTITION OF metric
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
Partitioning is essential. Without it, queries on recent data scan the entire table. With monthly partitions, a query for today’s data only touches the current partition. Old partitions can be dropped when the retention period expires, which is much faster than deleting rows.
Bucketing for NoSQL. In DynamoDB or Cassandra, store time-series data in buckets. Each bucket holds one hour or one day of data for a specific metric.
{
"pk": "metric:cpu_usage:2026-04-20T14",
"readings": [
{ "t": "14:00:01", "v": 42.5 },
{ "t": "14:00:06", "v": 43.1 },
{ "t": "14:00:11", "v": 41.8 }
]
}
This keeps each document bounded in size and makes range queries efficient: to read one hour of data, you fetch one document instead of thousands of individual records.
Schema evolution
Schemas change. Columns get added, types get altered, tables get split. Plan for this from the start.
Additive changes are safe. Adding a nullable column, adding a new table, adding an index. These do not break existing queries.
Destructive changes need migration. Removing a column, renaming a column, changing a type. Use a migration tool (Flyway, Alembic, Knex migrations) to version your schema changes.
The expand-contract pattern. For renaming a column: first add the new column (expand), deploy code that writes to both, backfill the new column, deploy code that reads from the new column, then drop the old column (contract). This avoids downtime.
-- Step 1: Expand
ALTER TABLE "user" ADD COLUMN display_name VARCHAR(255);
-- Step 2: Backfill
UPDATE "user" SET display_name = name WHERE display_name IS NULL;
-- Step 3: Application reads from display_name
-- Step 4: Contract (after all services are updated)
ALTER TABLE "user" DROP COLUMN name;
What comes next
Data modeling is where the abstract meets the concrete. The schemas, indexes, and access patterns you design here determine how well your system performs under load. As you build more complex systems, revisit these foundations. Every slow query, every migration headache, and every scaling bottleneck traces back to a data modeling decision. Apply these patterns deliberately and your systems will be easier to build, operate, and evolve.