Jan 22, 2026 · 18 min read · System Design

Data modeling for system design

In this series (15 parts)

The data model is the foundation everything else sits on. Get it wrong and you will spend months working around awkward joins, missing indexes, and denormalization hacks. Get it right and your queries are fast, your schema evolves cleanly, and your application code stays simple.

This article covers the low-level decisions you face when designing a data model: how to normalize, when to denormalize, how to index, how to model for NoSQL, and how to handle time-series data. For a broader look at database technologies and their trade-offs, see databases overview. For how data distribution affects modeling, see database sharding.

Starting with entities and relationships

Every data model starts by identifying the nouns in your domain and the relationships between them. For an e-commerce system, the core entities are User, Product, Order, OrderItem, and Payment.

erDiagram
  USER ||--o{ ORDER : places
  USER ||--o{ ADDRESS : has
  ORDER ||--|{ ORDER_ITEM : contains
  ORDER ||--|| PAYMENT : "paid by"
  ORDER }o--|| ADDRESS : "shipped to"
  PRODUCT ||--o{ ORDER_ITEM : "appears in"
  PRODUCT }o--|| CATEGORY : "belongs to"

  USER {
      uuid id PK
      string email UK
      string name
      string password_hash
      timestamp created_at
  }
  ORDER {
      uuid id PK
      uuid user_id FK
      uuid address_id FK
      string status
      decimal total_amount
      timestamp created_at
  }
  ORDER_ITEM {
      uuid id PK
      uuid order_id FK
      uuid product_id FK
      int quantity
      decimal unit_price
      decimal subtotal
  }
  PRODUCT {
      uuid id PK
      uuid category_id FK
      string name
      string description
      decimal price
      int stock_count
  }
  PAYMENT {
      uuid id PK
      uuid order_id FK
      string method
      string status
      decimal amount
      timestamp processed_at
  }
  ADDRESS {
      uuid id PK
      uuid user_id FK
      string street
      string city
      string country
      string postal_code
  }
  CATEGORY {
      uuid id PK
      string name
      string slug UK
  }

ER diagram showing a normalized e-commerce schema with users, orders, products, and payments.

This is a fully normalized schema. Every fact is stored once. The user’s name lives in the USER table. The product’s price lives in the PRODUCT table. The ORDER_ITEM records unit_price separately because the product price can change after the order is placed.

Normalization: the three forms that matter

Normalization eliminates redundancy. Each normal form adds a constraint:

First Normal Form (1NF): Every column holds a single atomic value. No arrays, no comma-separated lists. If a user has multiple phone numbers, that is a separate table, not a phone_numbers text column.

Second Normal Form (2NF): Every non-key column depends on the entire primary key. If your ORDER_ITEM table has a composite key of (order_id, product_id), then product_name violates 2NF because it depends only on product_id. Move it to the PRODUCT table.

Third Normal Form (3NF): No transitive dependencies. If ORDER has user_id and user_email, the email depends on the user, not the order. Remove user_email from ORDER and join through USER.

The rule of thumb: normalize until it hurts, then denormalize where the pain is worst.

When to denormalize

Normalization optimizes for write correctness. Denormalization optimizes for read performance. You trade storage space and update complexity for faster queries.

Common denormalization patterns:

Materialized aggregates. Instead of computing total_orders and total_spent from the ORDER table on every request, store them as columns on USER. Update them when an order is placed.

ALTER TABLE "user" ADD COLUMN total_orders INTEGER DEFAULT 0;
ALTER TABLE "user" ADD COLUMN total_spent DECIMAL DEFAULT 0;

-- Update on order completion
UPDATE "user"
SET total_orders = total_orders + 1,
    total_spent = total_spent + :orderAmount
WHERE id = :userId;

Embedded references. Store the product name and price on ORDER_ITEM even though they exist in PRODUCT. This captures the point-in-time value and avoids a join when displaying order history.

Pre-joined views. Create a denormalized table or materialized view for a dashboard query that would otherwise join five tables.

erDiagram
  ORDER_ITEM_DENORMALIZED {
      uuid id PK
      uuid order_id FK
      uuid product_id FK
      string product_name "denormalized from PRODUCT"
      string category_name "denormalized from CATEGORY"
      int quantity
      decimal unit_price "snapshot at order time"
      decimal subtotal
  }
  ORDER_SUMMARY {
      uuid order_id PK
      uuid user_id FK
      string user_email "denormalized from USER"
      string shipping_city "denormalized from ADDRESS"
      int item_count "materialized aggregate"
      decimal total_amount
      string status
      timestamp created_at
  }

ER diagram showing denormalized schemas optimized for read-heavy query patterns.

The cost of denormalization

Every denormalized field creates a maintenance burden. When a product name changes, you need to update every ORDER_ITEM that references it, or accept that historical records show the old name (which is often the correct behavior for orders). When a user’s email changes, you need to update ORDER_SUMMARY, or accept staleness.

The decision framework:

Factor	Normalize	Denormalize
Write-heavy workload	Preferred	Risky
Read-heavy workload	Expensive joins	Preferred
Data changes frequently	Preferred	Update anomalies
Data rarely changes	Either	Preferred
Strict consistency required	Preferred	Complex
Eventual consistency acceptable	Either	Preferred

Query time vs normalization

As normalization level increases, write operations become simpler but read queries involving joins become more expensive. The relationship is not linear; it depends heavily on data volume and query patterns.

The sweet spot for most applications is 3NF with targeted denormalization for the most frequently read paths.

Indexing strategy

Indexes make reads fast and writes slow. Every index must earn its place. Start with these rules:

Primary keys get an index automatically. Use UUIDs or auto-incrementing integers.

Foreign keys should always be indexed. Without an index on ORDER.user_id, a query like “find all orders for user X” does a full table scan.

Columns in WHERE clauses are candidates for indexes. If you frequently query WHERE status = 'pending' AND created_at > ?, a composite index on (status, created_at) is appropriate.

Column order in composite indexes matters. The index on (status, created_at) supports queries filtering on status alone, or status and created_at together. It does not help queries filtering only on created_at. Put the most selective column first, or the column that appears in equality conditions before range conditions.

-- Good: supports WHERE status = ? AND created_at > ?
CREATE INDEX idx_order_status_created ON "order" (status, created_at);

-- Good: supports WHERE user_id = ? ORDER BY created_at DESC
CREATE INDEX idx_order_user_created ON "order" (user_id, created_at DESC);

-- Covering index: query answered entirely from index, no table lookup
CREATE INDEX idx_order_summary ON "order" (user_id, status, total_amount)
    INCLUDE (created_at);

Do not index everything. Each index adds overhead to INSERT, UPDATE, and DELETE operations. On a write-heavy table, three unnecessary indexes can double your write latency.

Document modeling for NoSQL

When using a document store like MongoDB or DynamoDB, the modeling philosophy inverts. Instead of normalizing and joining, you embed related data in a single document to minimize read operations.

The access pattern drives the model. Ask: “What queries does my application run?” Then design documents that answer those queries in a single read.

{
  "_id": "order-789",
  "userId": "user-456",
  "userEmail": "jane@example.com",
  "status": "shipped",
  "createdAt": "2026-04-20T10:30:00Z",
  "shippingAddress": {
    "street": "123 Main St",
    "city": "Portland",
    "country": "US",
    "postalCode": "97201"
  },
  "items": [
    {
      "productId": "prod-001",
      "productName": "Mechanical Keyboard",
      "quantity": 1,
      "unitPrice": 149.99,
      "subtotal": 149.99
    },
    {
      "productId": "prod-042",
      "productName": "USB-C Cable",
      "quantity": 2,
      "unitPrice": 12.99,
      "subtotal": 25.98
    }
  ],
  "payment": {
    "method": "credit_card",
    "status": "captured",
    "amount": 175.97
  }
}

This single document answers: “Show me order 789 with all its items, shipping address, and payment status.” In a normalized relational schema, that is a five-table join.

The trade-off: if you need to query across orders by product, you need a secondary index on items.productId. And if product names change, you need to decide whether to update all historical documents or accept point-in-time snapshots.

Embedding vs referencing. Embed when the related data is always accessed together and does not grow unboundedly. Reference (store an ID and look up separately) when the related data is shared across many documents or grows without limit (like comments on a post).

Time-series data modeling

Time-series data has unique characteristics: it is append-only, ordered by time, and queried by time ranges. Examples include metrics, logs, IoT sensor readings, and financial transactions.

Relational approach. A narrow table with a timestamp index works for moderate volumes.

CREATE TABLE metric (
    id BIGSERIAL PRIMARY KEY,
    metric_name VARCHAR(255) NOT NULL,
    value DOUBLE PRECISION NOT NULL,
    tags JSONB,
    recorded_at TIMESTAMP NOT NULL
);

CREATE INDEX idx_metric_name_time ON metric (metric_name, recorded_at DESC);

-- Partition by time for automatic data lifecycle management
CREATE TABLE metric_2026_04 PARTITION OF metric
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Partitioning is essential. Without it, queries on recent data scan the entire table. With monthly partitions, a query for today’s data only touches the current partition. Old partitions can be dropped when the retention period expires, which is much faster than deleting rows.

Bucketing for NoSQL. In DynamoDB or Cassandra, store time-series data in buckets. Each bucket holds one hour or one day of data for a specific metric.

{
  "pk": "metric:cpu_usage:2026-04-20T14",
  "readings": [
    { "t": "14:00:01", "v": 42.5 },
    { "t": "14:00:06", "v": 43.1 },
    { "t": "14:00:11", "v": 41.8 }
  ]
}

This keeps each document bounded in size and makes range queries efficient: to read one hour of data, you fetch one document instead of thousands of individual records.

Schema evolution

Schemas change. Columns get added, types get altered, tables get split. Plan for this from the start.

Additive changes are safe. Adding a nullable column, adding a new table, adding an index. These do not break existing queries.

Destructive changes need migration. Removing a column, renaming a column, changing a type. Use a migration tool (Flyway, Alembic, Knex migrations) to version your schema changes.

The expand-contract pattern. For renaming a column: first add the new column (expand), deploy code that writes to both, backfill the new column, deploy code that reads from the new column, then drop the old column (contract). This avoids downtime.

-- Step 1: Expand
ALTER TABLE "user" ADD COLUMN display_name VARCHAR(255);

-- Step 2: Backfill
UPDATE "user" SET display_name = name WHERE display_name IS NULL;

-- Step 3: Application reads from display_name
-- Step 4: Contract (after all services are updated)
ALTER TABLE "user" DROP COLUMN name;

What comes next

Data modeling is where the abstract meets the concrete. The schemas, indexes, and access patterns you design here determine how well your system performs under load. As you build more complex systems, revisit these foundations. Every slow query, every migration headache, and every scaling bottleneck traces back to a data modeling decision. Apply these patterns deliberately and your systems will be easier to build, operate, and evolve.

← Back to all series