Search…

Database design for backend systems

In this series (15 parts)
  1. Backend system design scope
  2. Designing RESTful APIs
  3. Authentication and session management
  4. Database design for backend systems
  5. Caching in backend systems
  6. Background jobs and task queues
  7. File upload and storage
  8. Search integration
  9. Email and notification delivery
  10. Webhooks: design and security
  11. Payments integration
  12. Multi-tenancy patterns
  13. Backend for Frontend (BFF) pattern
  14. GraphQL server design
  15. gRPC and internal service APIs

A backend system is only as good as its data layer. Slow queries, missing indexes, and schema mismatches cause more production incidents than application bugs. This article covers the design decisions that make a database fast and reliable.

Schema design process

Schema design starts with understanding the data model, not the queries. Queries change as features evolve. A well-normalized schema adapts. A denormalized schema optimized for one query pattern breaks when requirements shift.

Step 1: Identify entities and relationships

Map the domain into entities (tables) and relationships (foreign keys). For a project management tool:

erDiagram
  WORKSPACE {
      uuid id PK
      string name
      string slug
      timestamp created_at
  }
  PROJECT {
      uuid id PK
      uuid workspace_id FK
      string name
      string status
      timestamp created_at
  }
  TASK {
      uuid id PK
      uuid project_id FK
      uuid assignee_id FK
      string title
      text description
      string status
      string priority
      timestamp due_date
      timestamp created_at
  }
  MEMBER {
      uuid id PK
      uuid workspace_id FK
      uuid user_id FK
      string role
      timestamp joined_at
  }
  COMMENT {
      uuid id PK
      uuid task_id FK
      uuid author_id FK
      text body
      timestamp created_at
  }

  WORKSPACE ||--|{ PROJECT : contains
  WORKSPACE ||--|{ MEMBER : has
  PROJECT ||--|{ TASK : contains
  TASK ||--o{ COMMENT : has

Entity-relationship diagram for a project management backend.

Step 2: Normalize to 3NF

Start with Third Normal Form:

  • Every column depends on the primary key (1NF).
  • Every non-key column depends on the whole key, not part of it (2NF).
  • No non-key column depends on another non-key column (3NF).

Normalization eliminates data duplication and update anomalies. You can always denormalize later for specific read patterns, but starting denormalized makes future changes painful.

Step 3: Choose primary keys

Use UUIDs (v4 or v7) for primary keys in distributed systems. Auto-incrementing integers work for single-database setups and are smaller, but they leak information (competitors can estimate your growth rate by watching IDs) and cause problems when merging data from multiple sources.

UUID v7 is time-ordered, which means inserts go to the end of the B-tree index rather than random positions. This gives you the uniqueness of UUIDs with the insert performance of sequential IDs.

Step 4: Add audit columns

Every table should have:

  • created_at TIMESTAMPTZ DEFAULT now()
  • updated_at TIMESTAMPTZ DEFAULT now()

For soft deletes, add deleted_at TIMESTAMPTZ. Use a partial index on deleted_at IS NULL to keep queries on active records fast.

Indexing strategy

Indexes are the single biggest lever for query performance. But indexes are not free: they slow down writes and consume storage. The goal is to index exactly what you need.

Composite indexes

A composite index covers multiple columns. Column order matters. An index on (project_id, status, created_at) supports:

  • WHERE project_id = ? (uses first column)
  • WHERE project_id = ? AND status = ? (uses first two columns)
  • WHERE project_id = ? AND status = ? ORDER BY created_at (uses all three columns)

It does not support:

  • WHERE status = ? (skips the first column)
  • WHERE created_at > ? (skips the first two columns)

The leftmost prefix rule: a composite index is useful for any query that uses a leading subset of its columns.

Covering indexes

A covering index includes all columns the query needs, so the database can answer the query from the index alone without reading the table (an “index-only scan”). In PostgreSQL:

CREATE INDEX idx_tasks_list ON tasks (project_id, status)
    INCLUDE (title, assignee_id, due_date);

This index covers the common “list tasks in project” query without touching the heap. The INCLUDE columns are stored in the index but not used for sorting or filtering.

Partial indexes

A partial index covers only rows that match a condition. This is powerful for tables where queries usually target a subset of rows:

CREATE INDEX idx_tasks_active ON tasks (project_id, due_date)
    WHERE deleted_at IS NULL AND status != 'completed';

This index is smaller than a full index and faster to scan because it only includes active, incomplete tasks.

Index performance impact

The difference between no index and a covering index can be two orders of magnitude. On a table with a million rows, that is the difference between 450ms and 3ms.

Query planning basics

The query planner decides how to execute a query: which indexes to use, what join strategy to pick, whether to sort in memory or on disk. Understanding the planner’s output helps you write queries that perform well.

Reading EXPLAIN output

EXPLAIN ANALYZE
SELECT id, title, status
FROM tasks
WHERE project_id = 'abc-123' AND status = 'in_progress'
ORDER BY created_at DESC
LIMIT 20;

Key things to look for:

  • Seq Scan: the database is reading the entire table. Usually bad for large tables.
  • Index Scan or Index Only Scan: the database is using an index. Good.
  • Sort: an explicit sort operation. If the index already provides the order, this disappears.
  • Rows: the planner’s estimate vs actual rows. A large mismatch means stale statistics; run ANALYZE.

Common planner pitfalls

Functions on indexed columns: WHERE LOWER(email) = 'john@example.com' cannot use a standard index on email. Create a functional index: CREATE INDEX idx_email_lower ON users (LOWER(email)).

Type mismatches: comparing a text column to an integer parameter may prevent index usage. Ensure types match.

OR conditions: WHERE status = 'active' OR status = 'pending' may not use an index efficiently. Rewrite as WHERE status IN ('active', 'pending').

Connection pooling

Every database connection consumes memory (roughly 5 to 10 MB per connection in PostgreSQL). With 100 application instances each opening 20 connections, you have 2,000 connections, which may exceed your database’s capacity.

Connection pooling solves this. A pooler (PgBouncer, pgcat, or your framework’s built-in pool) maintains a smaller number of actual database connections and multiplexes application requests across them.

graph LR
  A1["App Instance 1<br/>20 connections"] --> P["Connection Pooler<br/>50 actual connections"]
  A2["App Instance 2<br/>20 connections"] --> P
  A3["App Instance 3<br/>20 connections"] --> P
  A4["App Instance 4<br/>20 connections"] --> P
  P --> DB["PostgreSQL<br/>50 connections"]

  style P fill:#f39c12,color:#fff
  style DB fill:#3498db,color:#fff

Connection pooling reduces 80 application connections to 50 database connections.

Pool sizing

A good starting point for pool size is:

pool_size = (2 * CPU cores) + number_of_disks

This formula comes from the PostgreSQL wiki. For a database server with 8 cores and 2 SSDs, start with 18 connections. Profile and adjust from there.

Setting the pool too large causes contention on database locks and CPU. Setting it too small causes application threads to wait for a connection. Monitor connection wait time as a key metric.

Read replicas for reporting

Database replication creates copies of your primary database that serve read queries. This is useful for two scenarios:

Reporting and analytics: heavy queries that scan large portions of the data should not compete with transactional queries on the primary. Route them to a read replica.

Read scaling: if your workload is read-heavy (90%+ reads), replicas let you distribute read traffic across multiple servers.

Replication lag

Replicas are eventually consistent. A write to the primary may take milliseconds to seconds to appear on the replica. This means:

  • Never read from a replica immediately after writing. The data may not be there yet.
  • Use the primary for reads that follow writes in the same user flow (read-your-writes consistency).
  • Use replicas for dashboards, reports, and search indexing where slight staleness is acceptable.
# Route critical reads to primary, reporting reads to replica
def get_task(task_id, use_replica=False):
    db = replica_pool if use_replica else primary_pool
    return db.execute("SELECT * FROM tasks WHERE id = %s", [task_id])

def get_project_report(project_id):
    return get_task_stats(project_id, use_replica=True)

Schema migration practices

Schema changes in production require care:

  1. Always use migrations: tools like Flyway, Alembic, or Prisma Migrate. Never run DDL manually.
  2. Make changes backward-compatible: add columns as nullable or with defaults. Remove columns in a separate deployment after all code that references them is gone.
  3. Avoid long locks: ALTER TABLE ... ADD COLUMN with a default in PostgreSQL 11+ is instant. But ALTER TABLE ... ADD COLUMN ... NOT NULL without a default locks the table while rewriting it.
  4. Test migrations on production-sized data: a migration that takes 1 second on your dev database may take 30 minutes on production.

What comes next

The next article covers caching in backend systems: what to cache, Redis data structures, cache-aside patterns, distributed locks, and cache warming strategies. Caching is the natural complement to good database design; it reduces load on the patterns we just covered.

Start typing to search across all content
navigate Enter open Esc close