Database design for backend systems
In this series (15 parts)
- Backend system design scope
- Designing RESTful APIs
- Authentication and session management
- Database design for backend systems
- Caching in backend systems
- Background jobs and task queues
- File upload and storage
- Search integration
- Email and notification delivery
- Webhooks: design and security
- Payments integration
- Multi-tenancy patterns
- Backend for Frontend (BFF) pattern
- GraphQL server design
- gRPC and internal service APIs
A backend system is only as good as its data layer. Slow queries, missing indexes, and schema mismatches cause more production incidents than application bugs. This article covers the design decisions that make a database fast and reliable.
Schema design process
Schema design starts with understanding the data model, not the queries. Queries change as features evolve. A well-normalized schema adapts. A denormalized schema optimized for one query pattern breaks when requirements shift.
Step 1: Identify entities and relationships
Map the domain into entities (tables) and relationships (foreign keys). For a project management tool:
erDiagram
WORKSPACE {
uuid id PK
string name
string slug
timestamp created_at
}
PROJECT {
uuid id PK
uuid workspace_id FK
string name
string status
timestamp created_at
}
TASK {
uuid id PK
uuid project_id FK
uuid assignee_id FK
string title
text description
string status
string priority
timestamp due_date
timestamp created_at
}
MEMBER {
uuid id PK
uuid workspace_id FK
uuid user_id FK
string role
timestamp joined_at
}
COMMENT {
uuid id PK
uuid task_id FK
uuid author_id FK
text body
timestamp created_at
}
WORKSPACE ||--|{ PROJECT : contains
WORKSPACE ||--|{ MEMBER : has
PROJECT ||--|{ TASK : contains
TASK ||--o{ COMMENT : has
Entity-relationship diagram for a project management backend.
Step 2: Normalize to 3NF
Start with Third Normal Form:
- Every column depends on the primary key (1NF).
- Every non-key column depends on the whole key, not part of it (2NF).
- No non-key column depends on another non-key column (3NF).
Normalization eliminates data duplication and update anomalies. You can always denormalize later for specific read patterns, but starting denormalized makes future changes painful.
Step 3: Choose primary keys
Use UUIDs (v4 or v7) for primary keys in distributed systems. Auto-incrementing integers work for single-database setups and are smaller, but they leak information (competitors can estimate your growth rate by watching IDs) and cause problems when merging data from multiple sources.
UUID v7 is time-ordered, which means inserts go to the end of the B-tree index rather than random positions. This gives you the uniqueness of UUIDs with the insert performance of sequential IDs.
Step 4: Add audit columns
Every table should have:
created_at TIMESTAMPTZ DEFAULT now()updated_at TIMESTAMPTZ DEFAULT now()
For soft deletes, add deleted_at TIMESTAMPTZ. Use a partial index on deleted_at IS NULL to keep queries on active records fast.
Indexing strategy
Indexes are the single biggest lever for query performance. But indexes are not free: they slow down writes and consume storage. The goal is to index exactly what you need.
Composite indexes
A composite index covers multiple columns. Column order matters. An index on (project_id, status, created_at) supports:
WHERE project_id = ?(uses first column)WHERE project_id = ? AND status = ?(uses first two columns)WHERE project_id = ? AND status = ? ORDER BY created_at(uses all three columns)
It does not support:
WHERE status = ?(skips the first column)WHERE created_at > ?(skips the first two columns)
The leftmost prefix rule: a composite index is useful for any query that uses a leading subset of its columns.
Covering indexes
A covering index includes all columns the query needs, so the database can answer the query from the index alone without reading the table (an “index-only scan”). In PostgreSQL:
CREATE INDEX idx_tasks_list ON tasks (project_id, status)
INCLUDE (title, assignee_id, due_date);
This index covers the common “list tasks in project” query without touching the heap. The INCLUDE columns are stored in the index but not used for sorting or filtering.
Partial indexes
A partial index covers only rows that match a condition. This is powerful for tables where queries usually target a subset of rows:
CREATE INDEX idx_tasks_active ON tasks (project_id, due_date)
WHERE deleted_at IS NULL AND status != 'completed';
This index is smaller than a full index and faster to scan because it only includes active, incomplete tasks.
Index performance impact
The difference between no index and a covering index can be two orders of magnitude. On a table with a million rows, that is the difference between 450ms and 3ms.
Query planning basics
The query planner decides how to execute a query: which indexes to use, what join strategy to pick, whether to sort in memory or on disk. Understanding the planner’s output helps you write queries that perform well.
Reading EXPLAIN output
EXPLAIN ANALYZE
SELECT id, title, status
FROM tasks
WHERE project_id = 'abc-123' AND status = 'in_progress'
ORDER BY created_at DESC
LIMIT 20;
Key things to look for:
- Seq Scan: the database is reading the entire table. Usually bad for large tables.
- Index Scan or Index Only Scan: the database is using an index. Good.
- Sort: an explicit sort operation. If the index already provides the order, this disappears.
- Rows: the planner’s estimate vs actual rows. A large mismatch means stale statistics; run
ANALYZE.
Common planner pitfalls
Functions on indexed columns: WHERE LOWER(email) = 'john@example.com' cannot use a standard index on email. Create a functional index: CREATE INDEX idx_email_lower ON users (LOWER(email)).
Type mismatches: comparing a text column to an integer parameter may prevent index usage. Ensure types match.
OR conditions: WHERE status = 'active' OR status = 'pending' may not use an index efficiently. Rewrite as WHERE status IN ('active', 'pending').
Connection pooling
Every database connection consumes memory (roughly 5 to 10 MB per connection in PostgreSQL). With 100 application instances each opening 20 connections, you have 2,000 connections, which may exceed your database’s capacity.
Connection pooling solves this. A pooler (PgBouncer, pgcat, or your framework’s built-in pool) maintains a smaller number of actual database connections and multiplexes application requests across them.
graph LR A1["App Instance 1<br/>20 connections"] --> P["Connection Pooler<br/>50 actual connections"] A2["App Instance 2<br/>20 connections"] --> P A3["App Instance 3<br/>20 connections"] --> P A4["App Instance 4<br/>20 connections"] --> P P --> DB["PostgreSQL<br/>50 connections"] style P fill:#f39c12,color:#fff style DB fill:#3498db,color:#fff
Connection pooling reduces 80 application connections to 50 database connections.
Pool sizing
A good starting point for pool size is:
pool_size = (2 * CPU cores) + number_of_disks
This formula comes from the PostgreSQL wiki. For a database server with 8 cores and 2 SSDs, start with 18 connections. Profile and adjust from there.
Setting the pool too large causes contention on database locks and CPU. Setting it too small causes application threads to wait for a connection. Monitor connection wait time as a key metric.
Read replicas for reporting
Database replication creates copies of your primary database that serve read queries. This is useful for two scenarios:
Reporting and analytics: heavy queries that scan large portions of the data should not compete with transactional queries on the primary. Route them to a read replica.
Read scaling: if your workload is read-heavy (90%+ reads), replicas let you distribute read traffic across multiple servers.
Replication lag
Replicas are eventually consistent. A write to the primary may take milliseconds to seconds to appear on the replica. This means:
- Never read from a replica immediately after writing. The data may not be there yet.
- Use the primary for reads that follow writes in the same user flow (read-your-writes consistency).
- Use replicas for dashboards, reports, and search indexing where slight staleness is acceptable.
# Route critical reads to primary, reporting reads to replica
def get_task(task_id, use_replica=False):
db = replica_pool if use_replica else primary_pool
return db.execute("SELECT * FROM tasks WHERE id = %s", [task_id])
def get_project_report(project_id):
return get_task_stats(project_id, use_replica=True)
Schema migration practices
Schema changes in production require care:
- Always use migrations: tools like Flyway, Alembic, or Prisma Migrate. Never run DDL manually.
- Make changes backward-compatible: add columns as nullable or with defaults. Remove columns in a separate deployment after all code that references them is gone.
- Avoid long locks:
ALTER TABLE ... ADD COLUMNwith a default in PostgreSQL 11+ is instant. ButALTER TABLE ... ADD COLUMN ... NOT NULLwithout a default locks the table while rewriting it. - Test migrations on production-sized data: a migration that takes 1 second on your dev database may take 30 minutes on production.
What comes next
The next article covers caching in backend systems: what to cache, Redis data structures, cache-aside patterns, distributed locks, and cache warming strategies. Caching is the natural complement to good database design; it reduces load on the patterns we just covered.