Search…

Databases: SQL vs NoSQL and when to use each

In this series (20 parts)
  1. What is system design and why it matters
  2. Estimations and back-of-envelope calculations
  3. Scalability: vertical vs horizontal scaling
  4. CAP theorem and distributed system tradeoffs
  5. Consistency models
  6. Load balancing
  7. Caching: strategies and patterns
  8. Content Delivery Networks
  9. Databases: SQL vs NoSQL and when to use each
  10. Database replication
  11. Database sharding and partitioning
  12. Consistent hashing
  13. Message queues and event streaming
  14. API design: REST, GraphQL, gRPC
  15. Rate limiting and throttling
  16. Proxies: forward and reverse
  17. Networking concepts for system design
  18. Reliability patterns: timeouts, retries, circuit breakers
  19. Observability: logging, metrics, tracing
  20. Security in system design

Prerequisites: CDN.

Pick the wrong database and you spend the next three years migrating. Pick the right one and nobody notices because everything just works. This article breaks down how relational and non-relational databases actually differ under the hood, when each category shines, and how to reason about the decision when you are staring at a whiteboard during a system design review.

The relational model

Edgar Codd published his relational model paper in 1970. Over fifty years later, relational databases still power the majority of production systems. The core idea is simple: data lives in tables (relations), each row is a tuple, and you query by describing what you want rather than how to get it.

A relational database enforces a schema up front. Every row in a users table has the same columns. You join across tables using foreign keys. This structure is what makes SQL databases so powerful for complex queries: the optimizer knows the shape of your data before a single row is read.

PostgreSQL, MySQL, SQL Server, and Oracle are the big names. They all speak SQL, but they differ in storage engines, concurrency control, and extension ecosystems. PostgreSQL alone handles JSONB columns, full-text search, geospatial queries, and time-series workloads through extensions. It is far more versatile than people give it credit for.

The relational model works best when your data has clear relationships, you need multi-table transactions, and your query patterns are diverse. An e-commerce platform where orders reference customers who reference addresses who have payment methods is a textbook fit. The schema enforces data integrity at the database layer, so your application code does not have to.

ACID: the transactional guarantee

Relational databases provide ACID transactions. This acronym is worth understanding precisely because it tells you exactly what you give up when you choose a system that does not support it.

Atomicity means a transaction either fully completes or fully rolls back. Transfer $500 from account A to account B: both the debit and credit happen, or neither does. There is no state where the money vanishes.

Consistency means a transaction moves the database from one valid state to another. If you have a constraint that account balances cannot go negative, the database rejects any transaction that would violate it.

Isolation means concurrent transactions behave as if they ran sequentially. Two users buying the last concert ticket at the same time will not both succeed. The database serializes conflicting operations.

Durability means once a transaction commits, it stays committed even if the server loses power one millisecond later. The data is on disk (or replicated) before the client gets an acknowledgment.

ACID is expensive. Coordinating these guarantees across distributed nodes is even more expensive. This cost is the primary reason NoSQL databases exist. Many workloads simply do not need all four properties on every operation.

BASE: the alternative philosophy

NoSQL systems typically operate under BASE semantics: Basically Available, Soft state, Eventually consistent. The database prioritizes availability and partition tolerance over immediate consistency. You can read more about why this tradeoff exists in the CAP theorem and consistency models articles.

BASE means your read might return stale data for a brief window after a write. For a social media feed or a product catalog, that is perfectly acceptable. For a bank ledger, it is not. The art of database selection is matching the guarantee level to the actual requirements of each data domain in your system.

Document stores

MongoDB, CouchDB, and Amazon DocumentDB store data as JSON-like documents. Each document is self-contained: a single order document might embed the line items, shipping address, and payment details rather than normalizing them across four tables.

Document stores excel when your access pattern is “fetch everything about entity X in one read.” A content management system where each article has a different set of metadata fields fits naturally. You do not need to ALTER TABLE when one article has a video embed and another has a poll.

The tradeoff is joins. If you frequently need to query across document boundaries (“find all orders for customers in Texas who bought product Y”), you either denormalize aggressively or you perform multiple queries and join in application code. Neither is free.

MongoDB supports transactions across multiple documents as of version 4.0, but the performance overhead is noticeable. If you find yourself wrapping most operations in multi-document transactions, a relational database would likely serve you better.

A single MongoDB document can hold up to 16 MB. That sounds like a lot, but embedded arrays grow. A naive design that embeds all comments inside a blog post document will hit that limit on a popular post. Design your document boundaries around the queries you actually run.

Key-value stores

Redis, Memcached, Amazon DynamoDB, and Riak are key-value stores at their core. You store a value (a string, a blob, a JSON object) under a key, and you retrieve it by that key. That is the entire API surface.

The simplicity is the point. Redis serves a GET request in under 1 millisecond from memory. DynamoDB guarantees single-digit millisecond reads at any scale. When your access pattern is “look up this specific thing by its identifier,” nothing beats a key-value store.

Session storage, caching layers, feature flags, rate-limiting counters, shopping carts. These are all key-value problems. You know the key at query time. You do not need to scan or filter. You need speed.

DynamoDB adds a sort key that turns it into something closer to a wide-column store for certain access patterns. You can query all items with the same partition key, sorted by the sort key. This lets you model one-to-many relationships (all orders for customer X, sorted by date) without a full table scan. At scale, DynamoDB handles 10+ million requests per second with consistent single-digit millisecond latency.

Wide-column stores

Apache Cassandra and HBase organize data into rows and column families, but unlike relational databases, each row can have a different set of columns. Think of it as a two-dimensional key-value store: you look up a row by its partition key, then access specific columns within that row.

Cassandra was designed at Facebook to handle inbox search across hundreds of millions of users. It provides linear horizontal scalability: double the nodes, double the throughput. A 100-node Cassandra cluster can handle over 1 million writes per second.

The catch is that Cassandra requires you to model your data around your queries, not around your entities. You design a table for each query pattern. Need to look up users by email? One table. Need to look up users by city? A different table with the same data arranged differently. This denormalization feels wrong if you come from a relational background, but it is how you get O(1) reads at massive scale.

Wide-column stores fit time-series data, IoT telemetry, activity logs, and any workload with high write throughput and predictable read patterns. Netflix uses Cassandra to store hundreds of petabytes of viewing history. Discord uses it for billions of messages.

Graph databases

Neo4j, Amazon Neptune, and JanusGraph store data as nodes and edges. When the relationships between entities are the primary thing you query, a graph database outperforms everything else by orders of magnitude.

Consider a social network. “Find all friends of friends who also like hiking and live within 50 miles” requires multiple self-joins in SQL that scale poorly as the graph grows. In Neo4j, this is a traversal that follows edges directly. A three-hop traversal over millions of nodes completes in milliseconds.

Graph databases are the right tool for social networks, recommendation engines, fraud detection (finding rings of suspicious accounts), knowledge graphs, and network topology analysis. They are the wrong tool for tabular analytics, time-series ingestion, or anything where relationships are not the primary query axis.

The database decision tree

When you face a new data domain, start with the access pattern, not the technology.

flowchart TD
  A["What is your primary access pattern?"] --> B["Lookup by key"]
  A --> C["Complex queries with joins"]
  A --> D["Traverse relationships"]
  A --> E["Write-heavy time-series"]
  B --> F["Key-Value Store
(Redis, DynamoDB)"]
  C --> G{"Need strict ACID?"}
  G -->|Yes| H["Relational DB
(PostgreSQL, MySQL)"]
  G -->|No| I["Document Store
(MongoDB)"]
  D --> J["Graph DB
(Neo4j, Neptune)"]
  E --> K["Wide-Column Store
(Cassandra, HBase)"]

Decision tree for selecting a database type based on primary access pattern. Start with the query shape, not the brand name.

This diagram is a starting point, not a rulebook. Many real systems have multiple access patterns, which leads us to polyglot persistence.

Comparing database types

TypeExamplesData modelStrengthsWeaknessesSweet spot
RelationalPostgreSQL, MySQLTables with rows and columnsACID, complex joins, flexible queriesHorizontal scaling is hardTransactions, structured data
DocumentMongoDB, CouchDBJSON-like documentsSchema flexibility, fast single-entity readsCross-document joins are expensiveContent systems, catalogs
Key-ValueRedis, DynamoDBKey mapped to valueSub-millisecond reads, simple APINo complex queriesCaching, sessions, counters
Wide-ColumnCassandra, HBaseRow key plus column familiesMassive write throughput, linear scalingQuery patterns must be known upfrontTime-series, logs, IoT
GraphNeo4j, NeptuneNodes and edgesRelationship traversals in millisecondsPoor for non-graph queriesSocial, fraud, recommendations

Polyglot persistence

Most production systems at scale use more than one database. This is called polyglot persistence, and it is not a sign of over-engineering. It is a recognition that different data domains have different access patterns.

An e-commerce platform might use PostgreSQL for the order and payment pipeline (ACID matters), Redis for session management and caching (speed matters), Elasticsearch for product search (full-text ranking matters), and Cassandra for clickstream analytics (write throughput matters). Each database handles the workload it was designed for.

The cost of polyglot persistence is operational complexity. Each database has its own backup strategy, monitoring stack, failure modes, and scaling knobs. A team of three engineers probably should not run five different databases. Start with PostgreSQL. It handles an astonishing range of workloads. Add specialized stores only when PostgreSQL becomes a measurable bottleneck for a specific access pattern.

Choosing based on access patterns

The most common mistake in database selection is choosing based on hype or familiarity. “We use MongoDB because our last company used MongoDB” is not an engineering decision. Here is a more principled approach.

Step 1: List your access patterns. Write down every query your application needs to run. “Get user by ID.” “Find all orders for a customer in the last 30 days.” “Search products by keyword.” “Count events per minute for the last hour.” Be specific.

Step 2: Classify each pattern. Point lookups by key? Range scans? Full-text search? Graph traversals? Aggregations over large datasets? Each pattern type maps naturally to a database category.

Step 3: Identify consistency requirements. Which operations absolutely need ACID? Financial transactions, inventory decrements, and user registration usually do. Feed generation, analytics, and recommendations usually do not. See consistency models for the spectrum between strong and eventual consistency.

Step 4: Estimate scale. 1,000 reads per second is a different problem than 1,000,000 reads per second. A single PostgreSQL instance handles 10,000 to 50,000 simple queries per second depending on hardware. That covers most startups for years. Do not pre-optimize for scale you do not have.

Step 5: Consider operational cost. Can your team operate this database? Do you have runbooks for failover? Managed services like Amazon RDS, DynamoDB, and Cloud Spanner shift operational burden to the cloud provider, but they add cost and reduce control.

If most of your patterns point to one database type, use that. If you have two or three clearly distinct domains, consider polyglot persistence. If you are unsure, start relational. You can always add a specialized store later. Migrating away from a poor initial choice is far harder.

Replication and sharding

Once you have chosen a database, you need to make it survive hardware failures and handle growing load. These are separate problems with separate solutions.

Replication copies your data across multiple nodes so that if one machine dies, another can serve reads (and potentially writes). Leader-follower replication is the most common pattern. The leader handles writes and streams changes to followers, which serve reads. This gives you fault tolerance and read scalability.

Sharding (also called partitioning) splits your data across multiple nodes so that no single machine has to store or query the entire dataset. You pick a shard key, and each row goes to the node responsible for that key range.

Both topics have enough depth to warrant their own articles. Read about database replication and database sharding next.

Common mistakes

Choosing NoSQL to avoid learning SQL. SQL is a decades-old, battle-tested query language. If your data is relational, fighting that structure by stuffing it into documents creates more problems than it solves.

Denormalizing prematurely. Denormalization speeds up reads at the cost of write complexity and data consistency. Measure first. PostgreSQL with proper indexes handles far more than most engineers expect.

Ignoring the operational burden. Running a 12-node Cassandra cluster in production is a full-time job. If you do not have the team to operate it, use a managed service or pick a simpler architecture.

Treating the database as a dumb store. Modern databases have sophisticated query optimizers, materialized views, change data capture, stored procedures, and extension systems. Learn what your database can do before adding application-level complexity that duplicates built-in features.

What comes next

You have chosen a database. Now you need it to survive a server failure at 3 AM without losing data or waking anyone up. The next article covers database replication: leader-follower, leader-leader, quorum-based approaches, and how replication lag affects your consistency guarantees.

Start typing to search across all content
navigate Enter open Esc close