Search…
High Level Design · Part 8

Storage systems at scale

In this series (12 parts)
  1. Monolith vs microservices
  2. Microservice communication patterns
  3. Service discovery and registration
  4. Event-driven architecture
  5. Distributed data patterns
  6. Caching architecture patterns
  7. Search architecture
  8. Storage systems at scale
  9. Notification systems
  10. Real-time systems architecture
  11. Batch and stream processing
  12. Multi-region and global systems

Every system you design eventually asks the same question: where does the data live? A user uploads a profile photo, a pipeline produces a 4GB Parquet file, a logging service writes 200MB per second. Each of these has different access patterns, durability requirements, and cost sensitivities. Choosing the wrong storage layer means you either overpay or underperform.

Storage is not one problem. It is three distinct problems that happen to share the word “storage.”

Block, object, and file storage

Block storage exposes raw disk blocks to an operating system. Think of it as a virtual hard drive. The storage system has no understanding of files or directories; it just reads and writes fixed-size blocks (typically 4KB to 64KB). The operating system layers a filesystem on top. Amazon EBS, Google Persistent Disk, and Azure Managed Disks are cloud block storage. You attach them to a single compute instance, format them with ext4 or xfs, and use them like a local disk. Block storage gives you the lowest latency and highest IOPS because there is no metadata abstraction between you and the bytes.

File storage adds a POSIX-compatible filesystem layer that supports directories, permissions, and file locking. Multiple clients can mount the same filesystem simultaneously. NFS, Amazon EFS, and Azure Files are examples. File storage works well when multiple servers need shared access to the same files, like a pool of web servers reading the same static assets.

Object storage throws away the filesystem metaphor entirely. You store blobs (objects) in flat namespaces (buckets). Each object gets a unique key, some metadata, and the blob itself. There are no directories, no rename operations, no append writes. You PUT an object and GET it back. Amazon S3, Google Cloud Storage, and Azure Blob Storage are the dominant implementations. Object storage is optimized for massive scale, high durability, and low cost per gigabyte, at the expense of higher latency per operation and no partial update support.

The decision framework is straightforward. If you need a database disk or boot volume, use block storage. If multiple servers need to read and write shared files with POSIX semantics, use file storage. For everything else (images, videos, logs, backups, data lake files), object storage wins on cost, durability, and scalability.

HDFS concepts

The Hadoop Distributed File System (HDFS) was designed for a specific workload: storing very large files (gigabytes to terabytes) and reading them sequentially at high throughput. It was built for batch analytics, not low-latency random access.

HDFS splits every file into fixed-size blocks (128MB by default, much larger than OS-level blocks). Each block is replicated across multiple DataNodes (three copies by default). A central NameNode holds all metadata: which files exist, which blocks compose each file, and which DataNodes store each block.

When a client writes a file, the NameNode assigns block locations and the client streams data directly to DataNodes. The first DataNode forwards the data to the second, which forwards to the third, forming a replication pipeline. This design keeps the NameNode out of the data path entirely; it only handles metadata.

The NameNode is the single point of failure in HDFS. Early Hadoop clusters solved this with a standby NameNode and shared edit logs. Modern deployments use NameNode high availability with automatic failover through ZooKeeper.

graph TD
  Client["Client"] -->|1. File metadata request| NN["NameNode"]
  NN -->|2. Block locations| Client
  Client -->|3. Write block data| DN1["DataNode 1"]
  DN1 -->|4. Replicate| DN2["DataNode 2"]
  DN2 -->|5. Replicate| DN3["DataNode 3"]
  DN3 -->|6. Ack| DN2
  DN2 -->|7. Ack| DN1
  DN1 -->|8. Ack| Client

HDFS write path showing the replication pipeline from client through three DataNodes.

HDFS works well for batch processing workloads like MapReduce and Spark, where you scan entire datasets sequentially. It is a poor choice for serving individual small files to users because the 128MB block size and NameNode metadata overhead make small file access inefficient. The “small files problem” is one of the most common HDFS pitfalls: millions of tiny files consume NameNode memory (each file, directory, and block takes about 150 bytes of heap) without using block capacity efficiently.

S3-style object store architecture

Amazon S3 and its equivalents store trillions of objects across millions of physical drives. Understanding the architecture explains why object storage behaves the way it does.

At the highest level, an object store has two planes: the metadata plane and the data plane. The metadata plane maps bucket names and object keys to physical storage locations. The data plane handles the actual bytes.

When you PUT an object, the request hits a front-end load balancer, which routes it to an API server. The API server authenticates the request, checks bucket policies, and contacts the metadata service to allocate space. The metadata service chooses which storage nodes will hold the object’s data chunks. The API server streams the object data to those storage nodes, which write the chunks to local disks. Once all replicas (or erasure-coded fragments) are durably written, the metadata service records the mapping and the API server returns a success response.

GET requests follow a similar path in reverse. The API server queries the metadata service for chunk locations, fetches chunks from storage nodes in parallel, assembles them, and streams the response back to the client.

graph TD
  C["Client"] -->|PUT /bucket/key| LB["Load Balancer"]
  LB --> API["API Server"]
  API -->|Auth + Policy Check| Auth["IAM / Policy"]
  API -->|Allocate storage| MS["Metadata Service"]
  MS -->|Chunk locations| API
  API -->|Write chunks| SN1["Storage Node 1"]
  API -->|Write chunks| SN2["Storage Node 2"]
  API -->|Write chunks| SN3["Storage Node 3"]
  SN1 -->|Ack| API
  SN2 -->|Ack| API
  SN3 -->|Ack| API
  API -->|Record mapping| MS
  API -->|200 OK| C

Object upload path through load balancer, API server, metadata service, and storage nodes.

Durability is the defining feature. S3 promises 99.999999999% (eleven nines) annual durability. This is achieved through a combination of erasure coding (splitting each object into data and parity fragments so it survives multiple simultaneous drive failures) and continuous background integrity checking that detects and repairs bit rot.

Blob metadata service

The metadata service is the brain of an object store. It must handle billions of keys with low latency, survive node failures without data loss, and support operations like listing objects in a bucket (which can contain billions of keys).

Internally, metadata is typically stored in a distributed key-value store or a sharded database. The key space is partitioned by bucket and object key so that requests for different buckets hit different partitions. This is why S3 bucket listing is eventually consistent for very large buckets and why lexicographic key distribution matters for performance: keys with a common prefix may land on the same partition, creating hot spots.

Modern object stores use a tiered metadata cache. Hot metadata (recently accessed objects, bucket configurations) lives in memory. Warm metadata sits in fast SSDs. Cold metadata for rarely accessed objects can be fetched from slower storage. The metadata path is the bottleneck in most object store designs, which is why high request rates against a single prefix can cause throttling.

Lifecycle policies

Data has a lifecycle. A user-uploaded photo is accessed frequently in the first week, occasionally for the next year, and almost never after that. Paying hot-storage prices for cold data wastes money. Paying cold-storage prices for hot data wastes latency.

Lifecycle policies automate data movement between storage tiers. A typical policy might look like: keep objects in standard storage for 30 days, transition to infrequent access storage at day 30, move to glacier-style archive at day 90, and delete at day 365.

These transitions are not free. Moving an object between tiers involves rewriting metadata and sometimes physically relocating data. There is usually a minimum storage duration (for example, 30 days for infrequent access) to prevent thrashing. Retrieve costs also vary by tier: pulling data out of archive storage can take minutes to hours and costs significantly more per gigabyte than standard retrieval.

The key design decision is understanding your access patterns before choosing a tier. Logs are a textbook case for lifecycle policies: they are written constantly, queried for debugging within the first few days, occasionally searched for compliance in the first year, and archived for legal retention after that.

Replication across regions

Storing data in a single region works until it doesn’t. A regional outage takes down your entire storage layer. Users on the other side of the planet see high latency. Regulatory requirements may demand that data physically resides in specific countries.

Cross-region replication (CRR) copies objects from a source bucket to a destination bucket in another region. In S3, this is asynchronous: writes land in the source region first, then replicate to the destination within seconds to minutes. The replication lag means the destination bucket is eventually consistent with the source.

Bi-directional replication (both regions accept writes) introduces conflict resolution challenges. If the same key is written simultaneously in two regions, one write must win. Most object stores use last-writer-wins based on timestamps, which can cause data loss if clocks are skewed. For use cases where this matters, you need application-level conflict resolution or a single-writer architecture where each key is owned by one region.

graph LR
  subgraph US East
      B1["Source Bucket"]
  end
  subgraph EU West
      B2["Replica Bucket"]
  end
  subgraph AP Southeast
      B3["Replica Bucket"]
  end
  B1 -->|Async CRR| B2
  B1 -->|Async CRR| B3
  U1["US User"] --> B1
  U2["EU User"] --> B2
  U3["AP User"] --> B3

Cross-region replication from a primary bucket in US East to replicas in EU West and AP Southeast, with users routed to their nearest region.

Multi-region storage design ties directly into multi-region system architecture, where you need to decide between active-passive and active-active configurations at the application level. Storage replication is just one piece of that larger puzzle.

Choosing the right storage for your design

When you encounter storage in a system design interview or a production architecture, ask three questions. What are the access patterns (sequential scan vs random read vs write-heavy)? What latency and throughput do you need? How much are you willing to pay per gigabyte per month?

User-facing assets (profile pictures, thumbnails) belong in object storage behind a CDN. Database volumes need block storage with provisioned IOPS. Shared configuration files across a fleet of servers fit file storage. Analytics datasets that get processed in bulk jobs belong in HDFS or object storage with a columnar format like Parquet.

The most common mistake is using one storage type for everything. Block storage for images is expensive. Object storage for a database is unusable. Match the tool to the access pattern.

What comes next

Storage holds your data. The next article tackles how to tell users something happened with that data, covering notification systems, push delivery, email pipelines, and the fan-out problem that makes notifications one of the trickiest services to build correctly.

Start typing to search across all content
navigate Enter open Esc close