Design a search engine
In this series (18 parts)
- Design a URL shortener
- Design a key-value store
- Design a rate limiter
- Design a web crawler
- Design a notification system
- Design a news feed
- Design a chat application
- Design a video streaming platform
- Design a music streaming service
- Design a ride-sharing service
- Design a food delivery platform
- Design a hotel booking platform
- Design a search engine
- Design a distributed message queue
- Design a code deployment system
- Design a payments platform
- Design an ad click aggregation system
- Design a distributed cache
The core problem
A search engine accepts a short text query and returns the most relevant pages from a corpus of billions of documents in under 200 ms. That single sentence hides three hard problems: acquiring the documents (web crawler), organizing them for fast lookup (inverted index), and deciding which results matter most (ranking). If you have not reviewed the search architecture overview yet, start there for the vocabulary we build on here.
1. Requirements
Functional
- Users submit a text query (1 to 10 tokens on average) and receive a ranked list of results.
- Each result includes a title, URL, and short snippet with query terms highlighted.
- Support Boolean operators (AND, OR, NOT) and phrase matching.
- Autocomplete suggestions appear after the second keystroke.
- Index freshness: new or updated pages appear in results within 4 hours.
Non-functional
- Scale: 1 billion unique pages indexed. 100K queries per second at peak.
- Latency: p50 under 100 ms, p99 under 300 ms for query serving.
- Availability: 99.99% uptime. A partial index is acceptable during failures; zero results is not.
- DAU: 500 million daily active users.
- Storage: index must fit across a distributed cluster, not a single machine.
2. Capacity estimation
Corpus and storage
| Metric | Value |
|---|---|
| Indexed pages | 1 billion |
| Average compressed page size | 50 KB |
| Raw document store | 1 B x 50 KB = 50 TB |
| Inverted index (approx 30% of raw) | ~15 TB |
| Replicas (3x for availability) | ~45 TB index storage |
Query traffic
| Metric | Value |
|---|---|
| DAU | 500 M |
| Queries per user per day | 5 |
| Daily queries | 2.5 B |
| Average QPS | ~29 K |
| Peak QPS (3.5x average) | ~100 K |
Bandwidth
Each query response averages 10 KB (10 results with snippets). At 100K QPS peak that is 1 GB/s outbound. Comfortable for a fleet of frontend servers behind load balancing.
3. High-level architecture
graph LR U[User] -->|query| FE[Frontend API] FE --> QP[Query Processor] QP --> IS1[Index Shard 1] QP --> IS2[Index Shard 2] QP --> ISN[Index Shard N] IS1 --> AG[Aggregator] IS2 --> AG ISN --> AG AG --> RK[Ranker] RK --> CA[Cache Layer] CA --> FE CR[Crawler] --> PP[Parser / Extractor] PP --> IDX[Indexer] IDX --> IS1 IDX --> IS2 IDX --> ISN
Figure 1. High-level architecture showing the query serving path (left to right) and the indexing pipeline (bottom).
Two distinct data paths exist. The query path flows from user to frontend, fans out across index shards, aggregates, ranks, and returns. The indexing path runs asynchronously: the crawler fetches pages, the parser extracts text and metadata, and the indexer builds or updates the inverted index shards.
4. Deep dive: Indexing pipeline
4.1 Crawling
The web crawler operates as a distributed fleet of workers pulling URLs from a priority queue. Politeness rules throttle requests per domain. A URL frontier deduplicates and prioritizes URLs by freshness signals and page importance.
4.2 Parsing and extraction
Raw HTML goes through content extraction: strip tags, decode entities, detect language. The output is a clean document record containing title, body text, outbound links, and metadata (last modified, content type).
4.3 Inverted index construction
The inverted index maps every unique term to a posting list. Each posting contains a document ID, term frequency, and position offsets for phrase matching.
For a corpus of 1 billion documents with an average of 500 unique terms per page, the term dictionary holds roughly 10 million unique terms. Each posting list can span thousands to millions of entries for common words.
Consider the term “database”. Its posting list might contain 50 million document IDs. Storing these as raw 64-bit integers would consume 400 MB for a single term. In practice we use delta encoding and variable-byte compression to shrink posting lists by 4 to 8x.
The lookup flow for a two-term query like “distributed database” works as follows:
- Look up “distributed” in the term dictionary (hash map or trie). Retrieve its posting list.
- Look up “database”. Retrieve its posting list.
- Intersect the two lists using a merge join on sorted document IDs.
- For each matching document, compute BM25 using term frequencies stored in the postings.
- Return the top K documents by score.
Step 3 dominates cost. Skip pointers embedded in the posting list let us jump ahead when IDs diverge, reducing intersection time from O(n) to O(sqrt(n)) in practice.
Index sharding: We partition by document ID range. Each shard holds the full inverted index for its slice of documents. This keeps posting lists local and avoids cross-shard reads during index builds.
graph TD CW[Crawler Workers] --> MQ[Message Queue] MQ --> PE[Parser / Extractor] PE --> TK[Tokenizer] TK --> IB[Index Builder] IB --> SS1[Shard Store 1] IB --> SS2[Shard Store 2] IB --> SSN[Shard Store N] PE --> DS[Document Store] DS --> SG[Snippet Generator] IB --> ML[Merge Layer] ML --> SS1 ML --> SS2 ML --> SSN
Figure 2. Indexing pipeline from crawl to shard storage. The merge layer applies incremental updates without rebuilding entire shards.
4.4 Incremental updates
Rebuilding the entire index for every crawl cycle is impractical at this scale. Instead we use a two-tier approach:
- Real-time tier: New or updated documents go into a small in-memory index that gets merged into queries alongside the main index.
- Batch tier: Periodically (every few hours) the real-time segments compact into the main on-disk index through a merge sort on posting lists.
This keeps the freshness target of 4 hours without the cost of a full rebuild.
5. Deep dive: Query serving
5.1 Query processing
The query processor tokenizes, normalizes (lowercasing, stemming), and expands the query (synonym injection). It also rewrites Boolean expressions into a canonical form for the index lookup.
5.2 Fan-out and scatter-gather
The processed query fans out to all N index shards in parallel. Each shard intersects the posting lists for the query terms, computes a preliminary score (BM25), and returns the top K candidates. The aggregator merges these N lists and passes the top results to the ranker.
With 1000 shards and a 10 ms per-shard lookup, the scatter-gather step takes about 15 ms including network overhead, thanks to parallelism.
5.3 Ranking
Ranking happens in two stages:
- L1 (retrieval): BM25 on the inverted index. Fast, runs on every shard. Returns top 100 candidates per shard.
- L2 (re-ranking): A lightweight ML model scores candidates using features like PageRank, click-through rate, freshness, and domain authority. Runs on the aggregator over the merged candidate set (a few thousand documents).
PageRank assigns each page a score based on the link graph. Pages linked by many high-authority pages score higher. Computing PageRank over 1 billion nodes requires iterative MapReduce jobs that run offline and publish scores to the ranking service.
graph TD LG[Link Graph Store] --> PR[PageRank Compute] PR --> FS[Feature Store] CTR[Click-Through Logs] --> FP[Feature Pipeline] FP --> FS FM[Freshness Metadata] --> FS FS --> L2[L2 Re-Ranker] BM[BM25 Candidates] --> L2 L2 --> TOP[Final Top-10 Results]
Figure 4. Ranking feature pipeline. Offline signals like PageRank and click-through rate feed into a feature store consumed by the L2 re-ranker at query time.
The L2 model itself is typically a gradient-boosted decision tree or a small neural network. Training happens offline on click logs. The model takes 50 to 200 features per candidate and produces a relevance score in under 1 ms per document. With 2000 candidates from the merge step, L2 adds roughly 10 to 15 ms to total query latency.
graph LR QP[Query Processor] --> S1[Shard 1: BM25 Top-K] QP --> S2[Shard 2: BM25 Top-K] QP --> SN[Shard N: BM25 Top-K] S1 --> MG[Merge / Aggregate] S2 --> MG SN --> MG MG --> L2[L2 Re-Ranker] L2 --> SN2[Snippet Generation] SN2 --> CACHE[Result Cache] CACHE --> RESP[Response to User]
Figure 3. Query serving path with two-stage ranking. L1 scoring happens on each shard; L2 re-ranking runs centrally on the merged candidates.
5.4 Caching
Popular queries repeat frequently. A caching layer in front of the aggregator stores serialized result pages keyed by normalized query string. At 100K QPS with a 30% cache hit rate, the index shards only need to handle 70K QPS, a meaningful reduction.
Cache invalidation ties into the indexing pipeline: when a document in a cached result set gets re-indexed, we evict the affected cache entries.
6. Latency budget
The total sums to roughly 48 ms on the happy path. That leaves headroom for the p99 target of 300 ms even when individual shards are slow or retries are needed.
7. Trade-offs and alternatives
Document-partitioned vs. term-partitioned index
We chose document partitioning: each shard holds a subset of documents with their full posting lists. The alternative is term partitioning, where each shard owns a subset of terms across all documents.
| Approach | Pros | Cons |
|---|---|---|
| Document-partitioned | Simple shard management, easy rebalancing | Every query fans out to all shards |
| Term-partitioned | Queries touch fewer shards | Harder to rebalance, correlated failures on hot terms |
Document partitioning wins for general web search because queries are cheap to fan out and shard independence simplifies operations.
Exact vs. approximate retrieval
For very large corpora, approximate nearest neighbor (ANN) techniques can speed up retrieval at the cost of recall. Traditional inverted indexes give exact Boolean matching. Modern systems often combine both: inverted index for keyword matching plus a vector index for semantic similarity.
Pre-computed vs. on-the-fly snippets
Pre-computing snippets saves query-time CPU but increases storage and complicates freshness. Most search engines generate snippets on the fly from a compressed document store, accepting the 5 to 10 ms cost.
8. What real systems actually do
Google uses a document-partitioned index spread across multiple data centers. The index is split into tiers: a smaller “hot” tier for frequently accessed documents and a larger “cold” tier for the long tail. Caffeine (their indexing system) processes incremental updates continuously rather than doing periodic full rebuilds. Google also runs a dedicated “supplemental” index for rarely accessed pages, keeping the primary serving index lean.
Bing similarly uses tiered indexing with a focus on reducing tail latency through speculative execution: if a shard is slow, the query re-dispatches to a replica before the timeout. Bing invests heavily in hardware acceleration, using FPGAs for ranking model inference to keep p99 latency under control.
Elasticsearch, used by many smaller search applications, stores inverted indexes as immutable Lucene segments that get periodically merged. It uses document partitioning and supports both keyword and vector search. Segment merging is the Achilles heel: large merges can spike latency, so production deployments tune merge policies carefully.
Yandex takes a different approach to freshness. Their “Mercury” system maintains a small real-time index in RAM that handles the most recent documents, merged into the base index every few minutes rather than hours.
All four systems share common patterns: aggressive caching, tiered storage, two-stage ranking, and document-based sharding. The differences lie in how they handle tail latency and freshness trade-offs.
9. What comes next
This design handles the core search loop: crawl, index, serve. Several extensions are worth exploring:
- Personalization: Adjust ranking based on user history and location. Requires a user profile service and privacy-aware feature pipeline.
- Vertical search: Images, videos, news each need specialized parsers, indexes, and ranking signals.
- Semantic search: Dense vector embeddings enable “meaning” based retrieval alongside keyword matching. This adds a vector index layer and an embedding inference service.
- Spell correction and query understanding: NLP models that rewrite queries before they hit the index improve recall significantly.
- Federation: Merging results from multiple specialized indexes (web, images, knowledge graph) into a single result page.
- Geo-aware serving: Route queries to the nearest data center and bias results by geographic relevance. Reduces latency and improves result quality for local intent queries.
The inverted index and scatter-gather pattern remain at the center of every modern search system. If you want to go deeper on the crawling side, revisit the web crawler case study. For the infrastructure that ties everything together, the search architecture overview covers the full picture.