Dec 12, 2025 · 23 min read · System Design

Design a search engine

In this series (18 parts)

The core problem

A search engine accepts a short text query and returns the most relevant pages from a corpus of billions of documents in under 200 ms. That single sentence hides three hard problems: acquiring the documents (web crawler), organizing them for fast lookup (inverted index), and deciding which results matter most (ranking). If you have not reviewed the search architecture overview yet, start there for the vocabulary we build on here.

1. Requirements

Functional

Users submit a text query (1 to 10 tokens on average) and receive a ranked list of results.
Each result includes a title, URL, and short snippet with query terms highlighted.
Support Boolean operators (AND, OR, NOT) and phrase matching.
Autocomplete suggestions appear after the second keystroke.
Index freshness: new or updated pages appear in results within 4 hours.

Non-functional

Scale: 1 billion unique pages indexed. 100K queries per second at peak.
Latency: p50 under 100 ms, p99 under 300 ms for query serving.
Availability: 99.99% uptime. A partial index is acceptable during failures; zero results is not.
DAU: 500 million daily active users.
Storage: index must fit across a distributed cluster, not a single machine.

2. Capacity estimation

Corpus and storage

Metric	Value
Indexed pages	1 billion
Average compressed page size	50 KB
Raw document store	1 B x 50 KB = 50 TB
Inverted index (approx 30% of raw)	~15 TB
Replicas (3x for availability)	~45 TB index storage

Query traffic

Metric	Value
DAU	500 M
Queries per user per day	5
Daily queries	2.5 B
Average QPS	~29 K
Peak QPS (3.5x average)	~100 K

Bandwidth

Each query response averages 10 KB (10 results with snippets). At 100K QPS peak that is 1 GB/s outbound. Comfortable for a fleet of frontend servers behind load balancing.

3. High-level architecture

graph LR
  U[User] -->|query| FE[Frontend API]
  FE --> QP[Query Processor]
  QP --> IS1[Index Shard 1]
  QP --> IS2[Index Shard 2]
  QP --> ISN[Index Shard N]
  IS1 --> AG[Aggregator]
  IS2 --> AG
  ISN --> AG
  AG --> RK[Ranker]
  RK --> CA[Cache Layer]
  CA --> FE
  CR[Crawler] --> PP[Parser / Extractor]
  PP --> IDX[Indexer]
  IDX --> IS1
  IDX --> IS2
  IDX --> ISN

Figure 1. High-level architecture showing the query serving path (left to right) and the indexing pipeline (bottom).

Two distinct data paths exist. The query path flows from user to frontend, fans out across index shards, aggregates, ranks, and returns. The indexing path runs asynchronously: the crawler fetches pages, the parser extracts text and metadata, and the indexer builds or updates the inverted index shards.

4. Deep dive: Indexing pipeline

4.1 Crawling

The web crawler operates as a distributed fleet of workers pulling URLs from a priority queue. Politeness rules throttle requests per domain. A URL frontier deduplicates and prioritizes URLs by freshness signals and page importance.

4.2 Parsing and extraction

Raw HTML goes through content extraction: strip tags, decode entities, detect language. The output is a clean document record containing title, body text, outbound links, and metadata (last modified, content type).

4.3 Inverted index construction

The inverted index maps every unique term to a posting list. Each posting contains a document ID, term frequency, and position offsets for phrase matching.

For a corpus of 1 billion documents with an average of 500 unique terms per page, the term dictionary holds roughly 10 million unique terms. Each posting list can span thousands to millions of entries for common words.

Consider the term “database”. Its posting list might contain 50 million document IDs. Storing these as raw 64-bit integers would consume 400 MB for a single term. In practice we use delta encoding and variable-byte compression to shrink posting lists by 4 to 8x.

The lookup flow for a two-term query like “distributed database” works as follows:

Look up “distributed” in the term dictionary (hash map or trie). Retrieve its posting list.
Look up “database”. Retrieve its posting list.
Intersect the two lists using a merge join on sorted document IDs.
For each matching document, compute BM25 using term frequencies stored in the postings.
Return the top K documents by score.

Step 3 dominates cost. Skip pointers embedded in the posting list let us jump ahead when IDs diverge, reducing intersection time from O(n) to O(sqrt(n)) in practice.

Index sharding: We partition by document ID range. Each shard holds the full inverted index for its slice of documents. This keeps posting lists local and avoids cross-shard reads during index builds.

graph TD
  CW[Crawler Workers] --> MQ[Message Queue]
  MQ --> PE[Parser / Extractor]
  PE --> TK[Tokenizer]
  TK --> IB[Index Builder]
  IB --> SS1[Shard Store 1]
  IB --> SS2[Shard Store 2]
  IB --> SSN[Shard Store N]
  PE --> DS[Document Store]
  DS --> SG[Snippet Generator]
  IB --> ML[Merge Layer]
  ML --> SS1
  ML --> SS2
  ML --> SSN

Figure 2. Indexing pipeline from crawl to shard storage. The merge layer applies incremental updates without rebuilding entire shards.

4.4 Incremental updates

Rebuilding the entire index for every crawl cycle is impractical at this scale. Instead we use a two-tier approach:

Real-time tier: New or updated documents go into a small in-memory index that gets merged into queries alongside the main index.
Batch tier: Periodically (every few hours) the real-time segments compact into the main on-disk index through a merge sort on posting lists.

This keeps the freshness target of 4 hours without the cost of a full rebuild.

5. Deep dive: Query serving

5.1 Query processing

The query processor tokenizes, normalizes (lowercasing, stemming), and expands the query (synonym injection). It also rewrites Boolean expressions into a canonical form for the index lookup.

5.2 Fan-out and scatter-gather

The processed query fans out to all N index shards in parallel. Each shard intersects the posting lists for the query terms, computes a preliminary score (BM25), and returns the top K candidates. The aggregator merges these N lists and passes the top results to the ranker.

With 1000 shards and a 10 ms per-shard lookup, the scatter-gather step takes about 15 ms including network overhead, thanks to parallelism.

5.3 Ranking

Ranking happens in two stages:

L1 (retrieval): BM25 on the inverted index. Fast, runs on every shard. Returns top 100 candidates per shard.
L2 (re-ranking): A lightweight ML model scores candidates using features like PageRank, click-through rate, freshness, and domain authority. Runs on the aggregator over the merged candidate set (a few thousand documents).

PageRank assigns each page a score based on the link graph. Pages linked by many high-authority pages score higher. Computing PageRank over 1 billion nodes requires iterative MapReduce jobs that run offline and publish scores to the ranking service.

graph TD
  LG[Link Graph Store] --> PR[PageRank Compute]
  PR --> FS[Feature Store]
  CTR[Click-Through Logs] --> FP[Feature Pipeline]
  FP --> FS
  FM[Freshness Metadata] --> FS
  FS --> L2[L2 Re-Ranker]
  BM[BM25 Candidates] --> L2
  L2 --> TOP[Final Top-10 Results]

Figure 4. Ranking feature pipeline. Offline signals like PageRank and click-through rate feed into a feature store consumed by the L2 re-ranker at query time.

The L2 model itself is typically a gradient-boosted decision tree or a small neural network. Training happens offline on click logs. The model takes 50 to 200 features per candidate and produces a relevance score in under 1 ms per document. With 2000 candidates from the merge step, L2 adds roughly 10 to 15 ms to total query latency.

graph LR
  QP[Query Processor] --> S1[Shard 1: BM25 Top-K]
  QP --> S2[Shard 2: BM25 Top-K]
  QP --> SN[Shard N: BM25 Top-K]
  S1 --> MG[Merge / Aggregate]
  S2 --> MG
  SN --> MG
  MG --> L2[L2 Re-Ranker]
  L2 --> SN2[Snippet Generation]
  SN2 --> CACHE[Result Cache]
  CACHE --> RESP[Response to User]

Figure 3. Query serving path with two-stage ranking. L1 scoring happens on each shard; L2 re-ranking runs centrally on the merged candidates.

5.4 Caching

Popular queries repeat frequently. A caching layer in front of the aggregator stores serialized result pages keyed by normalized query string. At 100K QPS with a 30% cache hit rate, the index shards only need to handle 70K QPS, a meaningful reduction.

Cache invalidation ties into the indexing pipeline: when a document in a cached result set gets re-indexed, we evict the affected cache entries.

6. Latency budget

The total sums to roughly 48 ms on the happy path. That leaves headroom for the p99 target of 300 ms even when individual shards are slow or retries are needed.

7. Trade-offs and alternatives

Document-partitioned vs. term-partitioned index

We chose document partitioning: each shard holds a subset of documents with their full posting lists. The alternative is term partitioning, where each shard owns a subset of terms across all documents.

Approach	Pros	Cons
Document-partitioned	Simple shard management, easy rebalancing	Every query fans out to all shards
Term-partitioned	Queries touch fewer shards	Harder to rebalance, correlated failures on hot terms

Document partitioning wins for general web search because queries are cheap to fan out and shard independence simplifies operations.

Exact vs. approximate retrieval

For very large corpora, approximate nearest neighbor (ANN) techniques can speed up retrieval at the cost of recall. Traditional inverted indexes give exact Boolean matching. Modern systems often combine both: inverted index for keyword matching plus a vector index for semantic similarity.

Pre-computed vs. on-the-fly snippets

Pre-computing snippets saves query-time CPU but increases storage and complicates freshness. Most search engines generate snippets on the fly from a compressed document store, accepting the 5 to 10 ms cost.

8. What real systems actually do

Google uses a document-partitioned index spread across multiple data centers. The index is split into tiers: a smaller “hot” tier for frequently accessed documents and a larger “cold” tier for the long tail. Caffeine (their indexing system) processes incremental updates continuously rather than doing periodic full rebuilds. Google also runs a dedicated “supplemental” index for rarely accessed pages, keeping the primary serving index lean.

Bing similarly uses tiered indexing with a focus on reducing tail latency through speculative execution: if a shard is slow, the query re-dispatches to a replica before the timeout. Bing invests heavily in hardware acceleration, using FPGAs for ranking model inference to keep p99 latency under control.

Elasticsearch, used by many smaller search applications, stores inverted indexes as immutable Lucene segments that get periodically merged. It uses document partitioning and supports both keyword and vector search. Segment merging is the Achilles heel: large merges can spike latency, so production deployments tune merge policies carefully.

Yandex takes a different approach to freshness. Their “Mercury” system maintains a small real-time index in RAM that handles the most recent documents, merged into the base index every few minutes rather than hours.

All four systems share common patterns: aggressive caching, tiered storage, two-stage ranking, and document-based sharding. The differences lie in how they handle tail latency and freshness trade-offs.

9. What comes next

This design handles the core search loop: crawl, index, serve. Several extensions are worth exploring:

Personalization: Adjust ranking based on user history and location. Requires a user profile service and privacy-aware feature pipeline.
Vertical search: Images, videos, news each need specialized parsers, indexes, and ranking signals.
Semantic search: Dense vector embeddings enable “meaning” based retrieval alongside keyword matching. This adds a vector index layer and an embedding inference service.
Spell correction and query understanding: NLP models that rewrite queries before they hit the index improve recall significantly.
Federation: Merging results from multiple specialized indexes (web, images, knowledge graph) into a single result page.
Geo-aware serving: Route queries to the nearest data center and bias results by geographic relevance. Reduces latency and improves result quality for local intent queries.

The inverted index and scatter-gather pattern remain at the center of every modern search system. If you want to go deeper on the crawling side, revisit the web crawler case study. For the infrastructure that ties everything together, the search architecture overview covers the full picture.

← Back to all series