Jan 17, 2026 · 16 min read · System Design

Search architecture

In this series (12 parts)

Your PostgreSQL database has a LIKE '%wireless headphones%' query somewhere in production. It works fine with 50,000 products. At 50 million, it takes 12 seconds and pins a CPU core. Adding a trigram index helps, but now you need typo tolerance, synonym matching, faceted filtering, and relevance ranking. Your relational database was designed for transactional consistency, not for scanning every document in a corpus and ranking results by how well they match a vague human query.

Search is a fundamentally different workload. It reads aggressively, tolerates slight staleness, and rewards specialized data structures that trade write speed for query speed. That trade-off is why dedicated search engines exist.

Why your primary database is the wrong tool

Relational databases store data in rows optimized for point lookups and range scans on indexed columns. When you search for “wireless headphones,” the database has to scan text fields or rely on B-tree indexes that only support prefix matching. A query like WHERE name LIKE '%headphones%' cannot use a standard index at all because the wildcard appears at the start.

Full-text search extensions (PostgreSQL’s tsvector, MySQL’s FULLTEXT index) improve things, but they still share resources with your transactional workload. A heavy search query competes with your checkout flow for CPU, memory, and I/O. Scaling search independently becomes impossible when it lives inside your primary database.

Dedicated search engines solve both problems. They use inverted indexes purpose-built for text matching, and they run on separate infrastructure that scales horizontally without touching your transactional path.

The inverted index

An inverted index flips the relationship between documents and terms. Instead of storing “Document 1 contains words [wireless, headphones, bluetooth],” it stores “the word ‘wireless’ appears in documents [1, 47, 332, 8891].”

Think of it like the index at the back of a textbook. You look up a term, and it tells you which pages mention it. A search engine builds this index for every token in every document.

The process starts with analysis. Raw text passes through a pipeline of transformations: lowercasing, removing punctuation, splitting on whitespace (tokenization), removing stop words like “the” and “is,” and applying stemming so that “running,” “runs,” and “ran” all reduce to “run.” The resulting tokens are stored in the inverted index along with metadata like term frequency and position within the document.

When a query arrives, it goes through the same analysis pipeline. The engine looks up each query term in the inverted index, retrieves the matching document sets, and intersects or unions them depending on the query logic. Because the index maps terms to document IDs directly, this lookup is extremely fast even across billions of documents.

graph LR
  subgraph Analysis Pipeline
      D1["Raw Document"] --> T["Tokenizer"]
      T --> L["Lowercase"]
      L --> SW["Stop Word Removal"]
      SW --> ST["Stemmer"]
  end
  ST --> II["Inverted Index"]
  subgraph Inverted Index Structure
      II --> E1["'wireless' → doc 1, 47, 332"]
      II --> E2["'headphone' → doc 1, 47, 890"]
      II --> E3["'bluetooth' → doc 1, 332, 891"]
  end

Analysis pipeline transforming raw text into inverted index entries.

Elasticsearch and OpenSearch architecture

Elasticsearch (and its open-source fork OpenSearch) are the most widely deployed search engines in production systems. They are built on top of Apache Lucene, which provides the low-level inverted index and query execution. Elasticsearch adds distributed coordination, a REST API, and cluster management on top.

A cluster consists of multiple nodes. Each node is a Java process running on its own machine. One or more nodes are elected as cluster coordinators that manage cluster state: which indexes exist, how they are sharded, and where each shard lives.

An index in Elasticsearch is a logical namespace, roughly analogous to a database table. Each index is split into shards, and each shard is a self-contained Lucene index. Sharding lets you distribute data across nodes so that a single index can grow beyond what one machine can hold.

Each shard has one primary copy and zero or more replica copies. Writes go to the primary, which forwards to replicas. Reads can be served by any copy, which lets you scale read throughput by adding replicas.

graph TD
  Client["Client"] --> Coord["Coordinator Node"]
  Coord --> N1["Node 1"]
  Coord --> N2["Node 2"]
  Coord --> N3["Node 3"]
  N1 --> P1["Shard 1 Primary"]
  N1 --> R2["Shard 2 Replica"]
  N2 --> P2["Shard 2 Primary"]
  N2 --> R3["Shard 3 Replica"]
  N3 --> P3["Shard 3 Primary"]
  N3 --> R1["Shard 1 Replica"]

Elasticsearch cluster with three nodes, three primary shards, and their replicas distributed for fault tolerance.

When a search query arrives, the coordinator node routes it to one copy of every shard (primary or replica). Each shard executes the query locally against its Lucene index, scores the results, and returns the top N hits. The coordinator merges results from all shards, re-ranks them, and returns the final response to the client. This scatter-gather pattern is the backbone of distributed search.

The indexing pipeline

Documents rarely go straight from your application into the search engine. A robust indexing pipeline sits between your primary data store and your search cluster, transforming and enriching data before it becomes searchable.

The typical flow looks like this: a change happens in your primary database (a new product is created, a description is updated). A change data capture (CDC) system or application-level event emits that change into a message queue. An indexing worker consumes the event, enriches the document (resolving category names, computing derived fields, attaching image URLs), and sends the final document to Elasticsearch via its bulk indexing API.

graph LR
  DB["Primary Database"] -->|CDC / Events| MQ["Message Queue"]
  MQ --> IW["Indexing Workers"]
  IW -->|Enrich & Transform| IW
  IW -->|Bulk Index API| ES["Elasticsearch Cluster"]
  ES --> U["Search Queries from Users"]

Indexing pipeline from primary database through message queue and enrichment workers into the search cluster.

Bulk indexing matters. Sending documents one at a time creates enormous overhead from HTTP round trips and small Lucene segment flushes. The bulk API lets you batch hundreds or thousands of documents into a single request, which Lucene can merge into larger, more efficient segments.

This architecture means search results are eventually consistent with the primary database. A product updated in PostgreSQL might take a few seconds to appear in search results. For most use cases, this delay is acceptable. Users do not expect sub-second freshness from a search box.

For cases where near-real-time indexing matters, Elasticsearch supports a refresh interval (default one second) that controls how quickly newly indexed documents become searchable. You can tune this down, but shorter intervals increase indexing overhead.

Relevance ranking

Returning matching documents is easy. Returning them in the right order is the hard part. Relevance ranking determines which of 50,000 matching documents appears first.

The classical model is TF-IDF (term frequency, inverse document frequency). Term frequency measures how often a query term appears in a document: a product description mentioning “wireless” five times is probably more about wireless than one mentioning it once. Inverse document frequency measures how rare the term is across all documents: “the” appears everywhere and carries no signal, while “bluetooth” is more discriminating.

Elasticsearch uses BM25 by default, an evolution of TF-IDF that adds diminishing returns for term frequency (the sixth mention of “wireless” adds less signal than the second) and document length normalization (a 10-word title matching two query terms is more relevant than a 5,000-word description matching the same two terms).

Beyond text matching, production search systems layer on additional signals. Popularity (how many times a product was purchased), recency (newer content ranks higher), personalization (your past behavior influences your results), and business rules (promoted listings) all feed into the final score. These signals are often combined using Elasticsearch’s function score query, which lets you multiply or add custom scoring functions on top of BM25.

Analyzers and language handling

The analysis chain is where most search quality problems live. Choosing the wrong analyzer means perfectly good documents never match user queries.

A standard analyzer tokenizes on whitespace and punctuation, lowercases everything, and removes a small set of stop words. This works for English prose but fails for product codes like “XR-7700” (the hyphen splits it into two tokens) or CamelCase identifiers.

Custom analyzers let you chain together character filters (stripping HTML tags, normalizing unicode), tokenizers (splitting on whitespace, on word boundaries, or using n-grams), and token filters (stemming, synonym expansion, phonetic matching). Synonym expansion is particularly powerful: mapping “TV” to “television,” “laptop” to “notebook,” and “phone” to “smartphone” dramatically improves recall.

For multi-language support, you typically create separate indexes per language, each with its own analyzer. French stemming rules are different from English. Japanese text does not have whitespace between words and requires a morphological analyzer like Kuromoji.

Scaling search

Search clusters grow along two axes. More shards increase write throughput and total data capacity. More replicas increase read throughput and fault tolerance.

The shard count for an index is fixed at creation time (OpenSearch recently relaxed this, but re-sharding remains expensive). Choosing too few shards means you hit a ceiling on index size. Choosing too many creates overhead, because each shard consumes memory for its Lucene segment metadata, and the coordinator must fan out to more shards per query.

A good starting heuristic: aim for shards between 10GB and 50GB each. If your index will reach 200GB, start with 4 to 8 shards. Monitor shard sizes and query latency, then adjust for your next index version.

Hot-warm-cold architectures help manage cost. Recent data lives on fast SSD nodes (hot tier) for low-latency queries. Older data migrates to cheaper spinning-disk nodes (warm tier) where it is still searchable but slower. Very old data moves to frozen or cold storage where it is barely searchable but takes minimal resources. Index lifecycle management (ILM) policies automate these transitions.

Common failure modes

Search clusters fail in predictable ways. The most common is the “split brain” problem, where network partitions cause two coordinator nodes to each believe they are the leader. Modern Elasticsearch versions mitigate this with a voting-based quorum system, but you still need an odd number of master-eligible nodes (three is the minimum for production).

Mapping explosions happen when dynamically typed fields create thousands of unique field names (for example, indexing arbitrary JSON without a schema). Each unique field consumes cluster state memory. Set dynamic: strict or dynamic: false on indexes where you control the schema.

Slow queries with leading wildcards (*headphones) bypass the inverted index entirely and force a scan of every term in the index. These queries should be blocked or routed to a separate cluster to avoid impacting other searches.

What comes next

Search gets data into the hands of users, but the data itself needs a home. The next article covers storage systems at scale: block vs object vs file storage, how S3-style object stores work internally, and how to manage data lifecycle across regions.

← Back to all series