Search…

Design a web crawler

In this series (18 parts)
  1. Design a URL shortener
  2. Design a key-value store
  3. Design a rate limiter
  4. Design a web crawler
  5. Design a notification system
  6. Design a news feed
  7. Design a chat application
  8. Design a video streaming platform
  9. Design a music streaming service
  10. Design a ride-sharing service
  11. Design a food delivery platform
  12. Design a hotel booking platform
  13. Design a search engine
  14. Design a distributed message queue
  15. Design a code deployment system
  16. Design a payments platform
  17. Design an ad click aggregation system
  18. Design a distributed cache

A web crawler starts with a handful of seed URLs and systematically fetches, parses, and follows links across the internet. Search engines, archival services, and SEO tools all depend on crawlers. The hard problems are not fetching a single page; they are doing it billions of times without hammering hosts, re-crawling stale content, or wasting storage on duplicates.

Requirements

Functional:

  1. Given a set of seed URLs, the crawler discovers and downloads web pages by following hyperlinks.
  2. Downloaded pages are stored for downstream processing (indexing, analysis).
  3. The crawler respects robots.txt rules and per-host crawl delays.
  4. Duplicate URLs and duplicate content are detected and skipped.
  5. The system supports configurable crawl priorities (news sites refreshed more often than static pages).

Non-functional:

  • Scale: Crawl 1 billion pages per day.
  • Politeness: No more than 1 request per second to any single host.
  • Fault tolerance: A crashed worker resumes without re-crawling already fetched pages.
  • Extensibility: Pluggable modules for content extraction, link filtering, and storage backends.

Capacity estimation

Start with the target: 1 billion pages per day. Using back-of-envelope estimation:

MetricValue
Pages per day1,000,000,000
Pages per second~11,600
Avg page size (compressed)100 KB
Daily storage100 TB
Daily bandwidth~9.3 Gbps sustained
Unique hosts (estimated)~10 million
Metadata per URL~500 bytes
URL metadata storage~500 GB

At 11,600 pages per second, we need hundreds of crawler workers running in parallel. Each worker handles roughly 50 to 100 fetches per second depending on network latency and page size.

High-level architecture

graph TD
  Seeds["Seed URLs"] --> Frontier["URL Frontier"]
  Frontier --> Fetcher["Fetcher Workers"]
  Fetcher --> DNS["DNS Resolver Cache"]
  DNS --> Fetcher
  Fetcher --> Parser["HTML Parser"]
  Parser --> Dedup["Content Dedup"]
  Dedup --> Store["Page Store"]
  Parser --> LinkExtract["Link Extractor"]
  LinkExtract --> URLFilter["URL Filter + Dedup"]
  URLFilter --> Frontier
  Fetcher --> Robots["robots.txt Cache"]
  Robots --> Fetcher
  Store --> Downstream["Indexer / Analytics"]

High-level architecture of a distributed web crawler. The URL frontier feeds fetcher workers, which download pages, extract links, and loop discovered URLs back into the frontier.

The loop is the defining feature. Pages produce links, links become URLs in the frontier, and the frontier feeds fetchers. The system is a producer-consumer pipeline where the output of one stage feeds the input of an earlier stage.

Message queues sit at the heart of this pipeline. The URL frontier is essentially a priority queue with politeness constraints layered on top.

Deep dive: URL frontier

The URL frontier is the most critical component. A naive queue does not work because we need two things simultaneously: priority ordering and per-host rate limiting.

Two-level queue structure

The frontier splits into two layers:

  1. Front queues (priority): Multiple queues, each assigned a priority level. A prioritizer examines each URL and routes it to the appropriate queue based on page importance, freshness, and domain authority.
  2. Back queues (politeness): Multiple queues, one per host. A selector maps each URL to the queue for its host. Each back queue enforces a minimum delay between consecutive fetches to the same host.

The fetcher pulls from back queues. A router sits between front and back queues, draining high-priority URLs first and distributing them to the correct per-host queue.

graph LR
  subgraph Front Queues
      F1["Priority 1 (High)"]
      F2["Priority 2 (Medium)"]
      F3["Priority 3 (Low)"]
  end
  subgraph Router
      R["Priority Selector + Host Mapper"]
  end
  subgraph Back Queues
      B1["host: example.com"]
      B2["host: news.org"]
      B3["host: blog.dev"]
  end
  F1 --> R
  F2 --> R
  F3 --> R
  R --> B1
  R --> B2
  R --> B3
  B1 --> Fetcher["Fetcher Pool"]
  B2 --> Fetcher
  B3 --> Fetcher

The two-level frontier design. Front queues handle priority; back queues enforce per-host politeness.

Politeness enforcement

Each back queue tracks the last fetch timestamp for its host. When a fetcher requests work, the queue checks if enough time has passed since the last request. If not, the URL stays in the queue and the fetcher picks from a different host.

The crawl delay comes from two sources: the robots.txt Crawl-delay directive (when present) and a default minimum of 1 second. The system always uses the stricter of the two.

Priority scoring

Priority depends on several signals:

  • PageRank or domain authority: Well-linked pages get higher priority.
  • Update frequency: Pages that change often (news homepages) are re-crawled sooner.
  • Depth from seed: URLs closer to seed pages tend to be more important.
  • Content type: HTML pages rank higher than PDFs or images in a general-purpose crawler.

Deep dive: deduplication

At 1 billion pages per day, fetching the same page twice wastes bandwidth and storage. We need dedup at two levels: URL dedup (have we seen this URL?) and content dedup (is this page identical to one we already stored?).

URL deduplication with Bloom filters

Storing every seen URL in a hash set works for small crawls but fails at scale. One billion URLs at 100 bytes each is 100 GB of memory. A Bloom filter compresses this dramatically.

A Bloom filter is a probabilistic data structure that answers “have I seen this before?” with zero false negatives and a configurable false positive rate. With 1 billion entries and a 1% false positive rate, the filter needs roughly 1.2 GB of memory. That fits comfortably on a single machine.

The trade-off: a small percentage of unseen URLs will be incorrectly marked as “already crawled” and skipped. At 1% false positive rate, that means roughly 10 million pages per day are missed. For most use cases, this is acceptable. If it is not, you can chain two Bloom filters or fall back to a disk-backed hash set for uncertain cases.

Content deduplication with SimHash

Two different URLs can serve identical content (mirrors, syndication, URL parameters that do not change the page). Storing both wastes space.

SimHash computes a fingerprint of each page’s content. Pages with identical or near-identical SimHash values are considered duplicates. Unlike a cryptographic hash, SimHash preserves locality: similar documents produce similar hashes, allowing near-duplicate detection.

The workflow:

  1. Fetcher downloads a page.
  2. Content dedup module computes the SimHash of the page body.
  3. The fingerprint is compared against a database of stored fingerprints.
  4. If a match is found, the page is discarded (or stored as a reference to the original).

Deep dive: distributed coordination

A single machine cannot crawl 1 billion pages per day. We distribute the work across hundreds of crawler nodes.

Partitioning by host

Each crawler node is responsible for a subset of hosts. Using consistent hashing, we map each host to a node. This gives us two benefits:

  1. Natural politeness: Since all URLs for a given host land on the same node, per-host rate limiting is local. No distributed locks needed.
  2. DNS caching: Each node caches DNS resolutions for its assigned hosts, reducing DNS lookup overhead.

When a node fails, consistent hashing redistributes its hosts to neighboring nodes. The new owners resume crawling from the frontier state stored in a shared datastore.

sequenceDiagram
  participant F as Frontier Store
  participant C as Coordinator
  participant W1 as Worker 1
  participant W2 as Worker 2
  participant DNS as DNS Cache
  participant Web as Target Host

  C->>F: Pull next batch for Worker 1
  F-->>C: URLs for hosts A, B
  C->>W1: Assign URL batch
  W1->>DNS: Resolve host A
  DNS-->>W1: IP address
  W1->>Web: GET /page (host A)
  Web-->>W1: HTML response
  W1->>W1: Parse + extract links
  W1->>F: Push new URLs to frontier
  W1->>C: Report completion

  C->>F: Pull next batch for Worker 2
  F-->>C: URLs for hosts C, D
  C->>W2: Assign URL batch
  W2->>DNS: Resolve host C
  DNS-->>W2: IP address
  W2->>Web: GET /page (host C)
  Web-->>W2: HTML response
  W2->>F: Push new URLs to frontier
  W2->>C: Report completion

Crawl loop sequence. The coordinator assigns URL batches to workers, each worker fetches pages, extracts links, and pushes discovered URLs back into the frontier.

Checkpoint and recovery

Each worker periodically checkpoints its progress: the set of URLs it has fetched and the current state of its back queues. On crash, the replacement worker loads the checkpoint and resumes. URLs that were in-flight at crash time are re-fetched, but content dedup catches any resulting duplicates.

Checkpoints are stored in a distributed key-value store. The overhead is small: a worker processing 100 URLs per second generates roughly 50 KB of checkpoint data per second.

Trade-offs and alternatives

DecisionOption AOption B
URL dedupBloom filter (fast, small, lossy)Distributed hash table (exact, more memory)
Content dedupSimHash (near-duplicate detection)MD5/SHA hash (exact match only)
Frontier storageIn-memory with disk spillFully disk-based (RocksDB)
Host partitioningConsistent hashing (automatic rebalancing)Static assignment (simpler, manual rebalancing)
Crawl schedulingPriority queuesSimple FIFO (fair but not optimal)

Bloom filter vs. exact dedup: The false positive cost of a Bloom filter is skipping a few million pages out of billions. For most crawlers, that is an excellent trade-off. Archival crawlers that cannot afford any misses use a distributed hash set backed by persistent storage.

BFS vs. DFS crawl order: BFS (breadth-first) is standard because it discovers important pages early. DFS risks getting stuck deep in a single site. The priority-based frontier is a weighted BFS.

Single-machine vs. distributed frontier: A centralized frontier is simpler but becomes a bottleneck at scale. Partitioning the frontier by host lets each worker manage its own subset, removing the central coordinator for URL routing.

What real systems actually do

Googlebot uses a distributed crawl infrastructure with thousands of machines. It respects robots.txt, adapts crawl rate per host based on server responsiveness, and uses a sophisticated scheduler that predicts when pages will change.

Apache Nutch is an open-source crawler built on Hadoop. It stores the frontier and fetched content in HDFS and runs the crawl loop as MapReduce jobs. Nutch is batch-oriented: it crawls in rounds rather than continuously.

Scrapy (Python) is a single-machine framework popular for targeted crawls. It handles politeness, robots.txt, and dedup but does not distribute across nodes out of the box. For distributed Scrapy, teams add a shared frontier (Redis-based) via Scrapy-Redis.

Common Crawl is a non-profit that crawls the entire web monthly and publishes the data openly. Their architecture processes petabytes of data per crawl using cloud infrastructure.

All production crawlers share the same core ideas: a frontier with prioritization, per-host politeness, dedup at the URL and content level, and distributed workers. The differences are in implementation details and scale.

Performance characteristics

Throughput scales near-linearly up to around 600 workers. Beyond that, contention on the shared frontier and DNS resolution overhead cause sub-linear returns. Partitioning the frontier by host pushes the knee of this curve further out.

What comes next

This design handles the core crawl loop. Production systems layer additional concerns on top:

  • Recrawl scheduling: Predicting when pages will change and scheduling re-fetches accordingly, rather than blindly re-crawling everything on a fixed interval.
  • Trap detection: Infinite URL spaces generated by calendars, session IDs, or query parameters. Detection heuristics cap the number of URLs per host-path prefix.
  • Rendering: JavaScript-heavy pages require a headless browser to produce the final HTML. This is 10 to 100x slower than a simple HTTP fetch and needs a separate rendering pipeline.
  • Legal compliance: Respecting robots.txt is the baseline. Crawlers also need to handle copyright takedown requests, GDPR data deletion, and regional access restrictions.
  • Monitoring: Tracking crawl coverage, error rates per host, fetch latency percentiles, and frontier growth over time. Without good observability, a crawler at this scale is flying blind.

Each of these deserves a dedicated treatment. The architecture described here provides the foundation: a prioritized, polite, deduplicated crawl loop distributed across hundreds of machines.

Start typing to search across all content
navigate Enter open Esc close