Design a Web Crawler - How Google Indexes the Internet
A web crawler systematically browses the internet to download and index web pages. It's the foundation of every search engine, and designing one teaches you about distributed systems, graph traversal, and handling the messiness of the real web.
Step 1 - Requirements Gathering
Functional Requirements
Crawl web pages starting from a set of seed URLs
Extract and follow hyperlinks to discover new pages
Store downloaded content for indexing and processing
Respect robots.txt and crawl politeness rules
Non-Functional Requirements
Scale - crawl 1 billion pages per month
Politeness - never overwhelm a single web server
Freshness - re-crawl important pages regularly
Robustness - handle malformed HTML, infinite loops, and spider traps
The crawler operates as a continuous loop: fetch a page, extract new URLs, deduplicate, and enqueue for future crawling
Step 3 - The Crawl Loop
BFS vs DFS Crawling
BFS (Breadth-First) - preferred for web crawling. Discovers a broad set of domains quickly and avoids getting trapped deep in a single site
DFS (Depth-First) - risks going infinitely deep into dynamically generated pages before discovering other domains
URL Frontier Design
The frontier is not a simple queue - it's a sophisticated priority system with two sub-components:
Priority Queue (Front) - ranks URLs by importance. Factors include PageRank, update frequency, and domain authority
Politeness Queue (Back) - ensures each domain has its own FIFO queue. A scheduler picks from different domain queues to avoid hammering any single server
Cache robots.txt per domain with a TTL (24 hours). Always honour Disallow rules and Crawl-delay directives.
Crawl Rate Limiting
Even without explicit Crawl-delay, maintain politeness:
Maximum 1 request per second per domain (configurable)
Use separate download queues per domain
Track response times - if a server slows down, back off automatically
\ud83e\udd2f
Google's web crawler (Googlebot) crawls hundreds of billions of pages. To be polite, it monitors server response times and reduces its crawl rate if a site starts responding slowly - effectively letting the website "push back" against the crawler.
Step 5 - Deduplication
URL Deduplication (Bloom Filter)
With billions of URLs, checking if a URL has been seen before requires an efficient data structure. A Bloom filter provides O(1) lookups with a small memory footprint:
Space: ~1 GB for 1 billion URLs at 1% false positive rate
Trade-off: may occasionally re-crawl a page (false positive) but never misses a new URL (no false negatives)
Normalise URLs before checking: remove fragments (#), sort query parameters, lowercase the hostname, and resolve relative paths.
Content Deduplication (SimHash)
Different URLs can serve identical content (mirrors, copied articles). SimHash generates a fingerprint of page content - pages with similar fingerprints (small Hamming distance) are likely duplicates:
Calendar pages generating URLs for every future date
Session IDs in URLs creating unique URLs for identical content
Defences: maximum URL length, maximum crawl depth per domain, pattern detection for repetitive URL structures, and domain-level page count limits.
\ud83e\udd14
Think about it:If you need to keep a search index fresh, how would you decide which pages to re-crawl first? Consider a news site that updates hourly vs a Wikipedia article that changes monthly. What signals would you use to prioritise re-crawl scheduling?
Step 7 - Bottlenecks and Optimisations
| Bottleneck | Solution |
|-----------|----------|
| DNS resolution latency | Local DNS cache, batch pre-resolution |
| Slow/unresponsive servers | Timeouts (30s), async I/O, skip after retries |
| Storage for crawled content | Compress pages, store only changed content |
| URL frontier memory | Disk-backed priority queue (RocksDB) |
| Duplicate content | SimHash fingerprinting, canonical URL detection |
\ud83e\udd14
Think about it:Search engines must balance crawl freshness (re-crawling pages frequently) against discovery (finding new pages). If your crawler has a fixed budget of 1 million pages per day, how would you split that budget between re-crawling existing pages and discovering new ones?