AI EducademyAIEducademy
🌳

AI 学习路径

🌱
AI 种子

从零开始

🌿
AI 萌芽

打好基础

🌳
AI 枝干

付诸实践

🏕️
AI 树冠

深入探索

🌲
AI 森林

精通AI

🔨

工程技能路径

✏️
AI 草图

从零开始

🪨
AI 雕刻

打好基础

⚒️
AI 匠心

付诸实践

💎
AI 打磨

深入探索

🏆
AI 杰作

精通AI

查看所有学习计划→

实验室

已加载 7 个实验
🧠神经网络游乐场🤖AI 还是人类?💬提示实验室🎨图像生成器😊情感分析器💡聊天机器人构建器⚖️伦理模拟器
进入实验室→
📝

博客

关于AI、教育和技术的最新文章

阅读博客→
nav.faq
🎯
使命

让AI教育触达每一个人、每一个角落

💜
价值观

开源、多语言、社区驱动

⭐
Open Source

在 GitHub 上公开构建

认识创始人→在 GitHub 上查看
立即开始
AI EducademyAIEducademy

MIT 许可证。开源项目

学习

  • 学习计划
  • 课程
  • 实验室

社区

  • GitHub
  • 参与贡献
  • 行为准则
  • 关于
  • 常见问题

支持

  • 请我喝杯咖啡 ☕
AI & 工程学习计划›⚒️ AI 匠心›课程›设计网络爬虫
🕷️
AI 匠心 • 高级⏱️ 22 分钟阅读

设计网络爬虫

Design a Web Crawler - How Google Indexes the Internet

A web crawler systematically browses the internet to download and index web pages. It's the foundation of every search engine, and designing one teaches you about distributed systems, graph traversal, and handling the messiness of the real web.

Step 1 - Requirements Gathering

Functional Requirements

  • Crawl web pages starting from a set of seed URLs
  • Extract and follow hyperlinks to discover new pages
  • Store downloaded content for indexing and processing
  • Respect robots.txt and crawl politeness rules

Non-Functional Requirements

  • Scale - crawl 1 billion pages per month
  • Politeness - never overwhelm a single web server
  • Freshness - re-crawl important pages regularly
  • Robustness - handle malformed HTML, infinite loops, and spider traps

Step 2 - High-Level Architecture

Seed URLs → URL Frontier (priority queue)
                  ↓
            URL Scheduler (politeness + priority)
                  ↓
         Fetcher (HTTP client pool)
                  ↓
         HTML Parser → Extract URLs → URL Filter
              ↓                           ↓
       Content Store              URL Dedup (Bloom Filter)
       (S3 / HDFS)                      ↓
                                 URL Frontier (loop)
Web crawler architecture showing the crawl loop from URL frontier through fetcher, parser, URL extraction, deduplication, and back to the frontier
The crawler operates as a continuous loop: fetch a page, extract new URLs, deduplicate, and enqueue for future crawling

Step 3 - The Crawl Loop

BFS vs DFS Crawling

  • BFS (Breadth-First) - preferred for web crawling. Discovers a broad set of domains quickly and avoids getting trapped deep in a single site
  • DFS (Depth-First) - risks going infinitely deep into dynamically generated pages before discovering other domains

URL Frontier Design

The frontier is not a simple queue - it's a sophisticated priority system with two sub-components:

  1. Priority Queue (Front) - ranks URLs by importance. Factors include PageRank, update frequency, and domain authority
  2. Politeness Queue (Back) - ensures each domain has its own FIFO queue. A scheduler picks from different domain queues to avoid hammering any single server
Front (Priority):       Back (Politeness):
┌──────────────┐       ┌─── bbc.co.uk queue ───┐
│ High Priority │ ───→  │ url1 → url4 → url7    │
│ Med Priority  │ ───→  ├─── github.com queue ──┤
│ Low Priority  │ ───→  │ url2 → url5 → url8    │
└──────────────┘       ├─── wikipedia.org queue ┤
                        │ url3 → url6 → url9    │
                        └───────────────────────┘
\ud83e\udde0小测验

Why does a web crawler use BFS rather than DFS for general-purpose crawling?

Step 4 - Politeness and Robots.txt

Respecting Robots.txt

Before crawling any domain, fetch and cache its robots.txt file:

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

Cache robots.txt per domain with a TTL (24 hours). Always honour Disallow rules and Crawl-delay directives.

Crawl Rate Limiting

Even without explicit Crawl-delay, maintain politeness:

  • Maximum 1 request per second per domain (configurable)
  • Use separate download queues per domain
  • Track response times - if a server slows down, back off automatically
\ud83e\udd2f
Google's web crawler (Googlebot) crawls hundreds of billions of pages. To be polite, it monitors server response times and reduces its crawl rate if a site starts responding slowly - effectively letting the website "push back" against the crawler.

Step 5 - Deduplication

URL Deduplication (Bloom Filter)

With billions of URLs, checking if a URL has been seen before requires an efficient data structure. A Bloom filter provides O(1) lookups with a small memory footprint:

  • Space: ~1 GB for 1 billion URLs at 1% false positive rate
  • Trade-off: may occasionally re-crawl a page (false positive) but never misses a new URL (no false negatives)

Normalise URLs before checking: remove fragments (#), sort query parameters, lowercase the hostname, and resolve relative paths.

Content Deduplication (SimHash)

Different URLs can serve identical content (mirrors, copied articles). SimHash generates a fingerprint of page content - pages with similar fingerprints (small Hamming distance) are likely duplicates:

Page A content → SimHash → 10110010...
Page B content → SimHash → 10110011...
Hamming distance = 1 → likely duplicate, skip storing
\ud83e\udde0小测验

A Bloom filter reports that a URL has been seen before. What can you conclude?

Step 6 - Distributed Crawling

Scaling Horizontally

A single machine cannot crawl a billion pages monthly. Distribute the work:

  • Consistent hashing on domain names assigns each domain to a specific crawler node. This ensures politeness - only one node crawls a given domain
  • URL frontier is partitioned across nodes based on the hash ring
  • Coordination via a lightweight service that manages node assignments and rebalances when nodes join or leave

DNS Resolution Caching

DNS lookups add 10-100ms per request. At scale, this is a major bottleneck:

  • Maintain a local DNS cache per crawler node
  • Batch DNS pre-resolution for queued URLs
  • Use a dedicated DNS resolver to avoid overwhelming public DNS servers

Spider Trap Detection

Malicious or poorly built sites can create infinite URL spaces:

  • /page/1/page/1/page/1/... - repeating path patterns
  • Calendar pages generating URLs for every future date
  • Session IDs in URLs creating unique URLs for identical content

Defences: maximum URL length, maximum crawl depth per domain, pattern detection for repetitive URL structures, and domain-level page count limits.

\ud83e\udd14
Think about it:If you need to keep a search index fresh, how would you decide which pages to re-crawl first? Consider a news site that updates hourly vs a Wikipedia article that changes monthly. What signals would you use to prioritise re-crawl scheduling?

Step 7 - Bottlenecks and Optimisations

| Bottleneck | Solution | |-----------|----------| | DNS resolution latency | Local DNS cache, batch pre-resolution | | Slow/unresponsive servers | Timeouts (30s), async I/O, skip after retries | | Storage for crawled content | Compress pages, store only changed content | | URL frontier memory | Disk-backed priority queue (RocksDB) | | Duplicate content | SimHash fingerprinting, canonical URL detection |

\ud83e\udd14
Think about it:Search engines must balance crawl freshness (re-crawling pages frequently) against discovery (finding new pages). If your crawler has a fixed budget of 1 million pages per day, how would you split that budget between re-crawling existing pages and discovering new ones?

📚 Further Reading

  • Web Crawler - System Design Interview Vol. 1, Ch. 9 - Alex Xu's detailed web crawler design walkthrough
  • Mercator: A Scalable Web Crawler - foundational paper on scalable crawler architecture
  • Bloom Filters by Example - interactive tutorial on how Bloom filters work
第 9 课,共 10 课已完成 0%
←设计视频流媒体平台
设计分布式缓存→