AI EducademyAIEducademy
🌳

AI学習パス

🌱
AI Seeds(種)

ゼロから始める

🌿
AI Sprouts(芽)

基礎を築く

🌳
AI Branches(枝)

実践に活かす

🏕️
AI Canopy(樹冠)

深く学ぶ

🌲
AI Forest(森)

AIをマスターする

🔨

エンジニアリング習得パス

✏️
AI Sketch(スケッチ)

ゼロから始める

🪨
AI Chisel(鑿)

基礎を築く

⚒️
AI Craft(制作)

実践に活かす

💎
AI Polish(磨き上げ)

深く学ぶ

🏆
AI Masterpiece(傑作)

AIをマスターする

全プログラムを見る→

ラボ

7つの実験がロード済み
🧠ニューラルネットワーク プレイグラウンド🤖AIか人間か?💬プロンプトラボ🎨画像生成😊感情分析ツール💡チャットボットビルダー⚖️倫理シミュレーター
ラボへ入る→
📝

ブログ

AI・教育・テクノロジーの最新記事

ブログを読む→
nav.faq
🎯
ミッション

すべての人にAI教育をアクセス可能にする

💜
価値観

オープンソース・多言語・コミュニティ主導

⭐
オープンソース

GitHubで公開開発

クリエイターに会う→GitHubで見る
始める
AI EducademyAIEducademy

MITライセンス。オープンソース

学ぶ

  • アカデミックス
  • レッスン
  • ラボ

コミュニティ

  • GitHub
  • 貢献する
  • 行動規範
  • 概要
  • よくある質問

サポート

  • コーヒーをおごる ☕
AI & エンジニアリング アカデミックス›⚒️ AI Craft(制作)›レッスン›ウェブクローラーの設計
🕷️
AI Craft(制作) • 上級⏱️ 22 分で読める

ウェブクローラーの設計

Design a Web Crawler - How Google Indexes the Internet

A web crawler systematically browses the internet to download and index web pages. It's the foundation of every search engine, and designing one teaches you about distributed systems, graph traversal, and handling the messiness of the real web.

Step 1 - Requirements Gathering

Functional Requirements

  • Crawl web pages starting from a set of seed URLs
  • Extract and follow hyperlinks to discover new pages
  • Store downloaded content for indexing and processing
  • Respect robots.txt and crawl politeness rules

Non-Functional Requirements

  • Scale - crawl 1 billion pages per month
  • Politeness - never overwhelm a single web server
  • Freshness - re-crawl important pages regularly
  • Robustness - handle malformed HTML, infinite loops, and spider traps

Step 2 - High-Level Architecture

Seed URLs → URL Frontier (priority queue)
                  ↓
            URL Scheduler (politeness + priority)
                  ↓
         Fetcher (HTTP client pool)
                  ↓
         HTML Parser → Extract URLs → URL Filter
              ↓                           ↓
       Content Store              URL Dedup (Bloom Filter)
       (S3 / HDFS)                      ↓
                                 URL Frontier (loop)
Web crawler architecture showing the crawl loop from URL frontier through fetcher, parser, URL extraction, deduplication, and back to the frontier
The crawler operates as a continuous loop: fetch a page, extract new URLs, deduplicate, and enqueue for future crawling

Step 3 - The Crawl Loop

BFS vs DFS Crawling

  • BFS (Breadth-First) - preferred for web crawling. Discovers a broad set of domains quickly and avoids getting trapped deep in a single site
  • DFS (Depth-First) - risks going infinitely deep into dynamically generated pages before discovering other domains

URL Frontier Design

The frontier is not a simple queue - it's a sophisticated priority system with two sub-components:

  1. Priority Queue (Front) - ranks URLs by importance. Factors include PageRank, update frequency, and domain authority
  2. Politeness Queue (Back) - ensures each domain has its own FIFO queue. A scheduler picks from different domain queues to avoid hammering any single server
Front (Priority):       Back (Politeness):
┌──────────────┐       ┌─── bbc.co.uk queue ───┐
│ High Priority │ ───→  │ url1 → url4 → url7    │
│ Med Priority  │ ───→  ├─── github.com queue ──┤
│ Low Priority  │ ───→  │ url2 → url5 → url8    │
└──────────────┘       ├─── wikipedia.org queue ┤
                        │ url3 → url6 → url9    │
                        └───────────────────────┘
\ud83e\udde0クイックチェック

Why does a web crawler use BFS rather than DFS for general-purpose crawling?

Step 4 - Politeness and Robots.txt

Respecting Robots.txt

Before crawling any domain, fetch and cache its robots.txt file:

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

Cache robots.txt per domain with a TTL (24 hours). Always honour Disallow rules and Crawl-delay directives.

Crawl Rate Limiting

Even without explicit Crawl-delay, maintain politeness:

  • Maximum 1 request per second per domain (configurable)
  • Use separate download queues per domain
  • Track response times - if a server slows down, back off automatically
\ud83e\udd2f
Google's web crawler (Googlebot) crawls hundreds of billions of pages. To be polite, it monitors server response times and reduces its crawl rate if a site starts responding slowly - effectively letting the website "push back" against the crawler.

Step 5 - Deduplication

URL Deduplication (Bloom Filter)

With billions of URLs, checking if a URL has been seen before requires an efficient data structure. A Bloom filter provides O(1) lookups with a small memory footprint:

  • Space: ~1 GB for 1 billion URLs at 1% false positive rate
  • Trade-off: may occasionally re-crawl a page (false positive) but never misses a new URL (no false negatives)

Normalise URLs before checking: remove fragments (#), sort query parameters, lowercase the hostname, and resolve relative paths.

Content Deduplication (SimHash)

Different URLs can serve identical content (mirrors, copied articles). SimHash generates a fingerprint of page content - pages with similar fingerprints (small Hamming distance) are likely duplicates:

Page A content → SimHash → 10110010...
Page B content → SimHash → 10110011...
Hamming distance = 1 → likely duplicate, skip storing
\ud83e\udde0クイックチェック

A Bloom filter reports that a URL has been seen before. What can you conclude?

Step 6 - Distributed Crawling

Scaling Horizontally

A single machine cannot crawl a billion pages monthly. Distribute the work:

  • Consistent hashing on domain names assigns each domain to a specific crawler node. This ensures politeness - only one node crawls a given domain
  • URL frontier is partitioned across nodes based on the hash ring
  • Coordination via a lightweight service that manages node assignments and rebalances when nodes join or leave

DNS Resolution Caching

DNS lookups add 10-100ms per request. At scale, this is a major bottleneck:

  • Maintain a local DNS cache per crawler node
  • Batch DNS pre-resolution for queued URLs
  • Use a dedicated DNS resolver to avoid overwhelming public DNS servers

Spider Trap Detection

Malicious or poorly built sites can create infinite URL spaces:

  • /page/1/page/1/page/1/... - repeating path patterns
  • Calendar pages generating URLs for every future date
  • Session IDs in URLs creating unique URLs for identical content

Defences: maximum URL length, maximum crawl depth per domain, pattern detection for repetitive URL structures, and domain-level page count limits.

\ud83e\udd14
Think about it:If you need to keep a search index fresh, how would you decide which pages to re-crawl first? Consider a news site that updates hourly vs a Wikipedia article that changes monthly. What signals would you use to prioritise re-crawl scheduling?

Step 7 - Bottlenecks and Optimisations

| Bottleneck | Solution | |-----------|----------| | DNS resolution latency | Local DNS cache, batch pre-resolution | | Slow/unresponsive servers | Timeouts (30s), async I/O, skip after retries | | Storage for crawled content | Compress pages, store only changed content | | URL frontier memory | Disk-backed priority queue (RocksDB) | | Duplicate content | SimHash fingerprinting, canonical URL detection |

\ud83e\udd14
Think about it:Search engines must balance crawl freshness (re-crawling pages frequently) against discovery (finding new pages). If your crawler has a fixed budget of 1 million pages per day, how would you split that budget between re-crawling existing pages and discovering new ones?

📚 Further Reading

  • Web Crawler - System Design Interview Vol. 1, Ch. 9 - Alex Xu's detailed web crawler design walkthrough
  • Mercator: A Scalable Web Crawler - foundational paper on scalable crawler architecture
  • Bloom Filters by Example - interactive tutorial on how Bloom filters work
レッスン 9 / 100%完了
←動画ストリーミングプラットフォームの設計
分散キャッシュの設計→