Web crawler system design

Dilip Kumar
6 min readJul 14, 2024

--

Browse world wide web and build index with target of 15 billions pages per month at the scale of Google search. Following are additional considerations bring complexities to crawl the web.

  1. Don’t overload DNS server
  2. Be polite while making HTTP call to target server.
  3. Respect target server robots.txt policy.
  4. Handle duplicate urls.
  5. Handle duplicate content.
  6. Prioritize crawling high value websites based on popularity and other factor.
  7. Handle crawler traps created by many spam small pages with full of links.
  8. Design to rescan website to get the latest content.
  9. Cleanup the website that doesn’t exist anymore. Pages are retired at a rapid pace, such that during the course of one year 80% of pages disappear. Links are retired at about the same pace as pages, with 80%
    disappearing in the span of a year.
  10. Crawl website in order of their freshness.
  11. How do discover new pages? New pages are created at a rate of 8% per week. New links are created at the rate of 25% per week, which is
    significantly faster than the rate of new page creation.
  12. Extend design to handle PDF, Audio, Video etc.

Single Machine Design

Apply a simple BFS (BFS is preferred over DFS) graph traversal algorithm to crawl entire web.

queue = [... ]  # initial url set
visited = {};
while queue.length > 0 :
url = queue.dequeue();
// download page and run the indexing
page = download(url);
// Extract urls for next scan
urls = extract_url(page)
for url in urls :
if url not in visited :
visited.add(url);
queue.add(url);

Time complexity: O(n) * t. Here n is number of pages on Web with n = 15 billion and t is time to download and extract url.
Avg time t = 2 seconds
Total time = 15 billion * 2 seconds ~ 951 years
This not acceptable therefore need to optimize this design.

Distributed system design

Based on requirement and complexities to scale the Crawler, we can come up with following distributed system design.

Sharding of data

All urls for same domain can be kept together for future processing. Therefore we can shard all layers based on hostname.

Database

Database can easily leverage the hostname based sharding. We can use relational database for example Spanner to store the data.

Cache

Cache as well can leverage hostname based sharding. For cache, we can go with Redis.

Queue

Kafka natively required us to declare the partition size in the beginning. In our case since we don’t know the upper bound of Urls as it keeps growing therefore we can’t use Kafka for Queue storage. Instead we can either go with Spanner based Queue with advance config or any alternate solution which can support dynamic partition.

URL Frontier

URL Frontier stores all the Urls that needs to be processed. It can be initialized as below.

  1. Admin can pick popular Urls to initialize this queue.
  2. Company can buy the list of new websites and keep populating this queue on regular interval.
  3. Website owner can also submit request to scan their website. On request, it will be added to this queue.

Publisher to this queue will extract hostname and write payload in following format.

{
hostname: xxxx
publihsedTimestamp: xxxx
payload: {
url: xxx
hostname: xxxx
}
}

Data will be sharded based on hostname .

DNS Resolver

Since we are trying to crawl entire internet. Therefore we should not be hitting DNS for every url. Instead, we will maintain a local DNS cache. This subscriber will first check the cache, if doesn’t exist then make call to DNS and update the cache. Following will be schema for cached data.

Hostname  IPAddress
xxxx xxxxxx
xxxx xxxxxx

Robots txt fetcher

To respect the privacy of website, we should first download the robots.txt and store it for future usage. Since size of this file is small therefore we can parse it and store it into distributed cache. Following will be flow for subscriber for this queue to handle robots txt.

  1. Check if robots.txt exist in our cache or not.
  2. If doesn’t exist then go ahead to download and update the cache.
  3. If does exist then first make HEAD request to robots txt to find out the last modified timestamp.
  4. If timestamp is same as cached data then skip.
  5. If different then download and update the cache.

Following is schema for this cache.

Hostname  Robots.txt  LastUpdateTimestamp
xxxx xxxxx xxxxx

Please note; a site can have multiple robots.txt for each domain and it’s subdomain. We need to store it separately. Hostname column will keep the value for domain and subdomain. This column will also be primary column for this table.

Http fetcher

Http fetcher subscriber will take care of downloading content from the target Web server. To download, it will use the resolved IP Address during DNS resolver phase.

How to handle freshness?

  1. Query Website metadata table and see if it exist or not.
  2. If it doesn’t exist then download from target server and update both content and metadata storage.
  3. If it does exist then first do HEAD call to get the last refresh timestamp.
  4. If timestamp is newer then download the content otherwise skip it.

How to handle politeness?

Since Queue is partitioned based on hostname therefore set of urls from same hostname will be kept together. We need a smart Queue handler so that it reads messages from hostname based partition and only assign mini batch to handler.

How to store content and metadata?

Once content is downloaded from the target server, it will do following.

  1. Upload content to object storage. We can either use Google file storage or S3. It will return the url to access the file content.
  2. Write Website metadata with the file content url in following schema format.
Hostname  Url  ContentUrl  PageFreshnessTimestamp  Size  Type
xxxx xxx xxx xxxxx xxx xxx

We will shard it based on hostname and use Url as a primary key for this table. We can use Spanner or any other relational database for storage.

Link Extractor

Link extractor will explore more links to increase the scope for Web crawler. Skip links if corresponding robots.txt file doesn’t allow to crawl.

Custom Url filter

Crawler needs to make sure not to scan site related to terrorism, porn, drugs etc. It internally maintains the CustomUrlFilter list in the Redis cache to drop the link marked as blocked.

Duplicate Urls

It will do lookup into Urls cache and will drop links if it already sent for processing. This stage will act as a check point to make sure not to process the duplicate urls. Following will be schema for this table.

Url  ProcessingStatus FreshnessTimestamp
xx xxx xxxx

Url prioritizer

We have billion+ urls to scan. Therefore it become critical to make sure that our system spend more time to scan the popular and websites. Also needs to make sure to ignore the spam or small files created to trap the crawler. We can manage following table to rank the urls.

Url  Rank
xx 111

This data will be updated externally based on following properties.

  1. Number of visitors.
  2. Number of links
  3. Recent visited timestamp
  4. Is it flagged by any user
  5. Feedback

Based on this, we will assign a rank to the url. Which will be used to set priority in the Url Frontier to pick the next url.

Process Url Checks

We don’t want Crawler should trap into infinite loop to keep processing same url again and again. At the same time, we would like to process the Url if there is fresh content published. We can take following approach.

  1. Make HEAD http call to get the freshness timestamp.
  2. If freshness is new then enqueue url for processing again.
  3. If the freshness is same compare to database then reenqueue same queue with delayed delivery of ~ x days. This will make sure not to loose that url, instead add delay to retry to see if needs to be rescan or not.

Recrawling

Internal checkpoint to rescan the url is not sufficient and not every website data keeps changing every few seconds. Therefore we will have to rescan all the Urls again to crawl them again. We can take following approach.

  1. Build a parallel pipeline dedicated for recrawling.
  2. Build a event sourcing pipeline to read Urls table and initialize Url Frontier queue with these urls.

This will trigger scanning again.

Scale Workers

  1. Number of urls in 4 weeks: 15 billion
  2. Number of urls in a second: 6200 per sec
  3. Concurrent thread process by a server: 100
  4. Number of messages can be processed concurrently: 100
  5. Number of App server: 6200/100 ~ 62

Reference

http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf

Happy system designing :-)

--

--

Dilip Kumar
Dilip Kumar

Written by Dilip Kumar

With 18+ years of experience as a software engineer. Enjoy teaching, writing, leading team. Last 4+ years, working at Google as a backend Software Engineer.

No responses yet