Web crawler system design
Browse world wide web and build index with target of 15 billions pages per month at the scale of Google search. Following are additional considerations bring complexities to crawl the web.
- Don’t overload DNS server
- Be polite while making HTTP call to target server.
- Respect target server robots.txt policy.
- Handle duplicate urls.
- Handle duplicate content.
- Prioritize crawling high value websites based on popularity and other factor.
- Handle crawler traps created by many spam small pages with full of links.
- Design to rescan website to get the latest content.
- Cleanup the website that doesn’t exist anymore. Pages are retired at a rapid pace, such that during the course of one year 80% of pages disappear. Links are retired at about the same pace as pages, with 80%
disappearing in the span of a year. - Crawl website in order of their freshness.
- How do discover new pages? New pages are created at a rate of 8% per week. New links are created at the rate of 25% per week, which is
significantly faster than the rate of new page creation. - Extend design to handle PDF, Audio, Video etc.
Single Machine Design
Apply a simple BFS (BFS is preferred over DFS) graph traversal algorithm to crawl entire web.
queue = [... ] # initial url set
visited = {};
while queue.length > 0 :
url = queue.dequeue();
// download page and run the indexing
page = download(url);
// Extract urls for next scan
urls = extract_url(page)
for url in urls :
if url not in visited :
visited.add(url);
queue.add(url);
Time complexity: O(n) * t
. Here n
is number of pages on Web with n = 15 billion and t
is time to download and extract url.
Avg time t = 2 seconds
Total time = 15 billion * 2 seconds ~ 951 years
This not acceptable therefore need to optimize this design.
Distributed system design
Based on requirement and complexities to scale the Crawler, we can come up with following distributed system design.
Sharding of data
All urls for same domain can be kept together for future processing. Therefore we can shard all layers based on hostname
.
Database
Database can easily leverage the hostname
based sharding. We can use relational database for example Spanner to store the data.
Cache
Cache as well can leverage hostname
based sharding. For cache, we can go with Redis.
Queue
Kafka natively required us to declare the partition size in the beginning. In our case since we don’t know the upper bound of Urls as it keeps growing therefore we can’t use Kafka for Queue storage. Instead we can either go with Spanner based Queue with advance config or any alternate solution which can support dynamic partition.
URL Frontier
URL Frontier stores all the Urls that needs to be processed. It can be initialized as below.
- Admin can pick popular Urls to initialize this queue.
- Company can buy the list of new websites and keep populating this queue on regular interval.
- Website owner can also submit request to scan their website. On request, it will be added to this queue.
Publisher to this queue will extract hostname and write payload in following format.
{
hostname: xxxx
publihsedTimestamp: xxxx
payload: {
url: xxx
hostname: xxxx
}
}
Data will be sharded based on hostname
.
DNS Resolver
Since we are trying to crawl entire internet. Therefore we should not be hitting DNS for every url. Instead, we will maintain a local DNS cache. This subscriber will first check the cache, if doesn’t exist then make call to DNS and update the cache. Following will be schema for cached data.
Hostname IPAddress
xxxx xxxxxx
xxxx xxxxxx
Robots txt fetcher
To respect the privacy of website, we should first download the robots.txt and store it for future usage. Since size of this file is small therefore we can parse it and store it into distributed cache. Following will be flow for subscriber for this queue to handle robots txt.
- Check if robots.txt exist in our cache or not.
- If doesn’t exist then go ahead to download and update the cache.
- If does exist then first make HEAD request to robots txt to find out the last modified timestamp.
- If timestamp is same as cached data then skip.
- If different then download and update the cache.
Following is schema for this cache.
Hostname Robots.txt LastUpdateTimestamp
xxxx xxxxx xxxxx
Please note; a site can have multiple robots.txt for each domain and it’s subdomain. We need to store it separately. Hostname
column will keep the value for domain and subdomain. This column will also be primary column for this table.
Http fetcher
Http fetcher subscriber will take care of downloading content from the target Web server. To download, it will use the resolved IP Address during DNS resolver phase.
How to handle freshness?
- Query Website metadata table and see if it exist or not.
- If it doesn’t exist then download from target server and update both content and metadata storage.
- If it does exist then first do HEAD call to get the last refresh timestamp.
- If timestamp is newer then download the content otherwise skip it.
How to handle politeness?
Since Queue is partitioned based on hostname therefore set of urls from same hostname will be kept together. We need a smart Queue handler so that it reads messages from hostname based partition and only assign mini batch to handler.
How to store content and metadata?
Once content is downloaded from the target server, it will do following.
- Upload content to object storage. We can either use Google file storage or S3. It will return the url to access the file content.
- Write Website metadata with the file content url in following schema format.
Hostname Url ContentUrl PageFreshnessTimestamp Size Type
xxxx xxx xxx xxxxx xxx xxx
We will shard it based on hostname
and use Url
as a primary key for this table. We can use Spanner or any other relational database for storage.
Link Extractor
Link extractor will explore more links to increase the scope for Web crawler. Skip links if corresponding robots.txt file doesn’t allow to crawl.
Custom Url filter
Crawler needs to make sure not to scan site related to terrorism, porn, drugs etc. It internally maintains the CustomUrlFilter list in the Redis cache to drop the link marked as blocked.
Duplicate Urls
It will do lookup into Urls
cache and will drop links if it already sent for processing. This stage will act as a check point to make sure not to process the duplicate urls. Following will be schema for this table.
Url ProcessingStatus FreshnessTimestamp
xx xxx xxxx
Url prioritizer
We have billion+ urls to scan. Therefore it become critical to make sure that our system spend more time to scan the popular and websites. Also needs to make sure to ignore the spam or small files created to trap the crawler. We can manage following table to rank the urls.
Url Rank
xx 111
This data will be updated externally based on following properties.
- Number of visitors.
- Number of links
- Recent visited timestamp
- Is it flagged by any user
- Feedback
Based on this, we will assign a rank to the url. Which will be used to set priority in the Url Frontier
to pick the next url.
Process Url Checks
We don’t want Crawler should trap into infinite loop to keep processing same url again and again. At the same time, we would like to process the Url if there is fresh content published. We can take following approach.
- Make HEAD http call to get the freshness timestamp.
- If freshness is new then enqueue url for processing again.
- If the freshness is same compare to database then reenqueue same queue with delayed delivery of ~
x
days. This will make sure not to loose that url, instead add delay to retry to see if needs to be rescan or not.
Recrawling
Internal checkpoint to rescan the url is not sufficient and not every website data keeps changing every few seconds. Therefore we will have to rescan all the Urls again to crawl them again. We can take following approach.
- Build a parallel pipeline dedicated for recrawling.
- Build a event sourcing pipeline to read
Urls
table and initializeUrl Frontier
queue with these urls.
This will trigger scanning again.
Scale Workers
- Number of urls in 4 weeks: 15 billion
- Number of urls in a second: 6200 per sec
- Concurrent thread process by a server: 100
- Number of messages can be processed concurrently: 100
- Number of App server: 6200/100 ~ 62
Reference
http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf
Happy system designing :-)