Monday, June 29, 2015

Design a web crawler

4. Pirate. Design Wikipedia crawler.
                  followup 1: No global status.
                  followup 2: deal with machine failure
                  followup 3: make the wiki site unaware of this crawler.
1. distributed bfs
2. consistent hashing
3. assign task with a random delay

Algo Ramblings: Design a web crawler
Crawler basic algorithm
  1. Remove a URL from the unvisited URL list
  2. Determine the IP Address of its host name
  3. Download the corresponding document
  4. Extract any links contained in it.
  5. If the URL is new, add it to the list of unvisited URLs
  6. Process the downloaded document
  7. Back to step 1
Issues on crawler

General architecture
What pages should the crawler download ?
How should the crawler refresh pages ?
How should the load on the visited web sites be minimized ?
How should the crawling process be parallelized ?
the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?
BFS is normally used.

Decide what you want to crawl?
Scope: crawl whole internet or just company web site.
Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or,
your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it.
You need to have some ways to execute those javascript and get the URL.

How to rank:
Page rank algorithm

Forget about the fact that you’re dealing with billions of pages. How would you design this system if it were just a small number of pages? You should have an understanding of how you would solve the simple, small case in order to understand how you would solve the bigger case.

2. Now, think about the issues that occur with billions of pages. Most likely you can’t fit
the data on one machine. How will you di
Data Structure
We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
==> BFS+score, so we will crawl important pages/sites more timely.

We assign an importance score to each URL which we discover and then crawl them accordingly. 
We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.

A possible alternative would be to use Bloom filters.
==> False positive actually means that the bloom filter will tell you that you have already visited a URL but you actually haven't. The consequent of this is that you will miss crawling web pages rather than crawling the same page twice.

Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently.
We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. 
Hence a brute-force complete recrawl doesn’t really make sense.

We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have).

Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
===> if the crawler is built for companies site, we can provide api so client can add or manually crawl new added pages.

Check resource from
如果让你来设计一个最基本的Web Crawler,该如何设计?需要考虑的因素有哪些?


这个问题是面试中常见的设计类问题。实际上如果你没有做过相关的设计,想要回答出一个让面试官满意的结果其实并不是很容易。该问题并不局限于你在去面试搜索引擎公司时可能会问到。这里,我们从Junior Level和Senior Level两个角度来解答这个问题。

1. 如何抽象整个互联网

Junior: 抽象为一个无向图,网页为节点,网页中的链接为有向边。

Senior: 同上。

2. 抓取算法
Junior: 采用BFS的方法,维护一个队列,抓取到一个网页以后,分析网页的链接,扔到队列里。
Senior: 采用优先队列调度,区别于单纯的BFS,对于每个网页设定一定的抓取权重,优先抓取权重较高的网页。对于权重的设定,考虑的因素有:1. 是否属于一个比较热门的网站 2. 链接长度 3. link到该网页的网页的权重 4. 该网页被指向的次数 等等。

3. 网络模型
Junior: 多线程抓取。
Senior: 分别考虑单机抓取和分布式抓取的情况。对于Windows的单机,可以使用IOCP完成端口进行异步抓取,该种网络访问的方式可以最大程度的利用闲散资源。因为网络访问是需要等待的,如果简单的同时开多个线程,计算机用于线程间切换的耗费会非常大,这种用于处理抓取结果的时间就会非常少。IOCP可以做到使用几个线程就完成几十个线程同步抓取的效果。对于多机的抓取,需要考虑机器的分布,如抓取亚洲的站点,则用在亚洲范围内的计算机等等。
4. 实时性
Junior: 无需回答
Senior: 新闻网页的抓取一般来说是利用单独的爬虫来完成。新闻网页抓取的爬虫的权重设置与普通爬虫会有所区别。首先需要进行新闻源的筛选,这里有两种方式,一种是人工设置新闻源,如新浪首页,第二种方式是通过机器学习的方法。新闻源可以定义链接数非常多,链接内容经常变化的网页。从新闻源网页出发往下抓取给定层级限制的网页所得到,再根据网页中的时间戳信息判断,就可以加入新闻网页。
5. 网页更新
Junior: 无需回答。
Senior: 网页如果被抓下来以后,有的网页会持续变化,有的不会。这里就需要对网页的抓取设置一些生命力信息。当一个新的网页链接被发现以后,他的生命力时间戳信息应该是被发现的时间,表示马上需要被抓取,当一个网页被抓取之后,他的生命力时间戳信息可以被设置为x分钟以后,那么,等到x分钟以后,这个网页就可以根据这个时间戳来判断出,他需要被马上再抓取一次了。一个网页被第二次抓取以后,需要和之前的内容进行对比,
Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

How to abstract the internet?
You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?

BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

Decide what you want to crawl?

Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.
Some strategy used in practice includes

a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.


Robustness: avoid  fetching an infinite number of pages in a particular domain.
Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.

 Performance and efficiency: CPU, storage and network bandwidth.
 Quality:be biased towards fetching “useful” pages first.
 Freshness: continuous mode: it should obtain fresh copies of previously fetched pages.
 Extensible:  e.g. new data formats, new fetch protocols.

begin with a seed set

Check already fetched -- The simplest implementation for this would
use a simple fingerprint such as a checksum.

A more sophisticated test would use shingles
URL filter is used to determine whether the extracted URL should be excluded from the frontier based on one of several tests.

Many hosts on theWeb place certain portions of their websites off-limits to crawling, under a standard known as the Robots Exclusion Protocol. This is done by placing a file with the name robots.txt at the root of the URL hierarchy at the site. maintaining a cache of robots.txt files

URL should be normalized in the following sense: often the HTML encoding of a link from a web page p indicates the target of that link relative to the page p.

Finally, the URL is checked for duplicate elimination. When the URL is added to the frontier, it is assigned a priority based on which it is eventually removed from the frontier for fetching.

Certain housekeeping tasks are typically performed by a dedicated thread. 
This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl. 

In checkpointing, a snapshot of the crawler’s state (say the URL frontier) is committed to disk. In the event of a catastrophic crawler
failure, the crawl is restarted from the most recent checkpoint.

Documents change over time and so, in the context of continuous crawling, we must be able to delete their outdated fingerprints/shingles from the content-seen set(s). In order to do so, it is necessary to save the fingerprint/shingle of the document in the URL frontier, along with the URL itself.

DNS resolution is a well-known bottleneck in web crawling
. DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer.

DNS cache, are generally synchronous. 
This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler. 
Thread executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service. 

Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that resumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing
a timed wait each time; the designers of Mercator recommend of the order
of five attempts. The time quantum of the wait increases exponentially with
each of these attempts; Mercator started with one second and ended with
roughly 90 seconds, in consideration of the fact that there are host names
that take tens of seconds to resolve partitioning by terms, also known as global index organization, and partitioning by documents,

 The URL frontier
 high-quality pages that change frequently should be
prioritized for frequent crawling. Thus, the priority of a page should be a
function of both its change rate and its quality (using some reasonable quality
estimate). The combination is necessary because a large number of spam
pages change completely on every fetch

The second consideration is politeness: we must avoid repeated fetch requests
to a host within a short time span.
A common heuristic is to insert a
gap between successive fetch requests to a host that is an order of magnitude
larger than the time taken for the most recent fetch from that host.
Its goals are to ensure that (i) only one connection is open at a time to any
host; (ii) a waiting time of a few seconds occurs between successive requests
to a host and (iii) high-priority pages are crawled preferentially.

In the former, the dictionary of index terms is partitioned into subsets, each subset residing at a node.
Along with the terms at a node, we keep the postings for those terms. A
query is routed to the nodes corresponding to its query terms. In principle,
this allows greater concurrency since a stream of queries with different query
terms would hit different sets of machines.
In practice, partitioning indexes by vocabulary terms turns out to be nontrivial.
Multi-word queries require the sending of long postings lists between
sets of nodes for merging, and the cost of this can outweigh the greater concurrency.
Load balancing the partition is governed not by an a priori analysis
of relative term frequencies, but rather by the distribution of query terms
and their co-occurrences, which can drift with time or exhibit sudden bursts.
Achieving good partitions is a function of the co-occurrences of query terms
and entails the clustering of terms to optimize objectives that are not easy to
quantify. Finally, this strategy makes implementation of dynamic indexing
more difficult.
A more common implementation is to partition by documents: each node
contains the index for a subset of all documents. Each query is distributed to
all nodes, with the results fromvarious nodes being merged before presentation
to the user. This strategy trades more local disk seeks for less inter-node
communication. One difficulty in this approach is that global statistics used
in scoring – such as idf – must be computed across the entire document collection
even though the index at any single node only contains a subset of
the documents. These are computed by distributed “background” processes
that periodically refresh the node indexes with fresh global statistics
slaveBloom Filter放到master的内存里,而被访问过的url放到运行在master上的Redis里,这样保证所有操作都是O(1)

Master:读写task list这种工作需要统一交给一个master机器来做,这样能保持数据的一致性,也更便于管理,所以也就需要这个机器读写速度快,稳定。
Slave:而slave做的就是爬数据,把爬到的东西传回到master里面。 多一个,少一个,甚至坏一个都不太影响整个爬虫的工作。这样的机器运算速度不需要特别高。

-       一个是list, 那种汇聚了许多新闻超链接的页面
-       还有一种是page, 也就是普通的页面

2. Multiple Page Website Crawler
爬虫首先要找到那个list 页面,从这个页面开始搜集page。这个就产生了最基本的爬多个页面网站的爬虫架构:

上图是一个很基础的爬虫结构。值得注意的是对于每个网站来说,list crawler 只有一个,但是news crawler有许多个,而且news crawler 可以复用。
scheduler 就算是一个master,在控制整个爬虫的工作。但是,这也有一个问题——复用。这些list crawler会有浪费。比如说新浪没有啥新闻,list crawler就会闲置在那里;而163有很多新闻,crawler却不够使,这是浪费资源。


3. Master-Slave Crawler

这种结构没有分是list 还是 page crawler。 由一个master 的负责读写,其他crawler 来爬。这样就很好解决了一个资源浪费的问题。
这样就很好解决了一个资源浪费的问题。在这幅图里面画了task table的架构。这里面有一些列还是值得思考的。
- Priority
- State
还有state在这里包含 done, working, new 这几个状态。这个可以用于管理,这个也可以用来做verification。比如说一个爬虫的状态是done了。那就应该是爬过了。但是如果没有得到应有的数据,及link里面的内容。我就可以拿它来查是不是那个爬虫坏了。
- Timestamp


然后就应该说说那个scheduler 这个可以是个不过分的考点。考多线程的锁。这是第一种加锁的方式:睡眠。

1) Sleep
当有任务的时候爬,没任务的时候睡。这个方式的缺点在于睡多久。课件里面给的是睡一秒。但是外一这个一秒有任务了。爬虫还是不能马上开始爬。但是睡眠还是多线程一个很常见的处理方式。这个图很清晰。 按照视频的说法是把他理解并背诵
2) Conditional Variable

1.    自己观察
2.    可以排队
3.    有人通知

那个Cond_wait 的意思是: 没有任务的话就等着, 内部释放锁。
那个Cond_signal 就是去唤醒一个等着的线程
这是具体那两个conditional variable 的具体实现。


3) Semaphore

这个在视频里面也有一个很好的比喻。比如说我能处理有10 个新任务,我就有10个令牌。如果没有令牌就意味着没有新任务。这个爬虫就开始等着。当爬虫爬完页面, 就把令牌放回。
- Advantage

然后呢。视频里面对于这个生产者消费者模型,借着爬虫又详细的讲了一下。而且是对于这种处理大量数据的情况做了很经典的讲解。说到大数据。我们一台机器上肯定放不下所有的爬虫 肯定我们就要把他们部署在多个机器上。每个crawler负责爬页面获取任务 并把得到的页面发回给connector。由这个connector负责往table 里面读写。但是这样每个crawler就需要一个connector有点浪费。

所以, 为了减少connector 就按照crawler 的工作内容: 拿任务,送回页面 分成了两类 senderreceiver这样一组爬虫结构 就需要两个大的控制结构就好。


3. 设计实现
3.1 Master:
3.2 数据结构
对于一个CrawlManager,它主要管理两个数据结构UrlToDo,和UrlDone,前者可以抽象成一个链表,栈或者有优先级的队列,后者对外的表现是一个Set,做去重的工作。当定义出ADT(abstract data type)以后,则可以考虑出怎么样的去实现这个数据结构。这样的设计方法其实和实现一个数据结构是一样的,只不过当我们实现数据结构的时候,操作的对象是内存中的数组和列表,而在这个项目中,我们操作的对象是各种存储中提供给我们的功能,例如Redis中的List、Set,关系型数据库中的表等等。

4. 后记
Read full article from Algo Ramblings: Design a web crawler

No comments:

Post a Comment


Review (554) System Design (293) System Design - Review (189) Java (178) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Knowledge (39) Linux (39) Interview-Java (35) Knowledge - Review (32) Database (30) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Miscs (25) MultiThread (25) Concurrency (24) Cracking Code Interview (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Distributed (20) Interview Q&A (20) OOD Design (20) System Design - Practice (19) Security (17) Algorithm (15) How to Ace Interview (15) Brain Teaser (14) Google (13) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Redis (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Testing (11) Resource (10) Solr (10) Amazon (9) Cache (9) Search (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Cassandra (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Machine Learning (7) NoSQL (7) C++ (6) Design (6) File System (6) Highscalability (6) How to Better (6) Kafka (6) Network (6) Restful (6) Trouble Shooting (6) CareerCup (5) Code Review (5) Company - Facebook (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) Puzzles (3) Python (3) Resource-System Desgin (3) Scala (3) UML (3) geeksquiz (3) AI (2) API Design (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Elasticsearch (2) Garbage Collection (2) Go (2) Hadoop (2) Html (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) ZooKeeper (2) reddit (2) Ads (1) Advanced data structures (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Chrome (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) RateLimiter (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

Popular Posts