http://www.1point3acres.com/bbs/thread-148865-1-1.html
2. consistent hashing
3. assign task with a random delay
[NineChap System Design] Class 4.1: Crawler - Shuatiblog.com
KISS – Keep It Simple, Sweetie.
In Today's lecture:
4. Pirate. Design Wikipedia crawler.
followup 1: No global status.
followup 2: deal with machine failure
followup 3: make the wiki site unaware of this crawler.
1. distributed bfs2. consistent hashing
3. assign task with a random delay
[NineChap System Design] Class 4.1: Crawler - Shuatiblog.com
KISS – Keep It Simple, Sweetie.
In Today's lecture:
- write a crawler
- thread-saft consumer & producer
- GFS, BigTable and MapReduce
- Top 10 keyword/anagram using MapReduce
- Log analysis
News Aggregator App
- Info Collection
crawler - Info retrieval: rank, search and recommend.
They are in fact, all related to sorting.
It’s in-between Application Layer (HTTP, FTP, DNS) and Transport layer(UDP, TCP).
Remembering that socket is like a cellphone. It is an abstraction layer, that hinds the complexity of lower layer, thus making it easier to sende data in application layer.
What is socket? Where is it?
Crawl more websites
Simple design
- use crawlers to find out all list pages
- send the new lists to a Scheduler
- Scheduler will use crawlers again, to crawl pages.
This design is bad, cuz there is crawler waste. How can we reuse these crawlers???
Adv design
Design crawler that can crawl both list and pages information.
Look at our crawler: the text extraction logic and Regex forabc.com and bfe.com are totally different. However, they both share the same crawling techniques.
So we pass in all info a crawler task needs. Like:
- we gave more priority to list pages than content pages. Otherwise, your content get out of date soon.
- Type include both list/content and source info.
- status can be done, working, or new.
- timestamps helps us make sure each crawler runs every hour (let’s say)
So when schedule pick the next crawler task to run, it will choose based on Priority. However if the timestamp (availableTime) is not yet reached, the job won’t be executed.
If you crawler runs until endTime and haven’t finish, force finish it. We should also add task created time to the info.
Read full article from [NineChap System Design] Class 4.1: Crawler - Shuatiblog.com