News Aggregator App

Info Collection
crawler
Info retrieval: rank, search and recommend.
They are in fact, all related to sorting.

Socket is like the cellphone in the Call Center example.

It’s in-between Application Layer (HTTP, FTP, DNS) and Transport layer(UDP, TCP).

Remembering that socket is like a cellphone. It is an abstraction layer, that hinds the complexity of lower layer, thus making it easier to sende data in application layer.

What is socket? Where is it?

Crawl more websites

Simple design

use crawlers to find out all list pages
send the new lists to a Scheduler
Scheduler will use crawlers again, to crawl pages.

This design is bad, cuz there is crawler waste. How can we reuse these crawlers???

Adv design

Design crawler that can crawl both list and pages information.

Look at our crawler: the text extraction logic and Regex forabc.com and bfe.com are totally different. However, they both share the same crawling techniques.

So we pass in all info a crawler task needs. Like:

we gave more priority to list pages than content pages. Otherwise, your content get out of date soon.
Type include both list/content and source info.
status can be done, working, or new.
timestamps helps us make sure each crawler runs every hour (let’s say)

So when schedule pick the next crawler task to run, it will choose based on Priority. However if the timestamp (availableTime) is not yet reached, the job won’t be executed.

If you crawler runs until endTime and haven’t finish, force finish it. We should also add task created time to the info.

Read full article from [NineChap System Design] Class 4.1: Crawler - Shuatiblog.com

Wednesday, September 2, 2015

[NineChap System Design] Class 4.1: Crawler - Shuatiblog.com

News Aggregator App

Crawl more websites

Simple design

Adv design

Labels

Popular Posts