Wednesday, September 2, 2015

[NineChap System Design] Class 4.1: Crawler - Shuatiblog.com



http://www.1point3acres.com/bbs/thread-148865-1-1.html
4. Pirate. Design Wikipedia crawler.
                  followup 1: No global status.
                  followup 2: deal with machine failure
                  followup 3: make the wiki site unaware of this crawler.
1. distributed bfs
2. consistent hashing
3. assign task with a random delay

[NineChap System Design] Class 4.1: Crawler - Shuatiblog.com
KISS – Keep It Simple, Sweetie.
In Today's lecture:
  1. write a crawler
  2. thread-saft consumer & producer
  3. GFS, BigTable and MapReduce
  4. Top 10 keyword/anagram using MapReduce
  5. Log analysis

News Aggregator App

  1. Info Collection
    crawler
  2. Info retrieval: rank, search and recommend.
    They are in fact, all related to sorting.
Socket is like the cellphone in the Call Center example.

It’s in-between Application Layer (HTTP, FTP, DNS) and Transport layer(UDP, TCP).
Remembering that socket is like a cellphone. It is an abstraction layer, that hinds the complexity of lower layer, thus making it easier to sende data in application layer.
What is socket? Where is it?

Crawl more websites

Simple design

  1. use crawlers to find out all list pages
  2. send the new lists to a Scheduler
  3. Scheduler will use crawlers again, to crawl pages.
This design is bad, cuz there is crawler waste. How can we reuse these crawlers???

Adv design

Design crawler that can crawl both list and pages information.
Look at our crawler: the text extraction logic and Regex forabc.com and bfe.com are totally different. However, they both share the same crawling techniques.
So we pass in all info a crawler task needs. Like:
  1. we gave more priority to list pages than content pages. Otherwise, your content get out of date soon.
  2. Type include both list/content and source info.
  3. status can be done, working, or new.
  4. timestamps helps us make sure each crawler runs every hour (let’s say)
So when schedule pick the next crawler task to run, it will choose based on Priority. However if the timestamp (availableTime) is not yet reached, the job won’t be executed.
If you crawler runs until endTime and haven’t finish, force finish it. We should also add task created time to the info.
Read full article from [NineChap System Design] Class 4.1: Crawler - Shuatiblog.com

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts