Saturday, July 25, 2015

Cracking the coding interview--Q12.5 - - 博客频道 - CSDN.NET



Cracking the coding interview--Q12.5 - - 博客频道 - CSDN.NET

If you were designing a web crawler, how would you avoid getting into infinite loops?

译文:

如果你正在设计一个网络爬虫,何如避免进入无限循环?

解答

首先了解什么是网络爬虫,http://blog.csdn.net/navyifanr/article/details/21442653

网络爬虫的通用框架如图:

    网络爬虫的基本工作流程如下:

    1.首先选取一部分精心挑选的种子URL;

    2.将这些URL放入待抓取URL队列;

    3.从待抓取URL队列中取出待抓取在URL,解析DNS,并且得到主机的ip,并将URL对应的网页下载下来,存储进已下载网页库中。此外,将这些URL放进已抓取URL队列。

    4.分析已抓取URL队列中的URL,分析其中的其他URL,并且将URL放入待抓取URL队列,从而进入下一个循环。

    由上可知,当我们重新去解析一个已经解析过的网页时, 就会陷入无限循环。这意味着我们会重新访问那个网页的所有链接, 然后不久后又会访问到这个网页。最简单的例子就是,网页A包含了网页B的链接, 而网页B又包含了网页A的链接,那它们之间就会形成一个闭环。

         那么我们怎样防止访问已经访问过的页面呢,可以通过设置一个标志。 整个互联网就是一个图结构,我们通常使用DFS(深度优先搜索)和BFS(广度优先搜索) 进行遍历。所以,像遍历一个简单的图一样,将访问过的结点标记一下即可。


Read full article from Cracking the coding interview--Q12.5 - - 博客频道 - CSDN.NET

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts