Massive Technical Interviews Tips: How Nutch2 Works

Sunday, January 18, 2015

How Nutch2 Works

Benefits of Nutch
http://blog.commoncrawl.org/2014/02/common-crawl-move-to-nutch/
Nutch runs completely as a small number of Hadoop MapReduce jobs that delegate most of the core work of fetching pages, filtering and normalizing URLs and parsing responses to plug-ins.

The plug-in architecture of Nutch allowed us to isolate most of the customizations we needed for our own particular processes into plug-ins without making changes to the Nutch code itself. This makes life a lot easier when it comes to merging in changes from the larger Nutch community which in turn simplifies maintenance.

http://stackoverflow.com/questions/11696422/what-is-going-on-inside-of-nutch-2
A lot of effort has been applied to make it as efficient as possible.

They can handle computers breaking down in the middle of the job and reassigning the work to other slaves. They can handle some slaves being faster than others.

The Master may decide to do the slaves' tasks on its own machine instead of sending it out to a slave if it will be more efficient. The communication network is incredibly advanced.

MapReduce lets you write simple code:

Define a Mapper, an optional Partitioner, and a Reducer.

Sunday, January 18, 2015

How Nutch2 Works

Labels

Popular Posts