Massive Technical Interviews Tips: How to detect duplicate documents in billions of urls

Tuesday, December 23, 2014

How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
Based on the above two observations we can derive an algorithm which is as follows:

Iterate through the pages and compute the hash table of each one.
Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

34 bytes * 1 billion = 31.6 gigabytes. We're going to have trouble holding that all in memory!

What do we do?

We could split this up into files. We'll have to deal with the file loading / unloading—ugh.
We could hash to disk. Size wouldn't be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
Or, we could split this up across machines, and deal with network latency. Let's go with this solution, and assume we have n machines.
- First, we hash the document to get a hash value v
- $v\%n$ tells us which machine this document's hash table can be found on.
- $v / n$ is the value in the hash table that is located on its machine.