How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
Based on the above two observations we can derive an algorithm which is as follows:
https://en.wikipedia.org/wiki/Power_of_10
Read full article from How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
Based on the above two observations we can derive an algorithm which is as follows:
- Iterate through the pages and compute the hash table of each one.
- Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.
- How much space does each page take up in the hash table?
- Each page hashes to a four byte value.
- Each url is an average of 30 characters, so that's another 30 bytes at least.
- Each url takes up roughly 34 bytes.
- 34 bytes * 1 billion = 31.6 gigabytes. We're going to have trouble holding that all in memory!
- We could split this up into files. We'll have to deal with the file loading / unloading—ugh.
- We could hash to disk. Size wouldn't be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
- Or, we could split this up across machines, and deal with network latency. Let's go with this solution, and assume we have n machines.
- First, we hash the document to get a hash value v
- tells us which machine this document's hash table can be found on.
- is the value in the hash table that is located on its machine.
https://en.wikipedia.org/wiki/Power_of_10
Million | 6 | 1,000,000 | M | mega |
Billion (Milliard) | 9 | 1,000,000,000 | G | giga |
Trillion (Billion) | 12 | 1,000,000,000,000 | T | tera |
Read full article from How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice