Tuesday, December 23, 2014

How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice



How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice
Based on the above two observations we can derive an algorithm which is as follows:
  1. Iterate through the pages and compute the hash table of each one.
  2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.
This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?
  • How much space does each page take up in the hash table?
    • Each page hashes to a four byte value.
    • Each url is an average of 30 characters, so that's another 30 bytes at least.
    • Each url takes up roughly 34 bytes.
  • 34 bytes * 1 billion = 31.6 gigabytes. We're going to have trouble holding that all in memory!
What do we do?
  • We could split this up into files. We'll have to deal with the file loading / unloading—ugh.
  • We could hash to disk. Size wouldn't be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
  • Or, we could split this up across machines, and deal with network latency. Let's go with this solution, and assume we have n machines.
    • First, we hash the document to get a hash value v
    • v\%n tells us which machine this document's hash table can be found on.
    • v / n is the value in the hash table that is located on its machine.

https://en.wikipedia.org/wiki/Power_of_10
Million61,000,000Mmega
Billion (Milliard)91,000,000,000Ggiga
Trillion (Billion)121,000,000,000,000Ttera

Read full article from How to detect duplicate documents in billions of urls | Runhe Tian Coding Practice

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts