[CareerCup] 10.6 Find Duplicate URLs 找重复的URL链接 - Grandyang - 博客园
10.6 You have 10 billion URLs. How do you detect the duplicate documents? In this case, assume that "duplicate" means that the URLs are identical.
https://jixiangsanbao.wordpress.com/2014/10/04/scalability-and-memory-limits-6/
1. 硬盘存储
将所有的数据存到一台机子上,我们可以把4TB的数据分为4000份,每份1GB大小,然后我们把每个URL u存在文件x.txt中,其中x=hash(u)%4000,这样具有相同哈希值的URL都被放到一个文件中了。然后我们再把每个文件导入内存,来寻找重复值。
2. 多台机器
另一种方法是使用多台机器,我们不是将数据存在x.txt,而是将URL发给机器x. 使用这种方法有好处也有坏处。好处是可以并行操作,4000个块可以同时进行操作。坏处是我们需要4000台机器,这不太现实,而且还要考虑如何处理失败。
https://github.com/filipegoncalves/interview-questions/blob/master/systems_design/DuplicateURLs.md
10.6 You have 10 billion URLs. How do you detect the duplicate documents? In this case, assume that "duplicate" means that the URLs are identical.
https://jixiangsanbao.wordpress.com/2014/10/04/scalability-and-memory-limits-6/
First we compute how many memory for storing 10 billion urls:
Assume that a url is 100 chars long. Each URL takes 100*4 bytes
10 billion URLs takes: 10*2^30*100*4 bytes 4*2^40 byes (approximate), which is 4 TB.
Assume that we only have 4GB of memory, we can divide the URLs into 4000 files, to decide which file to store the URL, we use the following hash function
fileId = hash(u) % 4000, which u is the URL string, and hash() is the summation of ASCII code of all characters in u.
By this function, we make sure the the URLs with the same hash value will be inside the same file, and the average size of file is 1GB (4TB/4000).
Then we can apply the hash map to detect duplicate URLs file by file.
If we use 4000 machines instead of 4000 files, the pro is that we can process the file in parallel, but the con is that we can not make sure every machine is failure-free.
这道题让我们在一百亿个URL链接中寻找相同项,看这数据量简直吓尿了,如果每个URL链接平均100个字符的话,每个字符是4个字节,那么总共需要占4TB的空间,我们无法在内存中导入这么大的数据量。假如我们恩能够把数据全部导入到内存中,那么找重复项就不是一件难事,我们可以使用哈希表来建立每个URL和其是否存在过建立映射,很容易能找到重复项。那么下面来看我们怎么处理这么大的数据量,我们可以有如下两种方法:1. 硬盘存储
将所有的数据存到一台机子上,我们可以把4TB的数据分为4000份,每份1GB大小,然后我们把每个URL u存在文件x.txt中,其中x=hash(u)%4000,这样具有相同哈希值的URL都被放到一个文件中了。然后我们再把每个文件导入内存,来寻找重复值。
2. 多台机器
另一种方法是使用多台机器,我们不是将数据存在x.txt,而是将URL发给机器x. 使用这种方法有好处也有坏处。好处是可以并行操作,4000个块可以同时进行操作。坏处是我们需要4000台机器,这不太现实,而且还要考虑如何处理失败。
https://github.com/filipegoncalves/interview-questions/blob/master/systems_design/DuplicateURLs.md
Finding duplicate data based on a key (the URL) is usually achieved with sorting. If we sort the data and write the result to a file, the duplicates will be next to each other. However, this might not be feasible in this case for a number of reasons.
First off, how much space will 10 billion URLs take? If each URL uses 100 bytes, we have 10^12 bytes, which is roughly 1 TB. We can't sort in memory, so we will have to run an external sort algorithm. This won't be too complicated though: a commodity machine today is probably able to handle at least 512 GB of information on disk, so we can split the data in two equal halves, run external merge sort on each machine, and then make a final merge step (over the network) to write the final results to a file (possibly distributed among 2 machines too). However, sorting 512 GB of data and then merging it with another 512 GB is still a considerable computation.
Another possibility to perform sorting is to use MapReduce. Google's implementation of MapReduce makes it very easy to sort huge amounts of data because intermediate key value pairs are handed to the reducers in order, so the reducers just output the intermediate keys (map and reduce will basically be the identity functions). But using a beast like MapReduce to do this seems like overkill.
What if we hash URLs and attempt to sort based on the hashes? We know that there are 10 billion URLs, so at most there will be 10 billion unique URLs. So we need a hash function that converts a textual URL into an integer in the interval 1 .. 10 billion. If we do that, we can sort pretty efficiently in linear time with the help of a bitvector: if every hashed URL is a number in [1 .. 10 billion], we need a bitvector of 10 billion bits, which is roughly 10*2^30/2^3 = 1280 MB, or 1.28 GB. This fits in memory in a single machine!
So we do as follows: we create a bitvector of 10 billion bits, consuming about 1.28 GB of memory. We read the input file and hash the URL into a known integer in the interval [1 .. 10 billion]; let that integer be N. We check if N already belongs to the bitvector. If it does, we report the URL as a duplicate. If it doesn't, we add it to the bitvector.
Of course, this assumes that we have a hash function that is deterministic and collision-free. But in the context of this specific problem, we can achieve collision-free hashing, because it is one of those rare cases where the hashed keys space is at least as large as the hash input key space
Read full article from [CareerCup] 10.6 Find Duplicate URLs 找重复的URL链接 - Grandyang - 博客园