Massive Technical Interviews Tips: Big Data Sort

Wednesday, November 11, 2015

http://stackoverflow.com/questions/1152732/how-does-the-mapreduce-sort-algorithm-work

TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.

http://www.slideshare.net/tungld/terasort
http://search-hadoop.com/c/MapReduce:hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/terasort/TeraSort.java%7C%7C+%2522done+%2522merge+sort%2522

Wednesday, November 11, 2015

Big Data Sort

Labels

Popular Posts