Wednesday, November 11, 2015

Map Reduce Median



https://www.quora.com/Is-there-a-fast-algorithm-to-run-on-the-MapReduce-framework-to-find-the-median-from-a-huge-integer-set
You could use MapReduce to sort the data and then some postprocessing outside of the MapReduce framework to find the median in the sorted data.  
This isn't asymptotically optimal - it's O(nlogn).

http://www.1point3acres.com/bbs/thread-14930-1-1.html
http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce
Related:
https://github.com/rajdeepd/hadoop-samples
public class MedianTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

          private Text yearText = new Text();
          private final static IntWritable tempWritable = new IntWritable(0);

          protected void map(LongWritable key, Text value, Context context)
              throws java.io.IOException, InterruptedException {
        String[] line = value.toString().split(";");
            String year = line[0];
            yearText.set(year);
            int temp = Integer.parseInt(line[1]);
            tempWritable.set(temp);
            context.write(yearText,tempWritable);
          }
}              
public class MedianTemperatureReducer extends
        Reducer<Text, IntWritable, Text, IntWritable> {
        ArrayList<Integer> temperatureList = new ArrayList<Integer>();
       
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws java.io.IOException, InterruptedException {
                int median = 0;
                for (IntWritable value : values) {
                        temperatureList.add(value.get());
                }
                Collections.sort(temperatureList);
                int size  = temperatureList.size();

                if(size%2 == 0){
                        int half = size/2;

                        median  = temperatureList.get(half);
                }else {
                        int half = (size + 1)/2;
                        median = temperatureList.get(half -1);
                }
                context.write(key, new IntWritable(median));
        }

}

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts