Friday, July 31, 2015

Hadoop Map Reduce Miscs



Word Count:
Spark:
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
http://wiki.apache.org/hadoop/WordCount
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.
  17  public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  18     private final static IntWritable one = new IntWritable(1);
  19     private Text word = new Text();
  20         
  21     public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  22         String line = value.toString();
  23         StringTokenizer tokenizer = new StringTokenizer(line);
  24         while (tokenizer.hasMoreTokens()) {
  25             word.set(tokenizer.nextToken());
  26             context.write(word, one);
  27         }
  28     }
  29  } 
  30         
  31  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  32 
  33     public void reduce(Text key, Iterable<IntWritable> values, Context context) 
  34       throws IOException, InterruptedException {
  35         int sum = 0;
  36         for (IntWritable val : values) {
  37             sum += val.get();
  38         }
  39         context.write(key, new IntWritable(sum));
  40     }
  41  }
  42         
  43  public static void main(String[] args) throws Exception {
  44     Configuration conf = new Configuration();
  45         
  46         Job job = new Job(conf, "wordcount");
  47     
  48     job.setOutputKeyClass(Text.class);
  49     job.setOutputValueClass(IntWritable.class);
  50         
  51     job.setMapperClass(Map.class);
  52     job.setReducerClass(Reduce.class);
  53         
  54     job.setInputFormatClass(TextInputFormat.class);
  55     job.setOutputFormatClass(TextOutputFormat.class);
  56         
  57     FileInputFormat.addInputPath(job, new Path(args[0]));
  58     FileOutputFormat.setOutputPath(job, new Path(args[1]));
  59         
  60     job.waitForCompletion(true);
  61  }

Line Count:
SparkConf sparkConf = new SparkConf().setAppName("File Copy");
03.JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
04. 
05.// Read the source file
06.JavaRDD<String> input = sparkContext.textFile(args[0]);
07. 
08.// Gets the number of entries in the RDD
09.long count = input.count();

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts