Tuesday, December 29, 2015

AWS Elastic MapReduce (EMR)



public class AmazonElasticMapReduceClient extends AmazonWebServiceClient
        implements AmazonElasticMapReduce

AmazonElasticMapReduceClient emr;
emr.setRegion(Region.getRegion(Regions.US_WEST_1));

ScriptBootstrapActionConfig script = new ScriptBootstrapActionConfig().withPath(scriptPath + "xxxx.sh");
BootstrapActionConfig installJava =
        new BootstrapActionConfig().withName("xxx").withScriptBootstrapAction(script);
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-emr/src/main/java/com/amazonaws/services/elasticmapreduce/util/StepFactory.java
        *   AWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
        *   AmazonElasticMapReduce emr = new AmazonElasticMapReduceClient(credentials);
        *
        *   StepFactory stepFactory = new StepFactory();
        *
        *   StepConfig enableDebugging = new StepConfig()
        *       .withName("Enable Debugging")
        *       .withActionOnFailure("TERMINATE_JOB_FLOW")
        *       .withHadoopJarStep(stepFactory.newEnableDebuggingStep());
        *
        *   StepConfig installHive = new StepConfig()
        *       .withName("Install Hive")
        *       .withActionOnFailure("TERMINATE_JOB_FLOW")
        *       .withHadoopJarStep(stepFactory.newInstallHiveStep());
        *
        *   RunJobFlowRequest request = new RunJobFlowRequest()
        *       .withName("Hive Interactive")
        *       .withSteps(enableDebugging, installHive)
        *       .withLogUri("s3://log-bucket/")
        *       .withInstances(new JobFlowInstancesConfig()
        *           .withEc2KeyName("keypair")
        *           .withHadoopVersion("0.20")
        *           .withInstanceCount(5)
        *           .withKeepJobFlowAliveWhenNoSteps(true)
        *           .withMasterInstanceType("m1.small")
        *           .withSlaveInstanceType("m1.small"));
        *
        *   RunJobFlowResult result = emr.runJobFlow(request);

Programming Elastic MapReduce: Using AWS Services to Build an End-to-End Application
The data stored in S3 is highly durable and is stored in multiple facilities and multiple devices within a facility.

Amazon Elastic Compute Cloud (EC2)
Amazon EC2 makes it possible to run multiple instances of virtual machines on demand inside any one of the AWS regions. The beauty of this service is that you can start as many or as few instances as you need without having to buy or rent physical hardware like in traditional hosting services. In the case of Amazon EMR, this means we can scale the size of our Hadoop cluster to any size we need without thinking about new hardware purchases and capacity planning.

Glacier is intended for long-term storage of data due to the high latency involved in the storage and retrieval of data. A request to retrieve data from Glacier may take several hours for Amazon to fulfill.

Amazon EMR is an AWS service that allows users to launch and use resizable Hadoop clusters inside of Amazon’s infrastructure.

task nodes are optional, but they are one of the key areas where capacity of the Amazon EMR cluster can be expanded or shrunk without affecting the stability of the cluster.

Cloudwatch allows you to monitor the health and progress of Job Flows. It also allows you to set alarms when metrics are outside of normal execution parameters.

The Import and Export service for S3 allows you to prepare portable storage devices that you can ship to Amazon to import your data into S3.

create and start an Amazon Linux EC2 instance on which to run a Bash script.

MOVING DATA TO S3 STORAGE
Data in S3 is stored in buckets. An S3 bucket is a container for the objects, files, and directories of information that you store in it. S3 bucket names need to be globally unique, so choose your bucket name wisely.

s3cmd --configure
s3cmd mb s3://program-emr
s3cmd put sample-syslog.log s3://program-emr

RUNNING OUR JOB FLOW WITH DEBUGGING
When creating a new Job Flow, we have the option to enable logging and debugging.

When logging is enabled, the logs of each Job Flow are written to an S3 location that is chosen on Job Flow creation. If debugging is also enabled, Amazon EMR creates indexes of the logfiles’ contents, which enables the Debug view of steps and tasks on the Amazon EMR Management Console to review a Job Flow run.

Enabling Job Flow logging and debugging is a great idea in development and testing. However, leaving logging and debugging turned on for production Job Flows can use up a significant amount of S3 storage for the logfile and SimpleDB indexes.

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts