Massive Technical Interviews Tips: AWS Miscs

Thursday, July 23, 2015

AWS Miscs

http://docs.aws.amazon.com/cli/latest/userguide/installing.html

$ pip install --upgrade --user awscli

The --upgrade option tells pip to upgrade any requirements that are already installed. The --user option tells pip to install the program to a subdirectory of your user directory to avoid modifying libraries used by your operating sytem.

aws --version

$ pip install --upgrade --user awscli

If you need to uninstall the AWS CLI, use pip uninstall.

$ pip uninstall awscli

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

aws configure

aws s3 ls

https://stackoverflow.com/questions/8659382/downloading-an-entire-s3-bucket

aws s3 sync s3://mybucket .

Output:

download: s3://mybucket/test.txt to test.txt
download: s3://mybucket/test2.txt to test2.txt

This will download all of your files (one-way sync). It will not delete any existing files in your current directory (unless you specify --delete), and it won't change or delete any files on S3.

http://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html

aws s3 sync <source> <target> [--options]

The following example synchronizes the contents of an Amazon S3 folder named path in my-bucketwith the current working directory. s3 sync updates any files that have a different size or modified time than files with the same name at the destination. The output displays specific operations performed during the sync. Notice that the operation recursively synchronizes the subdirectoryMySubdirectory and its contents with s3://my-bucket/path/MySubdirectory.

$ aws s3 sync . s3://my-bucket/path
upload: MySubdirectory\MyFile3.txt to s3://my-bucket/path/MySubdirectory/MyFile3.txt
upload: MyFile2.txt to s3://my-bucket/path/MyFile2.txt
upload: MyFile1.txt to s3://my-bucket/path/MyFile1.txt

Normally, sync only copies missing or outdated files or objects between the source and target. However, you may supply the --delete option to remove files or objects from the target not present in the source.

http://briansjavablog.blogspot.com/2016/05/spring-boot-angular-amazon-web-services.html

EC2 - Amazons Elastic Cloud Compute provides on demand virtual server instances that can be quickly provisioned with the operating system and software stack of your choice. We'll be using Amazons own Linux machine image to deploy our application.
Relational Database Service - Amazons database as a service allows developers to provision Amazon managed database instances in the cloud. A number of common database platforms are supported but we'll be using a MySQL instance.
S3 Storage - Amazons Simple Storage Service provides simple key value data storage

http://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-hbase-create.html

http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html

With ElastiCache, you can quickly deploy your cache environment, without having to provision hardware or install software. You can choose from Memcached or Redis protocol-compliant cache engine software, and let ElastiCache perform software upgrades and patch management for you. For enhanced security, ElastiCache can be run in the Amazon Virtual Private Cloud (Amazon VPC) environment, giving you complete control over network access to your clusters. With just a few clicks in the AWS Management Console, you can add or remove resources such as nodes, clusters, or read replicas to your ElastiCache environment to meet your business needs and application requirements.

The ElastiCache Auto Discovery feature for Memcached lets your applications identify all of the nodes in a cache cluster and connect to them, rather than having to maintain a list of available host names and port numbers. In this way, your applications are effectively insulated from changes to node membership in a cluster.

ElastiCache has multiple features to enhance reliability for critical production deployments:

Automatic detection and recovery from cache node failures.
Automatic failover (Multi-AZ) of a failed primary cluster to a read replica in Redis replication groups.
Flexible Availability Zone placement of nodes and clusters.

https://github.com/aws/aws-sdk-java/issues/269

@ferozed Are you perhaps using the streaming API to read S3 objects but forget to close the S3Object after you consume the content?

http://stackoverflow.com/questions/16354966/does-amazon-s3-have-a-connection-pool

Each client in the AWS SDK for Java (including the Amazon S3 client) currently maintains it's own HTTP connection pool. You can tune the maximum size of the HTTP connection pool through the ClientConfiguration class that can be passed into client object constructors.

We recommend sharing client objects, because of the expense and overhead of having too many HTTP connection pools that aren't being utilized effectively. You should see better performance when you share client objects across threads like this.

final ObjectMetadata meta = new ObjectMetadata();
meta.setContentType("text/csv");
meta.setContentLength(stream.length);//somehow
final PutObjectRequest req = new PutObjectRequest(bucketName, objPath, stream, meta);
s3Client.putObject(req);

http://support.rightscale.com/09-Clouds/AWS/FAQs/FAQ_0025_-_Can_I_use_Network_Time_Protocol_%28NTP%29_on_my_RightScale_servers%3F/index.html
http://serverfault.com/questions/513326/wrong-time-on-my-ubuntu-amazon-ec2-instances

By default, Amazon has Xen setup on EC2 so that it forces the hardware clock to sync to UTC. However, there may be a (fairly rare) condition that causes the time on your instance to drift. You can fix this on Linux instances by disabling the hypervisor's independent wallclock then employing the use of an NTP client.

http://serverfault.com/questions/6438/time-synchronization-and-scheduled-tasks

If NTP runs at 10:59:55, and fixes the time to 11:00:13, does chron run the 11:00:00, and *:00:00 tasks, or are they skipped?

Alternately, if NTP runs at 11:00:00 and fixes the time to 10:59:48, do these tasks run twice?

if you run a NTP daemon you will not have this problem, it will speed up/slow down your clock so it gets synchronized over time rather then jumping.

SNS - Simple Notification Service
http://docs.aws.amazon.com/sns/latest/dg/GettingStarted.html
Create topic - topic ARN
Subscribe to a Topic
To receive messages published to a topic, you have to subscribe an endpoint to that topic. An endpoint is a mobile app, web server, email address, or an Amazon SQS queue that can receive notification messages from Amazon SNS. Once you subscribe an endpoint to a topic and the subscription is confirmed, the endpoint will receive all messages published to that topic.

Publish to a Topic
Publishers send messages to topics. Once a new message is published, Amazon SNS attempts to deliver that message to every endpoint that is subscribed to the topic.

http://docs.aws.amazon.com/sns/latest/dg/using-awssdkjava.html

http://docs.aws.amazon.com/sns/latest/dg/SNSMobilePush.html

Push notification services, such as APNS and GCM, maintain a connection with each app and associated mobile device registered to use their service. When an app and mobile device register, the push notification service returns a device token. Amazon SNS uses the device token to create a mobile endpoint, to which it can send direct push notification messages. In order for Amazon SNS to communicate with the different push notification services, you submit your push notification service credentials to Amazon SNS to be used on your behalf

http://www.telerik.com/forums/push-notification-to-specific-users-
http://www.codefoster.com/pushusers/

http://www.shuatiblog.com/blog/2014/08/05/AWS-explained/
SimpleDB

AWS’s always-available replacement for RDBMSs. Specifically SimpleDB is their hosted, replicated key-value store that is always available and accessible as a web service
S3
(a.k.a Simple Storage Service) AWS’s always-available file storage solution accessible as a web service
SQS

(a.k.a Simple Queue Service) AWS’s always-available queueing service accessible as a web service
ELB

(a.k.a Elastic Load Balancer) AWS’s always-available load balancing service accessible as a web service
EC2

(a.k.a Elastic Compute Cloud) AWS’s on-demand server offering accessible as a web service
CloudFront

AWS’s CDN (a.k.a Content Delivery Network) offering accessible as a web service

CloudWatch provides monitoring and alerting for your AWS resources. CloudWatch also offers basic features for managing application log data on your EC2 instances.

Amazon ElastiCache
https://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/WhatIs.html
http://www.sitepoint.com/amazon-elasticache-cache-on-steroids/

Amazon ElastiCache for Redis
http://www.slideshare.net/AmazonWebServices/aws-webinar
http://makeandbuild.com/blog/post/combining-redis-and-elasticache-for-a-better-solutin

AWS ElastiCache with Spring
http://stackoverflow.com/questions/21338270/aws-elasticache-with-spring
http://stackoverflow.com/questions/31982564/spring-with-elasticache-not-able-to-instantiate-amazonelasticacheclient-error
https://aws.amazon.com/elasticache/

Memcached - a widely adopted memory object caching system. ElastiCache is protocol compliant with Memcached, so popular tools that you use today with existing Memcached environments will work seamlessly with the service.
Redis – a popular open-source in-memory key-value store that supports data structures such as sorted sets and lists. ElastiCache supports Master / Slave replication and Multi-AZ which can be used to achieve cross AZ redundancy.

Amazon ElastiCache automatically detects and replaces failed nodes, reducing the overhead associated with self-managed infrastructures and provides a resilient system that mitigates the risk of overloaded databases, which slow website and application load times.

How to configure eviction (time to live) on Amazon AWS Elasticache Redis + Spring Data
Caching with Spring Data Redis

Cloud Search
http://www.searchtechnologies.com/amazon-cloudsearch-vs-solr-cloud
https://github.com/tahseen/amazon-cloudsearch-client-java
https://github.com/aws/aws-sdk-java/tree/master/aws-java-sdk-cloudsearch/src/main/java/com/amazonaws

AWS CloudTrail
AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you.

Amazon Route 53 is a highly-scalable DNS service that allows you to manage your DNS records by creating a hosted zone for every domain you would like to manage.

Amazon Virtual Private Cloud (Amazon VPC) allows you to extend your corporate network into a private cloud contained within AWS. Amazon VPC uses the IPSec tunnel mode that enables you to create a secure connection between a gateway in your data center and a gateway in AWS.

EC2 Plug-in
elasticsearch comes with an EC2 plug-in. Normally nodes can discover their context, or their environment with multicast. They can look for elasticsearch nodes, with the same clustername. But in AWS (EC2) we don’t have multicast. The solution is to use security groups, or tags. You can also restrict discovery to specify Availability Zones. With this plug-in you can also tell elasticsearch to store the index in S3.

Amazon CloudFront is a content delivery web service.

A Comprehensive Guide to Building a Scalable Web App on Amazon Web Services - Part 1

command line:
aws configure
aws s3 ls
aws s3 ls s3://bucket-name/path

s3
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219
Apache Spark comes with the built-in functionality to pull data from S3 as it would with HDFS using the SparContext’s textFiles method. textFilesallows for glob syntax, which allows you to pull hierarchal data as in textFiles(s3n://bucket/2015/*/*).

S3 isn’t a file system, it is a key-value store. The keys 2015/05/01and 2015/05/02 do not live in the “same place”. They just happen to have a similar prefix: 2015/05.

This causes issues when using Apache Spark’s textFiles since it assumes that anything being put through it behaves like an HDFS.

Originally we were pulling the data using SparkContext’s textFiles method as such sc.textFiles(s3n://bucket/events/*/*/*/*/*/*). This worked fine at first but as the dataset grew we noticed that there would always be a large period of inactivity between jobs.

The solution is quite simple: do not use textFiles. Instead use the AmazonS3Client to manually get every key (maybe with a prefix), then parallelize the data pulling using SparkContext’s parrallelize method and said AmazonS3Client.

https://gist.githubusercontent.com/pjrt/f1cad93b154ac8958e65/raw/7b0b764408f145f51477dc05ef1a99e8448bce6d/S3Puller.scala
val request = new ListObjectsRequest()
request.setBucketName(bucket)
request.setPrefix(prefix)
request.setMaxKeys(pageLength)
def s3 = new AmazonS3Client(new BasicAWSCredentials(key, secret))

val objs = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }

Above, we get all of the keys for a bucket and a prefix (events) and parallelize all of the keys (give them to the workers/partitions) and make each worker pull the data for that one key.
http://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html

aws s3 cp MyFile.txt s3://my-bucket/path/

// Move all .jpg files in s3://my-bucket/path to ./MyDirectory
$ aws s3 mv s3://my-bucket/path ./MyDirectory --exclude '*' --include '*.jpg' --recursive

// List the contents of my-bucket
$ aws s3 ls s3://my-bucket

// List the contents of path in my-bucket
$ aws s3 ls s3://my-bucket/path

// Delete s3://my-bucket/path/MyFile.txt
$ aws s3 rm s3://my-bucket/path/MyFile.txt

// Delete s3://my-bucket/path and all of its contents
$ aws s3 rm s3://my-bucket/path --recursive

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

--profile – name of a profile to use, or "default" to use the default profile.

--region – AWS region to call.

--output – output format.

--endpoint-url – The endpoint to make the call against. The endpoint can be the address of a proxy or an endpoint URL for the in-use AWS region. Specifying an endpoint is not required for normal use as the AWS CLI determines which endpoint to call based on the in-use region.

http://dockone.io/article/831
AWS每一个区域（Region）都会有多个可用区（Availability Zone，简称AZ），可用区之间互相独立，不受其他可用区故障的影响。

保持多可用区部署且无状态，避免单点服务，增加更细的监控点（比如对所使用的AWS服务本身）。
ELB要打开Cross-Zone Balancing功能，按实际机器数量来均分流量，默认是按可用区均分流量。
增加Fault Tolerance测试，类似Netflix Chaos Monkey的做法，评估我们所使用的每个AWS服务故障时，对服务可能产生的影响。
跨地区的切换，比如：东京不可用，就把用户流量切到新加坡。
购买AWS Support Plan，解决AWS故障时信息不透明且无人可帮忙的困境。

自动伸缩

自动伸缩（AutoScaling）可以认为是AWS的核心功能，可根据用户的业务需求和策略，自动调整其弹性计算资源。

可用性和稳定性是通过定时健康状况检查和自动替换机器来做到的，包括EC2本身的健壮检查、使用ELB的健康监控，甚至自定义的监控通过API反馈给AutoScaling服务。
伸缩规则分为两种：简单规则和步进规则。
a. 简单规则，只根据一条规则增减容量，比如当平均CPU超过70%，增加两台机器。Cloud Watch会去自动监控这个指标，达到时就会告警，触发伸缩行为。这边要注意的是伸缩行为的发生必须等待其他伸缩行为完成，再响应告警。其中，增减的数量可以是定值也可以是百分比，同样Cloud Watch中监控到的数据，也可是通过API自定义灌入的。
b. 步进规则，早先AWS并不支持，它包含一组规则。比如CPU在40%-50%时加一台机器，50%-70%加两台，70%以上加四台等等。此时，若已有伸缩行为发生，该规则还会继续响应告警，中间会有一个预热时间，时间不到，这个机器的指标都不会计入。和简单规则相比，这种规则的伸缩行为无锁，且持续统计指标，及时触发，推荐使用。

加机器要早，减机器要慢。负载开始增加时早作打算，因为中间有可能会产生新机器启动失败等问题，另外算上机器启动时间和服务到位的时间，早打算可以避免容量跟不上的问题；减机器时，要慢慢来，稳稳地进行。否则，一方面避免连接被硬生生掐断，另一方面由于减机器过快，而负载仍在，导致又要增加机器，这使得伸缩行为太过频繁，成本和稳定性会受到影响。
采集机器的CPU数据，尽量使用Cloud Watch的，本机采集的数据不一定准确。
应用本身需要记录足够的性能数据，写入日志，方便后期数据整理。

Thursday, July 23, 2015

AWS Miscs

自动伸缩

Labels

Popular Posts