Massive Technical Interviews Tips: Elasticsearch Misc

https://www.elastic.co/guide/en/elasticsearch/reference/5.4/allocation-awareness.html

If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimise the risk of losing all shard copies at the same time.

As an example, let’s assume we have several racks. When we start a node, we can tell it which rack it is in by assigning it an arbitrary metadata attribute called rack_id — we could use any attribute name. For example:

./bin/elasticsearch -Enode.attr.rack_id=rack_one

This setting could also be specified in the elasticsearch.yml config file.

Now, we need to setup shard allocation awareness by telling Elasticsearch which attributes to use. This can be configured in the elasticsearch.yml file on all master-eligible nodes, or it can be set (and changed) with the cluster-update-settings API.

For our example, we’ll set the value in the config file:

cluster.routing.allocation.awareness.attributes: rack_id

With this config in place, let’s say we start two nodes with node.attr.rack_id set to rack_one, and we create an index with 5 primary shards and 1 replica of each primary. All primaries and replicas are allocated across the two nodes.

Now, if we start two more nodes with node.attr.rack_id set to rack_two, Elasticsearch will move shards across to the new nodes, ensuring (if possible) that no two copies of the same shard will be in the same rack. However if rack_two were to fail, taking down both of its nodes, Elasticsearch will still allocate the lost shard copies to nodes in rack_one.

Multiple awareness attributes can be specified, in which case the combination of values from each attribute is considered to be a separate value.

cluster.routing.allocation.awareness.attributes: rack_id,zone

https://qbox.io/blog/dynamic-index-creation-in-elasticsearch

Index templating is one of the most useful and important features of Elasticsearch. This feature comes in handy when we need to create indices with similar names,and common index settings for them.

Consider a case in which we need to create weekly indices namely company-01, company-02, etc with the same settings to every one of them.

curl -XPUT 'localhost:9200/_template/testindextemplate' -d '{
  "template": "company-*",
  "order": 0,
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    },
    "analysis": {
      "analyzer": {
        "analyzer-name": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": "lowercase"
        }
      }
    },
    "mappings": {
      "employeeinfo": {
        "properties": {
          "age": {
            "type": "long"
          },
          "experienceInYears": {
            "type": "long"
          },
          "name": {
            "type": "string",
            "analyzer": "analyzer-name"
          }
        }
      }
    }
  }
}'

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html

http://haptik.ai/tech/setup-auto-scaling-for-aws-elasticsearch/
AWS Elasticsearch is Elasticsearch + Kibana provided as a service. AWS manages the nodes and you get an endpoint through which you can access the Elasticsearch cluster.

Auto-scaling is done based on a metric that is monitored and once that metric reaches a specific threshold value, a new server is spun up to balance the hike in the threshold. AWS provides us with the option of custom metrics where we can make our own metrics and set alarms based on the different values.
http://bit-clouded-tech.blogspot.com/2014/11/elasticsearch-on-aws-with-autoscaling.html
One of the most powerful feature of ElasticSearch is its ability to scale horizontally, in many different ways; routing, sharding, and time / pattern based index creation and query.
http://stackoverflow.com/questions/18010752/how-to-setup-elasticsearch-cluster-with-auto-scaling-on-amazon-ec2

Auto scaling doesn't make a lot of sense with ElasticSearch.

Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly

https://www.quora.com/Is-it-possible-to-do-auto-scaling-for-ElasticSearch-in-AWS-at-peak-load
I wouldn't recommend auto-scaling Elasticsearch unless you really have a good sense of your peak capacity. Replicating and sharding is by itself a pretty resource intensive task and would degrade performance. Also, changing the number of shards can not be done without a reindexing, which would create another resource-intensive overhead.

There have been some workarounds suggested (such as in the Stack Overflow post below), but I am not aware of any that have been endorsed by the dev team. I would not suggest this for a production system.

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

https://serverfault.com/questions/614703/is-it-possible-to-do-autoscaling-for-elasticsearch-in-aws-at-peak-load

This has been answered pretty well on StackOverflow

The full answer is worth reading but here are the key points:

Moving and re-allocating shards is resource intensive. So having a server get added or removed on the fly can put a load on the system.
You should already have 2 nodes for ElasticSearch already. It performs better that way and keeps the data safer.
You can't adjust the number of shards upwards and downwards when removing or adding servers. What this means is that when you move down from 2 to 1 servers, suddenly you're going to have a lot of unallocated shards.

That said, I actually have my ES servers behind an Auto-Scale. But I have it set to always keep the same number of servers; it's only there to ensure that there are two servers on hand at all times, not to scale up or down.

http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/

Start with a cluster from day one so that you can scale up easily, but make sure that you identify your bottleneck before blindly throwing nodes at a performance issue. For example, in Elasticsearch, if your shards are really large, adding more nodes may not help much for speeding up queries. You have to reduce the shard size to see improvement.

Similar to the above, create time based indexes (ex: hourly) in Elasticsearch. This way if you query Elasticsearch to find all API errors in the last hour, it can find the answer by looking at a single index, increasing efficiency.
Rather than pushing individual events to Elasticsearch, push events in the batches (based on a time duration and/or number of events). This helps limit IO.
Depending on the type of data and queries you are running, it is important to optimize number of nodes, number of shards, maximum size of each shard and replication factor in Elasticsearch.

Elasticsearch Server - Third Edition
one index can store many objects serving different purposes. The document type lets us easily differentiate between the objects in a single index.

Replicas: increase query throughput or achieve high availability
GATEWAY
The cluster state is held by the gateway, which stores the cluster state and indexed data across full cluster restarts. By default, every node has this information stored locally; it is synchronized among nodes

While indexing, replicas are only used as an additional place to store the data. When executing a query, by default, Elasticsearch will try to balance the load among the shard and its replicas so that they are evenly stressed.

SCATTER -> Gather
The node receiving the query forwards it to all the nodes holding the shards that belong to a given index and asks for minimum information about the documents that match the query (the identifier and score are matched by default), unless routing is used, when the query will go directly to a single shard only. This is called the scatter phase. After receiving this information, the aggregator node (the node that receives the client request) sorts the results and sends a second request to get the documents that are needed to build the results list (all the other information apart from the document identifier and score). This is called the gather phase. After this phase is executed, the results are returned to the client.

- but only to the relevant shards (the ones containing the needed documents) to get the documents needed to build the response.

Routing
Routing can control which shard your documents and queries will be forwarded to.
"_routing" : {
"required" : true
}
post/1?routing=12
_search?routing=12,6654&q=userId:12+AND+section:6654'

Elasticsearch allows us to control the write consistency to prevent writes happening when they should not.
action.write_consitency: quorum, all, one

If such an index does not exist, Elasticsearch automatically creates the index for us.
action.auto_create_index: false
action.auto_create_index: +logs*,-*
curl -XPUT http://localhost:9200/blog/ -d '{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 2
}
}'
curl -XDELETE http://localhost:9200/blog
automatic type determining algorithm used in Elasticsearch. As we already said, Elasticsearch can try guessing the schema for our documents by looking at the JSON that the document is built from

curl -XPUT 'localhost:9200/sites' -d '{
"index.mapper.dynamic": false
}'

curl -XGET 'localhost:9200/users/_mapping?pretty'
curl -XPOST 'http://localhost:9200/posts' -d @posts.json
{
"mappings": {
"post": {
"properties": {
"id": { "type":"long" },
"name": { "type":"string" },
"published": { "type":"date" },
"contents": { "type":"string" }
}
}
}
}
MULTI FIELDS

Monday, May 2, 2016

Elasticsearch Misc

Labels

Popular Posts