Massive Technical Interviews Tips: Elasticsearch: The Definitive Guide

Sunday, October 4, 2015

Elasticsearch: The Definitive Guide

Elasticsearch: The Definitive Guide
Heap: Sizing and Swapping
./bin/elasticsearch -Xmx=10g -Xms=10g
Ensure that the min (Xms) and max (Xmx) sizes are the same to prevent the heap from resizing at runtime, a very costly process.

Give Half Your Memory to Lucene
Heap is definitely important to Elasticsearch. It is used by many in-memory data structures to provide fast operation. But with that said, there is another major user of memory that is off heap: Lucene.
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access.
Lucene’s performance relies on this interaction with the OS. But if you give all available memory to Elasticsearch’s heap, there won’t be any left over for Lucene. This can seriously impact the performance of full-text search.
The standard recommendation is to give 50% of the available memory to Elasticsearch heap, while leaving the other 50% free. It won’t go unused; Lucene will happily gobble up whatever is left over.

Don’t Cross 32 GB!
the JVM uses a trick to compress object pointers when heaps are less than ~32 GB.
In Java, all objects are allocated on the heap and referenced by a pointer. Ordinary object pointers (OOP) point at these objects, and are traditionally the size of the CPU’s native word: either 32 bits or 64 bits, depending on the processor. The pointer references the exact byte location of the value.
For 32-bit systems, this means the maximum heap size is 4 GB. For 64-bit systems, the heap size can get much larger, but the overhead of 64-bit pointers means there is more wasted space simply because the pointer is larger. And worse than wasted space, the larger pointers eat up more bandwidth when moving values between main memory and various caches (LLC, L1, and so forth).
Java uses a trick called compressed oops to get around this problem. Instead of pointing at exact byte locations in memory, the pointers reference object offsets. This means a 32-bit pointer can reference four billion objects, rather than four billion bytes. Ultimately, this means the heap can grow to around 32 GB of physical size while still using a 32-bit pointer.
Once you cross that magical ~30–32 GB boundary, the pointers switch back to ordinary object pointers. The size of each pointer grows, more CPU-memory bandwidth is used, and you effectively lose memory. In fact, it takes until around 40–50 GB of allocated heap before you have the same effective memory of a 32 GB heap using compressed oops.
The moral of the story is this: even when you have memory to spare, try to avoid crossing the 32 GB heap boundary. It wastes memory, reduces CPU performance, and makes the GC struggle with large heaps.

Swapping Is the Death of Performance
sudo swapoff -a
To disable it permanently, you’ll likely need to edit your /etc/fstab. Consult the documentation for your OS.
If disabling swap completely is not an option, you can try to lower swappiness. This value controls how aggressively the OS tries to swap memory. This prevents swapping under normal circumstances, but still allows the OS to swap under emergency memory situations.
For most Linux systems, this is configured using the sysctl value:
vm.swappiness = 1
A swappiness of 1 is better than 0, since on some kernel versions a swappiness of 0 can invoke the OOM-killer.
if neither approach is possible, you should enable mlockall. file. This allows the JVM to lock its memory and prevent it from being swapped by the OS. In your elasticsearch.yml, set this:
bootstrap.mlockall: true

File Descriptors and MMap
Lucene uses a very large number of files. At the same time, Elasticsearch uses a large number of sockets to communicate between nodes and HTTP clients. All of this requires available file descriptors.

many modern Linux distributions ship with a paltry 1,024 file descriptors allowed per process.
You should increase your file descriptor count to something very large, such as 64,000.
GET /_nodes/process

Elasticsearch also uses a mix of NioFS and MMapFS for the various files. Ensure that you configure the maximum map count so that there is ample virtual memory available for mmapped files. This can be set temporarily:

sysctl -w vm.max_map_count=262144

Changing Settings Dynamically
PUT /_cluster/settings
{
"persistent" : {
"discovery.zen.minimum_master_nodes" :
},
"transient" : {
"indices.store.throttle.max_bytes_per_sec" : "50mb"
}
}

Indexing Performance Tips
Test Performance Scientifically
Using and Sizing Bulk Requests

Storage
Use SSDs.
Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of potential failure if a drive dies. Don’t use mirrored or parity RAIDS since replicas provide that functionality.
Alternatively, use multiple drives and allow Elasticsearch to stripe data across them via multiple path.data directories.
Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced here is antithetical to performance.
If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower than local instance storage.

Throttling
Segments and Merging
But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch will automatically throttle indexing requests to a single thread. This prevents a segment explosion problem, in which hundreds of segments are generated before they can be merged. Elasticsearch will log INFO-level messages stating now throttling indexing when it detects merging falling behind indexing.
Elasticsearch defaults here are conservative: you don’t want search performance to be impacted by background merging. But sometimes

If you don’t need near real-time accuracy on your search results, consider dropping the index.refresh_interval of each index to 30s. If you are doing a large import, you can disable refreshes by setting this value to -1 for the duration of the import. Don’t forget to reenable it when you are finished!
If you are doing a large bulk import, consider disabling replicas by setting index.number_of_replicas: 0. When documents are replicated, the entire document is sent to the replica node and the indexing process is repeated verbatim. This means each replica will perform the analysis, indexing, and potentially merging process.
In contrast, if you index with zero replicas and then enable replicas when ingestion is finished, the recovery process is essentially a byte-for-byte network transfer. This is much more efficient than duplicating the indexing process.
If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.
If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.

Rolling Restarts
If you shut down a single node for maintenance, the cluster will immediately recognize the loss of a node and begin rebalancing. This can be irritating if you know the node maintenance is short term, since the rebalancing of very large shards can take some time.
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}
Shut down a single node, preferably using the shutdown API on that particular machine:
POST /_cluster/nodes/_local/_shutdown

To back up your cluster, you can use the snapshot API. This will take the current state and data in your cluster and save it to a shared repository. This backup process is “smart.” Your first snapshot will be a complete copy of data, but all subsequent snapshots will save the delta between the existing snapshots and the new data. Data is incrementally added and deleted as you snapshot data over time. This means subsequent backups will be substantially faster since they are transmitting far less data.
PUT _snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups/my_backup",
"max_snapshot_bytes_per_sec" : "50mb",
"max_restore_bytes_per_sec" : "50mb"
}
}
PUT _snapshot/my_backup/snapshot_1
PUT _snapshot/my_backup/snapshot_1?wait_for_completion=true

Restoring from a Snapshot
POST _snapshot/my_backup/snapshot_1/_restore

Clusters Are Living, Breathing Creatures
Elasticsearch works hard to make clusters self-sufficient and just work. But a cluster still requires routine care and feeding, such as routine backups and upgrades.

Upgrading should be a routine process, rather than a once-yearly fiasco that requires countless hours of precise planning.
ake frequent snapshots of your cluster—and periodically test those snapshots by performing a real recovery!

GET /megacorp/employee/_search
GET /megacorp/employee/_search?q=last_name:Smith

Query DSL
Phrase Search
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}

GET /megacorp/employee/_search
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 }
}
},
"query" : {
"match" : {
"last_name" : "smith"
}
}
}
}
}

aggregations
Aggregations allow hierarchical rollups too.
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}

Clsuter
As nodes are added to or removed from the cluster, the cluster reorganizes itself to spread the data evenly.

One node in the cluster is elected to be the master node, which is in charge of managing cluster-wide changes like creating or deleting an index, or adding or removing a node from the cluster. The master node does not need to be involved in document-level changes or searches, which means that having just one master node will not become a bottleneck as traffic grows

GET /_cluster/health

Create Bookmark
Add an Index
To add data to Elasticsearch, we need an index—a place to store related data. In reality, an index is just a logical namespace that points to one or more physical shards.
A shard is a low-level worker unit that holds just a slice of all the data in the index

As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.

A shard can be either a primary shard or a replica shard.
A replica shard is just a copy of a primary shard. Replicas are used to provide redundant copies of your data to protect against hardware failure, and to serve read requests like searching or retrieving a document.
The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time.

By default, indices are assigned five primary shards
PUT /blogs
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
A cluster health of yellow means that all primary shards are up and running (the cluster is capable of serving any request successfully) but not all replica shards are active
increase the number of replicas from the default of 1 to 2:
PUT /blogs/_settings
{
"number_of_replicas" : 2
}
all data in every field is indexed by default.
it can use all of those inverted indices in the same query, to return results at breathtaking speed.

Document Metadata
_index
_type
_id
PUT /website/blog/123

Autogenerating IDs
instead of using the PUT verb (“store this document at this URL”), we use the POST verb (“store this document under this URL”).
GET /website/blog/123?pretty - _source
GET /website/blog/123?_source=title,text

Checking Whether a Document Exists
curl -i -XHEAD http://localhost:9200/website/blog/123
HEAD requests don’t return a body, just HTTP headers

If we already have an _id that we want to use, then we have to tell Elasticsearch that it should accept our index request only if a document with the same _index, _type, and _id doesn’t exist already.
PUT /website/blog/123?op_type=create
PUT /website/blog/123/_create

Deleting a Document
Even though the document doesn’t exist (found is false), the _version number has still been incremented. This is part of the internal bookkeeping, which ensures that changes are applied in the correct order across multiple nodes.

Optimistic Concurrency Control - _version
If an older version of a document arrives after a new version, it can simply be ignored.
PUT /website/blog/1?version=1

Using Versions from an External System
A common setup is to use some other database as the primary data store and Elasticsearch to make the data searchable, which means that all changes to the primary database need to be copied across to Elasticsearch as they happen.

The way external version numbers are handled is a bit different from the internal version numbers we discussed previously. Instead of checking that the current _version is the same as the one specified in the request, Elasticsearch checks that the current _version is less than the specified version. If the request succeeds, the external version number is stored as the document’s new _version.

PUT /website/blog/2?version=5&version_type=external

Partial Updates to Documents
POST /website/blog/1/_update
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}

"script" : "ctx._source.views+=1"
{
"script" : "ctx._source.tags+=new_tag",
"params" : {
"new_tag" : "search"
}
}
We can even choose to delete a document based on its contents, by setting ctx.op to delete:
POST /website/blog/1/_update
{
"script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'",
"params" : {
"count": 1
}
}

Updating a Document That May Not Yet Exist -
"script" : "ctx._source.views+=1",
"upsert": {
"views": 1
}
POST /website/pageviews/1/_update?retry_on_conflict=5
Retrieving Multiple Documents - _mget

Cheaper in Bulk - _bulk
Every line must end with a newline character (\n), including the last line. These are used as markers to allow for efficient line separation.
create
Create a document only if the document does not already exist.
index
Create a new document or replace an existing document

Routing a Document to a Shard
shard = hash(routing) % number_of_primary_shards
Create, index, and delete requests are write operations, which must be successfully completed on the primary shard before they can be copied to any associated replica shards

replication=sync/async
consistency=one,all,quorum
By default, the primary shard requires a quorum, or majority, of shard copies (where a shard copy can be a primary or a replica shard) to be available before even attempting a write operation.
timeout=100,30s

DOCUMENT-BASED REPLICATION
When a primary shard forwards changes to its replica shards, it doesn’t forward the update request. Instead it forwards the new version of the full document. Remember that these changes are forwarded to the replica shards asynchronously, and there is no guarantee that they will arrive in the same order that they were sent. If Elasticsearch forwarded just the change, it is possible that changes would be applied in the wrong order, resulting in a corrupt document.

Multidocument Patterns
It breaks up the multidocument request into a multidocument request per shard, and forwards these in parallel to each participating node.

Why the Funny Format?
Elasticsearch reaches up into the networking buffer, where the raw request has been received, and reads the data directly. It uses the newline characters to identify and parse just the small action/metadata lines in order to decide which shard should handle each request.
These raw requests are forwarded directly to the correct shard. There is no redundant copying of data, no wasted data structures. The entire request process is handled in the smallest amount of memory possible.

GET /_search?timeout=10ms
Elasticsearch will return any results that it has managed to gather from each shard before the requests timed out.
/gb,us/_search
/g*,u*/_search

/gb/user/_search
/gb,us/user,tweet/_search
/_all/user,tweet/_search

Search Lite
+name:john +tweet:mary
The _all Field
When you index a document, Elasticsearch takes the string values of all of its fields and concatenates them into one big string, which it indexes as the special _all field.

+name:(mary john) +date:>2014-09-10 +(aggregations geo)

Mapping and Analysis
GET /gb/_mapping/tweet

Exact Values Versus Full Text

Character filters
First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert & characters to the word and.
Tokenizer
Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.
Token filters
Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasing Quick), remove terms (for example, stopwords such as a, and, the) or add terms (for example, synonyms like jump and leap).

GET /_analyze?analyzer=standard
How Inner Objects are Indexed
Lucene doesn’t understand inner objects. A Lucene document consists of a flat list of key-value pairs.
"tweet": [elasticsearch, flexible, very],
"user.id": [@johnsmith],
Arrays of Inner Objects

Full-Body Search
A filter asks a yes|no question of every document and is used for fields that contain exact values:
A query is similar to a filter, but also asks the question: How well does this document match?

The output from most filter clauses—a simple list of the documents that match the filter—is quick to calculate and easy to cache in memory, using only 1 bit per document. These cached filters can be reused efficiently for subsequent requests.
Queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes queries heavier than filters. Also, query results are not cachable.

The goal of filters is to reduce the number of documents that have to be examined by the query.
use query clauses for full-text search or for any condition that should affect the relevance score, and use filter clauses for everything else.

GET /_search
{
"query": {
"filtered": {
"query": { "match": { "email": "business opportunity" }},
"filter": { "term": { "folder": "inbox" }}
}
}
}
GET /gb/tweet/_validate/query?explain
Sorting on Multivalue Fields
For numbers and dates, you can reduce a multivalue field to a single value by using the min, max, avg, or sum sort modes.
"sort": {
"dates": {
"order": "asc",
"mode": "min"
}
}
GET /_search?sort=date:desc&sort=_score&q=search

String Sorting and Multifields
But storing the same string twice in the _source field is waste of space. What we really want to do is to pass in a single field but to index it in two different ways.
"tweet": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}

we can use the tweet field for search and the tweet.raw field for sorting:

TF/IDF
Term frequency
How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.
Inverse document frequency
How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.
Field-length norm
How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.

term and document frequencies are calculated per shard, rather than per index.

Understanding Why a Document Matched
GET /us/tweet/12/_explain { QUERY}
format=yaml

Fielddata
The inverted index, which performs very well when searching, is not the ideal structure for sorting on field values:
To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata.

query then fetch
Results from multiple shards must be combined into a single sorted list before the search API can return a “page” of results.

the query is broadcast to a shard copy (a primary or replica shard) of every shard in the index. Each shard executes the search locally and builds a priority queue of matching documents.

Each shard returns the doc IDs and sort values of all the docs in its priority queue to the coordinating node, Node 3, which merges these values into its own priority queue to produce a globally sorted list of results.

A coordinating node will round-robin through all shard copies on subsequent requests in order to spread the load.
Each shard executes the query locally and builds a sorted priority queue of length from + size—in other words, enough results to satisfy the global search request all by itself. It returns a lightweight list of results to the coordinating node, which contains just the doc IDs and any values required for sorting, such as the _score.
The coordinating node merges these shard-level results into its own sorted priority queue, which represents the globally sorted result set.

The coordinating node identifies which documents need to be fetched and issues a multi GET request to the relevant shards.
Each shard loads the documents and enriches them, if required, and then returns the documents to the coordinating node.
Once all documents have been fetched, the coordinating node returns the results to the client.

preference
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
The preference parameter allows you to control which shards or nodes are used to handle the search request. It accepts values such as _primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, and _shards:2,3,

routing
you can specify one or more routing values to limit the search to just those shards:
GET /_search?routing=user_1,user2

search_type count, query_and_fetch, scan

A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.

Scan instructs Elasticsearch to do no sorting, but to just return the next batch of results from every shard that still has results to return.
GET /old_index/_search?search_type=scan&scroll=1m - telling Elasticsearch how long it should keep the scroll open:
When scanning, the size is applied to each shard, so you will get back a maximum of size * number_of_primary_shards documents in each batch.
The scroll request also returns a new _scroll_id. Every time we make the next scroll request, we must pass the _scroll_id returned by the previous scroll request.

config/elasticsearch.yml
action.auto_create_index: false
Configuring Index Settings
Configuring Analyzers

GET /spanish_docs/_analyze?analyzer=es_std
PUT /spanish_docs
{
"settings": {
"analysis": {
"analyzer": {
"es_std": {
"type": "standard",
"stopwords": "_spanish_"
}
}
}
}
}
Character filters
Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like <p> or <div> that we don’t want to be indexed. We can use the html_strip character filter to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.
An analyzer may have zero or more character filters.
Tokenizers
An analyzer must have a single tokenizer. The tokenizer breaks up the string into individual terms or tokens. The standard tokenizer, which is used in the standard analyzer, breaks up a string into individual terms on word boundaries, and removes most punctuation, but other tokenizers exist that have different behavior.
For instance, the keyword tokenizer outputs exactly the same string as it received, without any tokenization. The whitespace tokenizer splits text on whitespace only. The pattern tokenizer can be used to split text on a matching regular expression.
Token filters
Token filters may change, add, or remove tokens. We have already mentioned the lowercase and stop token filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. The ascii_folding filter removes diacritics, converting a term like "très" into "tres". The ngram and edge_ngram token filters can produce tokens suitable for partial matching or autocomplete.

An index may have several types, each with its own mapping, and documents of any of these types may be stored in the same index.
the type name of each document is stored with the document in a metadata field called _type. When we search for documents of a particular type, Elasticsearch simply uses a filter on the _type field to restrict results to documents of that type.

Lucene also has no concept of mappings. Mappings are the layer that Elasticsearch uses to map complex JSON documents into the simple flat documents that Lucene expects to receive.

It will use the analyzer for the first title field that it finds, which will be correct for some docs and incorrect for the others.

If the shard grows too big, you have two options: upgrading the hardware to scale up vertically, or rebuilding the entire Elasticsearch index with more shards, to scale out horizontally to more machines of the same kind.

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered

Sunday, October 4, 2015

Elasticsearch: The Definitive Guide

Labels

Popular Posts