Massive Technical Interviews Tips: Solr vs. Elasticsearch

Thursday, October 8, 2015

Solr vs. Elasticsearch

-- I already uses Solr for several years, here just try to know more about Elasticsearch and the difference and Elasticsearch's advantages over Solr.
-- Solr's advantages is ginored here, as I know about it.
Solr vs. Elasticsearch – Why SearchBlox uses Elasticsearch!
The choice to use Elasticsearch rather than Solr comes from features such as distributed searches and improved scale.

Elasticsearch is a REST based search engine powered by the Lucene library. Major features include:

• Hit highlighting
• Faceted search
• Full-text search
• Database integration
• Rich document handling
• Dynamic clustering• Distributed search and index replication

Solr
Solr refers to the main logical data structure as The Collection, which is composed of many Shards.
A Collection can have an exact copy of the Shard, called a Replica.
You must develop a custom search component to index different document types.

ElasticSearch
An Index is the term used for the top logical data structure, which can have multiple Shards.
Lucene indices is the term for Shards and Replicas.

Allows multiple document types in a single Index, which allows you to index different index structures in one place.

Different types of documents can be separated and indexed when querying.

Configuration
Solr requires a schema.xml file to define its index structure, fields, and types. ElasticSearch is schemaless, which means you can start indexing documents without the requirement of a schema. You still can use mapping to define your index structure though, which ElasticSearch uses when new indices are created. It will also make an attempt to create a field from a previously unseen field revealed when a document is indexed. This is an optional feature. In ElasticSearch, all configs are written to a configuration file. In Solr, all configs are defined in the solrconfig.xml file.

Cluster Management
Zen Discovery is the term used for cluster management by ElasticSearch. This is where the master node is detected for the cluster. There is also a plugin called Apache Zookeeper that uses its own Zen Discovery. Solr, on the other hand, uses Apache Zookeeper ensemble. Zookeeper stores all configuration files to keep track of nodes and cluster states. A new node must be matched to a specific Zookeeper ensemble.

SearchBlox has opted to go with ElasticSearch mainly because of ease of use, especially when it comes to configuration and cluster management.

there is even an ES plugin that allows you to use Solr clients/tools with ElasticSearch!
https://stackoverflow.com/questions/10213009/solr-vs-elasticsearch/12131150
Percolation is an exciting and innovative feature that singlehandedly blows Solr right out of the water.

ElasticSearch is distributed. No separate project required. Replicas are near real-time too, which is called "Push replication".
ElasticSearch fully supports the near real-time search of Apache Lucene.
Handling multitenancy is not a special configuration, where with Solr a more advanced setup is necessary.
ElasticSearch introduces the concept of the Gateway, which makes full backups easier.

Design: People love Solr. The Java API is somewhat verbose, but people like how it's put together. Solr code is unfortunately not always very pretty. Also, ES has sharding, real-time replication, document and routing built-in. While some of this exists in Solr, too, it feels a bit like an after-thought.

http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/
Elasticsearch percolation is similar to webhooks. The idea is to have Elasticsearch notify your application when new content matches your filters instead of having to constantly poll the search engine to check for new updates.

The new workflow looks like this:

register specific query (percolation) in Elasticsearch
index new content (passing a flag to trigger percolation)
the response to the indexing operation will contain the matched percolations

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

Create an index with a mapping for the field message:

curl -XPUT 'localhost:9200/my-index' -d '{
  "mappings": {
    "my-type": {
      "properties": {
        "message": {
          "type": "string"
        }
      }
    }
  }
}'

curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
    "query" : {
        "match" : {
            "message" : "bonsai tree"
        }
    }
}'

Match a document to the registered percolator queries:

curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
    "doc" : {
        "message" : "A new bonsai tree in the office"
    }
}'

https://www.elastic.co/blog/percolator
http://qbox.io/blog/elasticsesarch-percolator
First you store queries into an index and then—through the Percolate API—you define documents in order to retrieve these queries.

The Percolate API is a commonly-used utility in Elasticsearch for alerting and monitoring documents. A good way to think about the main function of Percolate is "search in reverse."

Percolate works in the opposite way, running your documents up against registered queries (percolators) for matches.

Since release 1.0.0, distributed Percolation has done away with all of these concerns, dropping the previous _percolator index shard restriction for a .percolatortype in an index.
A .percolator type gives users a distributed Percolator API environment for full shard distribution. You can now configure the number of shards necessary for your Percolator queries, changing from a restricted single shard execution to a parallelized execution between all shards within that index. Multiple shards mean support for routing and preference, just like the other Search APIs (except the Explain API).
http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/
Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Elasticsearch, lending itself to easier scaling, attracts use cases demanding larger clusters with more data and more nodes.
Elasticsearch is more dynamic – data can easily move around the cluster as its nodes come and go, and this can impact stability and performance of the cluster.
While Solr has traditionally been more geared toward text search, Elasticsearch is aiming to handle analytical types of queries, too, and such queries come at a price.

Solr also has the ability to more easily grow and shrink clusters, create indices more dynamically, shard them on the fly, route documents and queries, etc., etc.

Elasticsearch dominates the open-source log management use case — lots of organizations index their logs in Elasticsearch to make them searchable. While Solr can now be used for this, too (see Solr for Indexing and Searching Logs and Tuning Solr for Logs), it just missed the mindshare boat on this one.

Solr is still much more text-search-oriented. On the other hand, Elasticsearch is often for filtering and grouping – the analytical query workload – and not necessarily text search. Elasticsearch developers are putting a lot of effort into making such queries more efficient (lowering of the memory footprint and CPU usage) at both Lucene and Elasticsearch level. As such, at this point in time, Elasticsearch is a better choice for applications that need to do not just text search, but also complex search-time aggregations.
Elasticsearch is a bit easier to get started – a single download and a single command to get everything started. Solr has traditionally required a bit more work and knowledge, but Solr has recently made great strides to eliminate this and now just has to work on changing its reputation.

Operationally speaking, Elasticsearch is a bit simpler to work with – it has just a single process. Solr, in its Elasticsearch-like fully distributed deployment mode known as SolrCloud, depends on Apache ZooKeeper. ZooKeeper is super mature, super widely used, etc. etc., but it’s still another moving part. That said, if you are using Hadoop, HBase, Spark, Kafka, or a number of other newer distributed software, you are likely already running ZooKeeper somewhere in your organization.
While Elasticsearch has built-in ZooKeeper-like component called Zen, ZooKeeper is better at preventing the dreaded split-brain problem sometimes seen in Elasticsearch clusters. To be fair, Elasticsearch developers are aware of this problem and are working on improving this aspect of Elasticsearch.
If you love monitoring and metrics, with Elasticsearch you’ll be in heaven.
http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/

In Solr you need the schema.xml file in order to define how your index structure, to define fields and their types. Of course, you can have all fields defined as dynamic fields and create them on the fly, but you still need at least some degree of index configuration. In most cases though, you’ll create a schema.xml to match your data structure.

ElasticSearch is a bit different – it can be called schemaless. What exactly does this mean, you may ask. In short, it means one can launch ElasticSearch and start sending documents to it in order to have them indexed without creating any sort of index schema and ElasticSearch will try to guess field types. It is not always 100% accurate, at least when comparing to the manual creation of the index mappings, but it works quite well. Of course, you can also define the index structure (so called mappings) and then create the index with those mappings, or even create the mappings files for each type that will exist in the index and let ElasticSearch use it when a new index is created. Sounds pretty cool, right? In addition to than, when a new, previously unseen field is found in a document being indexed, ElasticSearch will try to create that field and will try to guess its type. As you may imagine, this behavior can be turned off.

Let’s talk about the actual configuration of Solr and ElasticSearch for a bit. In Solr, the configuration of all components, search handlers, index specific things such as merge factor or buffers, caches, etc. are defined in the solrconfig.xmlfile. After each change you need to restart Solr node or reload it. All configs in ElasticSearch are written to elasticsearch.yml file, which is just another configuration file. However, that’s not the only way to store and change ElasticSearch settings. Many settings exposed by ElasticSearch (not all though) can be changed on the live cluster – for example you can change how your shards and replicas are placed inside your cluster and ElasticSearch nodes don’t need to be restarted. Learn more about this in ElasticSearch Shard Placement Control.

Discovery and Cluster Management

Solr and ElasticSearch have a different approach to cluster node discovery and cluster management in general. The main purpose of discovery is to monitor nodes’ states, choose master nodes, and in some cases also store shared configuration files.

By default ElasticSearch uses the so called Zen Discovery, which has two methods of node discovery: multicast and unicast. With multicast a single node sends a multicast request and all nodes that receive that request respond to it. So if your nodes can see each other at the network layer with the use of multicast method your nodes will be able to form a cluster. On the other hand, unicast depends on the list of hosts that should be pinged in order to form the cluster. In addition to that, the Zen Discovery module is also responsible for detecting the master node for the cluster and for fault discovery. The fault discovery is done in two ways – the master node pings all the other nodes to see if they are healthy and the nodes ping the master in order to see if the master is still working as it should. We should note that there is an ElasticSearch plugin that makes ElasticSearch use Apache Zookeeper instead of its own Zen Discovery.

Apache Solr uses a different approach for handling search cluster. Solr uses Apache Zookeeper ensemble – which is basically one or more Zookeeper instances running together. Zookeeper is used to store the configuration files and monitoring – for keeping track of the status of all nodes and of the overall cluster state. In order for a new node to join an existing cluster Solr needs to know which Zookeeper ensemble to connect to.

There is one thing worth noting when it comes to cluster handling – the split brain situation. Imagine a situation, where you cluster is divided into half, so half of your nodes don’t see the other half, for example because of the network failure. In such cases ElasticSearch will try to elect a new master in the cluster part that doesn’t have one and this will lead to creation of two independent clusters running at the same time. This can be limited with a small degree of configuration, but it can still happen. On the other hand, Solr 4.0 is immune to split brain situations, because it uses Zookeeper, which prevents such ill situations. If half of your Solr cluster is disconnected, it wouldn’t be visible by Zookeeper and thus data and queries wouldn’t be forwarded there.

API

Those of you familiar with Solr know that in order to get search results from it you need to query one of the defined request handlers and pass in the parameters that define your query criteria. Depending on which query parser you choose to use, these parameters will be different, but the method is still the same – an HTTP GET request is sent to Solr in order to fetch search results. The good thing is that you are not limited to a single response format – you may choose to get results in XML, in JSON in JavaBin format and several other formats that have response writers developed for them. You can thus choose the format that is the most convenient for you and your search application. Of course, Solr API is not only about querying as you can also get some statistics about different search components or control Solr behavior, such as collection creation for example.

And what about ElasticSearch? ElasticSearch exposes a REST API which can be accessed using HTTP GET, DELETE, POST and PUT methods. Its API allows one not only to query or delete documents, but also to create indices, manage them, control analysis and get all the metrics describing current state and configuration of ElasticSearch. If you need to know anything about ElasticSearch, you can get it through the REST API (we use it in our Scalable Performance Monitoring for ElasticSearch, too!). If you are used to Solr there is one thing that may be strange for you in the beginning – the only format ElasticSearch can respond in JSON – there is no XML response for example. Another big difference between ElasticSearch and Solr is querying. While with Solr all query parameters are passed in as URL parameters, in ElasticSearch queries are structured in JSON representation. Queries structured as JSON objects give one a lot of control over how ElasticSearch should understand the query and thus what results to return.

Data Handling

Of course, both Solr and ElasticSearch leverage Lucene near real-time capabilities. This makes it possible for queries to match documents right after they’ve been indexed. In addition to that, both Solr (since 4.0) and ElasticSearch (since 0.15) allow versioning of documents in the index. This feature allows them to support optimistic locking and thus enable prevention of overwriting updates. Let’s look at how distributed indexing is done in Solr vs. ElastiSearch.

Let’s start with ElasticSearch this time. In order to add a document to the index in a distributed environment ElasticSearch needs to choose which shard each document should be sent to. By default a document is placed in a shard that is calculated as a hash from the documents identifier. Because this default behavior is not always desired, one can control and alter this behavior by using a feature called routing. This is controlled via the routing parameter, which can take any value you would like it to have. Imagine that you have a single logical index divided into multiple shards and you index multiple users’ data in it. On the search side you know queries are narrowed mostly to a single user’s data. With the use of the routing parameter you can index all documents belonging to a single user within a single shard by using the same routing value for all his/her documents. On the search side you can then use the same routing value when querying. This would result in a single shard being queried instead of the query being spread across all shards in the index, which would be more expensive and slower. In case each index shard contains multiple users’ data we could additionally use a filter to limit matches to only one user’s documents

In order to forward a document to a proper shard Solr uses Murmur hashing algorithm which calculates the hash for the given document on the basis of its unique identifier. This part is similar to default ElasticSearch behavior. However, Solr doesn’t yet let you specify explicitly to which shard the document should be sent – there is no document and query routing equivalent in Solr yet.

Of course, both Solr and ElasticSearch allow one to configure replicas of indices (ElasticSearch) or collections (Solr). This is crucial because replicas enable creation of highly available clusters – even if some of nodes are down, for example because of hardware failure or maintenance, the cluster and data within it can remain available. Without replicas if one nodes is lost, you lose (access to) the data that were on the missing node. If you have replicas present in your configuration both search engines will automatically copy documents to replicas, so that you don’t need to worry about data loss.
Solr vs. ElasticSearch: Part 2 – Data Handling
To index data in ElasticSearch you need to prepare your data in JSON format. Solr also allows that, but in addition to that, it lets you to use other formats like the default XML or CSV. Importantly, indexing data in different formats has different performance characteristics, but that comes with some limitations. For example, indexing documents in CSV format is considered to be the fastest, but you can’t use field value boosting while using that format.

More About ElasticSearch
ElasticSearch supports two additional things, that Solr does not – nested documents and multiple document types inside a single index.

The nested documents functionality lets you create more than a flat document structure. For example, imagine you index documents that are bound to some group of users. In addition to document contents, you would like to store which users can access that document. And this is were we run into a little problem – this data changes over time. If you were to store document content and users inside a single index document, you would have to reindex the whole document every time the list of users who can access it changes in any way. Luckily, with ElasticSearch you don’t have to do that – you can use nested document types and then use appropriate queries for matching. In this example, a nested document would hold a lists of users with document access rights. Internally, nested documents are indexed as separate index documents stored inside the same index. ElasticSearch ensures they are indexed in a way that allows it to use fast join operations to get them. In addition to that, these documents are not shown when using standard queries and you have to use nested query to get them, a very handy feature.
Multiple types of documents per index allow just what the name says – you can index different types of documents inside the same index. This is not possible with Solr, as you have only one schema in Solr per core. In ElasticSearch you can filter, query, or facet on document types. You can make queries against all document types or just choose a single document type (both with Java API and REST).

Solr
Remember that you need to have your configuration pushed into ZooKeeper ensemble in order to create a collection with a new configuration.
When it comes to Solr, there is additional functionality that is in early stages of work, although it’s functional – the ability to split your shards. After applying the patch available in SOLR-3755 you can use a SPLIT action to split your index and write it to two separate cores. If you look at the mentioned JIRA issue, you’ll see that once this is commited Solr will have the ability not only to create new replicas, but also to dynamically re-shard the indices.

ElasticSearch
During creation you can specify the number of shards an index should have and you can decrease and increase the number of replicas without anything more than a single API call. You cannot change the number of shards yet. Of course, you can also define mappings and analyzers during index creation, so you have all the control you need to index a new type of data into you cluster.

Partial Document Updates
Solr
curl 'localhost:8983/solr/update' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'

ElasticSearch
In case of ElasticSearch you need to have the _source field enabled for the partial update functionality to work. This _source is a special ElasticSearch field that stores the original JSON document. Theis functionality doesn’t have add/set/delete command, but instead lets you use script to modify a document.

curl -XPOST 'localhost:9200/sematext/doc/1/_update'-d '{
"script" : "ctx._source.price = price",
"params" : {
"price" : 100
}
}'

Multilingual Data Handling - solr
Analysis Chain Definition
both Apache Solr and ElasticSearch allow you to define a custom analysis chain by specifying your own analyzer/tokenizer and list of filters that should be used to process your data.

ElasticSearch allows one to specify the analyzer per document and per query. So, if you need to use a different analyzer for each document in the index you can do that in ElasticSearch. The same applies to queries – each query can use a different analyzer.

Results Grouping - solr and elastic
https://github.com/elastic/elasticsearch/pull/6124

Prospective Search - elasticsearch
Solr vs ElasticSearch: Part 3 – Searching
ElasticSearch query is more structured allowing for more precise control of what you are trying to get – similar to Lucene queries. Solr on the other hand uses a query parser to parse your query out of the textual value of the “q” URL parameter (n.b. you can use query parser in ElasticSearch too).

Solr does make simple queries with boosting and extended dismax parser very easy to do, although that comes at a price. If you want to have a higher degree of control over your query, you are (in most cases) forced to use local params that, while powerful, can be quite hard for users not familiar with its cryptic syntax.

the structured JSON way of querying ElasticSearch is a better fit in that case and feels more intuitive.

Solr vs ElasticSearch: Part 5 – Management API Capabilities
Settings API

ElasticSearch
ElasticSearch allows us to modify most of the configuration values dynamically. For example, you can clear you caches (or just specific type of cache), you can move shards and replicas to specific nodes in your cluster. In addition to that you are also allowed to update mappings (to some extent), define warming queries (since version 0.20), etc. You can even shut down a single node or a whole cluster with the use of a single HTTP call. Of course, this is just an example and doesn’t cover all the possibilities exposed by ElasticSearch.
Apache Solr
In case of Apache Solr we do not (yet) have the possibility of changing configuration values (like warming queries) with API calls.

Apache Solr
If you’ve used Apache Solr you probably come across the debugQuery parameter and the explainOther parameter. Those two allows to see the detailed score calculation for the given query and documents found in the results (the debugQuery parameter) and the specified ones (the explainOther). In addition, we can also see how the analysis process is done with the use of analysis handler or by using the analysis page of the Solr administration panel provided with Solr.

ElasticSearch
In case of ElasticSearch we can create and delete indices by running a simple HTTP command (GET or DELETE method) with the index name we are interested in. In addition to that, with a simple API call we can increase and decrease the number of replicas without the need of shutting down nodes or creating new nodes. With the newer ElasticSearch versions we can even manipulate shard placement with the cluster reroute API. With the use of that API we can move shards between nodes, we can cancel shard allocation process and we can also force shard allocation – everything on a live cluster.

ElasticSearch
ElasticSearch exposes three separate REST end-points to analyze our queries, documents and explain the documents score. The Analyze API allows us to test our analyzer on a specified text in order to see how it is processed and is similar to the analysis page functionality of Solr. The Explain API provides us with information about the score calculation for a given documents. Finally, the Validate API can validate our query to see is it is proper and how expensive it can be.

http://solr-vs-elasticsearch.com/

Thursday, October 8, 2015

Solr vs. Elasticsearch

Discovery and Cluster Management

API

Data Handling

Labels

Popular Posts