Massive Technical Interviews Tips: Solr Misc Part 6

Friday, October 6, 2017

Solr Misc Part 6

http://www.majiang.life/blog/deep-dive-on-elasticsearch-doc-values/
搜索使用倒排索引查找文档，聚合操作收集和聚合 Doc Values 里的数据

Doc Values 是在索引时与倒排索引同时生成。也就是说 Doc Values 和倒排索引一样，基于 Segement 生成并且是不可变的。同时 Doc Values 和倒排索引一样序列化到磁盘，这样对性能和扩展性有很大帮助。

Doc Values 通过序列化把数据结构持久化到磁盘，我们可以充分利用操作系统的内存，而不是 JVM 的 Heap 。当 working set 远小于系统的可用内存，系统会自动将 Doc Values 保存在内存中，使得其读写十分高速；不过，当其远大于可用内存时，操作系统会自动把 Doc Values 写入磁盘。很显然，这样性能会比在内存中差很多，但是它的大小就不再局限于服务器的内存了。如果是使用 JVM 的 Heap 来实现那么只能是因为 OutOfMemory 导致程序崩溃了。

Doc Values 本质上是一个序列化的 列式存储，这个结构非常适用于聚合、排序、脚本等操作。而且，这种存储方式也非常便于压缩，特别是数字类型。这样可以减少磁盘空间并且提高访问速度

http://lucene.472066.n3.nabble.com/What-is-cluster-overseer-at-SolrCloud-td4058390.html
The Overseer's main responsibility is to write the clusterstate.json file based on what individual nodes publish to ZooKeeper. It also does other things, like assign shard and node names. If the Overseer dies, another Overseer is elected and it starts processing the work queue where the dead Oveseer left off.

You can see which node is the Overseer by going to the Cloud view in the admin UI. Click the Tree tab. Under /overseer_elect, click on the leader node. Part of it's id should tell you which node is acting as the overseer.

http://lucene.apache.org/solr/guide/7_0/solrcloud-autoscaling-overview.html#solrcloud-autoscaling-overview

http://lucene.apache.org/solr/guide/7_0/cross-data-center-replication-cdcr.html
http://lucene.apache.org/solr/guide/7_0/rule-based-replica-placement.html

Don’t assign more than 1 replica of this collection to a host.
Assign all replicas to nodes with more than 100GB of free disk space or, assign replicas where there is more disk space.
Do not assign any replica on a given host because I want to run an overseer there.
Assign only one replica of a shard in a rack.
Assign replica in nodes hosting less than 5 cores.
Assign replicas in nodes hosting the least number of cores.

shard: this is the name of a shard or a wild card (* means for all shards). If shard is not specified, then the rule applies to the entire collection.
replica: this can be a number or a wild-card (* means any number zero to infinity).
tag: this is an attribute of a node in the cluster that can be used in a rule, e.g., “freedisk”, “cores”, “rack”, “dc”, etc. The tag name can be a custom string. If creating a custom tag, a snitch is responsible for providing tags and values.

if we have a rule such as, freedisk:>200~, Solr will try to assign replicas of this collection on nodes with more than 200GB of free disk space. If that is not possible, the node which has the most free disk space will be chosen instead.

This ensures that even if many nodes match the rules, the best nodes are picked up for node assignment. For example, if there is a rule such as freedisk:>20, nodes are sorted first on disk space descending and the node with the most disk space is picked up first. Or, if the rule is cores:<5, nodes are sorted with number of cores ascending and the node with the least number of cores is picked up first.

Friday, October 6, 2017

Solr Misc Part 6

Labels

Popular Posts