Massive Technical Interviews Tips: Solr Misc Part 5

https://wiki.apache.org/solr/SolrConfigXml

Inside requestHandlers it is possible to configure and specify the shard handler used for distributed search, it is possible to plug in custom shard handlers as well as configure the provided solr version.

To configure the standard handler one needs to provide configuration like so

<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- other params go here -->
     <shardHandlerFactory class="HttpShardHandlerFactory">
        <int name="socketTimeout">1000</int>
        <int name="connTimeout">5000</int>
      </shardHandlerFactory>
  </requestHandler>

The parameters that can be specified are as follows:

socketTimeout. default: 0 (use OS default) - The amount of time in ms that a socket is allowed to wait for
connTimeout. default: 0 (use OS default) - The amount of time in ms that is accepted for binding / connection a socket
maxConnectionsPerHost. default: 20 - The maximum number of connections that is made to each individual shard in a distributed search
corePoolSize. default: 0 - The retained lowest limit on the number of threads used in coordinating distributed search
maximumPoolSize. default: Integer.MAX_VALUE - The maximum number of threads used for coordinating distributed search
maxThreadIdleTime. default: 5 seconds - The amount of time to wait for before threads are scaled back in response to a reduction in load
sizeOfQueue. default: -1 - If specified the thread pool will use a backing queue instead of a direct handoff buffer. This may seem difficult to grasp, essentially high throughput systems will want to configure this to be a direct hand off (with -1). Systems that desire better latency will want to configure a reasonable size of queue to handle variations in requests.
fairnessPolicy. default: false - Chooses in the JVM specifics dealing with fair policy queuing, if enabled distributed searches will be handled in a First in First out fashion at a cost to throughput. If disabled throughput will be favoured over latency.

https://www.cvedetails.com/vulnerability-list/vendor_id-45/product_id-18263/Apache-Solr.html
Apache » Solr : Security Vulnerabilities

https://lucene.apache.org/solr/guide/6_6/collections-api.html

Either of these commands would cause all the active replicas that had the "preferredLeader" property set and were not already the preferred leader to become leaders.

http://localhost:8983/solr/admin/collections?action=REBALANCELEADERS&collection=collection1
http://localhost:8983/solr/admin/collections?action=REBALANCELEADERS&collection=collection1&maxAtOnce=5&maxWaitSeconds=30

/admin/collections?action=ADDREPLICAPROP&collection=collectionName&shard=shardName&replica=replicaName&property= preferredLeader&property.value=true

http://localhost:8983/solr/admin/collections?action=ADDREPLICAPROP&shard=shard1&collection=collection1&replica=core_node1&property=preferredLeader&property.value=true

Either of these commands would put the "preferredLeader" property on one replica in every shard in the "collection1" collection.

http://localhost:8983/solr/admin/collections?action=BALANCESHARDUNIQUE&collection=collection1&property=preferredLeader

http://localhost:8983/solr/admin/collections?action=BALANCESHARDUNIQUE&collection=collection1&property=property.preferredLeader

https://github.com/apache/lucene-solr/blob/b32dcbbe42bd3360f6d1cfa65495f7007034d0a9/solr/solr-ref-guide/src/pagination-of-results.adoc
CursorMark
public CursorMark(IndexSchema schema, SortSpec sortSpec) {
PagingFieldCollector

https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
https://issues.apache.org/jira/browse/SOLR-5463
lucene searchAfter
http://yonik.com/solr/paging-and-deep-paging/

sort must include a tie-breaker sort on the id field. This prevents tie-breaking by internal lucene document id (which can change).
start must be 0 for all calls including a cursorMark.
pass cursorMark=* for the first request.
Solr will return a nextCursorMark in the response. Simply use this value for cursorMark on the next call to continue paging through the results.

You know you have reached the end of a result set when you do not get back the full number of rows requested, or when the nextCursorMark returned is the same as the cursorMark you sent

(at which point, no documents will be in the returned list).

Although start must always be 0, you can vary the number of rows for every call to vary the page size.
You can re-use cursorMark values, changing other things like what stored fields are returned or what fields are faceted.
A client can efficiently go back pages by remembering previous cursorMarks and re-submitting them

https://stackoverflow.com/questions/28185133/cursormark-is-stateless-and-how-it-solves-deep-paging

Cursors in Solr are a logical concept, that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.

This is elaborated further in an explanation of the constraints on using cursorMark...

Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents.

https://wiki.apache.org/solr/ShawnHeisey
https://wiki.apache.org/lucene-java/JavaBugs

In principle, each segment contains a part of the index. New files get created when you add documents and you can completely ignore them. Only if you have problems with too many open file handles you my merge some of them together with the index OPTIMIZE command.

And yes, deleting one of the files will corrupt the index.

The tlog files are transaction logs where every index changing transaction (ADD, UPDATE, DELETE) is written down. If anything happens to your Solr server while there's an open segment currently undergoing a some transactions, the segment file will be corrupt. Solr then uses the tlog to rewind the already transmitted transactions and restore the failed segment to its best guess.

https://lucene.apache.org/core/3_0_3/fileformats.html#Segments
http://people.apache.org/~doronc/site/fileformats.pdf

In Lucene-Solr 4.5 and later, docValues are mostly disk-based to avoid the requirement for large heap allocations in Solr. If you use the field cache in sort, stats, and other queries, make those fields docValues

https://www.slideshare.net/LucidImagination/column-stride-fields-aka-docvalues?next_slideshow=1

https://www.cnblogs.com/o-andy-o/p/5830245.html

无论是聚类排序关联等，首先都需要获得文档中某个字段的值，通过docID去获得整个document，然后再去获得字段值，term转换得到最终值，FieldCache一开始就缓存了所有文档的某个特定域(所有数值类型以及不分词的stringField)的值到内存，便于随机存取该域值！

1. 常驻内存，大小是所有文档个数特定域类型大小

2. 初始加载过程耗时，需要遍历倒排索引及类型转换

docID->fieldvalue

建索引时，建立了document到field value的面向列的正排索引数据结构，直接通过已知的docID定位到字段值，从而无需加载document，亦不需要term转换，遍历term找寻doc等的过程

优点：大约节省三分之一的内存!

缺点：由于是硬盘读取，而非内存模式，对于大批量的使用下，优势明显，速度更优；小量情况下没有内存快！总体会慢15-20%

FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.

LUCENE-5666：Change uninverted access (sorting, faceting, grouping, etc) to use the DocValues API instead of FieldCache

https://medium.com/@sarkaramrit2/frequent-out-of-memory-errors-in-apache-solr-36499f84c98a
Solr Caches — QueryResultCache, DocumentCache, FilterCache, FieldCache
Sorting/Faceting/Grouping on fields which are not DocValues

With current scenarios asking for Near-Real Time Search, a new searcher is opened via commits in relatively short intervals. If we have autowarmCount a sizeable number (in this case 128), imagine the time it will take for each searcher to get opened, as it will be loading documents for the cache again and again. With respect to heap memory, a part of it allocated is already filled up.

https://lucene.apache.org/solr/guide/7_4/query-settings-in-solrconfig.html#QuerySettingsinSolrConfig-Caches

queryResultCache

This cache holds the results of previous searches: ordered lists of document IDs (DocList) based on a query, a sort, and the range of documents requested.

documentCache

This cache holds Lucene Document objects (the stored fields for each document). Since Lucene internal document IDs are transient, this cache is not auto-warmed. The size for the documentCache should always be greater than max_results times the max_concurrent_queries, to ensure that Solr does not need to refetch a document during a request. The more fields you store in your documents, the higher the memory usage of this cache will be.

<documentCache class="solr.LRUCache"
               size="512"
               initialSize="512"
               autowarmCount="0"/>

https://stackoverflow.com/questions/10722145/solr-how-do-i-construct-a-query-that-requires-a-not-null-location-field/10725584

The canonical way is this:

fieldName:[* TO *]

Using a '' on the left side as Paige Cook suggested will probably work too but I don't trust it as much as I do the above. And since this is to a Location field, you will probably have to do this against one of the two underlying actual fields versus this logical composite field. They start with fieldName and end with some sort of numeric suffix; look in the Schema Browser to see what the actual name is.

An important thing to be aware of here is that a query like this is expensive for Solr to do as it much do a full index scan on this field. If you have many distinct location field values (thousands on up?), then this is a big deal. If you do this in a filter query, then it will be cached and it's perhaps moot. If you want this query to run fast, then at index time you should index a boolean field to indicate if there is a value in this field or not.

https://stackoverflow.com/questions/22079622/how-to-clear-zookeeper-corruption

The Zookeeper database isn't corrupt, but zookeeper has a limit on the maximum response size, and listing 200k children of a znode exceeds this max response size.

To work around this, you can set jute.maxbuffer to a large value to let you list and delete the nodes under queue. You need to update this setting on all the servers, and the client you are using to clean up.

There is an open bug to fix this, ZOOKEEPER-1162 .

-Djute.maxbuffer
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options

jute.maxbuffer:

(Java system property: jute.maxbuffer)

This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.

https://stackoverflow.com/questions/34756953/how-to-run-zookeepers-zkcli-sh-commands-from-bash

zkCli.sh has been support process commands after 3.4.7.https://issues.apache.org/jira/browse/ZOOKEEPER-1897

such as:

./zkCli.sh -server xxxxx:2181 get /test

http://lucene.472066.n3.nabble.com/clusterstate-stores-IP-address-instead-of-hostname-now-td4064675.html
https://wiki.apache.org/solr/SolrCloud#SolrCloud_Instance_Params

host

Defaults to the first local host address found

If the wrong host address is found automatically, you can over ride the host address with this param.

That changed in 4.1. If you want real hostnames, include the host
parameter on each Solr instance on your startup commandline
(-Dhost=server1) or in solr.xml. I think solr.xml is better, but do
https://lucene.apache.org/solr/guide/7_1/using-solrj.html

final SolrClient client = getSolrClient();

final SolrRequest request = new CollectionAdminRequest.ClusterStatus();

final NamedList<Object> response = client.request(request);
final NamedList<Object> cluster = (NamedList<Object>) response.get("cluster");
final List<String> liveNodes = (List<String>) cluster.get("live_nodes");

assertEquals(NUM_LIVE_NODES, liveNodes.size());

https://stackoverflow.com/questions/34153683/how-can-i-delete-a-data-node-which-is-not-empty-in-zookeeper
The zkCli provides the rmr (deprecated) or deleteall command for this purpose. It will recursively delete all nodes under the path.

        delete path [version]

https://lucene.apache.org/solr/guide/7_1/solrcloud-recoveries-and-write-tolerance.html
Solr supports the optional min_rf parameter on update requests that cause the server to return the achieved replication factor for an update request in the response. For the example scenario described above, if the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only tells Solr that the client application wishes to know what the achieved replication factor was for the update request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates that succeed on a subset of replicas.

https://community.hortonworks.com/articles/7081/best-practice-chroot-your-solr-cloud-in-zookeeper.html
create /solr []
ls /solr
/solr start -c -z lake02:2181,lake03:2181,lake04:2181/solr

https://zookeeper.apache.org/doc/r3.3.3/zookeeperStarted.html

[zkshell: 9] create /zk_test my_data

[zkshell: 12] get /zk_test
my_data
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 5
mtime = Fri Jun 05 13:57:06 PDT 2009
pZxid = 5
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0
dataLength = 7
numChildren = 0

https://stackoverflow.com/questions/9605629/how-to-set-responsewriter-to-non-default-in-solr

searchquery.setParam("wt", "json");
QueryResponse response = server.query(searchquery)

But I'm still getting wt=javabin encoded and wt=json also appended, such that the query now looks like this:

webapp/solr path=/select params={wt=javabin&wt=json}

SolrJ only supports the javabin and xml formats (configurable with CommonHttpSolrServer.setParser).

But why would you want to use JSON? Javabin is by far the format which provides the best decompression speed, its drawback being that is is not human-readable, on the contrary to JSON and XML (which is not a problem in this case since SolrJ is parsing the result for you).

https://stackoverflow.com/questions/47327293/understanding-solr-field-cache

When you're using docValues, the field cache isn't used. Since docValues isn't implemented for TextFields yet, the filtering hasn't been applied like you think it would, so the values used for sorting isn't lowercased as you'd assume they'd be.

When you tell Solr to explicitly use the FieldCache, you're saying "don't use the docValues, even if they're available - use the old FieldCache implementation instead".

The correct solution would be to disable docValues for the Text field.

It was resolved by specifying useFieldCache=true in the query.

https://lucene.apache.org/solr/guide/6_6/mbean-request-handler.html

To return information and statistics for the fieldCache only:

http://localhost:8983/solr/techproducts/admin/mbeans?key=fieldCache&stats=true

To return information and statistics about the CACHE category only, formatted in JSON:

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&cat=CACHE&indent=true&wt=json

To return information for everything, and statistics for everything except the fieldCache:

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&f.fieldCache.stats=false

https://blog.cloudera.com/blog/2017/06/apache-solr-memory-tuning-for-production/
Except for field cache, other caches are per core. A Solr core is a slice of index.

Direct Memory

Solr uses direct memory to cache data read from disks, mostly index, to improve performance. Direct memory doesn’t cause any JVM GC overhead. Direct memory size can be set using CM (Cloudera Manager->Solr configuration->direct memory). As rule of thumb, the minimum size of direct memory recommended is 8G for a production system if docValues is not used in schema and 12-16G if docValues is used. A related configuration is block cache slab count (Cloudera Manager->Solr configuration->slab count) which needs to match direct memory size.

1

Slab count = direct memory size * 0.7 / 128M

https://www.slideshare.net/ravikgiitk/solrcloud-leader-election
SolrCloud Leader Election

• Shards: to scale : particular collection of

documents, the collection can be divided in

multiple shards.

• Shard replica: to failover correction(high

availability), load balancing : each of the shard

can be replicated to multiple shard replica

• Collection – multiple shards – multiple replica

• How a request is served?

– Types of request:

• Read – search query, no consistency issue between

replica

• Write – index a document, consistency issue, should

have single source for write – Hence leader

– The new node registers itself with Zok

– And creates znodes:

• session – with timeout, updated by the client node

regulary

• ephemaral node

• sequence node: when created gets a unique seq. no

assigned and suffixed to its name

– the clusterstate.json file gets updated (by

overseer)

• Based on seq. flag – leader gets selected

• The one having the lowest seq. no.

Leader dies

• When the leader dies, znode having the

lowest sequence no.

• all znodes are being watched by ZoK

• Znode having the next sequence no. is elected

as the leader

• New leader candidate starts sync process with

each replica, if everyone has same version.

Then it registers as leader active

• Old leader might have sent docs to some

replicas and not all.

• And if a replica is far too behind, its tries to

replay log or ask for full replication

http://shal.in/post/127561227271/how-to-make-apache-solr-listen-on-a-specific-ip

Someone asked me how to ensure that Solr is exposed exclusively on a server’s internal IP address so I thought this bit of information would be useful more generally.

Change SOLR_HOST to 192.168.1.1

solrj
http://localhost:8983/solr/admin/collections?action=clusterstatus

tZkStateReader
ClusterState clientClusterState = searchClient.getZkStateReader().getClusterState();

https://www.slideshare.net/shalinmangar/scaling-solrcloud-to-a-large-number-of-collections-fifth-elephant-2014
Goals of solr cloud
scalability, performance, high-availability, simplicity, and elasticity

http://lucene.472066.n3.nabble.com/SolrCloud-on-5-2-1-cluster-state-td4221041.html
Yes. The older-style ZK entity was all-in-one in /clusterstate.json. Recently we've moved to a per-collection state.json instead, to avoid the "thundering herd" problem. In that state, /clusterstate.json is completely ignored and, as you see, not updated.

A separate clusterstate for each
collection was one of the big new features in Solr 5.0:

https://issues.apache.org/jira/browse/SOLR-5473

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeper+SSL+User+Guide

https://lucene.apache.org/solr/guide/6_6/enabling-ssl.html

cURL isn’t capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format, which cURL understands.

The urlScheme cluster-wide property needs to be set to https before any Solr node starts up. The example below uses the zkcli tool that comes with the binary Solr distribution to do this:

*nix command

server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd clusterprop -name urlScheme -val https

https://lucene.apache.org/solr/guide/6_6/making-and-restoring-backups.html
https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-backup

/admin/collections?action=RESTORE&name=myBackupName&location=/path/to/my/shared/drive&collection=myRestoredCollectionName

The restore operation will create a collection with the specified name in the collection parameter. You cannot restore into the same collection the backup was taken from and the target collection should not be present at the time the API is called as Solr will create it for you.

The collection created will be of the same number of shards and replicas as the original collection, preserving routing information, etc. Optionally, you can override some parameters documented below. While restoring, if a configSet with the same name exists in ZooKeeper then Solr will reuse that, or else it will upload the backed up configSet in ZooKeeper and use that.

You can use the collection alias API to make sure client’s don’t need to change the endpoint to query or index against the newly restored collection

https://issues.apache.org/jira/browse/LUCENE-6966
https://stackoverflow.com/questions/36604551/adding-encryption-to-solr-lucene-indexes

More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team and, therefore, a non-starter for us.

This speaks volumes. You should fire your security team! You are wasting your time worrying about this: if you are using lucene, your data will be in memory, in plaintext, in ways you cannot control, and there is nothing you can do about that!

Trying to guarantee anything better than "at rest" is serious business, sounds like your team is over their head.

So you should consider to encrypt the storage Solr is using on OS level. This should be transparent for Solr. But if someone comes into your system, he should not be able to copy the Solr data.

This is also the conclusion the article Encrypting Solr/Lucene indexes from Erick Erickson of Lucidwors draws in the end

The short form is that this is one of those ideas that doesn't stand up to scrutiny. If you're concerned about security at this level, it's probably best to consider other options, from securing your communications channels to using an encrypting file system to physically divorcing your system from public networks. Of course, you should never, ever, let your working Solr installation be accessible directly from the outside world, just consider the following: http://server:port/solr/update?stream.body=<delete><query>*:*</query></delete>!

https://en.wikipedia.org/wiki/Data_at_rest
Data at rest is used as a complement to the terms data in use and data in transit which together define the three states of digital data
https://en.wikipedia.org/wiki/Data_at_rest#/media/File:3_states_of_data.jpg

https://community.hortonworks.com/questions/58306/does-apache-solr-521-support-for-encryption-of-the.html
if you are storing the Solr indexes in HDFS, then HDFS Encryption can be used. If they are stored on local disk, then an OS-level storage encryption, such as dm-crypt + LUKS for Linux, would be required.
http://lucene.472066.n3.nabble.com/DistributedUpdateProcessorFactory-was-explicitly-disabled-from-this-updateRequestProcessorChain-td4319154.html
Let's make a quick differentiation between PRE and POST processors in a Solr Cloud atchitecture :

"In a single node, stand-alone Solr, each update is run through all the update processors in a chain exactly once. But the behavior of update request processors in SolrCloud deserves special consideration. " cit. wiki

PRE PROCESSORS
All the processors defined BEFORE the distributedUpdateProcessor happen ONLY on the first node that receive the update ( regardless if it is a leader or a replica ).

POST PROCESSORS
The distributedUpdateProcessor will forward the update request to the the correct leader ( or multiple leaders if the request involves more shards), the leader will then forward to the replicas.
The leaders and replicas at this point will execute all the update request processors defined AFTER the distributedUpdateProcessor.

" Pre-processors and Atomic Updates
Because DistributedUpdateProcessor is responsible for processing Atomic Updates into full documents on the leader node, this means that pre-processors which are executed only on the forwarding nodes can only operate on the partial document. If you have a processor which must process a full document then the only choice is to specify it as a post-processor." wiki

In your example, your chain is definitely messed up, the order is important and you want your heavy processing to happen only on the first node.

For better info and clarification:
https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode ( you can find here a working alternative to your chain)

https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html

http://blog.comperiosearch.com/blog/2015/01/16/custom-solr-update-request-processors/

https://lucene.apache.org/solr/guide/6_6/running-solr.html

https://lucene.apache.org/solr/guide/6_6/solr-control-script-reference.html
bin/solr start -a "-Xdebug -Xrunjdwp:transport=dt_socket, server=y,suspend=n,address=1044"

http://help.eclipse.org/kepler/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Fconcepts%2Fcremdbug.htm

You must ensure you enter the correct hostname i.e. the name of the remote computer where you are currently running your code. The hostname can also be the IP address of the remote machine, for example using 127.0.0.1 instead of localhost.

http://opensourceconnections.com/blog/2015/04/30/debugging-solr-5-in-intellij/

$~/workspace/lucene-solr/solr> ./bin/solr -f -s /home/doug/workspace/statedecoded/solr_home/

(-f runs in the foreground, not as a daemon)

cd /<source_root>/solr-5.5.0/solr
ant ivy-bootstrap
ant server

https://github.com/apache/lucene-solr/tree/branch_5_5/solr

  bin/solr create -c sample

Solr includes a few examples to help you get started. To run a specific example, do:

  bin/solr -e <EXAMPLE> where <EXAMPLE> is one of:

    cloud        : SolrCloud example
    dih          : Data Import Handler (rdbms, mail, rss, tika)
    schemaless   : Schema-less example (schema is inferred from data during indexing)
    techproducts : Kitchen sink example providing comprehensive examples of Solr features

 bin/post -c <collection_name> example/exampledocs/*.xml

server/
  A self-contained Solr instance, complete with a sample
  configuration and documents to index. Please see: bin/solr start -help
  for more information about starting a Solr server.

https://wiki.apache.org/solr/HowToContribute

'ant server' will build a runnable Solr (note, 'ant example' is the target pre 5.x)

'ant dist' will build the libraries you might want to link against from, say, a SolrJ program.

'ant package' will build files like 'package/solr-5.2.0-SNAPSHOT-src.tgz' which are standard distributions you can install somewhere else just like an official release.

https://www.elastic.co/blog/lucene-points-6.0
Coming in Lucene's next major release (6.0) is a new feature called dimensional points, using the k-d tree geo-spatial data structure to offer fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering.

This feature replaces the now deprecated numeric fields and numeric range query since it has better overall performance and is more general, allowing up to 8 dimensions (versus 1) and up to 16 bytes (versus the 8 byte limit today) per dimension.

At index time, they are built by recursively partitioning the full space of N-dimensional points to be indexed into smaller and smaller rectangular cells, splitting equally along the widest ranging dimension at each step of the recursion. However, unlike an ordinary k-d tree, a block k-d tree stops recursing once there are fewer than a pre-specified (1024 in our case, by default) number of points in the cell.

At that point, all points within that cell are written into one leaf block on disk and the starting file-pointer for that block is saved into an in-heap binary tree structure. In the 1D case, this is simply a full sort of all values, divided into adjacent leaf blocks. There are k-d tree variants that can support removing values, and rebalancing, but Lucene does not need these operations because of its write-once per-segment design.

https://www.elastic.co/blog/searching-numb3rs-in-5.0

The current data-structure that underlies dimensional points is called a block KD tree, which in the case of a single dimension is a simple binary search tree that stores blocks of values on the leaves rather than individual values. Lucene currently defaults to having between 512 and 1024 values on the leaves.

In the case of Lucene, the fact that files are write-once makes the writing easy since we can build a perfectly balanced tree and then not worry about rebalancing since the tree will never be modified. Merging is easy too since you can get a sorted iterator over the values that are stored in a segment, then merge these iterators into a sorted iterator on the fly and build the tree of the merged segment from this sorted iterator.

While integer types are easy to compress based on the number of bits that they actually use (for instance if all integers are between 2 and 200, they can be encoded on one byte), the same is not true for floating-point data. In particular, the fact that they cannot represent decimals accurately makes them use all available bits in order to be as close as possible to the value that needs to be represented. This makes compression hard since values in a given segment rarely have a pattern that can be leveraged for compression.

But do you actually need the accuracy of a float or a double for your data, or would you happily trade some precision for reduced disk usage and faster queries? For those to whom this sounds like a logical trade-off, we introduced two new field types called half_float and scaled_float.

Half floats work the same way as floats and doubles, except that they use fewer bits for the mantissa and the exponent, allowing them to be stored on 16 bits in total, rather than 32 in the case of floats and 64 in the case of doubles. However beware that this storage reduction comes at a cost: they only have 3.3 significant decimal digits and the maximum value is 65504.

We suspect the new scaled_float field will be even more useful in practice: it takes a scaling factor and for every value, it internally stores the long that is closest to the product of the value and the scaling factor. For instance, with a scaling factor of 100, 0.123 would internally be indexed as 12 and all queries and aggregations would behave as if the value was actually 0.12. This helps because even though values are decimal, they are internally encoded as integers, which Lucene can compress more efficiently

https://en.wikipedia.org/wiki/K-d_tree

The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces

http://pointclouds.org/documentation/tutorials/kdtree_search.php

Old Trie Field

http://mentaldetritus.blogspot.com/2013/01/grokking-solr-trie-fields.html

Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000

0xDEAD0000

0xDEADBE00

0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx

0xDEADxxxx

0xDEADBExx

0xDEADBEEF

The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx

0xDEADBxxx

0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.

http://blog-archive.griddynamics.com/2014/10/numeric-range-queries-in-lucenesolr.html

Lets assume we want to find all records with term values between “423” and “642”. Naive algorithm here would be to expand the range to separate values: 423 OR 445 OR ... 641 OR 642 (Note: I omitted values which were not indexed to simplify description). But as we use special type of field, instead of selecting all terms in lowermost row, query is optimized to only match on labelled term values (elements with gray fill on the diagram) with lower precision, where applicable. It is enough to select “5” to match all records starting with “5” (“521”, “522”) or “44” for “445”, “446”, “448”. Query is therefore simplified to match all records containing the following terms: “423”, “44”, “5”, “63”, “641”, or “642”.

So, instead of doing search by every value in the requested range, algorithm uses grouped values wherever possible.

Index time

During index time all terms for value and its quotients are need to be produced. Solr index flow is as simple as: DirectUpdateHandler->IndexWriter#updateDocuments->DocumentWriter->DocumentConsumer#processDocument(fieldInfos)->DocFieldProcessor->DocFieldProcessorPerField->DocInverterPerField

So, for incoming document we get all its fields and process values of each field by DocInverterPerField. Basically it gets a token stream (Lucene's TokenStreams are explained by Mike McCandles) which produces the sequence of tokens to be indexed for a document's fields.

Search time

Again we have FieldType here which creates a query. TrieIntField creates a NumericRangeQuery.

NumericRangeQuery extends MultiTermQuery.

Underneath it uses utility class method NumericUtils.splitXXXRange() to calculate values to be matched.

https://www.slideshare.net/VadimKirilchuk/numeric-rangequeries

Solr.IntField

Store and index the text value verbatim and hence don't correctly support range queries, since the lexicographic ordering isn't equal to the numeric ordering

[1,10],11,2,3,4,5,6,7,8,9

Interesting, but “sort by” works fine.. Clever comparator knows that values are ints!

solr.SortableIntField

ortable”, in fact, refer to the notion of making the numbers have correctly sorted order. It’s not about “sort by” actually!

● Processed and compared as strings!!! tricky string encoding: NumberUtils.int2sortableStr(...)

● Deprecated and will be removed in 5.X

● What should i use then?

TrieIntField

NumericTokenStream is where half of magic happens!

● precision step = 1 ● value = 11

Algorithm requires to index all 32/precisionStep terms

So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0....0

Sub-classes of FieldType could override #getRangeQuery(...) to provide their own range query implementation.

If not, then likely you will have:

MultiTermQuery rangeQuery = TermRangeQuery. newStringRange(...)

TrieField overrides it. And here comes...

● Decimal example, precisionStep = ten ● q = price:[423 TO 642]

Multi-Tenant
https://www.slideshare.net/lucidworks/efficient-scalable-search-in-a-multi-tenant-environment-harry-hight
• Massive scale - shards have to be left offline until needed

Incremental Search
• Calculating the full result set is time consuming
• Query cache usually cold due to unload
• Shards load takes 7me
• Users want to review a subset before exporting
• Shards and results are date sorted
• Search shards sequentially, and return partial results as available
• Creates a streaming interface

Pinned Shards
• Incremental search starts with the most recent data
• `Pin` shards for most recent data
• Subset of shards to be kept loaded at all 7mes
• Shards already loaded for the beginning of the stream
• User doesn’t see the load times for the rest since it happens while they review initial results
• Allows query caches to be more effective
• User sees results in seconds rather than minutes

What if each user has a different view of a document?
• User 1 has permission to view the red
• User 2 has permission to view green
• User 3 has permission to view everything

• Post process each document
• Ends up being horribly slow
• Ties applica7on logic to backend
• Generate a unique document for each view
• 1000s of unique views makes for an unmanageable index
• Trillions of documents is a whole different problem!
• Dynamic fields
• text_view1:value1, text_view2:value2, text_view3:”value1 value2”
• Solr doesn’t have a max number of fields, but string interning becomes an issue
• Mangle field values
• text:”view1_value1 view2_value2 view3_value1 view3_value2”
• Works pre^y well
https://www.slideshare.net/lucidworks/multitenant-log-analytics-saas-service-using-solr-presented-by-chirag-gupta-srivatsan-parthasarathy-microsoft

https://www.elastic.co/blog/found-multi-tenancy
there is a base cost of memory for each shard and that this base cost is constant, even if the shard contains no documents.

https://lucidworks.com/2015/09/23/bloomberg-scales-apache-solr-multi-tenant-environment/
Basic security always starts with different users having access to subsets of the documents, but gets more interesting when users only have access to a subset of the data within a given document, and their search results must reflect that restriction to avoid revealing information

https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/
Using routing to handle such data is one of the solutions, and it allows one to efficiently divide the clients and put them into dedicated shards while still using all the goodness of SolrCloud

how to deal with some of the problems that come up with this solution: the different number of documents in shards and the uneven load.

it is usually best to avoid per/tenant collection creation. Having hundreds or thousands of collections inside a single SolrCloud cluster will most likely cause maintenance headaches and can stress the SolrCloud and ZooKeeper nodes that work together

This is not, however, a perfect solution. Very large tenant data will end up in a single shard. This means that some shards may become very large and the queries for such tenants will be slower, even though Solr doesn’t have to aggregate data from multiple shards. For such use cases — when you know the size of your tenants — you can use a modified routing approach in Solr, one that will place the data of such tenants in multiple shards.

Composite Hash Routing

Solr allows us to modify the routing value slightly and provide information on how many bits from the routing key to use (this is possible for the default, compositeId router). For example, instead of providing the routing value of user1!12345, we could use user1/2!12345. The /2 part indicates how many bits to use in the composite hash. In this case we would use 2 bits, which means that the data would go to 1/4 of the shards. If we would set it to /3 the data would go to 1/8 of the shards, and so on.

So instead of providing the _route_ request parameter and setting it to user1!, we’ll provide the composite hash, meaning the _route_ parameter value should be user1/2!.

The nice thing about the provided routing solutions is that they can be combined to work together. For small shards you can just provide a routing value equal to the user identifier; for medium tenants you can send data to more than a single shard; and, for large tenants, you can send them to multiple shards. The best thing about these strategies is that they are done automatically, and one only needs to care about the routing values during indexing and querying.

https://lucene.apache.org/solr/guide/6_6/streaming-expressions.html

Many of the provided streaming functions are designed to work with entire result sets rather then the top N results like normal search. This is supported by the /export handler.

Some streaming functions act as stream sources to originate the stream flow. Other streaming functions act as stream decorators to wrap other stream functions and perform operations on the stream of tuples. Many streams functions can be parallelized across a worker collection.

curl --data-urlencode 'expr=search(enron_emails,
                                   q="from:1800flowers*",
                                   fl="from, to",
                                   sort="from asc",
                                   qt="/export")' http://localhost:8983/solr/enron_emails/stream

StreamFactory streamFactory = new StreamFactory().withCollectionZkHost("collection1", zkServer.getZkAddress())
    .withStreamFunction("search", CloudSolrStream.class)
    .withStreamFunction("unique", UniqueStream.class)
    .withStreamFunction("top", RankStream.class)
    .withStreamFunction("group", ReducerStream.class)
    .withStreamFunction("parallel", ParallelStream.class);

ParallelStream pstream = (ParallelStream)streamFactory.constructStream("parallel(collection1, group(search(collection1, q=\"*:*\", fl=\"id,a_s,a_i,a_f\", sort=\"a_s asc,a_f asc\", partitionKeys=\"a_s\"), by=\"a_s asc\"), workers=\"2\", zkHost=\""+zkHost+"\", sort=\"a_s asc\")");

partitionKeys: Comma delimited list of keys to partition the search results by. To be used with the parallel function for parallelizing operations across worker nodes.

jdbc(
    connection="jdbc:hsqldb:mem:.",
    sql="select NAME, ADDRESS, EMAIL, AGE from PEOPLE where AGE > 25 order by AGE, NAME DESC",
    sort="AGE asc, NAME desc",
    driver="org.hsqldb.jdbcDriver"
)

echo("Hello world")

facet(collection1,
      q="*:*",
      buckets="year_i, month_i, day_i",
      bucketSorts="year_i desc, month_i desc, day_i desc",
      bucketSizeLimit=100,
      sum(a_i),
      sum(a_f),
      min(a_i),
      min(a_f),
      max(a_i),
      max(a_f),
      avg(a_i),
      avg(a_f),
      count(*))

random(baskets,
       q="productID:productX",
       rows="100",
       fl="basketID")

shortestPath(collection,
             from="john@company.com",
             to="jane@company.com",
             edge="from_address=to_address",
             threads="6",
             partitionSize="300",
             fq="limiting query",
             maxDepth="4")

http://solr.pl/en/2012/02/20/simple-photo-search/
Use tika to extract meta data from images
https://hortonworks.com/hadoop-tutorial/indexing-and-searching-text-within-images-with-apache-solr/

Download Leptonica, an image processing library

wget http://www.leptonica.org/source/leptonica-1.69.tar.gz

Download Tesseract, an Optical Character Recognition engine

wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz

https://sematext.com/blog/2017/06/19/solr-vs-elasticsearch-differences/

https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster.

To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.

To activate this request processor you’ll need to add the following to your solrconfig.xml:

<updateRequestProcessorChain name="ignore-commit-from-client" default="true">
  <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
    <int name="statusCode">200</int>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.DistributedUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Avoiding Distributed Deadlock

Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock.

For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the incoming requests will be blocked until the other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like this.

Prefer Local Shards

Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a distributed query should prefer local replicas of a shard when available. In other words, if a query includes preferLocalShards=true, then the query controller will look for local replicas to service the query instead of selecting replicas at random from across the cluster. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.

Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases because the query controller will have to direct the query to non-local replicas for most of the shards. In other words, this feature is mostly useful for optimizing queries directed towards collections with a small number of shards and many replicas. Also, this option should only be used if you are load balancing requests across all nodes that host replicas for the collection you are querying, as Solr’s CloudSolrClient will do. If not load-balancing, this feature can introduce a hotspot in the cluster since queries won’t be evenly distributed across the cluster.

Prefer Local Shards

If an update fails because cores are reloading schemas and some have finished but others have not, the leader tells the nodes that the update failed and starts the recovery procedure.

Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the achieved replication factor for an update request in the response. For the example scenario described above, if the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only tells Solr that the client application wishes to know what the achieved replication factor was for the update request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates that succeed on a subset of replicas.

On the client side, if the achieved replication factor is less than the acceptable level, then the client application can take additional measures to handle the degraded state. For instance, a client application may want to keep a log of which update requests were sent while the state of the collection was degraded and then resend the updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application to be warned that an update request was accepted while the collection is in a degraded state.

If shards.tolerant=true then partial results may be returned. If the returned response does not contain results from all the appropriate shards then the response header contains a special flag called partialResults.

https://doc.lucidworks.com/fusion/3.0/Indexing_Data/Time-Based-Partitioning.html

Once a collection is configured for time-base partitioning, Fusion automatically ages out old partitions and creates new ones, using the configured partition sizes, expiration intervals, and so on. No manual maintenance is needed.

https://lucene.apache.org/solr/guide/6_6/config-api.html

Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode, however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch' the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if modified, the core is reloaded.

SolrCore#addConfListener(Runnable listener)

to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configoverlay.xml or Schema). Components can use this to reload the files they are interested in.

<requestHandler name="/my_handler" class="solr.SearchHandler" useParams="my_handler_params"/>

https://github.com/apache/lucene-solr/tree/master/solr/example/films

Monday, September 18, 2017

Solr Misc Part 5

queryResultCache

documentCache

Direct Memory

Index time

Search time

Composite Hash Routing

Avoiding Distributed Deadlock

Prefer Local Shards

Prefer Local Shards

Labels

Popular Posts