Thursday, September 8, 2016

Solr Misc Par 2



org.apache.solr.client.solrj.request.QueryRequest.getPath()
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SuggesterParams.java

https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
To choose a different request handler, there is a specific method available in SolrJ version 4.0 and later:
query.setRequestHandler("/spellCheckCompRH");

https://cwiki.apache.org/confluence/display/solr/Suggester
To be used as the basis for a suggestion, the field must be stored. You may want to use copyField rules to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters.

Context Filtering

Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory.
http://stackoverflow.com/questions/7712606/solr-suggester-multiple-field-autocomplete
User copyfields to combine multiple fields into single field and use that field in suggester -
Schema -
<copyField source="name" dest="spell" />
<copyField source="other_name" dest="spell" />
suggester -
<str name="field">spell</str>

buildOnStartup
If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found. Enabling this to true could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. It’s usually preferred to have this setting set to 'false' and build suggesters manually issuing requests with suggest.build=true.
Context filtering (suggest.cfq) is currently only supported by AnalyzingInfixLookupFactory and BlendedInfixLookupFactory, and only when backed by a Document*Dictionary. All other implementations will return unfiltered matches as if filtering was not requested.
DocumentExpressionDictionaryFactory
This dictionary implementation is the same as the DocumentDictionaryFactory but allows users to specify an arbitrary expression into the 'weightExpression' tag.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
  • payloadField: The payloadField should be a field that is stored. This field is optional.
  • weightExpression: An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This field is required.
  • contextField: Field to be used for context filtering. Note that only some lookup implementations support filtering.
<str name="weightExpression">((price * 2) + ln(popularity))</str>
https://lucidworks.com/blog/2015/03/04/solr-suggester/
There are two different “styles” of suggester: FST-based suggesters and AnalyzingInfix suggesters.
these suggesters suggest whole fields! This is radically different than term-based suggestions that consider terms in isolation.

Both styles of suggesters have to be “built”. In the FST-based suggesters, the result is a binary blob that can optionally be stored on disk. The AnalyzingInfix suggesters have to have their underlying Lucene index, which can also be optionally stored on disk. You build them either automatically or by an explicit command (browser URL, curl command or the like). The result can be stored to disk in which case they’ll be re-used until their built again, even if documents are added/updated/removed from the “regular” index.
Note the sweet spot here. These suggesters are very powerful. They’re very flexible. But TANSTAAFL, There Ain’t No Such Thing As A Free Lunch. These particular suggesters are, IMO, not suitable for use in any large corpus where the suggestions have to be available in Near Real Time (NRT). The underlying documents can be updated NRT, but there’ll be a lag before suggestions from the new documents show up, i.e. until you rebuild the suggester. And building these is expensive on large indexes.

In particular, any version that uses a “DocumentDictionaryFactory” reads the raw data from the field’s stored data when building the suggester! That means that if you’ve added 1M docs to your index and start a build, each and every document must:
  • Be read from disk
  • Be decompressed
  • Be incorporated into the suggester’s data structures.
  • A consequence of this is that the field specified in the configs must have stored=”true” set in your schema.
  • The FuzzyLookupFactory that creates suggestions for misspelled words in fields.
  • The AnalyzingInfixLookupFactory that matches places other than from the beginnings of fields.

  • Build the suggester (Set the “storeDir” or “indexPath” parameter if desired). Issue …/suggesthandler?suggest.build=true. Until you do this step, no suggestions are returned and you’ll see messages and/or stack traces in the logs.
  • Ask for suggestions. As you can see above, the suggester is just a searchComponent, and we define it in a request handler. Simply issue “…/suggesthandler?suggest.q=whatever“.
The AnalyzingInfixSuggester highlights the response as well as gets terms “in the middle”, but doesn’t quite do spelling correction. The Fuzzy suggester returned suggestions for misspelled “enrgy” whereas Infix returned no suggestions. Fuzzy also assumes that what you’re sending as the suggest.q parameter is the beginning of the suggestion.

  • weightField: This allows you to alter the importance of the terms based on another field in the doc.
  • threshold: A percentage of the documents a term must appear in. This can be useful for reducing the number of garbage returns due to misspellings if you haven’t scrubbed the input.
It takes a little bit of care, but it’s perfectly reasonable to have multiple suggesters configured in the same Solr instance. In this example, both of my suggesters were defined as separate request handlers in solrconfig.xml, giving quite a bit of flexibility in what suggestions are returned by choosing one or the other request handler.
Build the suggester:
Issue http://localhost:8983/solr/index_name/suggest?suggest.build=true.

One to store the suggestion text and another to store the weight of that suggestion. The suggestion field should be a text type and the weight field should be a float type

  • The first uses the FuzzyLookupFactory: a FST-based sugester (Finite State Transducer) which will match terms starting with the provided characters while accounting for potential misspellings. This lookup implementation will not find terms where the provided characters are in the middle.
  • The second uses the AnalyzingInfixLookupFactory: which will look inside the terms for matches. Also the results will have <b> highlights around the provided terms inside the suggestions.
Using a combination of methods, we can get more complete results


https://wiki.apache.org/solr/UniqueKey
  • It is strongly advised to use one of the un-analyzed types (e.g. string) for textual unique keys. While using a solr.TextField with analysis does not produce errors, it also won't do what you expect, namely use the output from the analysis chain as the unique key. The raw input before analysis is still used which leads to duplicate documents (e.g. docs with unique keys of 'id1' and 'ID1' will be two unique docs even if you have a LowercaseFilter in an analysis chain for the unique key). Any normalization of the unique key should be done on the client side before ingestion.

http://robotlibrarian.billdueber.com/2009/03/a-plea-use-solr-to-normalize-your-data/

http://www.techsquids.com/bd/solr-multithreaded-concurrent-atomic-updates-problem/


http://lucene.472066.n3.nabble.com/Grouping-performance-problem-td3995245.html
if you don't need number of groups, you can try leaving out 
group.ngroups=true param. 
In this case Solr apparently skips calculating all groups and delivers 
results much faster. 
At least for our application the difference in performance 
with/without group.ngroups=true is significant (have to say, we use 
Solr 3.6). 
https://support.lucidworks.com/hc/en-us/articles/221618187-What-is-Managed-Schema-
SolrCloud users who need to modify any setting, for example adding a new field have to use the following four steps for their change to take affect
  1. Download their previous configuration set from ZooKeeper
  2. Make changes to their configuration set locally adding fields, modifying synonyms etc.
  3. Upload the configuration set to ZooKeeper again
  4. Call a Collection RELOAD command for the collection to notice the changes you just made.

Solr now has REST like APIs to make all of these changes easier for you. So you could use the Config APIs ( https://cwiki.apache.org/confluence/display/solr/Config+API ) to make changes to the solrconfig.xml file , the Schema APIs to make changes to the schema ( https://cwiki.apache.org/confluence/display/solr/Schema+API ) . You can add fields from the Solr Admin UI if you would like too!


To make all this consistent we decided that the Schema APIs should be enabled by default . This does NOT enable schemaless mode, only enables one to use Schema APIs. To enable it yourself you can can follow the instruction here - https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig .
In all our current example configuration that we ship, starting Solr 5.5 they will use the ManagedIndexSchemaFactory instead of the ClassicIndexSchemaFactory. To make this change more apparent the schema.xml file has been renamed to managed-schema . So don’t be surprised!

If you don’t like this behaviour and want to hand edit your files, then the ClassicIndexSchemaFactory is not going anywhere . You can specify it explicitly in your solrconfig.xml file and rename managed-schema to schema.xml ( when Solr is not running ) to get back the old behaviour.

With Solr 6.0 if you haven’t specified a schemaFactory in the solrconfig.xml file then ManagedIndexSchemaFactory will be used as opposed to ClassicIndexSchemaFactory and the schema file will be automatically renamed from schema.xml to managed-schema

https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig
 Other features such as Solr's Schemaless Mode also work via Schema modifications made programatically at run time.

When a <schemaFactory/> is not explicitly declared in a solrconfig.xml file, Solr implicitly uses a ManagedIndexSchemaFactory, which is by default "mutable" and keeps schema information in a managed-schema file.

 <schemaFactory class="ManagedIndexSchemaFactory">
   <bool name="mutable">true</bool>
   <str name="managedSchemaResourceName">managed-schema</str>
 </schemaFactory>

https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode
bin/solr start -e schemaless
This will launch a Solr server, and automatically create a collection (named "gettingstarted") that contains only three fields in the initial schema: id_version_, and _text_.
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
https://cwiki.apache.org/confluence/display/solr/Schema+API

Autocompletion
http://alexbenedetti.blogspot.com/2015/07/solr-you-complete-me.html
The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.
Building a suggester is the process of : 
  • retrieving the terms (source for the suggestions) from the dictionary
  • build the data structures that the Suggester requires for the lookup at query time
  • Store the data structures in memory/disk
The produced data structure will be stored in memory in first place.
It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.
For example when you start up Solr, the data will be loaded from disk to the memory without any rebuilding to be necessary.
This parameter is:
storeDir” for the FuzzyLookup
indexPath” for theAnalyzingInfixLookup

The built data structures will be later used by the suggester lookup strategy, at query time.
In details, for the DocumentDictionary during the building process, for ALL the documents in the index :
  • the stored content of the configured field is read from the disk ( stored="true" is required for the field to have the Suggester working)
  • the compressed content is decompressed ( remember that Solr stores the plain content of a field applying a compression algorithm [3] )
  • the suggester data structure is built
"for ALL the documents" -> no delta dictionary building is happening 

A good consideration at this point would be to introduce a delta approach in the dictionary building.
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
<lst name="suggester">
  <str name="name">AnalyzingSuggester</str>
  <str name="lookupImpl">AnalyzingLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="weightField">price</str>
  <str name="suggestAnalyzerFieldType">text_en</str>
</lst>
BuildingFor each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType.
The tokens produced are added to the Index FST.
Lookup strategyThe query is analysed,  the tokens produced are added to the query FST.
An intersection happens between the Index FST and the query FST.
The suggestions are identified starting at the beginning of the field content.

AnalyzingInfixLookupFactory

<lst name="suggester">
  <str name="name">AnalyzingInfixSuggester</str>
  <str name="lookupImpl">AnalyzingInfixLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="weightField">price</str>
  <str name="suggestAnalyzerFieldType">text_en</str>
</lst>

http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/

https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble

http://stackoverflow.com/questions/25088269/structr-badmessage-400-unknown-version-for-httpchanneloverhttp
this message usually occurs if you have a whitespace (or other characters which have to be URL-encoded properly) in the URL, like f.e.
curl "http://0.0.0.0:8082/structr/rest/users?name=A B"
Correct:
curl "http://0.0.0.0:8082/structr/rest/users?name=A%20B"

Join query
https://lucidworks.com/blog/2012/06/20/solr-and-joins/

https://community.hortonworks.com/articles/49790/joining-collections-in-solr-part-i.html
To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.
  1. curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs"
  1. http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
Find all product docs matching "ipod", then join them against (manufacturer) docs and return the list of manufacturers that make those products

TODO:
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS
http://localhost:8983/solr/admin/cores?action=STATUS&core=core0
https://cwiki.apache.org/confluence/display/solr/Collections+API
/admin/collections?action=CLUSTERSTATUSGet cluster status 
http://lucene.472066.n3.nabble.com/SolrCloud-and-split-brain-td3989857.html#a3989868

http://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html
The Classification Update Request Processor is a simple processor that will automatically classify a document ( the classification will be based on the latest index available) adding a new field  containing the class, before the document is indexed. 
After an initial valuable index has been built with human assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.

http://stackoverflow.com/questions/39922233/whether-replicationfactor-2-makes-sense-in-solrcloud/39926368
replicationFactor will not affect whether a split brain situation arises or not. The cluster details are stored in Zookeeper. As long as you have a working Zookeper ensemble Solr will not have this issue. This means you should make sure you have 2xF+1 zookeper nodes (minimum 3) .
http://lucene.472066.n3.nabble.com/Whether-replicationFactor-2-makes-sense-td4300204.html#a4300257
The Solr replicationFactor has nothing to do with quorum. Having 2 is the same as 3. 
Solr uses Zookeeper's Quorum sensing to insure that all Solr nodes 
have a consistent picture of the cluster. Solr will refuse to index data if 
_Zookeeper_ loses quorum. 

But whether Solr has 2 or 3 replicas is not relevant. Solr indexes data through 
the leader of each shard, and that keeps all replicas consistent. 

As far as other impacts, adding a replica will have an impact on indexing 
throughput, you'll have to see whether that makes any difference in your 
situation. This is usually on the order of 10% or so, YMMV. And this is only 
on the first replica you add, i.e. going from leader-only to 2 
replicas costs, say, 
10% on throughput, but adding yet another replica does NOT add another 10% 
since the leader->replica updates are done in parallel. 

The other thing replicas gain you is the ability to serve more queries since you only query a single replica for each shards. 
It will make our system more robust and resilient to temporary network failure issue
http://lucene.472066.n3.nabble.com/Does-soft-commit-re-opens-searchers-in-disk-td4248350.html
If you have already done a soft commit and that opened a new searcher, then 
the document will be visible from that point on.  The results returned by 
that searcher cannot be changed by the hard commit (whatever that is doing 
under the hood, the segment that has that document in must still be visible 
to the searcher).  I don't know exactly how the soft commit stores its 
segment, but there must be some kind of reference counting like there is 
for disk segments since the searcher has that "segment" open (regardless of 
whether that segment is in RAM or on disk). 

1. Everything visible prior to a soft commit remains visible after soft 
commit (regardless of <openSearcher>false</openSearcher>) 

openSearcher 
is only relevant for hard commits, it is meaningless with soft commits. 

a soft 
commit inherently *has* to open a new searcher, a soft commit with 
openSearcher=false (Erick correctly points out such a thing doesn't exist) 
would be a pointless no-op, nothing visible and nothing written to disk! 
https://forums.manning.com/posts/list/31850.page
re: openSearcher=false and soft commit ... I think these two are somewhat orthogonal in that you use openSearcher=false with auto-commit (hard commits) as a strategy to flush documents to durable storage w/o paying the cost of opening a new searcher after an auto-commit. This is useful because it allows you to keep Solr's update log (tlog) small and not affect search performance until you decide it is time to open the searcher, which makes documents visible. In other words, openSearcher=false, de-couples index writing tasks from index reading tasks when doing commits. 

Soft-commits are all about how soon does a new document need to be visible in search results. A new searcher is opened when you do a soft-commit. When opening a new searcher after a soft-commit, Solr will still execute the warming queries configured in solrconfig.xml, so be careful with configuring queries that take a long time to execute. It will also try to warm caches, so same advice -- keep the autowarm counts low for your caches if you need NRT. Put simply, your new searcher cannot take longer to warm-up than your soft-commit frequency. Also, you probably want useColdSearcher=true when doing NRT. 
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/











  • openSearcher: A boolean sub-property of <autoCommit> that governs whether the newly-committed data is made visible to subsequent searches.













  • Soft commit: A less-expensive operation than hard-commit (openSearcher=true) that also makes documents visible to search. Soft commits do not truncate the transaction log however.


    • Note: tlogs are “rolled over” automatically on hard commit (openSearcher true or false). The old one is closed and a new one is opened. 

      Hard commits are about durability, soft commits are about visibility. There are really two flavors here, openSearcher=true and openSearcher=false. First we’ll talk about what happens in both cases. If openSearcher=true or openSearcher=false, the following consequences are most important:
      • The tlog is truncated: A new tlog is started. Old tlogs will be deleted if there are more than 100 documents in newer, closed tlogs.
      • The current index segment is closed and flushed.
      • Background segment merges may be initiated.
      The above happens on all hard commits. That leaves the openSearcher setting
      •  openSearcher=true: The Solr/Lucene searchers are re-opened and  all caches are invalidated. Autowarming is done etc. This used to be the only way you could see newly-added documents.
      • openSearcher=false: Nothing further happens other than the four points above. To search the docs, a soft commit is necessary.
      https://cwiki.apache.org/confluence/display/solr/Managed+Resources
      To begin, you need to define a field type that uses the ManagedStopFilterFactory , such as:
      <fieldType name="managed_en" positionIncrementGap="100">
        <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.ManagedStopFilterFactory"
                  managed="english" />
        </analyzer>
      </fieldType>
      There are two important things to notice about this field type definition. First, the filter implementation class is  solr.ManagedStopFilterFactory . This is a special implementation of the StopFilterFactory that uses a set of stop words that are managed from a REST API. Second, the  managed=”english”  attribute gives a name to the set of managed stop words, in this case indicating the stop words are for English text.
      http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english
      https://cwiki.apache.org/confluence/display/solr/Request+Parameters+API
      the parameters are stored in a file named params.json. This file is kept in ZooKeeper or in the conf directory of a standalone Solr instance.
      <requestHandler name="/my_handler" class="solr.SearchHandler" useParams="my_handler_params"/>
      https://cwiki.apache.org/confluence/display/solr/Config+API


        When using this API, solrconfig.xml is  is not changed. Instead, all edited configuration is stored in a file called configoverlay.json. The values in configoverlay.json override the values in solrconfig.xml.
        • /config: retrieve or modify the config. GET to retrieve and POST for executing commands
        • /config/overlay: retrieve the details in the configoverlay.json alone
        • /config/params : allows creating parameter sets that can override or take the place of parameters defined in solrconfig.xml. See the Request Parameters API section for more details.
        updateHandler.autoCommit.maxTime
        updateHandler.autoSoftCommit.maxTime
        updateHandler.autoCommit.openSearcher

        requestDispatcher.requestParsers.multipartUploadLimitInKB
        requestDispatcher.requestParsers.formdataUploadLimitInKB
        {
          "set-property": {
            "updateHandler.autoCommit.maxTime":15000,
            "updateHandler.autoCommit.openSearcher":false
          }
        }
        curl http://localhost:8983/solr/gettingstarted/config/overlay?omitHeader=true

        Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode, however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch' the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if modified, the core is reloaded.


        JSON API:
        • We don’t need to pass the Content-Type for indexing or for querying when we’re using JSON since Solr is now smart enough to auto-detect it when Curl is the client.
        https://cwiki.apache.org/confluence/display/solr/JSON+Request+API
        curl http://localhost:8983/solr/techproducts/query -d '
        {
          "query" "memory",
          "filter" "inStock:true"
        }'
        It may sometimes be more convenient to pass the JSON body as a request parameter rather than in the actual body of the HTTP request. Solr treats a json parameter the same as a JSON body.
        curl http://localhost:8983/solr/techproducts/query -d 'json={"query":"memory"}'

        Multiple json parameters in a single request are merged before being interpreted.
        • Single-valued elements are overwritten by the last value.
        • Multi-valued elements like fields and filter are appended.
        • Parameters of the form json.<path>=<json_value> are merged in the appropriate place in the hierarchy. For example a json.facet parameter is the same as “facet” within the JSON body.
        • A JSON body, or straight json parameters are always parsed first, meaning that other request parameters come after, and overwrite single valued elements.
        {
          query : "memory",
          limit : 10,
          filter : "inStock:true"
        }'
        {
          facet: {
            avg_price: "avg(price)",
            top_cats: {
              terms: {
                field: "cat",
                limit:5
              }
            }
          }
        }

        Because we didn’t pollute the root body of the JSON request with the normal Solr request parameters (they are all contained in the params block), we now have the ability to validate requests and return an error for unknown JSON keys.
        curl http://localhost:8983/solr/techproducts/query -d '
        {
          query : "memory",
          fulter : "inStock:true"  // oops, we misspelled "filter"
        }'
        And we get an error back containing the error string:
        "Unknown top-level key in JSON request : fulter"

        Parameter Substitution / Macro Expansion

        Of course request templating via parameter substitution works fully with JSON request bodies or parameters as well.
        Example:
        {
          query:"${FIELD}:${TERM}",
          limit:${HOWMANY}
        }'
        https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/facet/FacetField.java
        http://yonik.com/solr-subfacets/
          top_genres:{
            type: terms,
            field: genre,
            limit: 5,
            facet:{
              top_authors:{
                type: terms,
                field: author,
                limit: 4
              }
            }
          }

        If instead, we wanted to find top authors by total revenue (assuming we had a “sales” field), then we could simply change the author facet from the previous example as follows:
              top_authors:{ 
                type: terms,
                field: author,
                limit: 7,
                sort: "revenue desc",
                facet:{
                  revenue: "sum(sales)"
                }
              }
        http://yonik.com/json-facet-api/
        &facet=true
        &facet.range={!key=age_ranges}age
        &f.age.facet.range.start=0
        &f.age.facet.range.end=100
        &f.age.facet.range.gap=10
        &facet.range={!key=price_ranges}price
        &f.price.facet.range.start=0
        &f.price.facet.range.end=1000
        &f.price.facet.range.gap=50
        
        And here is the equivalent faceting command in the new JSON Faceting API:
        {
          age_ranges: {
            type : range
            field : age,
            start : 0,
            end : 100,
            gap : 10
          }
          ,
          price_ranges: {
            type : range
            field : price,
            start : 0,
            end : 1000,
            gap : 50 
          }
        }
        {  // this is a single-line comment, which can help add clarity to large JSON commands
             /* traditional C-style comments are also supported */
          x : "avg(price)" // Simple strings can occur unquoted
          y : 'unique(manu)' // Strings can also use single quotes (easier to embed in another String)
        }
        There are two types of facets, one that breaks up the domain into multiple buckets, and aggregations / facet functions that provide information about the set of documents belonging to each bucket.
        Faceting can be nested! Any bucket produced by faceting can further be broken down into multiple buckets by a sub-facet.
        Statistics are now fully integrated into faceting. Since we start off with a single facet bucket with a domain defined by the main query and filters, we can even ask for statistics for this top level bucket, before breaking up into further buckets via faceting. Example:
        json.facet={
          x : "avg(price)",           // the average of the price field will appear under "x"
          y : "unique(manufacturer)"  // the number of unique manufacturers will appear under "y"
        }

        The general form of the JSON facet commands are:
        <facet_name> : { <facet_type> : <facet_parameter(s)> }
        Example: top_authors : { terms : { field : authors, limit : 5 } }
        After Solr 5.2, a flatter structure with a “type” field may also be used:
        <facet_name> : { "type" : <facet_type> , <other_facet_parameter(s)> }
        Example: top_authors : { type : terms, field : authors, limit : 5 }

        The terms facet, or field facet, produces buckets from the unique values of a field. The field needs to be indexed or have docValues.

        The simplest form of the terms facet

        {
          top_genres : { terms : genre_field }
        }

        An expanded form allows for more parameters:

        {
          top_genres : {
            type : terms,
            field : genre_field,
            limit : 3,
            mincount : 2
          }
        }

        The query facet produces a single bucket that matches the specified query.

        An example of the simplest form of the query facet

        {
          high_popularity : { query : "popularity:[8 TO 10]" }
        }

        An expanded form allows for more parameters (or sub-facets / facet functions):

        {
          high_popularity : {
            type : query,
            q : "popularity:[8 TO 10]",
            facet : { average_price : "avg(price)" }
          }
        }

        Example response:

        "high_popularity" : {
          "count" : 147,
          "average_price" : 74.25
        }

        The range facet produces multiple range buckets over numeric fields or date fields.

        Range facet example:

        {
          prices : {
            type : range,
            field : price,
            start : 0,
            end : 100,
            gap : 20
          }
        }

        http://yonik.com/multi-select-faceting/
        &q="running shorts"
        &fq={!tag=COLOR}color:Blue
        &json.facet={
              sizes:{type:terms, field:size},
              colors:{type:terms, field:color, domain:{excludeTags:COLOR} },
              brands:{type:terms, field:brand, domain:{excludeTags:BRAND} }
        }
        Tagging and excluding filters with excludeTags
        Solr filter queries (fq parameters) can be tagged with arbitrary strings using the localParams {!tag=mystring} syntax.
        Example: fq={!tag=COLOR}color:Blue
        • Multiple filters can be tagged with the same tag. Example:
          fq={!tag=foo}one_filter&fq={!tag=foo}another_filter
        • A single filter may be tagged with multiple tags. Example:
          fq={!tag=tag1,tag2,tag3}my_field:my_filter
        During faceting, the facet domain may be changed to exclude filters that match certain tags via the excludeTags keyword. It’s as if the filter was never specified for that specific facet. This is useful for implementing multi-select faceting
        Example: colors:{type:terms, field:color, domain:{excludeTags:COLOR}}
        • excludeTags can be multi-valued comma-separated string. Example: excludeTags:"tag1,tag2"
        • excludeTags can be a JSON array of tags. Example: excludeTags:["tag1","tag2"]
        • One can exclude tags that are not used in the current request. This makes constructing requests simpler since you don’t need to worry about changing the faceting part of the request based on what filters have been applied.

        stats-facet
        http://yonik.com/json-facet-api/
        http://yonik.com/solr-facet-functions/
        http://localhost:8983/solr/query?q=*:*&
           json.facet={x:'avg(price)'}
        
        http://yonik.com/solr-count-distinct/

        https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

        https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-CombiningStatsComponentWithPivots
        stats=true
        stats.field={!tag=piv1,piv2 min=true max=true}price
        stats.field={!tag=piv2 mean=true}popularity
        facet=true
        facet.pivot={!stats=piv1}cat,inStock
        facet.pivot={!stats=piv2}manu,inStock
        facet=true
        facet.query={!tag=q1}manufacturedate_dt:[2006-01-01T00:00:00Z TO NOW]
        facet.query={!tag=q1}price:[0 TO 100]
        facet.pivot={!query=q1}cat,inStock

        facet=true
        facet.range={!tag=r1}manufacturedate_dt
        facet.range.start=2006-01-01T00:00:00Z
        facet.range.end=NOW/YEAR
        facet.range.gap=+1YEAR
        facet.pivot={!range=r1}cat,inStock

        An expanded form allows for more parameters and a facet command block to specify sub-facets (either nested facets or metrics): 
        {
          high_popularity : { query : {
            q : "popularity:[8 TO 10]",
            facet : { average_price : "avg(price)" }
          }}
        }
        Example response:
        "high_popularity" : {
          "count" 36,
          "average_price" 36.75
        }

        https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/
        http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price&stats.facet=author
        But this stats.facet approach has always been plagued with problems:
        • Completely different code from FacetComponent that was hard to maintain, and doesn’t supported distributed search (see EDIT#1 below)
        • Always returns every term from the stats.facet field, w/o any support for facet.limitfacet.sort, etc…
        • Lots of problems with multivalued facet fields and/or non string facet fields.
        One of the new features available in Solr 5.0 will be the ability to “link” a stats.field to a facet.pivot param — this inverts the relationship that stats.facet used to offer (nesting the stats under the facets so to speak, instead of putting the facets under the stats) so that the FacetComponent does all the heavy lifting of determining the facet constraints, and delegates to the StatsComponent only as needed to compute stats over the subset of documents for each constraint.
        http://localhost:8983/solr/techproducts/select?q=crime&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}author
        "facet_pivot":{
          "author":[{
              "field":"author",
              "value":"Kaiser Soze",
              "count":42,
              "stats":{
                "stats_fields":{
                  "price":{
                    "min":12.95,
                    "max":29.95,
                    ...}}}},
            {
              "field":"author",
              "value":"James Moriarty",
              "count":37,
              "stats":{
                "stats_fields":{
                  "price":{
                    "min":19.95,
                    "max":39.95,
            ...
        The linkage mechanism is via a tag Local Param specified on the stats.field. This allows multiple facet.pivot params to refer to the same stats.field, or a single facet.pivot to refer to multiple different stats.field params over different fields/functions that all use the same tag, etc. And because this functionality is built on top of Pivot Facets, multiple levels of Pivots can be computed, and the stats will be computed at each level

        Although the stats.facet parameter is no longer recommended, sets of stats.field parameters can be referenced by 'tag' when using Pivot Faceting to compute multiple statistics at every level (i.e.: field) in the tree of pivot constraints.

        facet.pivot.mincount

        The facet.pivot.mincount parameter defines the minimum number of documents that need to match in order for the facet to be included in results. The default is 1.

        https://issues.apache.org/jira/browse/SOLR-6349


        docvalues
        https://support.lucidworks.com/hc/en-us/articles/201839163-When-to-use-DocValues-in-Solr
        Answer - Any field which is used for sorting, faceting, or is part of a custom scoring function one should use the DocValues feature by enabling it in the schema.xml file.
        Here is an example rating field on which the sort option is provided to the user, with DocValues enabled to make it more efficient.
        <field name="rating" type="tint" indexed="true" docValues="true" />

        http://www.slideshare.net/lucidworks/realtime-analytics-with-solr-presented-by-yonik-seeley-cloudera
        Indexing All tables into a Single Big Index
        Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood


        java.lang.OutOfMemoryError:
        Java heap Space
        Took a Heap dump and realized that it is due to Field Cache
        Found a Solution : Doc values and never had this issue again till date

        Doc Values (Life Saver)
        Disk based Field Data a.ka. Doc values
        Document to value mapping built at index time
        Store Field values on Disk (or Memory) in a column stride fashion
        Better compression than Field Cache
        Replacement of Field Cache (not completely)
        Quite suitable for Custom Scoring, Sorting and Faceting

        Scaling and Making Search Faster…
        3 Level Partitioning, by Month, Country and Table name
        Each Month has its own Cluster and a Cluster Manager.
        Latency and Throughput are tradeoff, you can’t have both at the same time.

        External Caching
        In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput.
        Solution: cache most frequently accessed query results in app layer (LRU based eviction)
        We use Redis for Caching
        All complex aggregation queries’ results, once fetched from multiple Solr servers

        Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache.
        Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB.

        Don’t use Soft Commit if you don’t need it. Specially in Batch Loading
        Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations.
        Use hash in Redis to minimize the memory usage.
        http://www.slideshare.net/lucidworks/realtime-analytics-with-solr-presented-by-yonik-seeley-cloudera
        Columnar Storage (DocValues)

        • Fast linear scan
        • Read only the data you need
        • Fast random access
        • docid -­‐> value(s)
        • High degree of locality
        • Compressed
        • prefix,delta, table, gcd, etc
        • Mostly "Off-­‐Heap"
        • Memory mapped from index
        • Row vs Column configurable per field!
        http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
        Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.
        Figure 1. Univerting a field to FieldCache
        FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look.
        providing a document to value mapping built at index time. IndexDocValues allows you to do all the work during document indexing with a lot more control over your data. Each ordinary Lucene Field accepts a typed value (long, double or byte array) which is stored in a column based fashion.
        https://cwiki.apache.org/confluence/display/solr/DocValues
        DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

        If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in schema.xml in order to successfully use docValues.
        There is an additional configuration option available, which is to modify the docValuesFormat used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk.

        If docValues="true" for a field, then DocValues will automatically be used any time the field is used for sortingfaceting or Function Queries.

        Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).
        In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
        When retrieving fields from their docValues form, two important differences between regular stored fields and docValues fields must be understood:
        1. Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
        2. Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.
        http://mozhenghua.iteye.com/blog/2275932
        docValues是一种记录doc字段值的一种形式,在例如在结果排序和统计Facet查询时,需要通过docid取字段值的场景下是非常高效的。

        为什么要使用docValues

        这种形式比老版本中利用fieldCache来实现正排查找更加高效,更加节省内存。倒排索引将字段内存切分成一个term列表,每个term都对应着一个docid列表,这样一种结构使得查询能够非常快速,因为term对应的docid是现成就有的。但是,利用它来做统计,排序,高亮操作的时候需要通过docid来找到,field的值却变得不那么高效了。之前lucene4.0之前会利用fieldCache在实例启动的时候预先将倒排索引的值load到内存中,问题是,如果文档多会导致预加载耗费大量时间,还会占用宝贵的内存资源。
        索引在lucene4.0之后引入了新的机制docValues,可以将这个理解为正排索引,是面向列存储的。

        DocValues field的存储值(field属性设置为stored=“true”)有什么区别?

        docValuesdocumentstored=ture存储的值,都是正排索引,单也是有区别的:
        l  存储方式:
         DocValues是面向列的存储方式,stored=true是面向行的存储方式,如果通过fieldid取列的值可定是用docValues的存储结构更高效。
        l  是否分词:
        Stored=true的存储方式是不会分词的,会将字段原值进行保存,而docValues的保存的值会进行分词。


        如果在索引上要进行facetgourphighlight等查询尽量使用docValue,这样不用为内存开销烦恼了。
        例如:solr4.0之后都会需要在schema中设置一个­­­_version_字段来实现对文档的原子操作,为了节省内存,可以加上docValues
        <field name="_version_"
        type="long" indexed="true" stored="true" docValues="true"/>

        <field name="manu_exact"
        type="string" indexed="false" stored="false"
        docValues="true" />

        另外可以通过fieldtypedocValuesFormat属性来设置docValue的实现策略:
        <fieldType name="string_in_mem_dv"
        class="solr.StrField" docValues="true"
        docValuesFormat="Memory" />

        http://solr1:8080/solr/admin/collections?action=CREATE&name=catcollection&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=conf1&router.field=cat
        ./bin/zkCli.sh -server zoo1,zoo2,zoo3
        get /clusterstate.json
        "router":{
        "field":"cat",
        "name":"compositeId"
        }
        To send a query to a particular shard, we have to use the shard.keys parameter in our query.
        http://solr1:8080/solr/catcollection/select?q=martin&fl=cat,name,description&shard.keys=book!

        SOLR_OPTS="-Dsolr.solr.home=/home/ubuntu/solr-cores -Dhost=solr5 -Dport=8080 -DhostContext=solr -DzkClientTimeout=20000 -DzkHost=zoo1:2181,zoo2:2181,zoo3:2181"
        http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=shard1
        In addition to splitting shard1 into two sub shards, SolrCloud makes the parent shard, shard1, inactive. This information is available in the ZooKeeper servers
        Only an inactive shard can be deleted.

        In order to move a shard to a new node, we need to add the node as a replica. Once the replication on the node is over and the node becomes active, we can simply shut down the old node and remove it from the cluster.

        http://solr1:8080/solr/admin/collections?action=DELETEREPLICA&collection=mycollection&shard=shard2&replica=core_node3
        http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=catcollection&split.key=books!

        http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=catcollection&split.key=books!&async=1111
        http://solr1:8080/solr/admin/collections?action=REQUESTSTATUS&requestid=1111

        http://solr1:8080/solr/admin/collections?action=MIGRATE&collection=catcollection&split.key=currency!&target.collection=mycollection&forward.timeout=120

        http://stackoverflow.com/questions/18636451/solr-4-4-0-atomic-update-always-tries-to-create-missing-documents
        I dont think there is an option to ignore an update if the document does not exist.
        Solr Internally still deletes and recreates the documents, that why you need to have the fields stored to be updatable.

        https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
        <processor class="solr.DocBasedVersionConstraintsProcessorFactory">
          <str name="versionField">my_version_l</str>
        </processor>
        The _version_ field is by default stored in the inverted index (indexed="true"). However, for some systems with a very large number of documents, the increase in FieldCache memory requirements may be too costly. A solution can be to declare the _version_ field as DocValues:
        Sample field definition
        <field name="_version_" type="long" indexed="false" stored="true" required="true" docValues="true"/>

        http://m.blog.csdn.net/article/details?id=51025580
        如果你创建了一个collection而且定义在创建的时候定义了“implicit”route,你可以添加定义一个router.field参数,通过各个document的这个field来确定document属于哪个shard。如果在document中丢失这个field指定,document将会被拒绝。你同样可以使用_route_参数来命名一个指定的shard。

        https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options
        This parameter allows you to specify a collection or a number of collections on which the query should be executed. This allows you to query multiple collections at once and all the feature of Solr which work in a distributed manner can work across collections. 

        The _route_ Parameter

        This parameter can be used to specify a route key which is used to figure out the corresponding shards. For example, if you have a document with a unique key "abc!123", then specifying the route key as "_route_=abc!" (notice the trailing '!' character) will route the request to the shard which hosts that doc. You can specify multiple such route keys separated by comma.
        https://cwiki.apache.org/confluence/display/solr/Collections+API
        Create a Shard
        Shards can only created with this API for collections that use the 'implicit' router. Use SPLITSHARD for collections using the 'compositeId' router. A new shard with a name can be created for an existing 'implicit' collection.

        http://www.cnblogs.com/rcfeng/p/4287031.html
        也就是说在创建collections时候(如果指定参数numshards参数会自动切换到router=”compositeId”),如果采用compositeId方式,那么就不能动态增加shard。如果采用的是implicit方式,就可以动态的增加shard。进一步来讲:
        • ImplicitDocRouter就是和 uniqueKey无关,可以在请求参数或者SolrInputDocument中添加_route_(_shard_已废弃,或者指定field参数)参数获取slice ,(router=”implicit”) .
        • CompositeIdRouter就是根据uniqueKey的hash值获取slice,   (在指定numshards参数会自动切换到router=”compositeId”) .
        综上所述,CompositeIdRouter在创建的时候由于指定了numshards参数即已经固定了hash区间,那么在update的时候,根据uniqueid的hash坐落在那个hash区间来决定这份document数据发送至哪个shard。而ImplicitDocRouter则是在创建的时候并不选定每个shard的hash区间,而是在需要update的document中增加_route_字段来存放需要发送的shard名字,以此shard的名字来决定发送至哪个shard。不难看出,相对来说ImplicitDocRouter更加灵活。
        https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
        Solr offers the ability to specify the router implementation used by a collection by specifying the router.name parameter when creating your collection. If you use the (default) "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which shard to direct the document to.
        Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_route_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

        If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_ parameter to name a specific shard.

        Certain Solr features such as grouping’s ngroups feature and joins require documents to be co-located in the same core or vm. For example to take advantage of the ngroups feature in grouping, documents need to be co-located by the grouping key. Document routing will do this automatically if the grouping key is used as the shard key.
        Setting Up The CompositeId Router
        The Solr Cloud compositeId router is used by default when a collection is created with the “numShards” parameter. If the numShards parameter is not supplied at collection creation time then the “implicit” document router is assigned to the collection. 
        https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/

        https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
        IgnoreCommitOptimizeUpdateProcessorFactory
        In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster. To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.

        memory calculation:
        http://docs.alfresco.com/4.1/concepts/solrnodes-memory.html

        http://yonik.com/solr-6-1/
        https://issues.apache.org/jira/browse/SOLR-8888
        https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions

        https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462
        https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+5+to+Solr+6
        Introduced in Solr 5, Streaming Expressions allow querying Solr and getting results as a stream of data, sorted and aggregated as requested.

        publish/subscribe
        The topic function provides publish/subscribe messaging capabilities built on top of SolrCloud. The topic function allows users to subscribe to a query. The function then provides one-time delivery of new or updated documents that match the topic query. The initial call to the topic function establishes the checkpoints for the specific topic ID. Subsequent calls to the same topic ID will return documents added or updated after the initial checkpoint. Each run of the topic query updates the checkpoints for the topic ID. Setting the initialCheckpoint parameter to 0 will cause the topic to process all documents in the index that match the topic query.

        https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a
        Solr has extended its set of features for nested document handling including faceting of nested documents and schemaless support of nested data structures.

        http://yonik.com/solr-nested-objects/
        Lucene has a flat object model and does not really support “nesting” of documents in the index.
        Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.
        When you add a parent document with 3 children, these appear int the index contiguously as
        child1, child2, child3, parent
        
        There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).
        Any document can have nested child documents.

        https://dzone.com/articles/using-solr-49-new
        Using Solr 4.9 new ChildDocTransformerFactory
        Now, we can use the below BlockJoinQuery to find the parent of all the documents containing the words I am child:


        q={!parent which="cat:PARENT"}name:(I am +child)
        Now, lets get the parent document along with the other child document but without the grandchild.
        We do that by adding a fields parameter that looks like:
        fl=id,name,[child parentFilter=cat:PARENT childFilter=cat:CHILD]

        The above contains regular fields request - id and name and additional ChildDocTransformer that enable Solr to return the given parent and it's nested child's.
        Pay attention that we've added optional parameter - childFilter=cat:CHILD that filters the GRANDCHILDREN out of the response.


        bin/solr status

        https://support.lucidworks.com/hc/en-us/articles/20342143-Recovery-times-while-restarting-a-SolrCloud-node
        Recovery time depends on these factors -
        1. Your transaction log size - So each replica in a collection writes to a transaction log. This gets rolled over on a (hard) commit. So if you restart between commits then that portion needs to be replayed.
        2. When a replica is in recovery, it starts writing all updates sent by the leader in this period to the transaction log. So if the node was in recovery when you restarted then this could be a reason why it took long.
        Thus if your transaction log on disk is large then that is probably the reason for a slow recovery time. A short auto (hard) commit setting will ensure a small transaction log and faster recovery, but be careful not to make it too short as that can effect your indexing performance
        3. When a replica tries to come up after the restart, it asks the shard leader if any document updates have taken place. If the number is less than 100 then SolrCloud does a PeerSync and gets all the updated documents, but when it's more than 100 it does an index replication from the leader. Full replication will kick in if the index has changed because of merges, optimize, expungeDeletes etc). If a full replication is kicking in then this could be a major reason for recovery time. You could grep for "fullCopy=true" to see if that is happening
        4. At the time of restarting a node if your overseer queue already has a lot of pending requests queued up and with the node restart theOverseer would have to process these state changes in addition, causing a pile up in the Overseer queue and thus it might take time to process the events - hence the slow recovery. You can look at your overseer queue size here - http://host:port/solr/#/~cloud?view=tree and click on /overseer -> queue and see it's children_count number.
        5. As a general advice for restarting nodes in a SolrCloud cluster, its advised to restart one node at a time, waiting for a node to completely come up ( all shards show up as active in the Cloud dashboard ) . Also you should bounce the Overseer as the last node as this causes minimum switching of the overseer


        Note the documents that are are sent via the PeerSync are not streamed. They are sent as a POST request. So setting a number too high will cause the request to fail. The default POST size limit for Jetty is 20MB ( http://wiki.eclipse.org/Jetty/Howto/Configure_Form_Size#Changing_the_Maximum_Form_Size_for_All_Apps_in_the_JVM ) , so one needs to make sure the numRecordsToKeep count multiplied by document size doesn't exceed that.
          https://hub.docker.com/_/solr/
          https://github.com/docker-solr/docker-solr

          TTL
          https://lucidworks.com/blog/2014/05/07/document-expiration/
          The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:
          • Periodically delete documents from the index based on an expiration field
          • Computing expiration field values for documents from a “time to live” (TTL)
          The DocExpirationUpdateProcessorFactory provides two features related to the “expiration” of documents which can be used individually, or in combination:
          • Periodically delete documents from the index based on an expiration field
          • Computing expiration field values for documents from a “time to live” (TTL)
          While the basic logic of “timer goes off, delete docs with expiration prior to NOW” was fairly simple and straight forward to add, a key aspect of making this work well was in a related issue (SOLR-5783) to ensure that the openSearcher=true doesn’t do anything unless there really are changes in the index. This means that you can configure autoDeletePeriodSeconds to very small values, and still rest easy that your search caches won’t get blown away every few seconds for no reason. The openSearcher=true soft commits will only affect things if there really are changes in the index.
          The second feature implemented by this Factory (and the key reason it’s implemented as an UpdateProcessorFactory) is to use “TTL” (Time To Live) values associated with documents to automatically generate an expiration date value to put in the expirationFieldNamewhen documents are indexed.
          By default, the DocExpirationUpdateProcessorFactory will look for a _ttl_ request parameter on update requests, as well as a _ttl_ field in each doc that is indexed in that request. If either exist, they will be parsed as Date Math Expressions relative to NOW and used to populate the expirationFieldName. The per-document _ttl_ field based values override the per-request _ttl_ parameter.
          Both the request parameter and field names use for specifying TTL values can be overridden by configuring ttlParamName & ttlFieldName on the DocExpirationUpdateProcessorFactory. They can also be completely disabled by configuring them as null. It’s also possible to use the TTL computation feature to generate expiration dates on documents, with out using the auto-deletion feature simply by not configuring the autoDeletePeriodSeconds option (so that the timer will never run).
          This sort of configuration may be handy if you only want to logically hide documents for search clients based on a per-document TTL using something like: fq=-press_release_expiration_date:[* TO NOW/DAY], but still retain the documents in the index for other search clients.
          • FirstFieldValueUpdateProcessorFactory is configured on the expire_at_dt — this means that if a document is added with an explicit value in the expire_at_dt field, it will be used instead any value that might be added by the DocExpirationUpdateProcessorFactory using the _ttl_ request param
          https://lucene.apache.org/solr/6_2_0/solr-core/org/apache/solr/update/processor/DocExpirationUpdateProcessorFactory.html

          https://issues.apache.org/jira/browse/SOLR-5795
          http://stackoverflow.com/questions/17806821/solr-composite-unique-key-from-existing-fields-in-schema
          https://wiki.apache.org/solr/Deduplication
              <processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
                <bool name="enabled">true</bool>
                <bool name="overwriteDupes">false</bool>
                <str name="signatureField">id</str>
                <str name="fields">name,features,cat</str>
                <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
              </processor>

          https://lawlesst.github.io/notebook/solr-etags.html

          http://stackoverflow.com/questions/4800559/how-to-transform-simpleorderedmap-into-json-string-or-json-object
          How to transform SimpleOrderedMap into JSON string or JSON object?
          In short: no, because JSON doesn't have a primitive ordered map type.
          JSONArray jarray = new JSONArray();
          for(Map.Entry e : simpleOrderedMap) {
              jarray.put(e.key());
              jarray.put(e.value());
          }

          Labels

          Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

          Popular Posts