org.apache.solr.client.solrj.request.QueryRequest.getPath()
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SuggesterParams.java
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
https://cwiki.apache.org/confluence/display/solr/Suggester
To be used as the basis for a suggestion, the field must be stored. You may want to use copyField rules to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters.
https://lucidworks.com/blog/2015/03/04/solr-suggester/
There are two different “styles” of suggester: FST-based suggesters and AnalyzingInfix suggesters.
these suggesters suggest whole fields! This is radically different than term-based suggestions that consider terms in isolation.
https://wiki.apache.org/solr/UniqueKey
http://robotlibrarian.billdueber.com/2009/03/a-plea-use-solr-to-normalize-your-data/
http://www.techsquids.com/bd/solr-multithreaded-concurrent-atomic-updates-problem/
http://lucene.472066.n3.nabble.com/Grouping-performance-problem-td3995245.html
if you don't need number of groups, you can try leaving out
group.ngroups=true param.
In this case Solr apparently skips calculating all groups and delivers
results much faster.
At least for our application the difference in performance
with/without group.ngroups=true is significant (have to say, we use
Solr 3.6).
https://support.lucidworks.com/hc/en-us/articles/221618187-What-is-Managed-Schema-
With Solr 6.0 if you haven’t specified a schemaFactory in the solrconfig.xml file then ManagedIndexSchemaFactory will be used as opposed to ClassicIndexSchemaFactory and the schema file will be automatically renamed from schema.xml to managed-schema
https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig
Other features such as Solr's Schemaless Mode also work via Schema modifications made programatically at run time.
https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode
https://cwiki.apache.org/confluence/display/solr/Schema+API
Autocompletion
http://alexbenedetti.blogspot.com/2015/07/solr-you-complete-me.html
The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.
The produced data structure will be stored in memory in first place.
It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.
A good consideration at this point would be to introduce a delta approach in the dictionary building.
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
<lst name="suggester">
<str name="name">AnalyzingSuggester</str>
<str name="lookupImpl">AnalyzingLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
<str name="name">AnalyzingInfixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
http://stackoverflow.com/questions/25088269/structr-badmessage-400-unknown-version-for-httpchanneloverhttp
Join query
https://lucidworks.com/blog/2012/06/20/solr-and-joins/
https://community.hortonworks.com/articles/49790/joining-collections-in-solr-part-i.html
TODO:
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS
https://cwiki.apache.org/confluence/display/solr/Collections+API
http://lucene.472066.n3.nabble.com/SolrCloud-and-split-brain-td3989857.html#a3989868
http://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html
http://stackoverflow.com/questions/39922233/whether-replicationfactor-2-makes-sense-in-solrcloud/39926368
replicationFactor will not affect whether a split brain situation arises or not. The cluster details are stored in Zookeeper. As long as you have a working Zookeper ensemble Solr will not have this issue. This means you should make sure you have 2xF+1 zookeper nodes (minimum 3) .
http://lucene.472066.n3.nabble.com/Whether-replicationFactor-2-makes-sense-td4300204.html#a4300257
The Solr replicationFactor has nothing to do with quorum. Having 2 is the same as 3.
Solr uses Zookeeper's Quorum sensing to insure that all Solr nodes
have a consistent picture of the cluster. Solr will refuse to index data if
_Zookeeper_ loses quorum.
But whether Solr has 2 or 3 replicas is not relevant. Solr indexes data through
the leader of each shard, and that keeps all replicas consistent.
As far as other impacts, adding a replica will have an impact on indexing
throughput, you'll have to see whether that makes any difference in your
situation. This is usually on the order of 10% or so, YMMV. And this is only
on the first replica you add, i.e. going from leader-only to 2
replicas costs, say,
10% on throughput, but adding yet another replica does NOT add another 10%
since the leader->replica updates are done in parallel.
The other thing replicas gain you is the ability to serve more queries since you only query a single replica for each shards.
It will make our system more robust and resilient to temporary network failure issue
http://lucene.472066.n3.nabble.com/Does-soft-commit-re-opens-searchers-in-disk-td4248350.html
If you have already done a soft commit and that opened a new searcher, then
the document will be visible from that point on. The results returned by
that searcher cannot be changed by the hard commit (whatever that is doing
under the hood, the segment that has that document in must still be visible
to the searcher). I don't know exactly how the soft commit stores its
segment, but there must be some kind of reference counting like there is
for disk segments since the searcher has that "segment" open (regardless of
whether that segment is in RAM or on disk).
1. Everything visible prior to a soft commit remains visible after soft
commit (regardless of <openSearcher>false</openSearcher>)
openSearcher
is only relevant for hard commits, it is meaningless with soft commits.
a soft
commit inherently *has* to open a new searcher, a soft commit with
openSearcher=false (Erick correctly points out such a thing doesn't exist)
would be a pointless no-op, nothing visible and nothing written to disk!
https://forums.manning.com/posts/list/31850.page
re: openSearcher=false and soft commit ... I think these two are somewhat orthogonal in that you use openSearcher=false with auto-commit (hard commits) as a strategy to flush documents to durable storage w/o paying the cost of opening a new searcher after an auto-commit. This is useful because it allows you to keep Solr's update log (tlog) small and not affect search performance until you decide it is time to open the searcher, which makes documents visible. In other words, openSearcher=false, de-couples index writing tasks from index reading tasks when doing commits.
Soft-commits are all about how soon does a new document need to be visible in search results. A new searcher is opened when you do a soft-commit. When opening a new searcher after a soft-commit, Solr will still execute the warming queries configured in solrconfig.xml, so be careful with configuring queries that take a long time to execute. It will also try to warm caches, so same advice -- keep the autowarm counts low for your caches if you need NRT. Put simply, your new searcher cannot take longer to warm-up than your soft-commit frequency. Also, you probably want useColdSearcher=true when doing NRT.
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
openSearcher: A boolean sub-property of <autoCommit> that governs whether the newly-committed data is made visible to subsequent searches.
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SuggesterParams.java
https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
To choose a different request handler, there is a specific method available in SolrJ version 4.0 and later:
To be used as the basis for a suggestion, the field must be stored. You may want to use copyField rules to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters.
Context Filtering
Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory.
http://stackoverflow.com/questions/7712606/solr-suggester-multiple-field-autocomplete
User copyfields to combine multiple fields into single field and use that field in suggester -
Schema -
<copyField source="name" dest="spell" />
<copyField source="other_name" dest="spell" />
suggester -
<str name="field">spell</str>
buildOnStartup |
If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found. Enabling this to true could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. It’s usually preferred to have this setting set to 'false' and build suggesters manually issuing requests with
suggest.build=true . |
Context filtering (
suggest.cfq
) is currently only supported by AnalyzingInfixLookupFactory and BlendedInfixLookupFactory, and only when backed by a Document*Dictionary. All other implementations will return unfiltered matches as if filtering was not requested.DocumentExpressionDictionaryFactory
This dictionary implementation is the same as the DocumentDictionaryFactory but allows users to specify an arbitrary expression into the 'weightExpression' tag.
This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation:
- payloadField: The payloadField should be a field that is stored. This field is optional.
- weightExpression: An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This field is required.
- contextField: Field to be used for context filtering. Note that only some lookup implementations support filtering.
<
str
name
=
"weightExpression"
>((price * 2) + ln(popularity))</
str
>
https://lucidworks.com/blog/2015/03/04/solr-suggester/
There are two different “styles” of suggester: FST-based suggesters and AnalyzingInfix suggesters.
these suggesters suggest whole fields! This is radically different than term-based suggestions that consider terms in isolation.
Both styles of suggesters have to be “built”. In the FST-based suggesters, the result is a binary blob that can optionally be stored on disk. The AnalyzingInfix suggesters have to have their underlying Lucene index, which can also be optionally stored on disk. You build them either automatically or by an explicit command (browser URL, curl command or the like). The result can be stored to disk in which case they’ll be re-used until their built again, even if documents are added/updated/removed from the “regular” index.
Note the sweet spot here. These suggesters are very powerful. They’re very flexible. But TANSTAAFL, There Ain’t No Such Thing As A Free Lunch. These particular suggesters are, IMO, not suitable for use in any large corpus where the suggestions have to be available in Near Real Time (NRT). The underlying documents can be updated NRT, but there’ll be a lag before suggestions from the new documents show up, i.e. until you rebuild the suggester. And building these is expensive on large indexes.
In particular, any version that uses a “DocumentDictionaryFactory” reads the raw data from the field’s stored data when building the suggester! That means that if you’ve added 1M docs to your index and start a build, each and every document must:
- Be read from disk
- Be decompressed
- Be incorporated into the suggester’s data structures.
- A consequence of this is that the field specified in the configs must have stored=”true” set in your schema.
- The FuzzyLookupFactory that creates suggestions for misspelled words in fields.
- The AnalyzingInfixLookupFactory that matches places other than from the beginnings of fields.
- Build the suggester (Set the “storeDir” or “indexPath” parameter if desired). Issue …/suggesthandler?suggest.build=true. Until you do this step, no suggestions are returned and you’ll see messages and/or stack traces in the logs.
- Ask for suggestions. As you can see above, the suggester is just a searchComponent, and we define it in a request handler. Simply issue “…/suggesthandler?suggest.q=whatever“.
- weightField: This allows you to alter the importance of the terms based on another field in the doc.
- threshold: A percentage of the documents a term must appear in. This can be useful for reducing the number of garbage returns due to misspellings if you haven’t scrubbed the input.
It takes a little bit of care, but it’s perfectly reasonable to have multiple suggesters configured in the same Solr instance. In this example, both of my suggesters were defined as separate request handlers in solrconfig.xml, giving quite a bit of flexibility in what suggestions are returned by choosing one or the other request handler.
One to store the suggestion text and another to store the weight of that suggestion. The suggestion field should be a text type and the weight field should be a float type
- The first uses the FuzzyLookupFactory: a FST-based sugester (Finite State Transducer) which will match terms starting with the provided characters while accounting for potential misspellings. This lookup implementation will not find terms where the provided characters are in the middle.
- The second uses the AnalyzingInfixLookupFactory: which will look inside the terms for matches. Also the results will have <b> highlights around the provided terms inside the suggestions.
Using a combination of methods, we can get more complete results
- It is strongly advised to use one of the un-analyzed types (e.g. string) for textual unique keys. While using a solr.TextField with analysis does not produce errors, it also won't do what you expect, namely use the output from the analysis chain as the unique key. The raw input before analysis is still used which leads to duplicate documents (e.g. docs with unique keys of 'id1' and 'ID1' will be two unique docs even if you have a LowercaseFilter in an analysis chain for the unique key). Any normalization of the unique key should be done on the client side before ingestion.
http://robotlibrarian.billdueber.com/2009/03/a-plea-use-solr-to-normalize-your-data/
http://www.techsquids.com/bd/solr-multithreaded-concurrent-atomic-updates-problem/
http://lucene.472066.n3.nabble.com/Grouping-performance-problem-td3995245.html
if you don't need number of groups, you can try leaving out
group.ngroups=true param.
In this case Solr apparently skips calculating all groups and delivers
results much faster.
At least for our application the difference in performance
with/without group.ngroups=true is significant (have to say, we use
Solr 3.6).
https://support.lucidworks.com/hc/en-us/articles/221618187-What-is-Managed-Schema-
SolrCloud users who need to modify any setting, for example adding a new field have to use the following four steps for their change to take affect
- Download their previous configuration set from ZooKeeper
- Make changes to their configuration set locally adding fields, modifying synonyms etc.
- Upload the configuration set to ZooKeeper again
- Call a Collection RELOAD command for the collection to notice the changes you just made.
Solr now has REST like APIs to make all of these changes easier for you. So you could use the Config APIs ( https://cwiki.apache.org/confluence/display/solr/Config+API ) to make changes to the solrconfig.xml file , the Schema APIs to make changes to the schema ( https://cwiki.apache.org/confluence/display/solr/Schema+API ) . You can add fields from the Solr Admin UI if you would like too!
To make all this consistent we decided that the Schema APIs should be enabled by default . This does NOT enable schemaless mode, only enables one to use Schema APIs. To enable it yourself you can can follow the instruction here - https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig .
In all our current example configuration that we ship, starting Solr 5.5 they will use the ManagedIndexSchemaFactory instead of the ClassicIndexSchemaFactory. To make this change more apparent the schema.xml file has been renamed to managed-schema . So don’t be surprised!
If you don’t like this behaviour and want to hand edit your files, then the ClassicIndexSchemaFactory is not going anywhere . You can specify it explicitly in your solrconfig.xml file and rename managed-schema to schema.xml ( when Solr is not running ) to get back the old behaviour.
With Solr 6.0 if you haven’t specified a schemaFactory in the solrconfig.xml file then ManagedIndexSchemaFactory will be used as opposed to ClassicIndexSchemaFactory and the schema file will be automatically renamed from schema.xml to managed-schema
https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig
Other features such as Solr's Schemaless Mode also work via Schema modifications made programatically at run time.
When a
<schemaFactory/>
is not explicitly declared in a solrconfig.xml
file, Solr implicitly uses a ManagedIndexSchemaFactory
, which is by default "mutable"
and keeps schema information in a managed-schema
file.
<
schemaFactory
class
=
"ManagedIndexSchemaFactory"
>
<
bool
name
=
"mutable"
>true</
bool
>
<
str
name
=
"managedSchemaResourceName"
>managed-schema</
str
>
</
schemaFactory
>
https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode
This will launch a Solr server, and automatically create a collection (named "
gettingstarted
") that contains only three fields in the initial schema: id
, _version_
, and _text_
.<
updateRequestProcessorChain
name
=
"add-unknown-fields-to-the-schema"
>
https://cwiki.apache.org/confluence/display/solr/Schema+API
Autocompletion
http://alexbenedetti.blogspot.com/2015/07/solr-you-complete-me.html
The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.
Building a suggester is the process of :
- retrieving the terms (source for the suggestions) from the dictionary
- build the data structures that the Suggester requires for the lookup at query time
- Store the data structures in memory/disk
It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.
For example when you start up Solr, the data will be loaded from disk to the memory without any rebuilding to be necessary.
This parameter is:
“storeDir” for the FuzzyLookup
“indexPath” for theAnalyzingInfixLookup
“storeDir” for the FuzzyLookup
“indexPath” for theAnalyzingInfixLookup
The built data structures will be later used by the suggester lookup strategy, at query time.
In details, for the DocumentDictionary during the building process, for ALL the documents in the index :
- the stored content of the configured field is read from the disk ( stored="true" is required for the field to have the Suggester working)
- the compressed content is decompressed ( remember that Solr stores the plain content of a field applying a compression algorithm [3] )
- the suggester data structure is built
"for ALL the documents" -> no delta dictionary building is happening
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
<lst name="suggester">
<str name="name">AnalyzingSuggester</str>
<str name="lookupImpl">AnalyzingLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
AnalyzingInfixLookupFactory
<lst name="suggester"><str name="name">AnalyzingInfixSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">text_en</str>
</lst>
http://blog.trifork.com/2012/02/15/different-ways-to-make-auto-suggestions-with-solr/
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
http://stackoverflow.com/questions/25088269/structr-badmessage-400-unknown-version-for-httpchanneloverhttp
this message usually occurs if you have a whitespace (or other characters which have to be URL-encoded properly) in the URL, like f.e.
curl "http://0.0.0.0:8082/structr/rest/users?name=A B"
Correct:
curl "http://0.0.0.0:8082/structr/rest/users?name=A%20B"
Join query
https://lucidworks.com/blog/2012/06/20/solr-and-joins/
https://community.hortonworks.com/articles/49790/joining-collections-in-solr-part-i.html
To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.
- curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs"
Find all product docs matching "ipod", then join them against (manufacturer) docs and return the list of manufacturers that make those products
- http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes
TODO:
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS
http://localhost:8983/solr/admin/cores?action=STATUS&core=core0
/admin/collections?action=CLUSTERSTATUS
: Get cluster status http://lucene.472066.n3.nabble.com/SolrCloud-and-split-brain-td3989857.html#a3989868
http://alexbenedetti.blogspot.co.uk/2015/07/solr-document-classification-part-1.html
The Classification Update Request Processor is a simple processor that will automatically classify a document ( the classification will be based on the latest index available) adding a new field containing the class, before the document is indexed.
After an initial valuable index has been built with human assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.
http://stackoverflow.com/questions/39922233/whether-replicationfactor-2-makes-sense-in-solrcloud/39926368
replicationFactor will not affect whether a split brain situation arises or not. The cluster details are stored in Zookeeper. As long as you have a working Zookeper ensemble Solr will not have this issue. This means you should make sure you have 2xF+1 zookeper nodes (minimum 3) .
http://lucene.472066.n3.nabble.com/Whether-replicationFactor-2-makes-sense-td4300204.html#a4300257
The Solr replicationFactor has nothing to do with quorum. Having 2 is the same as 3.
Solr uses Zookeeper's Quorum sensing to insure that all Solr nodes
have a consistent picture of the cluster. Solr will refuse to index data if
_Zookeeper_ loses quorum.
But whether Solr has 2 or 3 replicas is not relevant. Solr indexes data through
the leader of each shard, and that keeps all replicas consistent.
As far as other impacts, adding a replica will have an impact on indexing
throughput, you'll have to see whether that makes any difference in your
situation. This is usually on the order of 10% or so, YMMV. And this is only
on the first replica you add, i.e. going from leader-only to 2
replicas costs, say,
10% on throughput, but adding yet another replica does NOT add another 10%
since the leader->replica updates are done in parallel.
The other thing replicas gain you is the ability to serve more queries since you only query a single replica for each shards.
It will make our system more robust and resilient to temporary network failure issue
http://lucene.472066.n3.nabble.com/Does-soft-commit-re-opens-searchers-in-disk-td4248350.html
If you have already done a soft commit and that opened a new searcher, then
the document will be visible from that point on. The results returned by
that searcher cannot be changed by the hard commit (whatever that is doing
under the hood, the segment that has that document in must still be visible
to the searcher). I don't know exactly how the soft commit stores its
segment, but there must be some kind of reference counting like there is
for disk segments since the searcher has that "segment" open (regardless of
whether that segment is in RAM or on disk).
1. Everything visible prior to a soft commit remains visible after soft
commit (regardless of <openSearcher>false</openSearcher>)
openSearcher
is only relevant for hard commits, it is meaningless with soft commits.
a soft
commit inherently *has* to open a new searcher, a soft commit with
openSearcher=false (Erick correctly points out such a thing doesn't exist)
would be a pointless no-op, nothing visible and nothing written to disk!
https://forums.manning.com/posts/list/31850.page
re: openSearcher=false and soft commit ... I think these two are somewhat orthogonal in that you use openSearcher=false with auto-commit (hard commits) as a strategy to flush documents to durable storage w/o paying the cost of opening a new searcher after an auto-commit. This is useful because it allows you to keep Solr's update log (tlog) small and not affect search performance until you decide it is time to open the searcher, which makes documents visible. In other words, openSearcher=false, de-couples index writing tasks from index reading tasks when doing commits.
Soft-commits are all about how soon does a new document need to be visible in search results. A new searcher is opened when you do a soft-commit. When opening a new searcher after a soft-commit, Solr will still execute the warming queries configured in solrconfig.xml, so be careful with configuring queries that take a long time to execute. It will also try to warm caches, so same advice -- keep the autowarm counts low for your caches if you need NRT. Put simply, your new searcher cannot take longer to warm-up than your soft-commit frequency. Also, you probably want useColdSearcher=true when doing NRT.
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
Note: tlogs are “rolled over” automatically on hard commit (openSearcher true or false). The old one is closed and a new one is opened.
Hard commits are about durability, soft commits are about visibility. There are really two flavors here, openSearcher=true and openSearcher=false. First we’ll talk about what happens in both cases. If openSearcher=true or openSearcher=false, the following consequences are most important:
- The tlog is truncated: A new tlog is started. Old tlogs will be deleted if there are more than 100 documents in newer, closed tlogs.
- The current index segment is closed and flushed.
- Background segment merges may be initiated.
The above happens on all hard commits. That leaves the openSearcher setting
- openSearcher=true: The Solr/Lucene searchers are re-opened and all caches are invalidated. Autowarming is done etc. This used to be the only way you could see newly-added documents.
- openSearcher=false: Nothing further happens other than the four points above. To search the docs, a soft commit is necessary.
To begin, you need to define a field type that uses the ManagedStopFilterFactory , such as:
There are two important things to notice about this field type definition. First, the filter implementation class is
solr.ManagedStopFilterFactory
. This is a special implementation of the StopFilterFactory that uses a set of stop words that are managed from a REST API. Second, the managed=”english”
attribute gives a name to the set of managed stop words, in this case indicating the stop words are for English text.
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/english
https://cwiki.apache.org/confluence/display/solr/Request+Parameters+APIthe parameters are stored in a file named
params.json
. This file is kept in ZooKeeper or in the conf
directory of a standalone Solr instance.< requestHandler name = "/my_handler" class = "solr.SearchHandler" useParams = "my_handler_params" /> |
When using this API,
solrconfig.xml
is is not changed. Instead, all edited configuration is stored in a file called configoverlay.json
. The values in configoverlay.json
override the values in solrconfig.xml
./config
: retrieve or modify the config. GET to retrieve and POST for executing commands/config/overlay
: retrieve the details in theconfigoverlay.json
alone/config/params
: allows creating parameter sets that can override or take the place of parameters defined insolrconfig.xml
. See the Request Parameters API section for more details.
updateHandler.autoSoftCommit.maxTime
updateHandler.autoCommit.openSearcher
requestDispatcher.requestParsers.multipartUploadLimitInKB
requestDispatcher.requestParsers.formdataUploadLimitInKB
{
"set-property"
: {
"updateHandler.autoCommit.maxTime"
:15000,
"updateHandler.autoCommit.openSearcher"
:
false
}
}
Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode, however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a core, the node would watch
/configs/myconf
. Every write operation performed through the API would 'touch' the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the Schema file, solrconfig.xml
or configoverlay.json
is modified by comparing the znode
versions and if modified, the core is reloaded.JSON API:
- We don’t need to pass the Content-Type for indexing or for querying when we’re using JSON since Solr is now smart enough to auto-detect it when Curl is the client.
curl http: //localhost:8983/solr/techproducts/query -d ' { "query" : "memory" , "filter" : "inStock:true" }' |
It may sometimes be more convenient to pass the JSON body as a request parameter rather than in the actual body of the HTTP request. Solr treats a
json
parameter the same as a JSON body.
Multiple
json
parameters in a single request are merged before being interpreted.- Single-valued elements are overwritten by the last value.
- Multi-valued elements like
fields
andfilter
are appended. - Parameters of the form
json.<path>=<json_value>
are merged in the appropriate place in the hierarchy. For example ajson.facet
parameter is the same as“facet”
within the JSON body. - A JSON body, or straight
json
parameters are always parsed first, meaning that other request parameters come after, and overwrite single valued elements.
curl
'http://localhost:8983/solr/techproducts/query?json.limit=5&json.filter="cat:electronics"'
-d '
{
query :
"memory"
,
limit :
10
,
filter :
"inStock:true"
}'
{
facet: {
avg_price:
"avg(price)"
,
top_cats: {
terms: {
field:
"cat"
,
limit:
5
}
}
}
}
Because we didn’t pollute the root body of the JSON request with the normal Solr request parameters (they are all contained in the
params
block), we now have the ability to validate requests and return an error for unknown JSON keys.
And we get an error back containing the error string:
Parameter Substitution / Macro Expansion
Of course request templating via parameter substitution works fully with JSON request bodies or parameters as well.
Example:
http://yonik.com/solr-subfacets/
top_genres:{ type: terms, field: genre, limit: 5, facet:{ top_authors:{ type: terms, field: author, limit: 4 } } }
If instead, we wanted to find top authors by total revenue (assuming we had a “sales” field), then we could simply change the author facet from the previous example as follows:
top_authors:{ type: terms, field: author, limit: 7, sort: "revenue desc", facet:{ revenue: "sum(sales)" } }http://yonik.com/json-facet-api/
&facet=true &facet.range={!key=age_ranges}age &f.age.facet.range.start=0 &f.age.facet.range.end=100 &f.age.facet.range.gap=10 &facet.range={!key=price_ranges}price &f.price.facet.range.start=0 &f.price.facet.range.end=1000 &f.price.facet.range.gap=50
And here is the equivalent faceting command in the new JSON Faceting API:
{ age_ranges: { type : range field : age, start : 0, end : 100, gap : 10 } , price_ranges: { type : range field : price, start : 0, end : 1000, gap : 50 } } |
{ // this is a single-line comment, which can help add clarity to large JSON commands /* traditional C-style comments are also supported */ x : "avg(price)" , // Simple strings can occur unquoted y : 'unique(manu)' // Strings can also use single quotes (easier to embed in another String) } |
There are two types of facets, one that breaks up the domain into multiple buckets, and aggregations / facet functions that provide information about the set of documents belonging to each bucket.
Faceting can be nested! Any bucket produced by faceting can further be broken down into multiple buckets by a sub-facet.
Statistics are now fully integrated into faceting. Since we start off with a single facet bucket with a domain defined by the main query and filters, we can even ask for statistics for this top level bucket, before breaking up into further buckets via faceting. Example:
json.facet={ x : "avg(price)" , // the average of the price field will appear under "x" y : "unique(manufacturer)" // the number of unique manufacturers will appear under "y" }
|
The general form of the JSON facet commands are:
Example:
<facet_name> : { <facet_type> : <facet_parameter(s)> }
Example:
top_authors : { terms : { field : authors, limit : 5 } }
After Solr 5.2, a flatter structure with a “type” field may also be used:
Example:
<facet_name> : { "type" : <facet_type> , <other_facet_parameter(s)> }
Example:
top_authors : { type : terms, field : authors, limit : 5 }
The terms facet, or field facet, produces buckets from the unique values of a field. The field needs to be indexed or have docValues.
The simplest form of the terms facet
{ top_genres : { terms : genre_field } } |
An expanded form allows for more parameters:
{ top_genres : { type : terms, field : genre_field, limit : 3, mincount : 2 } }
The query facet produces a single bucket that matches the specified query.
An example of the simplest form of the query facet
An expanded form allows for more parameters (or sub-facets / facet functions):
Example response:
|
&q= "running shorts" &fq={!tag=COLOR}color:Blue &json.facet={ sizes:{type:terms, field:size}, colors:{type:terms, field:color, domain:{excludeTags:COLOR} }, brands:{type:terms, field:brand, domain:{excludeTags:BRAND} } } |
Solr filter queries (fq parameters) can be tagged with arbitrary strings using the localParams
Example:
{!tag=mystring}
syntax.Example:
fq={!tag=COLOR}color:Blue
- Multiple filters can be tagged with the same tag. Example:
fq={!tag=foo}one_filter&fq={!tag=foo}another_filter
- A single filter may be tagged with multiple tags. Example:
fq={!tag=tag1,tag2,tag3}my_field:my_filter
During faceting, the facet
Example:
domain
may be changed to exclude filters that match certain tags via the excludeTags
keyword. It’s as if the filter was never specified for that specific facet. This is useful for implementing multi-select facetingExample:
colors:{type:terms, field:color, domain:{excludeTags:COLOR}}
excludeTags
can be multi-valued comma-separated string. Example:excludeTags:"tag1,tag2"
excludeTags
can be a JSON array of tags. Example:excludeTags:["tag1","tag2"]
- One can exclude tags that are not used in the current request. This makes constructing requests simpler since you don’t need to worry about changing the faceting part of the request based on what filters have been applied.
stats-facet
http://yonik.com/json-facet-api/
http://yonik.com/solr-facet-functions/
http://localhost:8983/solr/query?q=*:*& json.facet={x:'avg(price)'}
$ curl http: //localhost:8983/solr/query -d 'q=*:*& json.facet={ categories:{ type : terms, field : cat, sort : "x desc" , // can also use sort:{x:desc} facet:{ x : "avg(price)" , y : "sum(price)" } } } ' |
https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-CombiningStatsComponentWithPivots
stats= true stats.field={!tag=piv1,piv2 min= true max= true }price stats.field={!tag=piv2 mean= true }popularity facet= true facet.pivot={!stats=piv1}cat,inStock facet.pivot={!stats=piv2}manu,inStock |
facet=
true
facet.query={!tag=q1}manufacturedate_dt:[
2006
-
01
-01T00:
00
:00Z TO NOW]
facet.query={!tag=q1}price:[
0
TO
100
]
facet.pivot={!query=q1}cat,inStock
facet=
true
facet.range={!tag=r1}manufacturedate_dt
facet.range.start=
2006
-
01
-01T00:
00
:00Z
facet.range.end=NOW/YEAR
facet.range.gap=+1YEAR
facet.pivot={!range=r1}cat,inStock
An expanded form allows for more parameters and a facet command block to specify sub-facets (either nested facets or metrics):
Example response:
https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/
http://localhost:8983/solr/books/select?q=Crime&stats=true&stats.field=price&stats.facet=author
But this
stats.facet
approach has always been plagued with problems:- Completely different code from
FacetComponent
that was hard to maintain, and doesn’t supported distributed search(see EDIT#1 below) - Always returns every term from the
stats.facet
field, w/o any support forfacet.limit
,facet.sort
, etc… - Lots of problems with multivalued facet fields and/or non string facet fields.
stats.field
to a facet.pivot
param — this inverts the relationship that stats.facet
used to offer (nesting the stats under the facets so to speak, instead of putting the facets under the stats) so that the FacetComponent
does all the heavy lifting of determining the facet constraints, and delegates to the StatsComponent
only as needed to compute stats over the subset of documents for each constraint.http://localhost:8983/solr/techproducts/select?q=crime&facet=true&stats=true&stats.field={!tag=t1}price&facet.pivot={!stats=t1}author
"facet_pivot":{
"author":[{
"field":"author",
"value":"Kaiser Soze",
"count":42,
"stats":{
"stats_fields":{
"price":{
"min":12.95,
"max":29.95,
...}}}},
{
"field":"author",
"value":"James Moriarty",
"count":37,
"stats":{
"stats_fields":{
"price":{
"min":19.95,
"max":39.95,
...
The linkage mechanism is via a
tag
Local Param specified on the stats.field
. This allows multiple facet.pivot
params to refer to the same stats.field
, or a single facet.pivot
to refer to multiple different stats.field
params over different fields/functions that all use the same tag
, etc. And because this functionality is built on top of Pivot Facets, multiple levels of Pivots can be computed, and the stats will be computed at each level
Although the
stats.facet
parameter is no longer recommended, sets of stats.field
parameters can be referenced by 'tag
' when using Pivot Faceting to compute multiple statistics at every level (i.e.: field) in the tree of pivot constraints.facet.pivot.mincount
The
facet.pivot.mincount
parameter defines the minimum number of documents that need to match in order for the facet to be included in results. The default is 1.docvalues
https://support.lucidworks.com/hc/en-us/articles/201839163-When-to-use-DocValues-in-Solr
Answer - Any field which is used for sorting, faceting, or is part of a custom scoring function one should use the DocValues feature by enabling it in the schema.xml file.
Here is an example rating field on which the sort option is provided to the user, with DocValues enabled to make it more efficient.
<field name="rating" type="tint" indexed="true" docValues="true" />
Indexing All tables into a Single Big Index
Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood
java.lang.OutOfMemoryError:
Java heap Space
Took a Heap dump and realized that it is due to Field Cache
Found a Solution : Doc values and never had this issue again till date
Doc Values (Life Saver)
Disk based Field Data a.ka. Doc values
Document to value mapping built at index time
Store Field values on Disk (or Memory) in a column stride fashion
Better compression than Field Cache
Replacement of Field Cache (not completely)
Quite suitable for Custom Scoring, Sorting and Faceting
Scaling and Making Search Faster…
3 Level Partitioning, by Month, Country and Table name
Each Month has its own Cluster and a Cluster Manager.
Latency and Throughput are tradeoff, you can’t have both at the same time.
External Caching
In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput.
Solution: cache most frequently accessed query results in app layer (LRU based eviction)
We use Redis for Caching
All complex aggregation queries’ results, once fetched from multiple Solr servers
Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache.
Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB.
Don’t use Soft Commit if you don’t need it. Specially in Batch Loading
Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations.
Use hash in Redis to minimize the memory usage.
http://www.slideshare.net/lucidworks/realtime-analytics-with-solr-presented-by-yonik-seeley-cloudera
Columnar Storage (DocValues)
• Fast linear scan
• Read only the data you need
• Fast random access
• docid -‐> value(s)
• High degree of locality
• Compressed
• prefix,delta, table, gcd, etc
• Mostly "Off-‐Heap"
• Memory mapped from index
• Row vs Column configurable per field!
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.
Figure 1. Univerting a field to FieldCache
FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look.
providing a document to value mapping built at index time. IndexDocValues allows you to do all the work during document indexing with a lot more control over your data. Each ordinary Lucene Field accepts a typed value (long, double or byte array) which is stored in a column based fashion.
https://cwiki.apache.org/confluence/display/solr/DocValues
DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.
If you have already indexed data into your Solr index, you will need to completely re-index your content after changing your field definitions in
schema.xml
in order to successfully use docValues.docValuesFormat
used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk.
If
docValues="true"
for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or Function Queries.
Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).
In cases where the query is returning only docValues fields performance may improve since returning stored fields requires disk reads and decompression whereas returning docValues fields in the fl list only requires memory access.
When retrieving fields from their docValues form, two important differences between regular stored fields and docValues fields must be understood:
- Order is not preserved. For simply retrieving stored fields, the insertion order is the return order. For docValues, it is the sorted order.
- Multiple identical entries are collapsed into a single value. Thus if I insert values 4, 5, 2, 4, 1, my return will be 1, 2, 4, 5.
docValues是一种记录doc字段值的一种形式,在例如在结果排序和统计Facet查询时,需要通过docid取字段值的场景下是非常高效的。
为什么要使用docValues?
这种形式比老版本中利用fieldCache来实现正排查找更加高效,更加节省内存。倒排索引将字段内存切分成一个term列表,每个term都对应着一个docid列表,这样一种结构使得查询能够非常快速,因为term对应的docid是现成就有的。但是,利用它来做统计,排序,高亮操作的时候需要通过docid来找到,field的值却变得不那么高效了。之前lucene4.0之前会利用fieldCache在实例启动的时候预先将倒排索引的值load到内存中,问题是,如果文档多会导致预加载耗费大量时间,还会占用宝贵的内存资源。
索引在lucene4.0之后引入了新的机制docValues,可以将这个理解为正排索引,是面向列存储的。
DocValues和 field的存储值(field属性设置为stored=“true”)有什么区别?
docValues和document的stored=ture存储的值,都是正排索引,单也是有区别的:
l 存储方式:
DocValues是面向列的存储方式,stored=true是面向行的存储方式,如果通过fieldid取列的值可定是用docValues的存储结构更高效。
l 是否分词:
Stored=true的存储方式是不会分词的,会将字段原值进行保存,而docValues的保存的值会进行分词。
如果在索引上要进行facet,gourp,highlight等查询尽量使用docValue,这样不用为内存开销烦恼了。
例如:solr4.0之后都会需要在schema中设置一个_version_字段来实现对文档的原子操作,为了节省内存,可以加上docValues:
<field name="_version_"
type="long" indexed="true" stored="true" docValues="true"/>
<field name="manu_exact"
type="string" indexed="false" stored="false"
docValues="true" />
另外可以通过fieldtype的docValuesFormat属性来设置docValue的实现策略:
<fieldType name="string_in_mem_dv"
class="solr.StrField" docValues="true"
docValuesFormat="Memory" />
http://solr1:8080/solr/admin/collections?action=CREATE&name=catcollection&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=conf1&router.field=cat
./bin/zkCli.sh -server zoo1,zoo2,zoo3
get /clusterstate.json
"router":{
"field":"cat",
"name":"compositeId"
}
To send a query to a particular shard, we have to use the shard.keys parameter in our query.
http://solr1:8080/solr/catcollection/select?q=martin&fl=cat,name,description&shard.keys=book!
http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=shard1
In addition to splitting shard1 into two sub shards, SolrCloud makes the parent shard, shard1, inactive. This information is available in the ZooKeeper servers
Only an inactive shard can be deleted.
In order to move a shard to a new node, we need to add the node as a replica. Once the replication on the node is over and the node becomes active, we can simply shut down the old node and remove it from the cluster.
http://solr1:8080/solr/admin/collections?action=DELETEREPLICA&collection=mycollection&shard=shard2&replica=core_node3
http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=catcollection&split.key=books!
http://solr1:8080/solr/admin/collections?action=SPLITSHARD&collection=catcollection&split.key=books!&async=1111
http://solr1:8080/solr/admin/collections?action=REQUESTSTATUS&requestid=1111
http://solr1:8080/solr/admin/collections?action=MIGRATE&collection=catcollection&split.key=currency!&target.collection=mycollection&forward.timeout=120
I dont think there is an option to ignore an update if the document does not exist.
Solr Internally still deletes and recreates the documents, that why you need to have the fields stored to be updatable.
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
< processor class = "solr.DocBasedVersionConstraintsProcessorFactory" > < str name = "versionField" >my_version_l</ str > </ processor > |
The
_version_
field is by default stored in the inverted index (indexed="true"
). However, for some systems with a very large number of documents, the increase in FieldCache memory requirements may be too costly. A solution can be to declare the _version_
field as DocValues:http://m.blog.csdn.net/article/details?id=51025580
如果你创建了一个collection而且定义在创建的时候定义了“implicit”route,你可以添加定义一个router.field参数,通过各个document的这个field来确定document属于哪个shard。如果在document中丢失这个field指定,document将会被拒绝。你同样可以使用_route_参数来命名一个指定的shard。
This parameter allows you to specify a collection or a number of collections on which the query should be executed. This allows you to query multiple collections at once and all the feature of Solr which work in a distributed manner can work across collections.
The _route_
Parameter
This parameter can be used to specify a route key which is used to figure out the corresponding shards. For example, if you have a document with a unique key "abc!123", then specifying the route key as "_route_=abc!" (notice the trailing '!' character) will route the request to the shard which hosts that doc. You can specify multiple such route keys separated by comma.
Create a Shard
Shards can only created with this API for collections that use the 'implicit' router. Use SPLITSHARD for collections using the 'compositeId' router. A new shard with a name can be created for an existing 'implicit' collection.
也就是说在创建collections时候(如果指定参数numshards参数会自动切换到router=”compositeId”),如果采用compositeId方式,那么就不能动态增加shard。如果采用的是implicit方式,就可以动态的增加shard。进一步来讲:
- ImplicitDocRouter就是和 uniqueKey无关,可以在请求参数或者SolrInputDocument中添加_route_(_shard_已废弃,或者指定field参数)参数获取slice ,(router=”implicit”) .
- CompositeIdRouter就是根据uniqueKey的hash值获取slice, (在指定numshards参数会自动切换到router=”compositeId”) .
综上所述,CompositeIdRouter在创建的时候由于指定了numshards参数即已经固定了hash区间,那么在update的时候,根据uniqueid的hash坐落在那个hash区间来决定这份document数据发送至哪个shard。而ImplicitDocRouter则是在创建的时候并不选定每个shard的hash区间,而是在需要update的document中增加_route_字段来存放需要发送的shard名字,以此shard的名字来决定发送至哪个shard。不难看出,相对来说ImplicitDocRouter更加灵活。
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
Solr offers the ability to specify the router implementation used by a collection by specifying the
router.name
parameter when creating your collection. If you use the (default) "compositeId
" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which shard to direct the document to.
Then at query time, you include the prefix(es) into your query with the
_route_
parameter (i.e., q=solr&_route_=IBM!
) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.
If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a
router.field
parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_
parameter to name a specific shard.
Certain Solr features such as grouping’s ngroups feature and joins require documents to be co-located in the same core or vm. For example to take advantage of the ngroups feature in grouping, documents need to be co-located by the grouping key. Document routing will do this automatically if the grouping key is used as the shard key.
Setting Up The CompositeId Router
The Solr Cloud compositeId router is used by default when a collection is created with the “numShards” parameter. If the numShards parameter is not supplied at collection creation time then the “implicit” document router is assigned to the collection.
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
IgnoreCommitOptimizeUpdateProcessorFactory
In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with
openSearcher=false
and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster. To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.memory calculation:
http://docs.alfresco.com/4.1/concepts/solrnodes-memory.html
http://yonik.com/solr-6-1/
https://issues.apache.org/jira/browse/SOLR-8888
https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462
https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+5+to+Solr+6
Introduced in Solr 5, Streaming Expressions allow querying Solr and getting results as a stream of data, sorted and aggregated as requested.
publish/subscribe
The
topic
function provides publish/subscribe messaging capabilities built on top of SolrCloud. The topic function allows users to subscribe to a query. The function then provides one-time delivery of new or updated documents that match the topic query. The initial call to the topic function establishes the checkpoints for the specific topic ID. Subsequent calls to the same topic ID will return documents added or updated after the initial checkpoint. Each run of the topic query updates the checkpoints for the topic ID. Setting the initialCheckpoint parameter to 0 will cause the topic to process all documents in the index that match the topic query.https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a
Solr has extended its set of features for nested document handling including faceting of nested documents and schemaless support of nested data structures.
http://yonik.com/solr-nested-objects/
Lucene has a flat object model and does not really support “nesting” of documents in the index.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.
When you add a parent document with 3 children, these appear int the index contiguously as
child1, child2, child3, parent
There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).
Any document can have nested child documents.
Any document can have nested child documents.
Using Solr 4.9 new ChildDocTransformerFactory
Now, we can use the below BlockJoinQuery to find the parent of all the documents containing the words I am child:
q={!parent which="cat:PARENT"}name:(I am +child)
Now, lets get the parent document along with the other child document but without the grandchild.We do that by adding a fields parameter that looks like:
fl=id,name,[child parentFilter=cat:PARENT childFilter=cat:CHILD]
The above contains regular fields request - id and name and additional ChildDocTransformer that enable Solr to return the given parent and it's nested child's.
Pay attention that we've added optional parameter - childFilter=cat:CHILD that filters the GRANDCHILDREN out of the response.
bin/solr status
https://support.lucidworks.com/hc/en-us/articles/20342143-Recovery-times-while-restarting-a-SolrCloud-node
Recovery time depends on these factors -
1. Your transaction log size - So each replica in a collection writes to a transaction log. This gets rolled over on a (hard) commit. So if you restart between commits then that portion needs to be replayed.
2. When a replica is in recovery, it starts writing all updates sent by the leader in this period to the transaction log. So if the node was in recovery when you restarted then this could be a reason why it took long.
Thus if your transaction log on disk is large then that is probably the reason for a slow recovery time. A short auto (hard) commit setting will ensure a small transaction log and faster recovery, but be careful not to make it too short as that can effect your indexing performance
3. When a replica tries to come up after the restart, it asks the shard leader if any document updates have taken place. If the number is less than 100 then SolrCloud does a PeerSync and gets all the updated documents, but when it's more than 100 it does an index replication from the leader. Full replication will kick in if the index has changed because of merges, optimize, expungeDeletes etc). If a full replication is kicking in then this could be a major reason for recovery time. You could grep for "fullCopy=true" to see if that is happening
4. At the time of restarting a node if your overseer queue already has a lot of pending requests queued up and with the node restart theOverseer would have to process these state changes in addition, causing a pile up in the Overseer queue and thus it might take time to process the events - hence the slow recovery. You can look at your overseer queue size here - http://host:port/solr/#/~cloud?view=tree and click on /overseer -> queue and see it's children_count number.
5. As a general advice for restarting nodes in a SolrCloud cluster, its advised to restart one node at a time, waiting for a node to completely come up ( all shards show up as active in the Cloud dashboard ) . Also you should bounce the Overseer as the last node as this causes minimum switching of the overseer
Note the documents that are are sent via the PeerSync are not streamed. They are sent as a POST request. So setting a number too high will cause the request to fail. The default POST size limit for Jetty is 20MB ( http://wiki.eclipse.org/Jetty/Howto/Configure_Form_Size#Changing_the_Maximum_Form_Size_for_All_Apps_in_the_JVM ) , so one needs to make sure the numRecordsToKeep count multiplied by document size doesn't exceed that.
https://github.com/docker-solr/docker-solr
TTL
https://lucidworks.com/blog/2014/05/07/document-expiration/
The
DocExpirationUpdateProcessorFactory
provides two features related to the “expiration” of documents which can be used individually, or in combination:- Periodically delete documents from the index based on an expiration field
- Computing expiration field values for documents from a “time to live” (TTL)
The
DocExpirationUpdateProcessorFactory
provides two features related to the “expiration” of documents which can be used individually, or in combination:- Periodically delete documents from the index based on an expiration field
- Computing expiration field values for documents from a “time to live” (TTL)
While the basic logic of “timer goes off, delete docs with expiration prior to NOW” was fairly simple and straight forward to add, a key aspect of making this work well was in a related issue (SOLR-5783) to ensure that the
openSearcher=true
doesn’t do anything unless there really are changes in the index. This means that you can configure autoDeletePeriodSeconds
to very small values, and still rest easy that your search caches won’t get blown away every few seconds for no reason. The openSearcher=true
soft commits will only affect things if there really are changes in the index.
The second feature implemented by this Factory (and the key reason it’s implemented as an
UpdateProcessorFactory
) is to use “TTL” (Time To Live) values associated with documents to automatically generate an expiration date value to put in the expirationFieldName
when documents are indexed.
By default, the
DocExpirationUpdateProcessorFactory
will look for a _ttl_
request parameter on update requests, as well as a _ttl_
field in each doc that is indexed in that request. If either exist, they will be parsed as Date Math Expressions relative to NOW
and used to populate the expirationFieldName
. The per-document _ttl_
field based values override the per-request _ttl_
parameter.
Both the request parameter and field names use for specifying TTL values can be overridden by configuring
ttlParamName
& ttlFieldName
on the DocExpirationUpdateProcessorFactory
. They can also be completely disabled by configuring them as null
. It’s also possible to use the TTL computation feature to generate expiration dates on documents, with out using the auto-deletion feature simply by not configuring the autoDeletePeriodSeconds
option (so that the timer will never run).
This sort of configuration may be handy if you only want to logically hide documents for search clients based on a per-document TTL using something like:
fq=-press_release_expiration_date:[* TO NOW/DAY]
, but still retain the documents in the index for other search clients.- A
FirstFieldValueUpdateProcessorFactory
is configured on theexpire_at_dt
— this means that if a document is added with an explicit value in theexpire_at_dt
field, it will be used instead any value that might be added by theDocExpirationUpdateProcessorFactory
using the_ttl_
request param
https://issues.apache.org/jira/browse/SOLR-5795
http://stackoverflow.com/questions/17806821/solr-composite-unique-key-from-existing-fields-in-schema
https://wiki.apache.org/solr/Deduplication
<processor class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <bool name="overwriteDupes">false</bool> <str name="signatureField">id</str> <str name="fields">name,features,cat</str> <str name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str> </processor>
https://lawlesst.github.io/notebook/solr-etags.html
http://stackoverflow.com/questions/4800559/how-to-transform-simpleorderedmap-into-json-string-or-json-object
How to transform SimpleOrderedMap into JSON string or JSON object?
In short: no, because JSON doesn't have a primitive ordered map type.
JSONArray jarray = new JSONArray(); for(Map.Entry e : simpleOrderedMap) { jarray.put(e.key()); jarray.put(e.value()); }