http://massivetechinterview.blogspot.com/2017/09/solr-misc-part-5.html
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymGraphFilter
https://issues.apache.org/jira/browse/SOLR-10379
Add ManagedSynonymGraphFilterFactory, deprecate ManagedSynonymFilterFactory
http://blog.csdn.net/zteny/article/details/60633374
DocValues字段是一个面向列存储的字段,一个Segment只有一个DocValues文件。也就是被DocValues标记的字段在建索引时会额外存储文件到值的映射关系,存储这个映射的文件叫DocValues data,简称.dvd。对应的元数据文件叫.dvm:DocValues Metadata。
http://mozhenghua.iteye.com/blog/2275932
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term.
https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.
https://lucene.apache.org/solr/guide/6_6/docvalues.html
There is an additional configuration option available, which is to modify the
https://wiki.apache.org/solr/SchemaXml
https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+Case
https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties
Thanks Chris, but actually, it turns out that "query text" from elevate.xml has to match the query (q=...). So in this case, elevation works only for http://localhost:8080/solr/elevate?q=brain, but not for http://localhost:8080/solr/elevate?q=indexingabstract:brain type of queries.
This could be solved by using DisMax query parser (http://localhost:8080/solr/elevate?q=brain&qf=indexingabstract&defType=dismax), but we have way more complicated queries which cannot be reduced just to q=searchterm&...
right ... query elevation by default is based on the raw query string.
: but we have way more complicated queries which cannot be reduced just to
: q=searchterm&...
what would you want/exect QEC to do in that type of situation? how would
it know what part of a complex query should/shouldn't be used for
elevation?
FWIW: ne thing you can do is configure a "queryFieldType" on in your
QueryElevationComponent .. if specific, i will use the analyzer for
that field type to process the raw query string before doing a lookup in
the QEC data -- so for example: you could use it to lowercase the input,
or strip out unwanted whitespace or punctuation.
it night not help for really complicated queries, but it would let you
easily deal with things like extra whitespace you want to ignore.
I think it would also be fairly easy to make QEC support an "elevate.q"
param similar to how there is a "spellcheck.q" param and a "hl.q" param to
let the client specify an alternate, simplified, string for the feature to
use
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
https://issues.apache.org/jira/browse/SOLR-418
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
Query nesting
q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book
https://wiki.apache.org/solr/FunctionQuery#strdist
https://stackoverflow.com/questions/5516503/searching-names-with-apache-solr/35764572
https://www.slideshare.net/basistech/simple-fuzzynamematchinginsolr-googleslides
Edit distance
https://stackoverflow.com/questions/21607413/edit-distance-similarity-in-lucene-solr
LocalParams
https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
http://www.lucenetutorial.com/lucene-query-syntax.html
Fuzzy search
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
https://stackoverflow.com/questions/30909106/fuzzy-search-not-working-with-dismax-query-parser
https://issues.apache.org/jira/browse/SOLR-629
http://developer4life.blogspot.com/2013/02/solr-and-lucene-fuzzy-search-closer-look.html
https://stackoverflow.com/questions/1752301/how-to-configure-solr-to-use-levenshtein-approximate-string-matching
just append the ~ character to all terms that you want to fuzzy match on the way in to solr. If you are using the default set up, this will give you fuzzy match.
term:"apple"~2
If you put quotes around apple, I think it becomes a phrase query, so the ~2 means proximity search, instead of edit distance.
https://stackoverflow.com/questions/18629373/solr-xml-parser-exception
http://localhost:8983/solr/techproducts/browse
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis
http://blog.thedigitalgroup.com/vijaym/beider-morse-phonetic-matching-in-solr/
./solr zk -h
./solr zk mkroot /yourroot
ln -s /etc/default/solr.in.sh /var/solr/solr.in.sh
service solr start
curl -d "action=clusterstatus&wt=json" http://localhost:8983/solr/admin/collections | jq
http://solr.pl/en/2011/02/28/sorting-by-function-value-in-solr-solr-1297/
https://stackoverflow.com/questions/22107683/solr-boost-query-syntax
Nested function query must use $param or {!v=value}
http://lucene.472066.n3.nabble.com/Nested-function-query-must-use-td4038037.html
You have to do exactly what the error message tells you:
rewrite:
query(id:3)
as:
query({!=v'id:3'})
The correct syntax is:
http://localhost:8983/solr/articles.0/select/?q={!func}query({!query
v='hello'})&fl=Document.title,score,&debugQuery=on
https://lucidworks.com/2009/03/31/nested-queries-in-solr/
https://stackoverflow.com/questions/17654266/solr-autocommit-vs-autosoftcommit
http://lucene.472066.n3.nabble.com/optimal-maxWarmingSearchers-in-solr-cloud-td4046164.html
Broadly speaking, you want it to be a smallish value as
background warming can be expensive. So how often are you doing a "hard"
commit and does it take longer to warm-up your searchers?
A few things to consider are:
auto-commits - can use openSearcher=false to avoid opening a new searcher
when doing large batch updates. This allows you to auto-commit more
frequently so that your update log doesn't get too big w/o paying the price
of warming a new searcher on every auto-commit.
new searcher warming queries - how many of these do you have and how long
do they take to warm up? You can get searcher warm-up time from the
Searcher MBean in the admin console.
cache auto-warming - again, how much of your existing caches are you
auto-warming? Keep a close eye on the filterCache and it's autowarmCount.
Warm-up times also available from the admin console.
https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
http://blog.csdn.net/a550246215/article/details/52402232
1) Collapse does not directly support faceting. It simply collapses the
results and the faceting components compute facets on the collapsed result
set. Grouping has direct support for faceting which, can be slow, but it
has options other then just computing facets on the collapsed result set.
2) Originally collapse only supported selecting group heads with min/max
value of a numeric field. It did not support using the sort parameter for
selecting the group head. Recently the sort parameter was added to
collapse, but this likely is not nearly as fast as using the min/max for
selecting group heads.
The collapse query parser filters search results so that only one document is returned out of all of those for a given field's value. Said differently, it collapses search results to one document per group of those with the same field value. This query parser is a special type called post-filter, which can only be used as a filter query because it needs to see the results of all other filter queries and the main query.
For min, the document with the smallest value is chosen, and for max, the largest. If your function query needs to be computed based on the document's score, refer to that via cscore().
If, and only if, your documents are partitioned into separate shards by manufacturer name would you get the correct group count, because each group would be guaranteed to only exist on one shard.
/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(field1, field2)}
/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(product(field1,
1000000), cscore())}
we are effectively sorting by field1 desc and then the cscore desc (a special “collapse score” function for getting the score of a document before collapsing)
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/test/org/apache/solr/client/solrj/request/SchemaTest.java
SchemaRequest.SchemaVersion schemaVersionRequest = new SchemaRequest.SchemaVersion();
SchemaResponse.SchemaVersionResponse schemaVersionResponse = schemaVersionRequest.process(getSolrClient());
https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/suggest/DocumentDictionary.java
http://opensourceconnections.com/blog/2017/01/23/our-solution-to-solr-multiterm-synonyms/
http://blog.trifork.com/2009/10/20/result-grouping-field-collapsing-with-solr/
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the
http://opensourceconnections.com/blog/2016/01/22/solr-vs-elasticsearch-relevance-part-two/
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.
https://www.slideshare.net/shalinmangar/parallel-sql-and-streaming-expressions-in-apache-solr-6
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
Solr Block Join - Nested Documents
https://blog.griddynamics.com/introduction-to-block-join-faceting-in-solr
The screenshot above is taken from an online retailer’s website. According to the graphic, a dress can be blue, pink or red, and only sizes XS and S are available in blue. However, for merchandisers and customers this dress is considered a single product, not many similar variations. When a customer navigates the site, she should see all SKUs belonging to the same product as a single product, not as multiple products. This means that for facet calculations, our facet counts should represent products, not SKUs. Thus, we need to find a way to aggregate SKU-level facets into product ones.
Getting back to the technology, this means we should carefully support our catalog structure when searching and faceting products. The problem of searching structured data is already addressed in Solr with a powerful, high performance and robust solution: Block Join Query.
https://blog.griddynamics.com/high-performance-join-in-solr-with-blockjoinquery
You can see that Join almost never ran for less than a second, and the CPU saturated with 100 requests per minute. Adding more queries harmed latency.
All index was cached in RAM via memory mapped files magic
Search now takes only a few tens of milliseconds and survives with 6K requests per minute (100 qps). And you see plenty of free CPU!
https://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration
https://issues.apache.org/jira/browse/LUCENE-7452
https://issues.apache.org/jira/browse/SOLR-6096
https://issues.apache.org/jira/browse/SOLR-7888
https://issues.apache.org/jira/browse/SOLR-7963
SOLR-7888 has introduced a new parameter for filtering suggester queries
https://www.garysieling.com/blog/list-solr-functions
https://wiki.apache.org/solr/CoreQueryParameters
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html
One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*
Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application.
You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher:
dump^2 OR dump*
When creating a query like this you should also take care that the user can't add terms that will make the query invalid. The SolrJ method
escapeQueryChars
of the class ClientUtils can be used to escape the user input. <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/>
You can't use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter:
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
<filter class="solr.ReverseStringFilterFactory"/>
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymGraphFilter
Synonym Filter has been deprecated in favor of Synonym Graph Filter, which is required for multi-term synonym support.
Add ManagedSynonymGraphFilterFactory, deprecate ManagedSynonymFilterFactory
Word Delimiter Filter has been deprecated in favor of Word Delimiter Graph Filter, which is required to produce a correct token graph so that e.g. phrase queries can work correctly.
DocValues字段是一个面向列存储的字段,一个Segment只有一个DocValues文件。也就是被DocValues标记的字段在建索引时会额外存储文件到值的映射关系,存储这个映射的文件叫DocValues data,简称.dvd。对应的元数据文件叫.dvm:DocValues Metadata。
http://mozhenghua.iteye.com/blog/2275932
docValues和document的stored=ture存储的值,都是正排索引,单也是有区别的:
l 存储方式:
DocValues是面向列的存储方式,stored=true是面向行的存储方式,如果通过fieldid取列的值可定是用docValues的存储结构更高效。
l 是否分词:
Stored=true的存储方式是不会分词的,会将字段原值进行保存,而docValues的保存的值会进行分词。
如果在索引上要进行facet,gourp,highlight等查询尽量使用docValue,这样不用为内存开销烦恼了。
例如:solr4.0之后都会需要在schema中设置一个_version_字段来实现对文档的原子操作,为了节省内存,可以加上docValues:
<field name="_version_"
type="long" indexed="true" stored="true" docValues="true"/>
https://www.slideshare.net/lucidworks/search-analytics-component-presented-by-steven-bower-bloomberg-lphttp://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term.
https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.
row-oriented (stored fields)
{ 'doc1': {'A':1, 'B':2, 'C':3}, 'doc2': {'A':2, 'B':3, 'C':4}, 'doc3': {'A':4, 'B':3, 'C':2} }
column-oriented (docValues)
{ 'A': {'doc1':1, 'doc2':2, 'doc3':4}, 'B': {'doc1':2, 'doc2':3, 'doc3':3}, 'C': {'doc1':3, 'doc2':4, 'doc3':2} }
When Solr/Lucene returns a set of document ids from a query, it will then use the row-oriented (aka, stored fields) view of the documents to retrieve the actual field values. This requires a very few number of seeks since all of the field data will be stored close together in the fields data file.
However, for faceting/sorting/grouping Lucene needs to iterate over every document to collect the field values. Traditionally, this is achieved by uninverting the term index. This performs very well actually, since the field values are already grouped (by nature of the index), but it is relatively slow to load and is maintained in memor
https://wiki.apache.org/solr/DocValues- What docvalues are:
- NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
- Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
- Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
- Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.
- What docvalues are not:
- Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
- Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
https://lucene.apache.org/solr/guide/6_6/docvalues.html
The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.
In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
There is an additional configuration option available, which is to modify the
docValuesFormat
used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk. <fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true" docValuesFormat="Memory" />
If
docValues="true"
for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.
When
useDocValuesAsStored="false"
, non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*"
). Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).https://wiki.apache.org/solr/SchemaXml
https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+Case
Use Case
|
indexed
|
stored
|
multiValued
|
omitNorms
|
termVectors
|
termPositions
|
docValues
|
---|---|---|---|---|---|---|---|
search within field
|
true
| ||||||
retrieve contents
|
true8
| true8 | |||||
use as unique key
|
true
|
false
| |||||
sort on field
|
true7
|
false
|
true 1
| true7 | |||
highlighting
|
true 4
|
true
|
true2
|
true 3
| |||
faceting 5
|
true7
| true7 | |||||
add multiple values, maintaining order
|
true
| ||||||
field length affects doc score
|
false
| ||||||
MoreLikeThis 5
|
true 6
|
Notes:
1 Recommended but not necessary.
2 Will be used if present, but not necessary.
3 (if termVectors=true)
4 A tokenizer must be defined for the field, but it doesn't need to be indexed.
5 Described in Understanding Analyzers, Tokenizers, and Filters.
6 Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are recommended, but only required if
7 For most field types, either
https://cwiki.apache.org/confluence/display/solr/Defining+Fields2 Will be used if present, but not necessary.
3 (if termVectors=true)
4 A tokenizer must be defined for the field, but it doesn't need to be indexed.
5 Described in Understanding Analyzers, Tokenizers, and Filters.
6 Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are recommended, but only required if
stored=false
.7 For most field types, either
indexed
or docValues
must be true, but both are not required. DocValues can be more efficient in many cases. For [Int/Long/Float/Double/Date]PointFields
, docValues=true
is required.https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties
autoGeneratePhraseQueries
|
For text fields. If true, Solr automatically generates phrase queries for adjacent terms. If false, terms must be enclosed in double-quotes to be treated as phrases.
|
docValues
|
If true, the value of the field will be put in a column-oriented DocValues structure.
|
true or false
| false |
useDocValuesAsStored | If the field has docValues enabled, setting this to true would allow the field to be returned as if it were a stored field (even if it has stored=false ) when matching "* " in an fl parameter. | true or false | true |
large | Large fields are always lazy loaded and will only take up space in the document cache if the actual value is < 512KB. This option requires stored="true" and multiValued="false" . It's intended for fields that might have very large values so that they don't get cached in memory. |
Although they have been deprecated for quite some time, Solr still has support for Schema based configuration of a
<defaultSearchField/>
(which is superseded by the df parameter
) and <solrQueryParser defaultOperator="OR"/>
(which is superseded by the q.op
parameter.
* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages. (yonik)
with a ton of useful, though back and forth, commentary here: <https://issues.apache.org/jira/browse/SOLR-2015>
http://signaldump.org/solr/qpod/33443/solr-elevate-with-complex-query-specifying-field-namesautoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages. (yonik)
with a ton of useful, though back and forth, commentary here: <https://issues.apache.org/jira/browse/SOLR-2015>
The query elevation component matches queries exactly with entries in
elevate.xml. You can nominate a query field type that is used to process
the query before matching, but that won't help you when your queries have
explicit boosts.
elevate.xml. You can nominate a query field type that is used to process
the query before matching, but that won't help you when your queries have
explicit boosts.
Are you using the eDismax query parser? If so, you can separate your boosts
from the actual query, using the "qf" edismax configuration parameter,
which specifies which fields to query, and their boosts:
from the actual query, using the "qf" edismax configuration parameter,
which specifies which fields to query, and their boosts:
q=test
qf=ean name^10.00 persartnr^5.00 persartnrdirect shortdescription
qf=ean name^10.00 persartnr^5.00 persartnrdirect shortdescription
That way your query isn't polluted with boosts (or fields for that matter),
and an entry in elevate.xml with will match.
http://lucene.472066.n3.nabble.com/Problems-with-elevation-component-configuration-td3993204.htmland an entry in elevate.xml with will match.
Thanks Chris, but actually, it turns out that "query text" from elevate.xml has to match the query (q=...). So in this case, elevation works only for http://localhost:8080/solr/elevate?q=brain, but not for http://localhost:8080/solr/elevate?q=indexingabstract:brain type of queries.
This could be solved by using DisMax query parser (http://localhost:8080/solr/elevate?q=brain&qf=indexingabstract&defType=dismax), but we have way more complicated queries which cannot be reduced just to q=searchterm&...
right ... query elevation by default is based on the raw query string.
: but we have way more complicated queries which cannot be reduced just to
: q=searchterm&...
what would you want/exect QEC to do in that type of situation? how would
it know what part of a complex query should/shouldn't be used for
elevation?
FWIW: ne thing you can do is configure a "queryFieldType" on in your
QueryElevationComponent .. if specific, i will use the analyzer for
that field type to process the raw query string before doing a lookup in
the QEC data -- so for example: you could use it to lowercase the input,
or strip out unwanted whitespace or punctuation.
it night not help for really complicated queries, but it would let you
easily deal with things like extra whitespace you want to ignore.
I think it would also be fairly easy to make QEC support an "elevate.q"
param similar to how there is a "spellcheck.q" param and a "hl.q" param to
let the client specify an alternate, simplified, string for the feature to
use
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
https://issues.apache.org/jira/browse/SOLR-418
Absolute boosting
Absolute boosting enables a document to be consistently displayed at a given position in the result set when a user searches with a specific query. It also prevents individual documents from being displayed when a user searches with a specific query.
Under boosting, they have:
Boosting may be applied in two ways:
- Query independent (document boosting). This is used to boost high quality pages for all queries that match the document
- Query dependant (query boosting). In this case specific documents may be boosted for given queries
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
For debugging it may be useful to see results with and without the elevated docs. To hide results, use
enableElevation=false
:http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=false
Query nesting
q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book
https://wiki.apache.org/solr/FunctionQuery#strdist
Calculate the distance between two strings. Uses the Lucene spell checker StringDistance interface and supports all of the implementations available in that package, plus allows applications to plug in their own via Solr's resource loading capabilities.
- Signature: strdist(s1, s2, {jw|edit|ngram|FQN}[, ngram size])
- Example: strdist("SOLR",id,edit)
The third argument is the name of the distance measure to use. The abbreviations stand for:
- jw - Jaro-Winkler
- edit - Levenstein or Edit distance
- ngram - The NGramDistance, if specified, can optionally pass in the ngram size too. Default is 2.
- FQN - Fully Qualified class Name for an implementation of the StringDistance interface. Must have a no-arg constructor.
This function returns a float between 0 and 1 based on how similar the specified strings are to one another. Returning a value of 1 means the specified strings are identical and 0 means the string are maximally different.
query?q=title:iPhone+4S+Battery+Replacement&fl=*,score,lev_dist:strdist("iPhone 4S Battery Replacement",title_raw,edit)
http://lucene.apache.org/core/6_5_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html
Performs potentially multiple passes over Query text to parse any nested logic in PhraseQueries. - First pass takes any PhraseQuery content between quotes and stores for subsequent pass. All other query content is parsed as normal - Second pass parses any stored PhraseQuery content, checking all embedded clauses are referring to the same field and therefore can be rewritten as Span queries. All PhraseQuery clauses are expressed as ComplexPhraseQuery objects
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
The
ComplexPhraseQParser
provides support for wildcards, ORs, etc., inside phrase queries using Lucene's ComplexPhraseQueryParser
. Under the covers, this query parser makes use of the Span group of queries, e.g., spanNear, spanOr, etc., and is subject to the same limitations as that family or parsers.inOrder |
Set to true to force phrase queries to match terms in the order specified. Default: true
|
df | The default search field. |
A mix of ordered and unordered complex phrase queries:
Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably three letters as a prefix. Allowing very short prefixes may result in to many low-quality documents being returned.
Notice that it also supports leading wildcards "*a" as well with consequent performance implications. Applying ReversedWildcardFilterFactory in index-time analysis is usually a good idea.
You may need to increase MaxBooleanClauses in
solrconfig.xml
as a result of the term expansion above:
It is recommended not to use stopword elimination with this query parser. Lets say we add the, up, to to
stopwords.txt
for your collection, and index a document containing the text "Stores up to 15,000 songs, 25,00 photos, or 150 yours of video" in a field named "features".
While the query below does not use this parser:
the document is returned. The next query that does use the Complex Phrase Query Parser, as in this query:
does not return that document because SpanNearQuery has no good way to handle stopwords in a way analogous to PhraseQuery. If you must remove stopwords for your use case, use a custom filter factory or perhaps a customized synonyms filter that reduces given stopwords to some impossible token.
https://stackoverflow.com/questions/2589086/lucene-fuzzy-match-on-phrase-instead-of-single-word
I'm trying to do a fuzzy match on the Phrase "Grand Prarie" (deliberately misspelled) using Apache Lucene. Part of my issue is that the
~
operator only does fuzzy matches on single word terms and behaves as a proximity match for phrases.
Is there a way to do a fuzzy match on a phrase with lucene?
There's no direct support for a fuzzy phrase, but you can simulate it by explicitly enumerating the fuzzy terms and then adding them to a MultiPhraseQuery. The resulting query would look like:
<MultiPhraseQuery: "grand (prarie prairie)">
https://www.slideshare.net/basistech/simple-fuzzynamematchinginsolr-googleslides
Edit distance
https://stackoverflow.com/questions/21607413/edit-distance-similarity-in-lucene-solr
LocalParams
https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
If a local parameter value appears without a name, it is given the implicit name of "type". This allows short-form representation for the type of query parser to use when parsing a query string. Thus
q={!dismax qf=myfield}solr rocks
is equivalent to:
q={!type=dismax qf=myfield}solr rocks
If no "type" is specified (either explicitly or implicitly) then the lucene parser is used by default. Thus
fq={!df=summary}solr rocks
is equivilent to:
fq={!type=lucene df=summary}solr rocks
A special key of v
within local parameters is an alternate way to specify the value of that parameter.
q={!dismax qf=myfield}solr rocks
is equivalent to
q={!type=dismax qf=myfield v='solr rocks'
}
Parameter dereferencing or indirection lets you use the value of another argument rather than specifying it directly. This can be used to simplify queries, decouple user input from query parameters, or decouple front-end GUI parameters from defaults set in
solrconfig.xml
.q={!dismax qf=myfield}solr rocks
is equivalent to:
q={!type=dismax qf=myfield v=$qq}&qq=solr rocks
Search for any word that starts with "foo" in the title field.
title:foo*
Search for any word that starts with "foo" and ends with bar in the title field.
title:foo*bar
Note that Lucene doesn't support using a * symbol as the first character of a search.
Search for "foo bar" within 4 words from each other.
"foo bar"~4
Note that for proximity searches, exact matches are proximity zero, and word transpositions (bar foo) are proximity 1.
A query such as "foo bar"~10000000 is an interesting alternative to foo AND bar.
Whilst both queries are effectively equivalent with respect to the documents that are returned, the proximity query assigns a higher score to documents for which the terms foo and bar are closer together.
The trade-off, is that the proximity query is slower to perform and requires more CPU.
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
roam~
This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.
An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2. For example:
roam~1
This will match terms like roams & foam - but not foams since it has an edit distance of "2".
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parserhttps://stackoverflow.com/questions/30909106/fuzzy-search-not-working-with-dismax-query-parser
DisMax, by design, does not support all lucene query syntax in it's query parameter. From the documentation:
This query parser supports an extremely simplified subset of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses ... but all other Lucene query parser special characters are escaped to simplify the user experience.
Fuzzy queries are one of the things that are not supported. There is a request to add it to the
qf
parameter, if you'd care to take a look, but it has not been implemented.
One good solution would be to go to the
edismax
query parser, instead. It's query parameter supports full lucene query parser syntax:http://localhost:8983/solr/simple/select?q=test~1&defType=edismax&qf=fullText
http://developer4life.blogspot.com/2013/02/solr-and-lucene-fuzzy-search-closer-look.html
The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Than given an unknown query, how does Lucene finds all the terms in the index that are at distance <= than the specified required similarity? Well... this depends on the Solr/Lucene version you are using.
You can take a look at the warning that appears at Lucene 3.2.0 Javadoc:
Warning: this query is not very scalable with its default prefix length of 0 - in this case, *every* term will be enumerated and cause an edit score calculation.
Moreover, prior to 4.0 release Lucene implementation to compute this distance was done for each query for EACH term in the index. You really don't want to use this. So my advice to you is to upgrade - the faster the better.
The Lucene 4.0 Fuzzy took a very different approach. The search now works with FuzzyQuery. The underlying implementation has changed in 4.0 drastically, which lead to significant complexity improvements. Current implementation uses the Levenshtein Automata. This automaton is based on the work of Klaus U. Schulz and Stoyan Mihov "Fast string correction with Levenshtein automata". To make a very long story short this paper shows how to recognize the set of all words V in an index where the Levenshtein distance between V and the query does not exceed a distance d, which is exactly what one wants with Fuzzy Search. For a deeper look see here and here.
http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
The default that is used if the parameter is not given is 0.5.
https://stackoverflow.com/questions/1752301/how-to-configure-solr-to-use-levenshtein-approximate-string-matching
just append the ~ character to all terms that you want to fuzzy match on the way in to solr. If you are using the default set up, this will give you fuzzy match.
term:"apple"~2
If you put quotes around apple, I think it becomes a phrase query, so the ~2 means proximity search, instead of edit distance.
Typically this is done with the SpellCheckComponent, which internally uses the Lucene SpellChecker by default, which implements Levenshtein.
The wiki really explains very well how it works, how to configure it and what options are available, no point repeating it here.
Or you could just use Lucene's fuzzy search operator.
Another option is using a phonetic filter instead of Levenshtein.
https://stackoverflow.com/questions/18629373/solr-xml-parser-exception
Caused by: org.apache.solr.common.SolrException: org.xml.sax.SAXParseException;systemId: solrres:/solrconfig.xml; lineNumber: 813; columnNumber: 19; The content of elements must consist of well-formed character data or markup.
at org.apache.solr.core.Config.<init>(Config.java:148)
at org.apache.solr.core.Config.<init>(Config.java:86)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:589)
... 11 more
used by: org.xml.sax.SAXParseException; systemId: solrres:/solrconfig.xml; lineNumber: 813; columnNumber: 19; The content of elements must consist of well-formed character data or markup.
As
You would need to use
< >
are xml characters, parsing would fail.You would need to use
>
and <
for > & < respectively in the Solr Config xml file.
e.g.
https://lucene.apache.org/solr/guide/6_6/velocity-response-writer.html#velocity-response-writer<str name="mm">4 < 100%</str>
http://localhost:8983/solr/techproducts/browse
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis
The
MoreLikeThis
search component enables users to query for documents similar to a document in their result list. It does this by using terms from the original document to find similar documents in the index.
There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results. The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matchinghttp://blog.thedigitalgroup.com/vijaym/beider-morse-phonetic-matching-in-solr/
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"> </filter> </analyzer>
RuleType
- APPROX : Approximate rules, which will lead to the largest number of phonetic interpretations.
- EXACT : Exact rules, which will lead to a minimum number of phonetic interpretations.
NameType
- Supported types of names. Unless you are matching particular family names, use GENERIC. The GENERIC NameType should work reasonably well for non-name words. The other encodings are specifically tuned to family names, and may not work well at all for general text.
- ASHKENAZI (ash) : Ashkenazi family names.
- GENERIC (gen) : Generic names and words.
- SEPHARDIC (sep) : Sephardic family names.
./solr zk -h
./solr zk mkroot /yourroot
ln -s /etc/default/solr.in.sh /var/solr/solr.in.sh
service solr start
curl -d "action=clusterstatus&wt=json" http://localhost:8983/solr/admin/collections | jq
http://solr.pl/en/2011/02/28/sorting-by-function-value-in-solr-solr-1297/
q=*:*&sort=opis_sort+desc | 200.000 | 267ms | 0ms |
q=*:*&sort=sum(geo_x,geo_y)+desc | 200.000 | 823ms | 0ms |
q=opis:ala&sort=opis_sort+desc | 200.000 | 266ms | 1ms |
q=opis:ala&sort=sum(geo_x,geo_y)+desc | 200.000 | 810ms | 1ms |
Above test shows that sorting using the sort function is much slower than the default sort order (which you’d expect). Sorting on the basis of function value is also slower than sorting with the use of string based field, but the difference is not as significant as in the previous case.
https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/function-queries.htmlif(termfreq
(cat,'electronics'),
popularity,42)
: This function checks each document for the to see if it contains the term “electronics
” in the cat
field. If it does, then the value of the popularity
field is returned, otherwise the value of 42
is returned.query(subquery, default)
q=product
(popularity,
` query({!dismax v='solr rocks'}): returns the product of the popularity and the score of the DisMax query. `q=product
(popularity,
` query($qq))&qq={!dismax}solr rocks`: equivalent to the previous query, using parameter de-referencing. q=product
(popularity,
` query($qq,0.1))` &qq={!dismax}
solr rocks
: specifies a default score of 0.1 for documents that don’t match the DisMax queryand(not
(exists
(popularity)),
exists
(price)):
returns true
for any document which has a value in the price
field, but does not have a value in the popularity
fieldhttps://stackoverflow.com/questions/22107683/solr-boost-query-syntax
The problem is the way you are trying to nest queries inside of each other w/o any sort of quoting -- the parser has no indication that the "b" param is
if(exists(query({!v='user_type:ADMIN'})),10,1)
it thinks it's "if(exists(query({!v='user_type:ADMIN'"
and the rest is confusing it.
If you quote the "b" param to the boost parser, then it should work...
http://localhost:8983/solr/select?q={!boost b="if(exists(query({!v='foo_s:ADMIN'})),10,1)"}id:1
...or if you could use variable derefrencing, either of these should work...
http://localhost:8983/solr/select?q={!boost b=$b}id:1&b=if(exists(query({!v='foo_s:ADMIN'})),10,1)
http://localhost:8983/solr/select?q={!boost b=if(exists(query($nestedq)),10,1)}id:1&nestedq=foo_s:ADMIN
Nested function query must use $param or {!v=value}
http://lucene.472066.n3.nabble.com/Nested-function-query-must-use-td4038037.html
You have to do exactly what the error message tells you:
rewrite:
query(id:3)
as:
query({!=v'id:3'})
The correct syntax is:
http://localhost:8983/solr/articles.0/select/?q={!func}query({!query
v='hello'})&fl=Document.title,score,&debugQuery=on
https://lucidworks.com/2009/03/31/nested-queries-in-solr/
To embed a query of another type in a Lucene/Solr query string, simply use the magic field name _query_. The following example embeds a lucene query type:poems into another lucene query:
text:"roses are red" AND _query_:"type:poems"
Now of course this isn’t too useful on it’s own, but it becomes very powerful in conjunction with the query parser framework and local params which allows us to change the types of queries. The following example embeds a DisMax query in a normal lucene query:
text:hi AND _query_:"{!dismax qf=title pf=title}how now brown cow"
And we can further use parameter defererencing in the local params syntax to make it easier for the front-end to compose the request:
&q=text:hi AND _query_:"{!dismax qf=title pf=title v=$qq} &qq=how now brown cow
Nested Queries in Function Query Syntax
q=how now brown cow&bq={!query v=$datefunc}
And the defaults for the handler in solrconfig.xml would contain the actual definition of datefunc as a function query:
<lst name="defaults"> <str name="datefunc">{!func}recip(rord(date),1,1000,1000)</str> [...]
- Use in a parameter that is explicitly for specifying functions, such as the EDisMax query parser's
boost
param, or DisMax query parser'sbf
(boost function) parameter. (Note that thebf
parameter actually takes a list of function queries separated by white space and each with an optional boost. Make sure you eliminate any internal white space in single function queries when usingbf
). For example: - Introduce a function query inline in the lucene QParser with the
_val_
keyword. For example:
https://stackoverflow.com/questions/17654266/solr-autocommit-vs-autosoftcommit
You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.
Broadly speaking, you want it to be a smallish value as
background warming can be expensive. So how often are you doing a "hard"
commit and does it take longer to warm-up your searchers?
A few things to consider are:
auto-commits - can use openSearcher=false to avoid opening a new searcher
when doing large batch updates. This allows you to auto-commit more
frequently so that your update log doesn't get too big w/o paying the price
of warming a new searcher on every auto-commit.
new searcher warming queries - how many of these do you have and how long
do they take to warm up? You can get searcher warm-up time from the
Searcher MBean in the admin console.
cache auto-warming - again, how much of your existing caches are you
auto-warming? Keep a close eye on the filterCache and it's autowarmCount.
Warm-up times also available from the admin console.
https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
Whenever a commit happens in Solr, a new "searcher" (with new caches) is opened, "warmed" up according to your SolrConfigXml settings, and then put in place. The previous searcher is not closed until the "warming" search is ready. If multiple commits happen in rapid succession -- before the warming searcher from first commit has enough time to warm up, then there can be multiple searchers all competeing for resources at the same time, even htough one of them will be thrown away as soon as the next one is ready.
maxWarmingSearchers is a setting in SolrConfigXml that helps you put a safety valve on the number of overlapping warming searchers that can exist at one time. If you see this error it means Solr prevented a commit from resulting an a new searcher being opened because there were already X warming searchers open.
If you encounter this error a lot, you can (in theory) increase the number in your maxWarmingSearchers, but that is risky to do unless you are confident you have the system resources (RAM, CPU, etc...) to do it safely. A more correct way to deal with the situation is to reduce how frequently you send commits.
If you only encounter this error infrequently because of fluke situations, you'll probably be ok just ignoring it.
Why doesn't my index directory get smaller (immediately) when i delete documents? force a merge? optimize?
Because of the "inverted index" data structure, deleting documents only annotates them as deleted for the purpose of searching. The space used by those documents will be reclaimed when the segments they are in are merged.
When segments are merged (either because of the Merge Policy as documents are added, or explicitly because of a forced merge or optimize command) then Solr attempts to delete old segment files, but on some filesystems Notably in Microsoft Windows) it is not possible to delete a file while the file is open for reading (Which is usually true since Solr is still serving requests against the old segments until the new Searcher is ready and has it's caches warmed). When this happens, the older segment files are left on disk, and Solr will re-attempt to delete them later the next time a merge happens.
http://blog.csdn.net/a550246215/article/details/52402232
结论spring-data-solr在save的时候 如果没有设置事务管理 会直接执行solrClient.commit()方式(硬提交)
解决方案:
方案1:设置事务
方案2:弃用spring-data-solr 改为源生solrj
直接调用solrClient.addBeans(args);
(只add 不提交,提交的动作由solr服务端决定)
配置solrconfig.xml
https://support.datastax.com/hc/en-us/articles/207690673-FAQ-Solr-logging-PERFORMANCE-WARNING-Overlapping-onDeckSearchers-and-its-meaning解决方案:
方案1:设置事务
方案2:弃用spring-data-solr 改为源生solrj
直接调用solrClient.addBeans(args);
(只add 不提交,提交的动作由solr服务端决定)
配置solrconfig.xml
WARN [commitScheduler-4-thread-1] 2015-12-16 08:46:42,488 SolrCore.java:1712 - [mykeyspace.my_search_table] PERFORMANCE WARNING: Overlapping onDeckSearchers=2
When a commit is issued to a Solr core, it makes index changes visible to new search requests. Commits may come from an application or an auto-commit. A "normal" commit in DSE is usually more often than not from an auto commit in which is, as outlined here, configured in the solr config file.
Each time a commit is issued a new searcher object is created. When there are too many searcher objects this warning will be observed.
Also if the configuration is such that a searcher has pre-warming queries, this can delay the start time meaning that the searcher is still starting up when a new commit comes in.
Decrease overlapping commits
Find out if there are commits being issued from the application. These might overlap with the auto commits. Ideally one would tune the auto commit settings to suit the application, negating the need for anything but auto commits.
Reduce auto warm count (if not using SolrFilterCache)
If you are not using the default SolrFilterCache Disable or reduce the
autowarmCount
setting for the given filter cache you are using. This setting controls the amount of objects populated from an older cache.
Increase the max searchers
One can increase the following setting in the solr_config.xml if required
<maxWarmingSearchers>16</maxWarmingSearchers>
Note: Having too high a number here can place more load on the node and have a negative impact on performance. It is normally recommended to keep the at 50% number of CPUs at the most. In most cases the default of 2 should be sufficient anyway.
http://lucene.472066.n3.nabble.com/Result-Grouping-vs-Collapsing-Query-Parser-Can-one-be-deprecated-td4302127.html1) Collapse does not directly support faceting. It simply collapses the
results and the faceting components compute facets on the collapsed result
set. Grouping has direct support for faceting which, can be slow, but it
has options other then just computing facets on the collapsed result set.
2) Originally collapse only supported selecting group heads with min/max
value of a numeric field. It did not support using the sort parameter for
selecting the group head. Recently the sort parameter was added to
collapse, but this likely is not nearly as fast as using the min/max for
selecting group heads.
The collapse query parser filters search results so that only one document is returned out of all of those for a given field's value. Said differently, it collapses search results to one document per group of those with the same field value. This query parser is a special type called post-filter, which can only be used as a filter query because it needs to see the results of all other filter queries and the main query.
For min, the document with the smallest value is chosen, and for max, the largest. If your function query needs to be computed based on the document's score, refer to that via cscore().
If, and only if, your documents are partitioned into separate shards by manufacturer name would you get the correct group count, because each group would be guaranteed to only exist on one shard.
/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(field1, field2)}
/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(product(field1,
1000000), cscore())}
we are effectively sorting by field1 desc and then the cscore desc (a special “collapse score” function for getting the score of a document before collapsing)
SchemaRequest.SchemaVersion schemaVersionRequest = new SchemaRequest.SchemaVersion();
SchemaResponse.SchemaVersionResponse schemaVersionResponse = schemaVersionRequest.process(getSolrClient());
https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/suggest/DocumentDictionary.java
<
str
name
=
"weightField"
>price</
str
>
http://blog.trifork.com/2009/10/20/result-grouping-field-collapsing-with-solr/
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
The Collapsing query parser and the Expand component combine to form an approach to grouping documents for field collapsing in search results. The Collapsing query parser groups documents (collapsing the result set) according to your parameters, while the Expand component provides access to documents in the collapsed group for use in results display or other processing by a client application. Collapse & Expand can together do what the older Result Grouping (
group=true
) does for most use-cases but not all. Generally, you should prefer Collapse & Expand.In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the
router.name
parameter as compositeId
when creating the collection.
The
CollapsingQParser
is really a post filter that provides more performant field collapsing than Solr's standard approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single document per group before it forwards the result set to the rest of the search components. So all downstream components (faceting, highlighting, etc...) will work with the collapsed result set.
q=foo&fq={!collapse field=ISBN}&expand=true
http://opensourceconnections.com/blog/2016/01/22/solr-vs-elasticsearch-relevance-part-two/
This verbosity pays off. It’s much easier for the uninitiated to look at the JSON and guess what’s happening. It’s clear that there’s a query, of type “multi_match” being passed a query stirng “dog catcher law.” You can see clearly the fields being searched. Without much knowledge, you could make guesses about what
minimum_should_match
ormost_fields
might mean.
It’s also helpful that Elasticsearch always scopes the parameters to the current query. There’s no “local” vs “global” parameters. There’s just the current JSON query object and its arguments. To appreciate this point, you have to appreciate an annoying Solr localparams quirk. Solr localparams inherit the global query parameters. For example, let’s say you use the following query parameters
q=dog catcher law&defType=edismax&q.op=AND&bq={!edismax mm=50% tie=1 qf='catch_line text'}cat
(search for dog catcher law, boost (bq) by a ‘cat’ query). Well your scoped local params query unintuitively receives the outside parameter q.op=AND
. More frustratingly, with this query you’ll get a deeply befuddling “Infinite Recursion” error from Solr. Why? because hey guess what, your local params query in bq also inherits the bq from the outside – aka itself! So in reality this query is bq={!edismax mm=50% tie=1 q.op=AND bq='{!edismax mm=50% tie=1 q.op=AND bq='...' qf='catch_line text'} qf='catch_line text'}
. Solr keeps filling in that ‘bq’ from the outside bq, and therefore reports the not so intuitive:org.apache.solr.search.SyntaxError: Infinite Recursion detected parsing query ‘dog catcher law’
To avoid accepting the external arguments, you need to be explicit in your local params query. Here we set no
bq
and change q.op
to OR.Elasticsearch, on the other hand, focusses on the common use cases.
Well Solr’s desire for terseness has created features like parameter substitution and dereferencing. These features let you reuse parts of queries in a fairly readable fashion. Moreover, Solr’s function query syntax gives you an extremely powerful function (the
query()
) function that lets you combine relevance scoring and math more seamlessly than the Elasticsearch equivalents.Performance Problems with "Deep Paging"
In some situations, the results of a Solr search are not destined for a simple paginated user interface. When you wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values for the
start
or rows
parameters can be very inefficient. Pagination using start
and rows
not only require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also all of the documents that would have appeared on previous pages. So while a request for start=0&rows=1000000
may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents, likewise a request for start=999000&rows=1000
is equally inefficient for the same reasons. Solr can't compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are. If the index is distributed, which is common when running in SolrCloud mode, then 1 million documents are retrieved from each shard. For a ten shard index, ten million entries must be retrieved and sorted to figure out the 1000 documents that match those query parameters.Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.
cursorMark
andstart
are mutually exclusive parameters- Your requests must either not include a
start
parameter, or it must be specified with a value of "0
".
- Your requests must either not include a
sort
clauses must include the uniqueKey field (either "asc
" or"desc
")- If
id
is your uniqueKey field, then sort params likeid asc
andname asc, id desc
would both work fine, butname asc
by itself would not
- If
- Sorts including Date Math based functions that involve calculations relative to
NOW
will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated. In this situation, choose & re-use a fixed value for theNOW
request param in all of your cursor requests.
Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that
cursorMark
would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark
value will identify a unique point in the sequence of documents.- A client requests
q=*:*&rows=5&start=0&sort=name asc, id asc&cursorMark=*
- Documents with the ids
1-5
will be returned to the client in order
- Documents with the ids
- Document id
3
is deleted - The client requests 5 more documents using the
nextCursorMark
from the previous response- Documents
6-10
will be returned -- the deletion of a document that's already been returned doesn't affect the relative position of the cursor
- Documents
- 3 new documents are now added with the ids
90
,91
, and92
; All three documents have a name ofA
- The client requests 5 more documents using the
nextCursorMark
from the previous response- Documents
11-15
will be returned -- the addition of new documents with sort values already past does not affect the relative position of the cursor
- Documents
- Document id
1
is updated to change its 'name' toQ
- Document id 17 is updated to change its 'name' to
A
- The client requests 5 more documents using the
nextCursorMark
from the previous response- The resulting documents are
16,1,18,19,20
in that order - Because the sort value of document
1
changed so that it is after the cursor position, the document is returned to the client twice - Because the sort value of document
17
changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress
- The resulting documents are
In a nutshell: When fetching all results matching a query using
https://cwiki.apache.org/confluence/display/solr/Streaming+ExpressionscursorMark
, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes. https://www.slideshare.net/shalinmangar/parallel-sql-and-streaming-expressions-in-apache-solr-6
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
A common mistake is to try to filter parents with a
which
filter, as in this bad example:q={!parent which="title:join"}comments:SolrCloud
Instead, you should use a sibling mandatory clause as a filter:
q= +title:join
+{!parent which="content_type:parentDocument"}comments:SolrCloud
Block Join Parent Query Parser
This parser takes a query that matches child documents and returns their parents. The syntax for this parser is similar:
q={!parent which=<allParents>}<someChildren>
. The parameter allParents
is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someChildren
is a query that matches some or all of the child documents. Note that the query for someChildren
should match only child documents or you may get an exception: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. or in older version it's: child query must only match non-parent docs. As it's said, you can search for q=+(parentFilter) +(someChildren)
to find a cause .
Again using the example documents above, we can construct a query such as
q={!parent which="content_type:parentDocument"}comments:SolrCloud
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.
Note that this transformer can be used even though the query itself is not a Block Join query.
When using this transformer, the
parentFilter
parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:childFilter
- query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)limit
- the maximum number of child documents to be returned per parent document (default: 10)
Solr Block Join - Nested Documents
https://blog.griddynamics.com/introduction-to-block-join-faceting-in-solr
The screenshot above is taken from an online retailer’s website. According to the graphic, a dress can be blue, pink or red, and only sizes XS and S are available in blue. However, for merchandisers and customers this dress is considered a single product, not many similar variations. When a customer navigates the site, she should see all SKUs belonging to the same product as a single product, not as multiple products. This means that for facet calculations, our facet counts should represent products, not SKUs. Thus, we need to find a way to aggregate SKU-level facets into product ones.
A common solution is to propagate properties from the SKU level to the product level and produce a single product document with multivalued fields aggregated from the SKUs. With this approach, our aggregated product looks like this:
However, this approach creates the possibility of false positive matches with regards to combinations of SKU-level fields. For example, if a customer filters by color ‘Blue’ and size ‘M’, Product_1 will be considered a valid match, even though there is no SKU in the original catalog which is both 'Blue' and 'M'. This happens because when we are aggregating values from the SKU level, we are losing information about what value comes from what SKU.
Getting back to the technology, this means we should carefully support our catalog structure when searching and faceting products. The problem of searching structured data is already addressed in Solr with a powerful, high performance and robust solution: Block Join Query.
https://blog.griddynamics.com/high-performance-join-in-solr-with-blockjoinquery
A Join Query looks like:
q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}acl:[1303 TO 1309]
You can see that Join almost never ran for less than a second, and the CPU saturated with 100 requests per minute. Adding more queries harmed latency.
All index was cached in RAM via memory mapped files magic
blockjoin.
q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!parent which=kind:body}acl:[1303 TO 1309]
Search now takes only a few tens of milliseconds and survives with 6K requests per minute (100 qps). And you see plenty of free CPU!
We can check where Join uses so much CPU power with jstack:
java.lang.Thread.State: RUNNABLE
at
o.a.l.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docFreq(BlockTreeTermsReader.java:2098)
at
o.a.s.search.JoinQuery$JoinQueryWeight.getDocSet(JoinQParserPlugin.java:338)
let’s explain how a 55GB index can ever be cached in just 8GB RAM. You should know that not all files in your index are equally valuable. (In other words, tune your schema wisely.) In my index the frq file is 7.7GB and the tim file is only 427MB, and it’s almost all that’s needed for these queries. Of course, a file which stores primary key values is also read, but it doesn’t seem significant.
BlockJoin is the most efficient way to do the join operation, but it doesn’t mean you need to get rid of your solution based on the other (slow) Join. The place for Join is frequent child updates -- and small indexes, of course.
Solr supports the search of hierarchical documents using BlockJoinQuery (BJQ). Using this query requires a special way of indexing documents, based on their positioning in the index. All documents belonging to the same hierarchy have to be indexed together, starting with child documents followed by their parent documents.
BJQ works as a bridge between levels of the document hierarchy; e.g. it transforms matches on child documents to matches on parent documents. When we search using BJQ, we provide a child query and a parent filter as parameters. A child query represents what we are looking for among child documents, and a parent filter tells BJQ how to distinguish parent documents from child documents in the index.
For each matched child document, BJQ scans ahead in the index until it finds the nearest parent document, which is sent into the collector chain instead of the child document. This trick of relying on relative document positioning in the index, or “index-time join,” is the secret behind BJQ’s high performance.
We consider each hierarchy of matched documents separately. As we are using BJQ, each hierarchy is represented in the Solr index as a document block, or DocSet slice as we call it.
First, we calculate facets based on matched SKUs from our block. Then we aggregate obtained SKU counts into Product-level facet counts, increasing the product level facet count by only one for every matched block, irrespective of the number of matched SKUs within the block. For example, if we are searching by COLOR:Blue, even though two Blue SKUs were found within a block, aggregated product-level counts will be increased only by one.
This solution is implemented inside the BlockJoinFacetComponent, which extends the standard Solr SearchComponent. BlockJoinFacetComponent validates the query and injects a special BlockJoinFacetCollector into the Solr post-filter collectors chain. When BlockJoinFacetCollector is invoked, it receives a parent document, since the BlockJoinFacetComponent ensures that only ToParentBlockJoinQuery is allowed as a top level query.
Use Lucene’s MMapDirectory on 64bit platforms, please!
https://issues.apache.org/jira/browse/LUCENE-7452
java.lang.IllegalStateException: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. docId=23, class org.apache.lucene.search.DisjunctionSumScorer
java.lang.IllegalStateException: Parent query must not match any docs beside parent filter. Combine them as must (+) and must-not (-) clauses to find a problem doc. docID=12
https://issues.apache.org/jira/browse/SOLR-7963
suggest.cfq=ctx1 OR ctx2
The implementation use the Solr StandardQueryParser for parsing the cfq param.
This card is to allow to pass in local param queries such as
suggest.cfq={!terms f=contextx}ctx1,ctx2
https://www.garysieling.com/blog/list-solr-functions
https://wiki.apache.org/solr/CoreQueryParameters
echoParams
The echoParams parameter tells Solr what kinds of Request parameters should be included in the response for debugging purposes, legal values include:
- none - don't include any request parameters for debugging
- explicit - include the parameters explicitly specified by the client in the request
- all - include all parameters involved in this request, either specified explicitly by the client, or implicit because of the request handler configuration.
TZ
The TZ parameter can be specified to override the default TimeZone (UTC) used for the purposes of adding and rounding in date math. The local rules for the specified TimeZone (including the start/end of DST if any) determine when each arbitrary day starts -- which affects not only rounding/adding of DAYs, but also cascades to rounding of HOUR, MIN, MONTH, YEAR as well.
For example "2013-03-10T12:34:56Z/YEAR" using the default TZ would be 2013-01-01T00:00:00Z but with TZ=America/Los_Angeles, the result is 2013-01-01T08:00:00Z. Likewise, 2013-03-10T08:00:00Z+1DAY evaluates to 2013-03-11T08:00:00Z by default, but with TZ=America/Los_Angeles the local DST rules result in 2013-03-11T07:00:00Z
qt
If a request uses the /select URL, and no SolrRequestHandler has been configured with /select as its name, then Solr uses the qt parameter to determine which Query Handler should be used to process the request. Valid values are any of the names specified by <requestHandler ... /> declarations in solrconfig.xml
"qt" doesn't really have a default, but the default request handler to dispatch to is "/select".
Jira
http://lucene.472066.n3.nabble.com/both-way-synonyms-with-ManagedSynonymFilterFactory-td4256592.html
Think the issue here is that when the SynonymFilter is created based on the managed map, option “expand” is always set to “false”, while the default for file-based synonym dictionary is “true”.
So with expand=false, what happens is that the input word (e.g. “mb”) is *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms are applied both on index and query side, your document will contain “megabytes” instead of “mb”, but when you query for “mb”, the same happens on query side, so you will actually match :-)
I think what we need is to switch default to expand=true, and make it configurable also in the managed factory.
https://issues.apache.org/jira/browse/SOLR-8737
We can use solr analysis admin ui to test whether synonym works
Calls the standard query parser and defines query input strings, when the q parameter is not used.
|
Jira
http://lucene.472066.n3.nabble.com/both-way-synonyms-with-ManagedSynonymFilterFactory-td4256592.html
Think the issue here is that when the SynonymFilter is created based on the managed map, option “expand” is always set to “false”, while the default for file-based synonym dictionary is “true”.
So with expand=false, what happens is that the input word (e.g. “mb”) is *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms are applied both on index and query side, your document will contain “megabytes” instead of “mb”, but when you query for “mb”, the same happens on query side, so you will actually match :-)
I think what we need is to switch default to expand=true, and make it configurable also in the managed factory.
https://issues.apache.org/jira/browse/SOLR-8737
We can use solr analysis admin ui to test whether synonym works