http://stackoverflow.com/questions/13306272/removing-solr-duplicate-values-into-multivalued-field
https://issues.apache.org/jira/browse/SOLR-5403
Atomic Updates with SolrJ
http://lucene.472066.n3.nabble.com/SolrJ-atomic-updates-td4020438.html
HashMap editTags = new HashMap();
editTags.put("set", new String[]{"tag1","tag2","tag3"});
doc = new SolrInputDocument();
doc.addField("id", "unique");
doc.addField("tags_ss", editTags);
server.add(doc);
server.commit(true, true);
resp = server.query(q);
System.out.println(resp.getResults().get(0).getFirstValue("tags_ss"));
prints "tag1"
ArrayList<String> as a value works the same way as String[].
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Cursors in Solr are a logical concept, that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values.
the cursor is stateless from Solr's perspective,
Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the
http://stackoverflow.com/questions/4628571/solr-date-field-tdate-vs-date
https://wiki.apache.org/solr/UpdateXmlMessages
https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory
https://wiki.apache.org/solr/SchemaDesign
Solr block-join
http://stackoverflow.com/questions/39785462/nested-documents-with-spring-data-solr
http://stackoverflow.com/questions/40490606/solr-querying-on-nested-child-documents
https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr
SolrInputDocument has methods — getChildDocuments()and addChildDocument() — for nesting child documents into a parent document.
http://yonik.com/solr-nested-objects/
Now if we search for
$ curl http://localhost:8983/solr/demo/query -d '
q=cat_s:(fantasy OR sci-fi)&
fl=id,[child parentFilter=type_s:book]'
Faceting on Children
http://stackoverflow.com/questions/12101382/is-there-any-meaningful-performance-difference-between-integer-and-string-values
https://wiki.apache.org/solr/SolrFacetingOverview
http://stackoverflow.com/questions/14323269/should-i-prefer-integers-or-strings-in-my-solr-schema-if-a-field-will-fit-either
http://lucene.472066.n3.nabble.com/numberic-or-string-type-for-non-sortable-field-td2606353.html
the reason i suggested that using ints might (marginally) be better is
because of the FieldCache and the fieldValueCache -- the int
representation uses less memory then if it was holding strings
representing hte same ints.
worrying about that is really a premature optimization though -- model
your data in the way that makes the most sense -- if your ids are
inherently ints, model them as ints until you come up with a reason to
model them otherwise and move on to the next problem.
https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/schema/UUIDField.html
http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr
http://stackoverflow.com/questions/29470458/solr-external-file-field-performance-issue
https://wiki.apache.org/solr/SolrRelevancyFAQ
https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
The qf (Query Fields) Parameter
The
The
https://wiki.apache.org/solr/QueryElevationComponent
The uniqueKey field must currently be of type string for the QueryElevationComponent to operate properly.
https://medium.com/@kaismh/automated-solution-for-query-elevation-using-solr-a15ce88b2762
http://lucene.472066.n3.nabble.com/Programmatically-upload-configuration-into-ZooKeeper-td4104924.html
Programmatically upload configuration into ZooKeeper
ZkStateReader zkStateReader = cloudSolrServer.getZkStateReader();
SolrZkClient zkClient = zkStateReader.getZkClient();
File jsonFile = new File(updateClusterstateJson);
if (!jsonFile.isFile()) {
System.err.println(jsonFile.getAbsolutePath()+" not found.");
return;
}
byte[] clusterstateJson = readFile(jsonFile);
// validate the user is passing is valid JSON
InputStreamReader bytesReader = new InputStreamReader(new
ByteArrayInputStream(clusterstateJson), "UTF-8");
JSONParser parser = new JSONParser(bytesReader);
parser.toString();
zkClient.setData("/clusterstate.json", clusterstateJson, true);
http://www.programcreek.com/java-api-examples/index.php?class=org.apache.solr.common.cloud.SolrZkClient&method=getData
numShards=3&collection.configName=configName&replicationFactor=2&router.field=routerField&maxShardsPerNode=2
https://www.linkedin.com/pulse/automated-solution-query-elevation-using-solr-kais-hassan-phd
https://issues.apache.org/jira/browse/SOLR-6092
Suggestor
http://brandnewuser.iteye.com/blog/2297834
https://issues.apache.org/jira/browse/SOLR-9637
http://lucene.472066.n3.nabble.com/Duplicate-suggestions-td4212639.html
duplicated fields across your docs, you will se duplicate suggestions.
Do you have any intermediate API in your application ? In the case you can
modify the API using a Collection that prevent duplicates to contain and
return the suggestions.
In the case you want it directly from Solr I assume it is a "bug" .
I think the suggestions should return by default no duplicates ( because
the only information returned is the field value and not the document id.
Anyway could be a nice parameter to get better suggestions ( sending the
avoidDuplicate parameter to the suggester 0.
http://stackoverflow.com/questions/32045700/solr-suggester-in-solrcloud-mode
https://wiki.apache.org/solr/SpellCheckComponent#Distributed_Search_Support
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/params/ShardParams.java
/** The requested URL for this shard */
public static final String SHARD_URL = "shard.url";
/** The Request Handler for shard requests */
public static final String SHARDS_QT = "shards.qt";
/** Request detailed match info for each shard (true/false) */
public static final String SHARDS_INFO = "shards.info";
/** Should things fail if there is an error? (true/false) */
public static final String SHARDS_TOLERANT = "shards.tolerant";
/** Force a single-pass distributed query? (true/false) */
public static final String DISTRIB_SINGLE_PASS = "distrib.singlePass";
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
http://signaldump.org/solr/qpod/49836/cloudsolrclient-does-not-distribute-suggest-build-true
https://wiki.apache.org/solr/DistributedSearch
http://stackoverflow.com/questions/36079395/how-to-configure-multiple-contextfields-in-single-solr-suggester
https://issues.apache.org/jira/browse/SOLR-7963
suggest.q=c&suggest.cfq=memory
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SolrSuggester.java
Analyzer contextFilterQueryAnalyzer = new TokenizerChain(new StandardTokenizerFactory(Collections.EMPTY_MAP), null);
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/solr/core/src/test-files/solr/collection1/conf/solrconfig-suggestercomponent.xml
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the
The
fq={!collapse field=group_field}
https://cwiki.apache.org/confluence/display/solr/Result+Clustering
https://developer.s24.com/blog/a-utility-library-for-working-with-solrs-namedlist.html
- seems not really useful
- change defualt
to: application/json
Rest api
https://lucidworks.com/blog/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/RestManager.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/ManagedResource.java
https://cwiki.apache.org/confluence/display/solr/Managed+Resources
https://prismoskills.appspot.com/lessons/Solr/Chapter_20_-_Field_types_-_schema.xml.jsp
https://www.triquanta.nl/blog/going-dutch-stemming-apache-solr
https://cwiki.apache.org/confluence/display/solr/Language+Analysis
TODOhttps://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
https://support.lucidworks.com/hc/en-us/articles/205359448-Injecting-multi-word-phrase-synonyms-at-query-time-with-Solr-SynonymFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Synonym Filter
https://github.com/haberman/fast-recs-collate/blob/master/lookup3.c
http://burtleburtle.net/bob/hash/doobs.html
https://yonik.wordpress.com/tag/lookup3/
http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/Hash.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.java
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/lucene/core/src/java/org/apache/lucene/util/StringHelper.java
https://lawlesst.github.io/notebook/solr-etags.html
Prefix search
http://stackoverflow.com/questions/7496405/how-to-configure-solr-so-users-can-make-prefix-search-by-default
There are several ways to do this, but performance wise you might want to use EdgeNgramFilterFacortory
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html
https://github.com/sunspot/sunspot/wiki/Matching-substrings-in-fulltext-search
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter
Edge N-Gram Filter
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ShingleFilter
https://issues.apache.org/jira/browse/SOLR-1321
https://issues.apache.org/jira/browse/LUCENE-1398
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
Suffix search
http://stackoverflow.com/questions/19995804/search-suffix-word-solr-4-5-1
http://stackoverflow.com/questions/39517891/solr-find-documents-that-contain-a-field-value-that-a-query-string-starts-or-en
Another variation is
https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr
http://stackoverflow.com/questions/34253178/solr-doesnt-overwrite-duplicated-uniquekey-entries
https://issues.apache.org/jira/browse/SOLR-6096
https://issues.apache.org/jira/browse/SOLR-5403
<updateRequestProcessorChain name="distinct-values" default="true">
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.UniqFieldsUpdateProcessorFactory">
<str name="fieldName">field1</str>
<str name="fieldName">field2</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
http://yonik.com/solr/atomic-updates/Atomic Updates with SolrJ
set
– set or replace a particular value, or remove the value ifnull
is specified as the new valueadd
– adds an additional value to a listremove
– removes a value (or a list of values) from a listremoveregex
– removes from a list that match the given Java regular expressioninc
– increments a numeric value by a specific amount (use a negative value to decrement)
// create the document
SolrInputDocument sdoc =
new
SolrInputDocument();
sdoc.addField(
"id"
,
"book1"
);
Map<String,Object> fieldModifier =
new
HashMap<>(
1
);
fieldModifier.put(
"add"
,
"Cyberpunk"
);
sdoc.addField(
"cat"
, fieldModifier);
// add the map as the field value
client.add( sdoc );
// send it to the solr server
client.close();
// shutdown client before we exit
http://lucene.472066.n3.nabble.com/SolrJ-atomic-updates-td4020438.html
HashMap editTags = new HashMap();
editTags.put("set", new String[]{"tag1","tag2","tag3"});
doc = new SolrInputDocument();
doc.addField("id", "unique");
doc.addField("tags_ss", editTags);
server.add(doc);
server.commit(true, true);
resp = server.query(q);
System.out.println(resp.getResults().get(0).getFirstValue("tags_ss"));
prints "tag1"
ArrayList<String> as a value works the same way as String[].
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Cursors in Solr are a logical concept, that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values.
cursorMark
andstart
are mutually exclusive parameters- Your requests must either not include a
start
parameter, or it must be specified with a value of "0
".
- Your requests must either not include a
sort
clauses must include the uniqueKey field (either "asc
" or"desc
")- If
id
is your uniqueKey field, then sort params likeid asc
andname asc, id desc
would both work fine, butname asc
by itself would not
- If
- Sorts including Date Math based functions that involve calculations relative to
NOW
will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated. In this situation, choose & re-use a fixed value for theNOW
request param in all of your cursor requests.
Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that
cursorMark
would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark
value will identify a unique point in the sequence of documents.SolrQuery q = (
new
SolrQuery(some_query)).setRows(r).setSort(SortClause.asc(
"id"
));
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean
done =
false
;
while
(! done) {
q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrServer.query(q);
String nextCursorMark = rsp.getNextCursorMark();
doCustomProcessingOfResults(rsp);
if
(cursorMark.equals(nextCursorMark)) {
done =
true
;
}
cursorMark = nextCursorMark;
}
Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the
cursorMark
specified in a request encapsulates information about the relative position of the last document returned, based on the absolute sort values of that document. This means that the impact of index modifications is much smaller when using a cursor compared to basic pagination.- The client requests 5 more documents using the
nextCursorMark
from the previous response- Documents
6-10
will be returned -- the deletion of a document that's already been returned doesn't affect the relative position of the cursor
- Documents
- 3 new documents are now added with the ids
90
,91
, and92
; All three documents have a name ofA
- The client requests 5 more documents using the
nextCursorMark
from the previous response- Documents
11-15
will be returned -- the addition of new documents with sort values already past does not affect the relative position of the cursor
- Documents
- Document id
1
is updated to change its 'name' toQ
- Document id 17 is updated to change its 'name' to
A
- The client requests 5 more documents using the
nextCursorMark
from the previous response- The resulting documents are
16,1,18,19,20
in that order - Because the sort value of document
1
changed so that it is after the cursor position, the document is returned to the client twice - Because the sort value of document
17
changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress
- The resulting documents are
When fetching all results matching a query using
cursorMark
, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.
One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as the primary (and therefore: only significant) sort criterion.
In this situation, you will be guaranteed that each document is only returned once, no matter how it may be be modified during the use of the cursor.
"Tailing" a Cursor
The most common examples of how this can be useful is when you have a "timestamp" field recording when a document has been added/updated in your index. Client applications can continuously poll a cursor using a sort=timestamp asc, id asc for documents matching a query, and always be notified when a document is added or updated matching the request criteria. Another common example is when you have uniqueKey values that always increase as new documents are created, and you can continuously poll a cursor using sort=id asc to be notified about new documents.http://stackoverflow.com/questions/4628571/solr-date-field-tdate-vs-date
Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.
Let's say we index the number
12345678
. We can tokenize this into the following tokens.12345678
123456xx
1234xxxx
12xxxxxx
The
12345678
token represents the actual integer value. The tokens with the x
digits represent ranges. 123456xx
represents the range 12345600
to 12345699
, and matches all the documents that contain a token in that range.
Notice how in each token on the list has successively more
x
digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.12345678
12345xxx
12xxxxxx
A precision step of 4:
12345678
1234xxxx
A precision step of 1:
12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx
It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.
Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (
1250
, 1251
, 1252
, ..., 1275
) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x
, 126x
, 1270
, 1271
, 1272
, 1273
, 1274
, 1275
), because 125x
is a precomputed aggregation of 1250
- 1259
. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.
Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.
A common mistake is to try to filter parents with a
which
filter, as in this bad example:q={!parent which="title:join"}comments:SolrCloud
Instead, you should use a sibling mandatory clause as a filter:
q= +title:join
+{!parent which="content_type:parentDocument"}comments:SolrCloud
This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.
Note that this transformer can be used even though the query itself is not a Block Join query.
When using this transformer, the
parentFilter
parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:childFilter
- query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)limit
- the maximum number of child documents to be returned per parent document (default: 10)
https://wiki.apache.org/solr/SchemaDesign
Solr provides one table. Storing a set database tables in an index generally requires denormalizing some of the tables. Attempts to avoid denormalizing usually fail.
<!-- points to the root document of a block of nested documents. Required for nested document support, may be removed otherwise -->
<field name="_root_" type="string" indexed="true" stored="false"/>
is needed for block-join support.
You can use this when you have relationships between entities and you don't want to flatten your docs, for example, one Class doc, contains many Student docs, and you want to be able to query in a more similar way as you would do it in a DB.
http://blog-archive.griddynamics.com/2013/09/solr-block-join-support.htmlhttp://stackoverflow.com/questions/39785462/nested-documents-with-spring-data-solr
The default
MappingSolrConverter
does not yet support nested documents. However you can switch to SolrJConverter
which uses the native mapping.@Bean
public SolrTemplate solrTemplate(SolrClient client) {
SolrTemplate template = new SolrTemplate(client);
template.setSolrConverter(new SolrJConverter());
return template;
}
http://stackoverflow.com/questions/37241489/solrj-6-0-0-insertion-of-a-bean-object-which-associate-list-of-bean-object-is-g/37243756#37243756
So as of now I am annotating @Field annotation at field level rather than at setter:
@Field (child = true)
private Collection<Technology2> technologies2;
Nested Objects in Solrhttp://stackoverflow.com/questions/40490606/solr-querying-on-nested-child-documents
I think the best way to achieve your goal is through Block Join Parser capabilities.
What you need change a little bit - is to introduce some marker for parent documents. In Solr glossary it will be needed for "parent filter". So assuming each parent document will have
content_type:parentDocument
(just for sake of example) you'll be able to find all your parent documents with BJQ (block-join query) like:{!parent which="content_type:parentDocument"}(+place:blr +street:bakery)
Please bear in mind you need to index your parent-children documents together (in the same block) as described on Solr wiki
https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr
SolrInputDocument has methods — getChildDocuments()and addChildDocument() — for nesting child documents into a parent document.
http://yonik.com/solr-nested-objects/
Now if we search for
color:RED AND size:M
, it would incorrectly match our document!
Lucene has a flat object model and does not really support “nesting” of documents in the index.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.
When you add a parent document with 3 children, these appear int the index contiguously as
child1, child2, child3, parent
There is no Lucene-level information that links parent and child, or distinguishes this parent/child block from the other documents in the index that come before or after. Successfully using parent/child relationships relies on more information being provided at query time.
All children of a parent document must be indexed together with the parent document. One cannot update any document (parent or child) individually. The entire block needs to be re-indexed of any changes need to be made.
There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).
“Block Join” refers to the set of related query technologies to efficiently map from parents to children or vice versa at query time. The locality of children and parents can be used to both speed up query operations and lower memory requirements compared to other join methods.
we can see that these are really just indexed as 3 documents, all visible by default$ curl http://localhost:8983/solr/demo/query -d '
q=cat_s:(fantasy OR sci-fi)&
fl=id,[child parentFilter=type_s:book]'
Child Doc Transformer Parameters:
parentFilter
– identifies all of the parents. See the section on The Parent Filter for more info.childFilter
– optional query to filter which child documents should be included.limit
– maximum number of child documents to return per parent (defaults to 10)
The main query gives us a document list of reviews by
If we want to facet on the book genre (cat_s field) then we need to
switch the
author_s:yonik
If we want to facet on the book genre (cat_s field) then we need to
switch the
domain
from the children (type_s:reviews) to the parents (type_s:books).
Faceting on Parent
$ curl http: //localhost:8983/solr/demo/query -d ' q=author_s:yonik&fl=id,comment_t& json.facet={ genres : { type: terms, field: cat_s, domain: { blockParent : "type_s:book" } } }' |
$ curl http: //localhost:8983/solr/demo/query -d ' q=cat_s:(sci-fi OR fantasy)&fl=id,title_t& json.facet={ top_reviewers : { type: terms, field: author_s, domain: { blockChildren : "type_s:book" } } }' |
http://stackoverflow.com/questions/12101382/is-there-any-meaningful-performance-difference-between-integer-and-string-values
https://wiki.apache.org/solr/SolrFacetingOverview
- They are often not mapped into lower case
- Human-readable punctuation is often not removed (other than double-quotes)
- There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for value retrieval.
As an example, if I had an "author" field with a list of authors, such as:
- Schildt, Herbert; Wolpert, Lewis; Davies, P.
I might want to index the same data differently in three different fields (perhaps using the Solr copyField directive):
- For searching: Tokenized, case-folded, punctuation-stripped:
- schildt / herbert / wolpert / lewis / davies / p
- For sorting: Untokenized, case-folded, punctuation-stripped:
- schildt herbert wolpert lewis davies p
- For faceting: Primary author only, using a solr.StringField:
- Schildt, Herbert
http://stackoverflow.com/questions/14323269/should-i-prefer-integers-or-strings-in-my-solr-schema-if-a-field-will-fit-either
Will you ever query on a range? So if your 1...4 is really marking statuses of say Bad to Great, would you ever query on records from 1-2? This is the only thing of where you may need them to be ints (and, since you only have 4, it's not that big of a deal).
My rule in data storage is that if the int will never be used as an int, then store it as a string. It may require more space, etc. but you can do more string manipulations, etc. And the memory requirements of 11m records may not matter if that one field is a string or int (11m is a lot of records, but not a heavy load for Solr/Lucene).
Unless you need to perform range queries (numeric fields have special support for this) or sorting (the int field cache is more memory-efficient than the String field cache), they should be roughly equivalent.
I think I have found an issue with using the long integer for
uniqueKey*— *Document
routing using ! notation will not work with a long integer uniqueKey :(
uniqueKey*— *Document
routing using ! notation will not work with a long integer uniqueKey :(
http://lucene.472066.n3.nabble.com/numberic-or-string-type-for-non-sortable-field-td2606353.html
the reason i suggested that using ints might (marginally) be better is
because of the FieldCache and the fieldValueCache -- the int
representation uses less memory then if it was holding strings
representing hte same ints.
worrying about that is really a premature optimization though -- model
your data in the way that makes the most sense -- if your ids are
inherently ints, model them as ints until you come up with a reason to
model them otherwise and move on to the next problem.
https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/schema/UUIDField.html
This FieldType accepts UUID string values, as well as the special value of "NEW" which triggers generation of a new random UUID.
NOTE: Configuring a
https://wiki.apache.org/solr/UniqueKeyNOTE: Configuring a
UUIDField
instance with a default value of "NEW
" is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory
to generate UUID values when documents are added is recomended instead.- UUID is short for Universal Unique IDentifier. The UUID standard RFC-4122 includes several types of UUID with different input formats. There is a UUID field type (called UUIDField) in Solr 1.4 which implements version 4. Fields are defined in the schema.xml file with:
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
in Solr 4, this field must be populated via solr.UUIDUpdateProcessorFactory.<field name="id" type="uuid" indexed="true" stored="true" required="true"/>
<updateRequestProcessorChain name="uuid"> <processor class="solr.UUIDUpdateProcessorFactory"> <str name="fieldName">id</str> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
- Due to low level changes to support SolrCloud, the uniqueKey field can no longer be populated via <copyField/> or <field default=...> in the schema.xml. Users wishing to have Solr automatically generate a uniqueKey value when adding documents should instead use an instance of solr.UUIDUpdateProcessorFactory in their update processor chain.
By taking advantage of variable de-referencing in Local Params, we can specify an “appends” fq filter that delegates to a custom parameter name of our choosing. We can then specify that custom param name in our “defaults”, and still allow clients to override is as needed.
http://stackoverflow.com/questions/22017616/stronger-boosting-by-date-in-solrhttp://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr
http://stackoverflow.com/questions/29470458/solr-external-file-field-performance-issue
As described in the SO Relevancy boosting very slow in Solr the key=value pairs the external file consists of, should be sorted by that key. This is also stated in the java doc of the ExternalFileField
https://lucidworks.com/blog/2011/12/14/options-to-tune-documents-relevance-in-solr/The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.
Boost Queries
Sometimes it is necessary to boost some documents regardless of the user query. A typical example of boost queries is boosting sponsored documents. The user searches for “car rental”, but the application has some sponsored document that should be boosted. A good way of doing this is by using boost queries. A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.
For this example, the boost query (specified by the “bq” parameter) would be something like:
bq=sponsored:true
The boost query won’t determine which documents are considered a hit an which are not, but it will just influence the score of the result.
Boost Functions
Boost Functions are very similar to boost queries; in fact, they can achieve the same goals. The difference between boost functions and boost queries is that the boost function is an arbitrary function instead of a query (see http://lucidworks.lucidimagination.com/display/solr/Function+Queries). A typical example of boost functions is boosting those documents that are more recent than others. Imagine a forum search application, where the user is searching for forum entries with the text “foo bar”. The application should display all the forum entries that talk about “foo bar” but usually the most recent entries are more important (most users will want to see updated entries, and not historical). The boost function will be executed on background after each user query, and will boost some documents in some way.
For this example, a boost function (specified by the “bf” parameter) could be something like:
bf=recip(ms(NOW,publicationDate),3.16e-11,1,1)
The boost Parameter
The “boost” parameter is very similar to the “bf” parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the “Extended Dismax Query Parser” or the “Lucid Query Parser”.
- Prefer multiplicative boosting to additive boosting.
- Be careful not to confuse queries with functions.
https://wiki.apache.org/solr/FunctionQuery
msord
ord(myfield) returns the ordinal of the indexed field value within the indexed list of terms for that field in lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering. The field must have a maximum of one value per document (not multiValued). 0 is returned for documents without a value in the field.
- Example: If there were only three values for a particular field: "apple","banana","pear", then ord("apple")=1, ord("banana")=2, ord("pear")=3Example Syntax: ord(myIndexedField)
- Example SolrQuerySyntax: _val_:"ord(myIndexedField)"
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting or using a different function query, in addition to ord()/rord() will double memory use.
WARNING: ord() depends on the position in an index and can thus change when other documents are inserted or deleted, or if a MultiSearcher is used.
rord
The reverse ordering of what ord provides.
- Example Syntax: rord(myIndexedField)
- Example: rord(myDateField) is a metric for how old a document is: the youngest document will return 1, the oldest document will return the total number of documents.
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting or using a different function query, in addition to ord()/rord() will double memory use.
Returns milliseconds of difference between it's arguments.
Dates are relative to the Unix or POSIX time epoch, midnight, January 1, 1970 UTC.
Arguments may be numerically indexed date fields such as TrieDate (recommended field type for dates since Solr 1.4), or date math (examples in SolrQuerySyntax) based on a constant date or NOW.
ms()
- Equivalent to ms(NOW), number of milliseconds since the epoch.
ms(a)
- Returns the number of milliseconds since the epoch that the argument represents.Example: ms(NOW/DAY)
- Example: ms(2000-01-01T00:00:00Z)
- Example: ms(mydatefield)
Note that this number can be negative for dates from before the epoch.
ms(a,b)
- Returns the number of milliseconds that b occurs before a (i.e. a - b). Note that this offers higher precision than sub(a,b) because the arguments are not converted to floating point numbers before subtraction.
- Example: ms(NOW,mydatefield)
- Example: ms(mydatefield,2000-01-01T00:00:00Z)
- Example: ms(datefield1,datefield2)
ms(foofield) currently (Chris Harris, 4/16/2010) returns the value 0 for docs with nonexistent foofield. Should this behavior be relied on?
Date Boosting
Boosting more recent content is a common use case. One way is to use a recip function in conjunction with ms.
There are approximately 3.16e10 milliseconds in a year, so one can scale dates to fractions of a year with the inverse, or 3.16e-11. Thus the function recip(ms(NOW,mydatefield),3.16e-11,1,1) will yield values near 1 for very recent documents, 1/2 for documents a year old, 1/3 for documents two years old, etc. Be careful to not use this function for dates more than one year in the future or the values will be negative.
Consider using reduced precision to prevent excessive memory consumption. You would instead use recip(ms(NOW/HOUR,mydatefield),3.16e-11,1,1). See this thread for more information.
The most effective way to use such a boost is to multiply it with the relevancy score, rather than add it in. One way to do this is with the boost query parser.
How can I make exact-case matches score higher
Example: a query of "Penguin" should score documents containing "Penguin" higher than docs containing "penguin".
The general strategy is to index the content twice, using different fields with different fieldTypes (and different analyzers associated with those fieldTypes). One analyzer will contain a lowercase filter for case-insensitive matches, and one will preserve case for exact-case matches.
Use copyField commands in the schema to index a single input field multiple times.
Once the content is indexed into multiple fields that are analyzed differently, query across both fields.
How can I make queries of "spiderman" and "spider man" match "Spider-Man"
WordDelimiterFilter can be used in the analyzer for the field being queried to match words with intra-word delimiters such as dashes or case changes.
How can I search for one term near another term (say, "batman" and "movie")
The dismax handler can easily create sloppy phrase queries with the pf (phrase fields) and ps (phrase slop) parameters:
q=batman movie&pf=text&ps=100
The dismax handler also allows users to explicitly specify a phrase query with double quotes, and the qs(query slop) parameter can be used to add slop to any explicit phrase queries:
q="batman movie"&qs=100
How can I change the score of a document based on the *value* of a field (say, "popularity")
defType=dismax&qf=text&q=supervillians&bf=sqrt(popularity)
the explainOther parameter can be used to specify other documents you want detailed scoring info for.
q=supervillians&debugQuery=on&explainOther=id:juggernaut
There are a number of ways to boost using functionqueries. One way is: recip + linear, where recip computes an age-based score, and linear is used to boost it.
The dismax query parser provides and easy way to apply boost functions. For example:
Another way is recip + ms:
or
The qf (Query Fields) Parameter
The
qf
parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:qf="fieldOne^2.3 fieldTwo fieldThree^0.4"
assigns
fieldOne
a boost of 2.3, leaves fieldTwo
with the default boost (because no boost factor is specified), and fieldThree
a boost of 0.4. These boost factors make matches in fieldOne
much more significant than matches in fieldTwo
, which in turn are much more significant than matches in fieldThree
.
The bq
(Boost Query) Parameter
The
bq
parameter specifies an additional, optional, query clause that will be added to the user's main query to influence the score. For example, if you wanted to add a relevancy boost for recent documents:
You can specify multiple
bq
parameters. If you want your query to be parsed as separate clauses with separate boosts, use multiple bq
parameters.
The bf
(Boost Functions) Parameter
The
bf
parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries which will be added to the user's main query as optional clauses that will influence the score. Any function supported natively by Solr can be used, along with a boost value. For example:
Specifying functions with the bf parameter is essentially just shorthand for using the
bq
param combined with the {!func}
parser.
For example, if you want to show the most recent documents first, you could use either of the following:
The uniqueKey field must currently be of type string for the QueryElevationComponent to operate properly.
https://medium.com/@kaismh/automated-solution-for-query-elevation-using-solr-a15ce88b2762
- Configurable FieldType for query: you can create any custom pipeline for analyzing the query text, so you can lowercase the text, use a stemmer and etc.
- Force elevation even on sorted results: you can either choose to respect or ignore the sort parameter.
- Enable/Disable elevation parameter: you can choose to disable elevation via a query parameter, very useful for testing.
http://lucene.472066.n3.nabble.com/Programmatically-upload-configuration-into-ZooKeeper-td4104924.html
Programmatically upload configuration into ZooKeeper
ZkStateReader zkStateReader = cloudSolrServer.getZkStateReader();
SolrZkClient zkClient = zkStateReader.getZkClient();
File jsonFile = new File(updateClusterstateJson);
if (!jsonFile.isFile()) {
System.err.println(jsonFile.getAbsolutePath()+" not found.");
return;
}
byte[] clusterstateJson = readFile(jsonFile);
// validate the user is passing is valid JSON
InputStreamReader bytesReader = new InputStreamReader(new
ByteArrayInputStream(clusterstateJson), "UTF-8");
JSONParser parser = new JSONParser(bytesReader);
parser.toString();
zkClient.setData("/clusterstate.json", clusterstateJson, true);
http://www.programcreek.com/java-api-examples/index.php?class=org.apache.solr.common.cloud.SolrZkClient&method=getData
String path = ZkStateReader.COLLECTIONS_ZKNODE + "/" + collection;
byte[] data = zkClient.getData(path, null, null, true);
numShards=3&collection.configName=configName&replicationFactor=2&router.field=routerField&maxShardsPerNode=2
https://www.linkedin.com/pulse/automated-solution-query-elevation-using-solr-kais-hassan-phd
https://issues.apache.org/jira/browse/SOLR-6092
Suggestor
FuzzyLookupFactory
This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm.
- AnalyzingLookupFactory (default, finds matches based on prefix)
- FuzzyLookupFactory (finds matches with misspellings),
- AnalyzingInfixLookupFactory (finds matches anywhere in the text),
- BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)
You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).
Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory.
■ It should return ranked suggestions ordered by term frequency, as there is little benefit to suggesting rare terms that occur in only a few documents in your index, especially when the user has typed only a few characters.
https://cwiki.apache.org/confluence/display/solr/Suggester
https://lucene.apache.org/core/6_1_0/suggest/org/apache/lucene/search/suggest/FileDictionary.html
https://issues.apache.org/jira/browse/LUCENE-6336
blenderType: used to calculate weight coefficient using the position of the first matching word. Can be one of:
- position_linear: weightFieldValue*(1 - 0.10*position): Matches to the start will be given a higher score (Default)
- position_reciprocal: weightFieldValue/(1+position): Matches to the end will be given a higher score.
- exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.
https://lucene.apache.org/core/6_1_0/suggest/org/apache/lucene/search/suggest/FileDictionary.html
https://issues.apache.org/jira/browse/SOLR-9637
http://lucene.472066.n3.nabble.com/Duplicate-suggestions-td4212639.html
duplicated fields across your docs, you will se duplicate suggestions.
Do you have any intermediate API in your application ? In the case you can
modify the API using a Collection that prevent duplicates to contain and
return the suggestions.
In the case you want it directly from Solr I assume it is a "bug" .
I think the suggestions should return by default no duplicates ( because
the only information returned is the field value and not the document id.
Anyway could be a nice parameter to get better suggestions ( sending the
avoidDuplicate parameter to the suggester 0.
http://stackoverflow.com/questions/32045700/solr-suggester-in-solrcloud-mode
https://wiki.apache.org/solr/SpellCheckComponent#Distributed_Search_Support
It was not working because solr was running in SolrCloud mode. There is two ways to perform suggestion in solrCloud mode:
- Use the distrib=false parameter. This will fetch the data from only one shard which you are accessing in the command. You can add the following into Component definition itself.
<bool name="distrib">false</bool>
- Use the shards and shards.qt parameter for searching all the shards. The shards parameter will contain comma separated list of all the shards which you want to include in the query. The shards.qt parameter will define the reat API you want to access.
http://stackoverflow.com/questions/40905965/how-to-optimize-documentdictionary-build-on-solr-cloud-suggestershards.qt: Signals Solr that requests to shards should be sent to a request handler given by this parameter. Use shards.qt=/spell when making the request if your request handler is "/spell".shards: shards=solr-shard1:8983/solr,solr-shard2:8983/solr Distributed Search
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/params/ShardParams.java
/** The requested URL for this shard */
public static final String SHARD_URL = "shard.url";
/** The Request Handler for shard requests */
public static final String SHARDS_QT = "shards.qt";
/** Request detailed match info for each shard (true/false) */
public static final String SHARDS_INFO = "shards.info";
/** Should things fail if there is an error? (true/false) */
public static final String SHARDS_TOLERANT = "shards.tolerant";
/** Force a single-pass distributed query? (true/false) */
public static final String DISTRIB_SINGLE_PASS = "distrib.singlePass";
https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
f, on the other hand, you wanted to search just one shard, you can specify that shard by its logical ID, as in:
If you want to search a group of shard Ids, you can specify them together:
In both of the above examples, the shard Id(s) will be used to pick a random replica of that shard.
Alternatively, you can specify the explict replicas you wish to use in place of a shard Ids:
Or you can specify a list of replicas to choose from for a single shard (for load balancing purposes) by using the pipe symbol (|):
When using the new Suggester component (with AnalyzingInfixSuggester) in
Solr trunk with solrj, the suggest.build command seems to be executed only
on one of the solr cloud nodes.
Solr trunk with solrj, the suggest.build command seems to be executed only
on one of the solr cloud nodes.
I had to add shards.qt=/suggest and
shards=host1:port2/solr/mycollection,host2:port2/solr/mycollection... to
distribute the build command on all nodes.
shards=host1:port2/solr/mycollection,host2:port2/solr/mycollection... to
distribute the build command on all nodes.
Given that we are using SolrCloud, I would have expected the build command
to behave like an cloud update and be sent to all nodes without the need of
specifying shards and shards.qt
to behave like an cloud update and be sent to all nodes without the need of
specifying shards and shards.qt
https://wiki.apache.org/solr/DistributedSearch
The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list. The syntax of shards is host:port/base_url[,host:port/base_url]* A sharded request will go to the standard request handler (not necessarily the original); this can be overridden via shards.qt. Since SOLR-3134 it is possible to obtain numFound, maxScore and time per shard in a distributed search query. Use shards.info=true to enable this feature. The shards.tolerant=true parameter includes error information if available. (SolrCloud can handle this for you in a more transparent way).
Distributed Deadlock
Each shard may also serve top-level query requests and then make sub-requests to all of the other shards. In this configuration, care should be taken to ensure that the max number of threads serving HTTP requests in the servlet container is greater than the possible number of requests from both top-level clients and other shards (the solr example server is already configured correctly). If this is not the case, a distributed deadlock is possible.
Consider the simplest case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the servlet containers will block the incoming requests until the other pending requests are finished (but they won't finish since they are waiting for the sub-requests).
https://lucidworks.com/blog/2017/01/09/context-filtering-with-solr-suggesters/http://stackoverflow.com/questions/36079395/how-to-configure-multiple-contextfields-in-single-solr-suggester
You have to create a new field in your schema.xml as context_field. This field should have
multivalued=true
<field name="context_field" type="text_suggest" multiValued="true" indexed="true" stored="true"/>
Then you have to create this context_field as a list in json for indexing in solr.
"context_field" : ["some document type", "some department type"]
after indexing you can suggest like this-
suggest.q=b&suggest.cfq=context_documentType AND context_departmentType
https://issues.apache.org/jira/browse/SOLR-7963
suggest.q=c&suggest.cfq=memory
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SolrSuggester.java
Analyzer contextFilterQueryAnalyzer = new TokenizerChain(new StandardTokenizerFactory(Collections.EMPTY_MAP), null);
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/solr/core/src/test-files/solr/collection1/conf/solrconfig-suggestercomponent.xml
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the
router.name
parameter as compositeId
when creating the collectionThe
CollapsingQParser
is really a post filter that provides more performant field collapsing than Solr's standard approach when the number of distinct groups in the result set is high.fq={!collapse field=group_field}
https://cwiki.apache.org/confluence/display/solr/Result+Clustering
The clustering (or cluster analysis) plugin attempts to automatically discover groups of related
search hits (documents) and assign human-readable labels to these groups. By default in Solr, t
he clustering algorithm is applied to the search result of each single query—this is called an
on-line clustering.
https://developer.s24.com/blog/a-utility-library-for-working-with-solrs-namedlist.html
- seems not really useful
- change defualt
solr.JSONResponseWriter
text/plain; charset=UTF-8to: application/json
Rest api
https://lucidworks.com/blog/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/RestManager.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/ManagedResource.java
https://cwiki.apache.org/confluence/display/solr/Managed+Resources
Changes made to managed resources via this REST API are not applied to the active Solr components until the Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop word, you must reload the core/collection before changes become active.
This approach is required when running in distributed mode so that we are assured changes are applied to all cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that you don’t want one of your replicas working with a different set of stop words or synonyms than the others.
Field typeshttps://prismoskills.appspot.com/lessons/Solr/Chapter_20_-_Field_types_-_schema.xml.jsp
https://www.triquanta.nl/blog/going-dutch-stemming-apache-solr
https://cwiki.apache.org/confluence/display/solr/Language+Analysis
Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.
TODOhttps://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
https://support.lucidworks.com/hc/en-us/articles/205359448-Injecting-multi-word-phrase-synonyms-at-query-time-with-Solr-SynonymFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:
- The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
- Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
- An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
- Many thousands of documents containing the term "text:TV"
- A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
Synonym Filter
synonyms
: (required) The path of a file that contains a list of synonyms, one per line. In the (default) solr
format - see the format
argument below for alternatives - blank lines and lines that begin with "#
" are ignored. This may be an absolute path, or path relative to the Solr config directory. There are two ways to specify synonym mappings:- A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token.
- Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right.
ignoreCase
: (optional; default: false
) If true
, synonyms will be matched case-insensitively.expand
: (optional; default: true
) If true
, a synonym will be expanded to all equivalent synonyms. If false
, all equivalent synonyms will be reduced to the first in the list.https://github.com/haberman/fast-recs-collate/blob/master/lookup3.c
http://burtleburtle.net/bob/hash/doobs.html
https://yonik.wordpress.com/tag/lookup3/
http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/Hash.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.java
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/lucene/core/src/java/org/apache/lucene/util/StringHelper.java
https://lawlesst.github.io/notebook/solr-etags.html
Prefix search
http://stackoverflow.com/questions/7496405/how-to-configure-solr-so-users-can-make-prefix-search-by-default
There are several ways to do this, but performance wise you might want to use EdgeNgramFilterFacortory
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html
https://github.com/sunspot/sunspot/wiki/Matching-substrings-in-fulltext-search
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter
Edge N-Gram Filter
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ShingleFilter
minShingleSize
: (integer, default 2) The minimum number of tokens per shingle.maxShingleSize
: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.
In: "To be, or what?"
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
https://issues.apache.org/jira/browse/SOLR-1321
https://issues.apache.org/jira/browse/LUCENE-1398
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not reversed.
Factory class:
solr.ReversedWildcardFilterFactory
Suffix search
http://stackoverflow.com/questions/19995804/search-suffix-word-solr-4-5-1
http://stackoverflow.com/questions/39517891/solr-find-documents-that-contain-a-field-value-that-a-query-string-starts-or-en
from solr4.3+ We have to combine it with ReverseStringFilter
http://khaidoan.wikidot.com/solr
What are the costs of n-gram analysis?
There is a high price to be paid for n-gramming. Recall that in the earlier example, Tonight was split into 15 substring terms, whereas typical analysis would probably leave only one. This translates to greater index sizes, and thus a longer time to index.
Note the ten-fold increase in indexing time for the artist name, and a five-fold increase in disk space. Remember that this is just one field!
Given these costs, n-gramming, if used at all, is generally only done on a field or two of small size where there is a clear requirement for substring matches.
EdgeNGramTokenizerFactory
and EdgeNGramFilterFactory
, which emit n-grams that are adjacent to either the start or end of the input text.https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr
http://stackoverflow.com/questions/34253178/solr-doesnt-overwrite-duplicated-uniquekey-entries
https://issues.apache.org/jira/browse/SOLR-6096
Giving the test I suppose that update with children works fine if you specify
<add overwrite = "true"><doc>...</doc> </add>
<doc childfree="true">