Massive Technical Interviews Tips: December 2016

http://stackoverflow.com/questions/13306272/removing-solr-duplicate-values-into-multivalued-field
https://issues.apache.org/jira/browse/SOLR-5403

<updateRequestProcessorChain name="distinct-values" default="true">
    <processor class="solr.DistributedUpdateProcessorFactory"/>
    <processor class="solr.UniqFieldsUpdateProcessorFactory">
        <str name="fieldName">field1</str>
        <str name="fieldName">field2</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

http://yonik.com/solr/atomic-updates/
Atomic Updates with SolrJ

set – set or replace a particular value, or remove the value if null is specified as the new value
add – adds an additional value to a list
remove – removes a value (or a list of values) from a list
removeregex – removes from a list that match the given Java regular expression
inc – increments a numeric value by a specific amount (use a negative value to decrement)

HttpSolrClient client = new HttpSolrClient("http://localhost:8983/solr");

// create the document

SolrInputDocument sdoc = new SolrInputDocument();

sdoc.addField("id","book1");

Map<String,Object> fieldModifier = new HashMap<>(1);

fieldModifier.put("add","Cyberpunk");

sdoc.addField("cat", fieldModifier);  // add the map as the field value

client.add( sdoc );  // send it to the solr server

client.close();  // shutdown client before we exit

http://lucene.472066.n3.nabble.com/SolrJ-atomic-updates-td4020438.html
HashMap editTags = new HashMap();
editTags.put("set", new String[]{"tag1","tag2","tag3"});
doc = new SolrInputDocument();
doc.addField("id", "unique");
doc.addField("tags_ss", editTags);
server.add(doc);
server.commit(true, true);
resp = server.query(q);
System.out.println(resp.getResults().get(0).getFirstValue("tags_ss"));

prints "tag1"

ArrayList<String> as a value works the same way as String[].
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
Cursors in Solr are a logical concept, that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values.

cursorMark and start are mutually exclusive parameters
- Your requests must either not include a start parameter, or it must be specified with a value of "0".
sort clauses must include the uniqueKey field (either "asc" or "desc")
- If id is your uniqueKey field, then sort params like id asc and name asc, id desc would both work fine, but name asc by itself would not
Sorts including Date Math based functions that involve calculations relative to NOW will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated. In this situation, choose & re-use a fixed value for the NOW request param in all of your cursor requests.

Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents.

SolrQuery q = (new SolrQuery(some_query)).setRows(r).setSort(SortClause.asc("id"));

String cursorMark = CursorMarkParams.CURSOR_MARK_START;

boolean done = false;

while (! done) {

  q.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);

  QueryResponse rsp = solrServer.query(q);

  String nextCursorMark = rsp.getNextCursorMark();

  doCustomProcessingOfResults(rsp);

  if (cursorMark.equals(nextCursorMark)) {

    done = true;

  }

  cursorMark = nextCursorMark;

}

the cursor is stateless from Solr's perspective,
Unlike basic pagination, Cursor pagination does not rely on using an absolute "offset" into the completed sorted list of matching documents. Instead, the cursorMark specified in a request encapsulates information about the relative position of the last document returned, based on the absolute sort values of that document. This means that the impact of index modifications is much smaller when using a cursor compared to basic pagination.

The client requests 5 more documents using the nextCursorMark from the previous response
- Documents 6-10 will be returned -- the deletion of a document that's already been returned doesn't affect the relative position of the cursor
3 new documents are now added with the ids 90, 91, and 92; All three documents have a name of A
The client requests 5 more documents using the nextCursorMark from the previous response
- Documents 11-15 will be returned -- the addition of new documents with sort values already past does not affect the relative position of the cursor
Document id 1 is updated to change its 'name' to Q
Document id 17 is updated to change its 'name' to A
The client requests 5 more documents using the nextCursorMark from the previous response
- The resulting documents are 16,1,18,19,20 in that order
- Because the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice
- Because the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress

When fetching all results matching a query using cursorMark, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.

One way to ensure that a document will never be returned more then once, is to use the uniqueKey field as the primary (and therefore: only significant) sort criterion.

In this situation, you will be guaranteed that each document is only returned once, no matter how it may be be modified during the use of the cursor.

"Tailing" a Cursor

The most common examples of how this can be useful is when you have a "timestamp" field recording when a document has been added/updated in your index. Client applications can continuously poll a cursor using a sort=timestamp asc, id asc for documents matching a query, and always be notified when a document is added or updated matching the request criteria. Another common example is when you have uniqueKey values that always increase as new documents are created, and you can continuously poll a cursor using sort=id asc to be notified about new documents.
http://stackoverflow.com/questions/4628571/solr-date-field-tdate-vs-date

Trie fields make range queries faster by precomputing certain range results and storing them as a single record in the index. For clarity, my example will use integers in base ten. The same concept applies to all trie types. This includes dates, since a date can be represented as the number of seconds since, say, 1970.

Let's say we index the number 12345678. We can tokenize this into the following tokens.

12345678
123456xx
1234xxxx
12xxxxxx

The 12345678 token represents the actual integer value. The tokens with the x digits represent ranges. 123456xx represents the range 12345600 to 12345699, and matches all the documents that contain a token in that range.

Notice how in each token on the list has successively more x digits. This is controlled by the precision step. In my example, you could say that I was using a precision step of 2, since I trim 2 digits to create each extra token. If I were to use a precision step of 3, I would get these tokens.

12345678
12345xxx
12xxxxxx

A precision step of 4:

12345678
1234xxxx

A precision step of 1:

12345678
1234567x
123456xx
12345xxx
1234xxxx
123xxxxx
12xxxxxx
1xxxxxxx

It's easy to see how a smaller precision step results in more tokens and increases the size of the index. However, it also speeds up range queries.

Without the trie field, if I wanted to query a range from 1250 to 1275, Lucene would have to fetch 25 entries (1250, 1251, 1252, ..., 1275) and combine search results. With a trie field (and precision step of 1), we could get away with fetching 8 entries (125x, 126x, 1270, 1271, 1272, 1273, 1274, 1275), because 125x is a precomputed aggregation of 1250 - 1259. If I were to use a precision step larger than 1, the query would go back to fetching all 25 individual entries.

Note: In reality, the precision step refers to the number of bits trimmed for each token. If you were to write your numbers in hexadecimal, a precision step of 4 would trim one hex digit for each token. A precision step of 8 would trim two hex digits.

https://wiki.apache.org/solr/UpdateXmlMessages

waitFlush = "true" | "false" — default is true — block until index changes are flushed to disk Solr1.4 At least in Solr 1.4 and later (perhaps earlier as well), this command has no affect. In Solr4.0 it will be removed.

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers

A common mistake is to try to filter parents with a which filter, as in this bad example:

q={!parent which="title:join"}comments:SolrCloud

Instead, you should use a sibling mandatory clause as a filter:

q= +title:join +{!parent which="content_type:parentDocument"}comments:SolrCloud

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory

This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.

fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter limit=100]

Note that this transformer can be used even though the query itself is not a Block Join query.

When using this transformer, the parentFilter parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:

childFilter - query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)
limit - the maximum number of child documents to be returned per parent document (default: 10)

https://wiki.apache.org/solr/SchemaDesign

Solr provides one table. Storing a set database tables in an index generally requires denormalizing some of the tables. Attempts to avoid denormalizing usually fail.

Solr block-join

<!-- points to the root document of a block of nested documents. Required for nested document support, may be removed otherwise -->
<field name="_root_" type="string" indexed="true" stored="false"/>

is needed for block-join support.

You can use this when you have relationships between entities and you don't want to flatten your docs, for example, one Class doc, contains many Student docs, and you want to be able to query in a more similar way as you would do it in a DB.

http://blog-archive.griddynamics.com/2013/09/solr-block-join-support.html

http://stackoverflow.com/questions/39785462/nested-documents-with-spring-data-solr

The default MappingSolrConverter does not yet support nested documents. However you can switch to SolrJConverter which uses the native mapping.

@Bean
public SolrTemplate solrTemplate(SolrClient client) {

    SolrTemplate template = new SolrTemplate(client);
    template.setSolrConverter(new SolrJConverter());
    return template;
}

http://stackoverflow.com/questions/37241489/solrj-6-0-0-insertion-of-a-bean-object-which-associate-list-of-bean-object-is-g/37243756#37243756

So as of now I am annotating @Field annotation at field level rather than at setter:

 @Field (child = true)
    private Collection<Technology2> technologies2;

Nested Objects in Solr
http://stackoverflow.com/questions/40490606/solr-querying-on-nested-child-documents

I think the best way to achieve your goal is through Block Join Parser capabilities.

What you need change a little bit - is to introduce some marker for parent documents. In Solr glossary it will be needed for "parent filter". So assuming each parent document will have content_type:parentDocument (just for sake of example) you'll be able to find all your parent documents with BJQ (block-join query) like:

{!parent which="content_type:parentDocument"}(+place:blr +street:bakery)

Please bear in mind you need to index your parent-children documents together (in the same block) as described on Solr wiki

https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr
SolrInputDocument has methods — getChildDocuments()and addChildDocument() — for nesting child documents into a parent document.

http://yonik.com/solr-nested-objects/
Now if we search for color:RED AND size:M , it would incorrectly match our document!

Lucene has a flat object model and does not really support “nesting” of documents in the index.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.

When you add a parent document with 3 children, these appear int the index contiguously as

child1, child2, child3, parent

There is no Lucene-level information that links parent and child, or distinguishes this parent/child block from the other documents in the index that come before or after. Successfully using parent/child relationships relies on more information being provided at query time.

All children of a parent document must be indexed together with the parent document. One cannot update any document (parent or child) individually. The entire block needs to be re-indexed of any changes need to be made.

There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).

“Block Join” refers to the set of related query technologies to efficiently map from parents to children or vice versa at query time. The locality of children and parents can be used to both speed up query operations and lower memory requirements compared to other join methods.

we can see that these are really just indexed as 3 documents, all visible by default

$ curl http://localhost:8983/solr/demo/query -d '
q=cat_s:(fantasy OR sci-fi)&
fl=id,[child parentFilter=type_s:book]'

Child Doc Transformer Parameters:

parentFilter – identifies all of the parents. See the section on The Parent Filter for more info.
childFilter – optional query to filter which child documents should be included.
limit – maximum number of child documents to return per parent (defaults to 10)

The main query gives us a document list of reviews by author_s:yonik
If we want to facet on the book genre (cat_s field) then we need to
switch the domain from the children (type_s:reviews) to the parents (type_s:books).

Faceting on Parent

$ curl http://localhost:8983/solr/demo/query -d '

q=author_s:yonik&fl=id,comment_t&

json.facet={

  genres : {

    type: terms,

    field: cat_s,

    domain: { blockParent : "type_s:book" } 

  }

}'

Faceting on Children

$ curl http://localhost:8983/solr/demo/query -d '

q=cat_s:(sci-fi OR fantasy)&fl=id,title_t&

json.facet={

  top_reviewers : {

    type: terms,

    field: author_s,

    domain: { blockChildren : "type_s:book" } 

  }

}'

http://stackoverflow.com/questions/12101382/is-there-any-meaningful-performance-difference-between-integer-and-string-values
https://wiki.apache.org/solr/SolrFacetingOverview

They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for value retrieval.

As an example, if I had an "author" field with a list of authors, such as:

Schildt, Herbert; Wolpert, Lewis; Davies, P.

I might want to index the same data differently in three different fields (perhaps using the Solr copyField directive):

For searching: Tokenized, case-folded, punctuation-stripped:
- schildt / herbert / wolpert / lewis / davies / p
For sorting: Untokenized, case-folded, punctuation-stripped:
- schildt herbert wolpert lewis davies p
For faceting: Primary author only, using a solr.StringField:
- Schildt, Herbert

http://stackoverflow.com/questions/14323269/should-i-prefer-integers-or-strings-in-my-solr-schema-if-a-field-will-fit-either

Will you ever query on a range? So if your 1...4 is really marking statuses of say Bad to Great, would you ever query on records from 1-2? This is the only thing of where you may need them to be ints (and, since you only have 4, it's not that big of a deal).

My rule in data storage is that if the int will never be used as an int, then store it as a string. It may require more space, etc. but you can do more string manipulations, etc. And the memory requirements of 11m records may not matter if that one field is a string or int (11m is a lot of records, but not a heavy load for Solr/Lucene).

http://stackoverflow.com/questions/12490238/solr-filter-query-string-vs-int

Unless you need to perform range queries (numeric fields have special support for this) or sorting (the int field cache is more memory-efficient than the String field cache), they should be roughly equivalent.

http://grokbase.com/t/lucene/solr-user/1381x0624h/uniquekey-string-vs-long-integer

I think I have found an issue with using the long integer for
uniqueKey*— *Document
routing using ! notation will not work with a long integer uniqueKey :(

http://lucene.472066.n3.nabble.com/numberic-or-string-type-for-non-sortable-field-td2606353.html
the reason i suggested that using ints might (marginally) be better is
because of the FieldCache and the fieldValueCache -- the int
representation uses less memory then if it was holding strings
representing hte same ints.

worrying about that is really a premature optimization though -- model
your data in the way that makes the most sense -- if your ids are
inherently ints, model them as ints until you come up with a reason to
model them otherwise and move on to the next problem.
https://lucene.apache.org/solr/4_2_0/solr-core/org/apache/solr/schema/UUIDField.html

This FieldType accepts UUID string values, as well as the special value of "NEW" which triggers generation of a new random UUID.
NOTE: Configuring a UUIDField instance with a default value of "NEW" is not advisable for most users when using SolrCloud (and not possible if the UUID value is configured as the unique key field) since the result will be that each replica of each document will get a unique UUID value. Using UUIDUpdateProcessorFactory to generate UUID values when documents are added is recomended instead.

https://wiki.apache.org/solr/UniqueKey

UUID is short for Universal Unique IDentifier. The UUID standard RFC-4122 includes several types of UUID with different input formats. There is a UUID field type (called UUIDField) in Solr 1.4 which implements version 4. Fields are defined in the schema.xml file with:

 <fieldType name="uuid" class="solr.UUIDField" indexed="true" />

in Solr 4, this field must be populated via solr.UUIDUpdateProcessorFactory.

 <field name="id" type="uuid" indexed="true" stored="true" required="true"/>

  <updateRequestProcessorChain name="uuid">
    <processor class="solr.UUIDUpdateProcessorFactory">
      <str name="fieldName">id</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

Due to low level changes to support SolrCloud, the uniqueKey field can no longer be populated via <copyField/> or <field default=...> in the schema.xml. Users wishing to have Solr automatically generate a uniqueKey value when adding documents should instead use an instance of solr.UUIDUpdateProcessorFactory in their update processor chain.

https://lucidworks.com/blog/2013/02/20/custom-solr-request-params/

By taking advantage of variable de-referencing in Local Params, we can specify an “appends” fq filter that delegates to a custom parameter name of our choosing. We can then specify that custom param name in our “defaults”, and still allow clients to override is as needed.

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/params/DisMaxParams.java

http://stackoverflow.com/questions/22017616/stronger-boosting-by-date-in-solr

http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr

http://stackoverflow.com/questions/29470458/solr-external-file-field-performance-issue

As described in the SO Relevancy boosting very slow in Solr the key=value pairs the external file consists of, should be sorted by that key. This is also stated in the java doc of the ExternalFileField

The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.

https://lucidworks.com/blog/2011/12/14/options-to-tune-documents-relevance-in-solr/

Boost Queries

Sometimes it is necessary to boost some documents regardless of the user query. A typical example of boost queries is boosting sponsored documents. The user searches for “car rental”, but the application has some sponsored document that should be boosted. A good way of doing this is by using boost queries. A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.

For this example, the boost query (specified by the “bq” parameter) would be something like:

bq=sponsored:true

The boost query won’t determine which documents are considered a hit an which are not, but it will just influence the score of the result.

Boost Functions

Boost Functions are very similar to boost queries; in fact, they can achieve the same goals. The difference between boost functions and boost queries is that the boost function is an arbitrary function instead of a query (see http://lucidworks.lucidimagination.com/display/solr/Function+Queries). A typical example of boost functions is boosting those documents that are more recent than others. Imagine a forum search application, where the user is searching for forum entries with the text “foo bar”. The application should display all the forum entries that talk about “foo bar” but usually the most recent entries are more important (most users will want to see updated entries, and not historical). The boost function will be executed on background after each user query, and will boost some documents in some way.

For this example, a boost function (specified by the “bf” parameter) could be something like:

bf=recip(ms(NOW,publicationDate),3.16e-11,1,1)

The boost Parameter

The “boost” parameter is very similar to the “bf” parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the “Extended Dismax Query Parser” or the “Lucid Query Parser”.

https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/

Prefer multiplicative boosting to additive boosting.
Be careful not to confuse queries with functions.

https://wiki.apache.org/solr/FunctionQuery

ord

ord(myfield) returns the ordinal of the indexed field value within the indexed list of terms for that field in lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering. The field must have a maximum of one value per document (not multiValued). 0 is returned for documents without a value in the field.

Example: If there were only three values for a particular field: "apple","banana","pear", then ord("apple")=1, ord("banana")=2, ord("pear")=3
Example Syntax: ord(myIndexedField)
Example SolrQuerySyntax: _val_:"ord(myIndexedField)"

WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting or using a different function query, in addition to ord()/rord() will double memory use.

WARNING: ord() depends on the position in an index and can thus change when other documents are inserted or deleted, or if a MultiSearcher is used.

rord

The reverse ordering of what ord provides.

Example Syntax: rord(myIndexedField)
Example: rord(myDateField) is a metric for how old a document is: the youngest document will return 1, the oldest document will return the total number of documents.

Returns milliseconds of difference between it's arguments.

Dates are relative to the Unix or POSIX time epoch, midnight, January 1, 1970 UTC.

Arguments may be numerically indexed date fields such as TrieDate (recommended field type for dates since Solr 1.4), or date math (examples in SolrQuerySyntax) based on a constant date or NOW.

ms()

Equivalent to ms(NOW), number of milliseconds since the epoch.

ms(a)

Returns the number of milliseconds since the epoch that the argument represents.
Example: ms(NOW/DAY)
Example: ms(2000-01-01T00:00:00Z)
Example: ms(mydatefield)

Note that this number can be negative for dates from before the epoch.

ms(a,b)

Returns the number of milliseconds that b occurs before a (i.e. a - b). Note that this offers higher precision than sub(a,b) because the arguments are not converted to floating point numbers before subtraction.
Example: ms(NOW,mydatefield)
Example: ms(mydatefield,2000-01-01T00:00:00Z)
Example: ms(datefield1,datefield2)

ms(foofield) currently (Chris Harris, 4/16/2010) returns the value 0 for docs with nonexistent foofield. Should this behavior be relied on?

Date Boosting

Boosting more recent content is a common use case. One way is to use a recip function in conjunction with ms.

There are approximately 3.16e10 milliseconds in a year, so one can scale dates to fractions of a year with the inverse, or 3.16e-11. Thus the function recip(ms(NOW,mydatefield),3.16e-11,1,1) will yield values near 1 for very recent documents, 1/2 for documents a year old, 1/3 for documents two years old, etc. Be careful to not use this function for dates more than one year in the future or the values will be negative.

Consider using reduced precision to prevent excessive memory consumption. You would instead use recip(ms(NOW/HOUR,mydatefield),3.16e-11,1,1). See this thread for more information.

The most effective way to use such a boost is to multiply it with the relevancy score, rather than add it in. One way to do this is with the boost query parser.

Also see http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

https://wiki.apache.org/solr/SolrRelevancyFAQ

How can I make exact-case matches score higher

Example: a query of "Penguin" should score documents containing "Penguin" higher than docs containing "penguin".

The general strategy is to index the content twice, using different fields with different fieldTypes (and different analyzers associated with those fieldTypes). One analyzer will contain a lowercase filter for case-insensitive matches, and one will preserve case for exact-case matches.

Use copyField commands in the schema to index a single input field multiple times.

Once the content is indexed into multiple fields that are analyzed differently, query across both fields.

How can I make queries of "spiderman" and "spider man" match "Spider-Man"

WordDelimiterFilter can be used in the analyzer for the field being queried to match words with intra-word delimiters such as dashes or case changes.

How can I search for one term near another term (say, "batman" and "movie")

The dismax handler can easily create sloppy phrase queries with the pf (phrase fields) and ps (phrase slop) parameters:

q=batman movie&pf=text&ps=100

The dismax handler also allows users to explicitly specify a phrase query with double quotes, and the qs(query slop) parameter can be used to add slop to any explicit phrase queries:

q="batman movie"&qs=100

How can I change the score of a document based on the *value* of a field (say, "popularity")

defType=dismax&qf=text&q=supervillians&bf=sqrt(popularity)

the explainOther parameter can be used to specify other documents you want detailed scoring info for.

q=supervillians&debugQuery=on&explainOther=id:juggernaut

http://www.solrtutorial.com/boost-documents-by-age.html

There are a number of ways to boost using functionqueries. One way is: recip + linear, where recip computes an age-based score, and linear is used to boost it.

The dismax query parser provides and easy way to apply boost functions. For example:

http://localhost:8983/solr/search?qt=dismax&q=search&bf=linear(recip(rord(modify_date),1,1000,1000),11,0)

Another way is recip + ms:

http://localhost:8983/solr/search?qt=dismax&q=search&bf=recip(ms(NOW/HOUR,modify_date),3.16e-11,0.08,0.05)

http://localhost:8983/solr/search?qt=dismax&q=search&bf=recip(ms(NOW,modify_date),3.16e-11,1,1)

https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
The qf (Query Fields) Parameter

The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below:

qf="fieldOne^2.3 fieldTwo fieldThree^0.4"

assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4. These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree.

The `bq` (Boost Query) Parameter

The bq parameter specifies an additional, optional, query clause that will be added to the user's main query to influence the score. For example, if you wanted to add a relevancy boost for recent documents:

q=cheese

bq=date:[NOW/DAY-1YEAR TO NOW/DAY]

You can specify multiple bq parameters. If you want your query to be parsed as separate clauses with separate boosts, use multiple bq parameters.

The `bf` (Boost Functions) Parameter

The bf parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries which will be added to the user's main query as optional clauses that will influence the score. Any function supported natively by Solr can be used, along with a boost value. For example:

recip(rord(myfield),1,2,3)^1.5

Specifying functions with the bf parameter is essentially just shorthand for using the bq param combined with the {!func} parser.

For example, if you want to show the most recent documents first, you could use either of the following:

bf=recip(rord(creationDate),1,1000,1000)

  ...or...

bq={!func}recip(rord(creationDate),1,1000,1000)

https://wiki.apache.org/solr/QueryElevationComponent
The uniqueKey field must currently be of type string for the QueryElevationComponent to operate properly.
https://medium.com/@kaismh/automated-solution-for-query-elevation-using-solr-a15ce88b2762

Configurable FieldType for query: you can create any custom pipeline for analyzing the query text, so you can lowercase the text, use a stemmer and etc.
Force elevation even on sorted results: you can either choose to respect or ignore the sort parameter.
Enable/Disable elevation parameter: you can choose to disable elevation via a query parameter, very useful for testing.

http://lucene.472066.n3.nabble.com/Programmatically-upload-configuration-into-ZooKeeper-td4104924.html
Programmatically upload configuration into ZooKeeper
ZkStateReader zkStateReader = cloudSolrServer.getZkStateReader();
SolrZkClient zkClient = zkStateReader.getZkClient();

File jsonFile = new File(updateClusterstateJson);
if (!jsonFile.isFile()) {
System.err.println(jsonFile.getAbsolutePath()+" not found.");
return;
}

byte[] clusterstateJson = readFile(jsonFile);

// validate the user is passing is valid JSON
InputStreamReader bytesReader = new InputStreamReader(new
ByteArrayInputStream(clusterstateJson), "UTF-8");
JSONParser parser = new JSONParser(bytesReader);
parser.toString();

zkClient.setData("/clusterstate.json", clusterstateJson, true);
http://www.programcreek.com/java-api-examples/index.php?class=org.apache.solr.common.cloud.SolrZkClient&method=getData

    String path = ZkStateReader.COLLECTIONS_ZKNODE + "/" + collection; 

    byte[] data = zkClient.getData(path, null, null, true); 

numShards=3&collection.configName=configName&replicationFactor=2&router.field=routerField&maxShardsPerNode=2

https://www.linkedin.com/pulse/automated-solution-query-elevation-using-solr-kais-hassan-phd
https://issues.apache.org/jira/browse/SOLR-6092

Suggestor

FuzzyLookupFactory

This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm.

https://findwise.com/blog/query-completion-apache-solr/

AnalyzingLookupFactory (default, finds matches based on prefix)
FuzzyLookupFactory (finds matches with misspellings),
AnalyzingInfixLookupFactory (finds matches anywhere in the text),
BlendedInfixLookupFactory (combines matches based on prefix and infix lookup)

You need to choose the one which fulfill your requirements. The second important parameter is dictionaryImpl which represents how indexed suggestions are stored. And again, you can choose between couple of implementations, e.g. DocumentDictionaryFactory (stores terms, weights, and optional payload) or HighFrequencyDictionaryFactory (works when very common terms overwhelm others, you can set up proper threshold).

Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory.

http://brandnewuser.iteye.com/blog/2297834

■ It should return ranked suggestions ordered by term frequency, as there is little benefit to suggesting rare terms that occur in only a few documents in your index, especially when the user has typed only a few characters.

https://cwiki.apache.org/confluence/display/solr/Suggester

blenderType: used to calculate weight coefficient using the position of the first matching word. Can be one of:

position_linear: weightFieldValue*(1 - 0.10*position): Matches to the start will be given a higher score (Default)
position_reciprocal: weightFieldValue/(1+position): Matches to the end will be given a higher score.
- exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0.

https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/solrconfig-phrasesuggest.xml
https://lucene.apache.org/core/6_1_0/suggest/org/apache/lucene/search/suggest/FileDictionary.html

https://issues.apache.org/jira/browse/LUCENE-6336
https://issues.apache.org/jira/browse/SOLR-9637
http://lucene.472066.n3.nabble.com/Duplicate-suggestions-td4212639.html
duplicated fields across your docs, you will se duplicate suggestions.

Do you have any intermediate API in your application ? In the case you can
modify the API using a Collection that prevent duplicates to contain and
return the suggestions.

In the case you want it directly from Solr I assume it is a "bug" .
I think the suggestions should return by default no duplicates ( because
the only information returned is the field value and not the document id.
Anyway could be a nice parameter to get better suggestions ( sending the
avoidDuplicate parameter to the suggester 0.

http://stackoverflow.com/questions/32045700/solr-suggester-in-solrcloud-mode
https://wiki.apache.org/solr/SpellCheckComponent#Distributed_Search_Support

It was not working because solr was running in SolrCloud mode. There is two ways to perform suggestion in solrCloud mode:

Use the distrib=false parameter. This will fetch the data from only one shard which you are accessing in the command. You can add the following into Component definition itself.
```
<bool name="distrib">false</bool> 
```
Use the shards and shards.qt parameter for searching all the shards. The shards parameter will contain comma separated list of all the shards which you want to include in the query. The shards.qt parameter will define the reat API you want to access.

shards.qt: Signals Solr that requests to shards should be sent to a request handler given by this parameter. Use shards.qt=/spell when making the request if your request handler is "/spell".

shards: shards=solr-shard1:8983/solr,solr-shard2:8983/solr Distributed Search

http://stackoverflow.com/questions/40905965/how-to-optimize-documentdictionary-build-on-solr-cloud-suggester
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/params/ShardParams.java
/** The requested URL for this shard */
public static final String SHARD_URL = "shard.url";

/** The Request Handler for shard requests */
public static final String SHARDS_QT = "shards.qt";

/** Request detailed match info for each shard (true/false) */
public static final String SHARDS_INFO = "shards.info";

/** Should things fail if there is an error? (true/false) */
public static final String SHARDS_TOLERANT = "shards.tolerant";

/** Force a single-pass distributed query? (true/false) */
public static final String DISTRIB_SINGLE_PASS = "distrib.singlePass";

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests

f, on the other hand, you wanted to search just one shard, you can specify that shard by its logical ID, as in:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1

If you want to search a group of shard Ids, you can specify them together:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=shard1,shard2

In both of the above examples, the shard Id(s) will be used to pick a random replica of that shard.

Alternatively, you can specify the explict replicas you wish to use in place of a shard Ids:

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/gettingstarted,localhost:8983/solr/gettingstarted

Or you can specify a list of replicas to choose from for a single shard (for load balancing purposes) by using the pipe symbol (|):

http://localhost:8983/solr/gettingstarted/select?q=*:*&shards=localhost:7574/solr/gettingstarted|localhost:7500/solr/gettingstarted

http://signaldump.org/solr/qpod/49836/cloudsolrclient-does-not-distribute-suggest-build-true

When using the new Suggester component (with AnalyzingInfixSuggester) in
Solr trunk with solrj, the suggest.build command seems to be executed only
on one of the solr cloud nodes.

I had to add shards.qt=/suggest and
shards=host1:port2/solr/mycollection,host2:port2/solr/mycollection... to
distribute the build command on all nodes.

Given that we are using SolrCloud, I would have expected the build command
to behave like an cloud update and be sent to all nodes without the need of
specifying shards and shards.qt

https://wiki.apache.org/solr/DistributedSearch

The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list. The syntax of shards is host:port/base_url[,host:port/base_url]* A sharded request will go to the standard request handler (not necessarily the original); this can be overridden via shards.qt. Since SOLR-3134 it is possible to obtain numFound, maxScore and time per shard in a distributed search query. Use shards.info=true to enable this feature. The shards.tolerant=true parameter includes error information if available. (SolrCloud can handle this for you in a more transparent way).

Distributed Deadlock

Each shard may also serve top-level query requests and then make sub-requests to all of the other shards. In this configuration, care should be taken to ensure that the max number of threads serving HTTP requests in the servlet container is greater than the possible number of requests from both top-level clients and other shards (the solr example server is already configured correctly). If this is not the case, a distributed deadlock is possible.

Consider the simplest case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the servlet containers will block the incoming requests until the other pending requests are finished (but they won't finish since they are waiting for the sub-requests).

https://lucidworks.com/blog/2017/01/09/context-filtering-with-solr-suggesters/
http://stackoverflow.com/questions/36079395/how-to-configure-multiple-contextfields-in-single-solr-suggester

You have to create a new field in your schema.xml as context_field. This field should have multivalued=true

<field name="context_field" type="text_suggest" multiValued="true" indexed="true" stored="true"/>

Then you have to create this context_field as a list in json for indexing in solr.

"context_field" : ["some document type", "some department type"]

after indexing you can suggest like this-

suggest.q=b&suggest.cfq=context_documentType AND context_departmentType

https://issues.apache.org/jira/browse/SOLR-7963
suggest.q=c&suggest.cfq=memory
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/spelling/suggest/SolrSuggester.java
Analyzer contextFilterQueryAnalyzer = new TokenizerChain(new StandardTokenizerFactory(Collections.EMPTY_MAP), null);
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/solr/core/src/test-files/solr/collection1/conf/solrconfig-suggestercomponent.xml

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the router.name parameter as compositeId when creating the collection

The CollapsingQParser is really a post filter that provides more performant field collapsing than Solr's standard approach when the number of distinct groups in the result set is high.

fq={!collapse field=group_field}

https://cwiki.apache.org/confluence/display/solr/Result+Clustering

The clustering (or cluster analysis) plugin attempts to automatically discover groups of related

search hits (documents) and assign human-readable labels to these groups. By default in Solr, t

he clustering algorithm is applied to the search result of each single query—this is called an

on-line clustering.

https://developer.s24.com/blog/a-utility-library-for-working-with-solrs-namedlist.html
- seems not really useful

- change defualt

solr.JSONResponseWriter

text/plain; charset=UTF-8
to: application/json

Rest api
https://lucidworks.com/blog/2014/03/31/introducing-solrs-restmanager-and-managed-stop-words-and-synonyms/

https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/RestManager.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/rest/ManagedResource.java

https://cwiki.apache.org/confluence/display/solr/Managed+Resources

Changes made to managed resources via this REST API are not applied to the active Solr components until the Solr collection (or Solr core in single server mode) is reloaded. For example:, after adding or deleting a stop word, you must reload the core/collection before changes become active.

This approach is required when running in distributed mode so that we are assured changes are applied to all cores in a collection at the same time so that behavior is consistent and predictable. It goes without saying that you don’t want one of your replicas working with a different set of stop words or synonyms than the others.

curl "http://localhost:8983/solr/techproducts/schema/managed"

Field types
https://prismoskills.appspot.com/lessons/Solr/Chapter_20_-_Field_types_-_schema.xml.jsp
https://www.triquanta.nl/blog/going-dutch-stemming-apache-solr

https://cwiki.apache.org/confluence/display/solr/Language+Analysis

Protects words from being modified by stemmers. A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.

TODOhttps://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
https://support.lucidworks.com/hc/en-us/articles/205359448-Injecting-multi-word-phrase-synonyms-at-query-time-with-Solr-SynonymFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit") The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arrise at query time:

The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:

An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"

A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.

Synonym Filter

synonyms: (required) The path of a file that contains a list of synonyms, one per line. In the (default) solr format - see the format argument below for alternatives - blank lines and lines that begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory. There are two ways to specify synonym mappings:

A comma-separated list of words. If the token matches any of the words, then all the words in the list are substituted, which will include the original token.

Two comma-separated lists of words with the symbol "=>" between them. If the token matches any word on the left, then the list on the right is substituted. The original token will not be included unless it is also in the list on the right.

ignoreCase: (optional; default: false) If true, synonyms will be matched case-insensitively.

expand: (optional; default: true) If true, a synonym will be expanded to all equivalent synonyms. If false, all equivalent synonyms will be reduced to the first in the list.

https://github.com/haberman/fast-recs-collate/blob/master/lookup3.c
http://burtleburtle.net/bob/hash/doobs.html
https://yonik.wordpress.com/tag/lookup3/
http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/Hash.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.java
https://github.com/apache/lucene-solr/blob/53981795fd73e85aae1892c3c72344af7c57083a/lucene/core/src/java/org/apache/lucene/util/StringHelper.java
https://lawlesst.github.io/notebook/solr-etags.html

Prefix search
http://stackoverflow.com/questions/7496405/how-to-configure-solr-so-users-can-make-prefix-search-by-default
There are several ways to do this, but performance wise you might want to use EdgeNgramFilterFacortory
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html
https://github.com/sunspot/sunspot/wiki/Matching-substrings-in-fulltext-search

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-EdgeN-GramFilter
Edge N-Gram Filter

In: "four score"

Tokenizer to Filter: "four", "score"

Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-ShingleFilter

minShingleSize: (integer, default 2) The minimum number of tokens per shingle.

maxShingleSize: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.

In: "To be, or what?"

Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

https://issues.apache.org/jira/browse/SOLR-1321
https://issues.apache.org/jira/browse/LUCENE-1398

http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/

This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not reversed.

Factory class: solr.ReversedWildcardFilterFactory

Suffix search
http://stackoverflow.com/questions/19995804/search-suffix-word-solr-4-5-1
http://stackoverflow.com/questions/39517891/solr-find-documents-that-contain-a-field-value-that-a-query-string-starts-or-en

from solr4.3+ We have to combine it with ReverseStringFilter

http://khaidoan.wikidot.com/solr

What are the costs of n-gram analysis?

There is a high price to be paid for n-gramming. Recall that in the earlier example, Tonight was split into 15 substring terms, whereas typical analysis would probably leave only one. This translates to greater index sizes, and thus a longer time to index.

Note the ten-fold increase in indexing time for the artist name, and a five-fold increase in disk space. Remember that this is just one field!

Given these costs, n-gramming, if used at all, is generally only done on a field or two of small size where there is a clear requirement for substring matches.

Another variation is EdgeNGramTokenizerFactory and EdgeNGramFilterFactory, which emit n-grams that are adjacent to either the start or end of the input text.

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production
$ bin/solr start -Dsolr.autoSoftCommit.maxTime=10000

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
$ bin/solr restart -c -p 8983 -s example/cloud/node1/solr

http://stackoverflow.com/questions/34253178/solr-doesnt-overwrite-duplicated-uniquekey-entries
https://issues.apache.org/jira/browse/SOLR-6096

Giving the test I suppose that update with children works fine if you specify

<add overwrite = "true"><doc>...</doc> </add>

<doc childfree="true">

Friday, December 30, 2016

Solr Misc Part 3

"Tailing" a Cursor

Boost Queries

Boost Functions

The boost Parameter

ord

rord

Date Boosting

How can I make exact-case matches score higher

How can I make queries of "spiderman" and "spider man" match "Spider-Man"

How can I search for one term near another term (say, "batman" and "movie")

The `bq` (Boost Query) Parameter

The `bf` (Boost Functions) Parameter

FuzzyLookupFactory

Distributed Deadlock

Labels

Popular Posts

Friday, December 30, 2016

Solr Misc Part 3

"Tailing" a Cursor

Boost Queries

Boost Functions

The boost Parameter

ord

rord

Date Boosting

How can I make exact-case matches score higher

How can I make queries of "spiderman" and "spider man" match "Spider-Man"

How can I search for one term near another term (say, "batman" and "movie")

The bq (Boost Query) Parameter

The bf (Boost Functions) Parameter

FuzzyLookupFactory

Distributed Deadlock

Labels

Popular Posts

The `bq` (Boost Query) Parameter

The `bf` (Boost Functions) Parameter