Massive Technical Interviews Tips: May 2017

Thursday, May 25, 2017

How to Respond to Recruiters

https://www.themuse.com/advice/5-email-templates-to-respond-to-recruiters-no-matter-where-you-are-in-your-search

Thanks for reaching out! This certainly sounds like an interesting job, and I appreciate your consideration.

I really love the work I’m doing for [Your Company] and am not in the market for a new opportunity at the moment. That said, if I find myself looking to make a change in the future, I’ll be sure to get in touch.

Thanks for getting in touch!

I’m pretty happy in my current role with [Your Company] and am not actively looking to change jobs, but I’d be open to discussing this role, as I never turn down a chance to chat about [compelling trait about the job description, e.g., software development or sales enablement]. Would it be possible for us to connect sometime next week? I should be available for a quick call on [dates and times that’ll work with your schedule].

Moving forward, you can reach me directly here: [your email address and/or phone number].

Looking forward to speaking with you!

This sounds like a really interesting opportunity—thanks for thinking of me!

As you probably saw on my profile, I have [X years] of experience in the [industry or job function, e.g., digital marketing or project management] space, and am particularly interested in opportunities that allow me to [relevant job duty/deliverable, e.g. leverage my creativity in a design-focused role or build new programs from the ground up]. Based on the information you’ve shared, it sounds like the role certainly could be a great fit!

I’d love to schedule a time for us to discuss how my skills and experience could benefit the team; would it be possible for us to connect sometime this week? I’ve included my availability below:

[dates/times]

You can reach me directly at [your e-mail address and/or phone number]. Looking forward to connecting!

Thanks for getting in touch! Based on what you’ve shared about this role, I’d be eager to learn more.

It sounds like you’re looking for an [job title] with [relevant skills/experience] expertise and a talent for developing [insert outcomes, e.g., unique and compelling marketing campaigns across a variety of digital channels]—that’s me!

As someone with [X years of experience] in the industry, I know what it takes to deliver [deliverables based on job description, e.g., flawlessly executed e-mail campaigns from start to finish]. In my current role at [Your Current Company], I [description of relevant experience and tangible results based on job description, i.e., guide the production and execution of 25 unique monthly email campaigns and have grown new lead generation by 50% in just six months].

I’d love to schedule a time for us to discuss how my skills and experience could benefit the [Company Name] team; would it be possible for us to connect sometime this week? I’ve included my availability below:

By the way, I noticed you’re a Chico State alum, too. It’s always great to hear from a fellow Wildcat!” or “it looks like you’re also connected with [Name of Mutual Acquaintance]. I used to work with her at [Company Name]!”

http://lifehacker.com/5905427/how-to-follow-up-on-a-job-interview-without-being-annoying

I usually confine it to email and make it a quick note - thank them again for the interview and ask if there's been an update/any movement on the position. If they respond, you can usually get a feel for whether you're annoying the shit out of them.

I just wanted to follow up in regards to my interview on [date — or "last week"]. Do you have an update, or do you need any further information from me?

https://www.quora.com/How-do-I-ask-the-HR-person-by-mail-whether-I-am-selected-or-not-after-a-few-days-of-interview

Subject : Following up on the interview for position of _________

Dear Hiring Manager,

In reference to my interview for the position of ______________ dated ____________, I am writing to inquire about updates on the progress of your hiring decision and the status of my job application. In review of the opportunity of interview and my skills, I am eager to work with your company.

I would appreciate any feedback for the same. Again, I would like to thank you for your time and consideration and look forward to hearing back from you soon. In case you require any clarifications about my candidature, I can be reached at 123456 or via email at abc@mail.com

Please recall that I had the fortune of interviewed by your company on …. for the position. I believe that overall it went well and as I said in the interview, I would love to work for your company.

If it is not early, can I be advised if you have decided anything - hopefully positive - on my candidature? A positive and early response would help me to get prepared for the new phase of my life.

Looking forward to working with you and with regards

http://www.impactinterview.com/2017/02/how-to-ask-interview-status-2-sample-emails/

Following up for the position of [position name], I’d like to inquire about the progress of your hiring decision and the status of my job application. I am very eager to work with your company.

Thanks for your time and consideration, and I look forward to hear back from you soon.

I enjoyed meeting you last week and wanted to share how excited I am about this opportunity. Is there anything else I can forward along to make your hiring decision easier?

https://workplace.stackexchange.com/questions/46323/is-it-okay-to-tell-the-preferred-company-that-you-have-received-other-job-offers

I was very excited about the Designer position after talking with Mr. Senior Manager the other day. It seems like a great fit for both of us and I am eager to join your team!

However, I have received other job offers which are competitive and I am considering them.

I would rather work for your company; can you tell me what the status regarding my application?

Am I a serious candidate for the position?

(If so) Is there a way to move the hiring process along?

I am happy to talk with you again this week (or sooner) if that would help.

Thanks for your consideration! I look forward to talking with you again soon!

https://www.quora.com/Does-Apple-grant-or-replenish-RSU-to-its-employees-senior-engineers-in-the-annual-review-I-am-looking-for-statistics-on-whether-an-individual-group-received-it-every-year-and-what-percent-the-grant-was
There isn't a better answer -- it really is "it depends". To help you understand why, here's how equity grants work at most SV companies: each comp period (e.g. once a year) there's a stock (equity) pool allocation given to managers. There is usually guidance on how to allocate that stock pool -- e.g. "give stock to 50% of the eligible employees and focus on those with the greatest future potential, and those with low equity who are strong performers" Managers then allocate stock to employees, and that allocation is reviewed by their manager (often with HR partner consultation) to ensure allocations are taking place equitably (that is, using the same criteria / rules).

I would add that average performance will result in below average equity allocation -- in my experience, the guide rules encourage giving more stock to fewer people to focus on above-average performance. That is going to depend on how good you are, what contributions you make, how well you get along with your manager, the group you are in, and other factors. You need to focus on your personal contribution and be smart about ensuring you're in the best place in the organization to deliver the best contribution.

https://www.quora.com/What-kind-of-pay-raises-and-stock-refresh-does-one-get-in-software-hardware-engineering-at-Apple

My very anti-climatic answer to this question is that the average Apple engineer should expect that they will get their compensation half in cash (salary plus bonus) and half in stock, in the form of RSUs on a four year vesting schedule (1/4 after a year, then 1/8th every six months thereafter). The flow of RSU income is maintained by yearly grants at comp review time.

(Cheap shot) now that Apple actually has to compete on compensation again (haw haw haw), you should also expect the total compensation package to be in the Silicon Valley average ($300k+ for a senior engineer, after four years when the vesting schedule is in full force, in the form of $150k base and the rest in stock and bonuses per the above).

Sunday, May 14, 2017

Solr Misc Part 4

http://massivetechinterview.blogspot.com/2017/09/solr-misc-part-5.html
http://blog.florian-hopf.de/2014/03/prefix-and-suffix-matches-in-solr.html

One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*

Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application.

You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher:

dump^2 OR dump*

When creating a query like this you should also take care that the user can't add terms that will make the query invalid. The SolrJ method escapeQueryChars of the class ClientUtils can be used to escape the user input.

            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>

            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/>

You can't use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter:

        <filter class="solr.ReverseStringFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        <filter class="solr.ReverseStringFilterFactory"/>

https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-SynonymGraphFilter

Synonym Filter has been deprecated in favor of Synonym Graph Filter, which is required for multi-term synonym support.

https://issues.apache.org/jira/browse/SOLR-10379
Add ManagedSynonymGraphFilterFactory, deprecate ManagedSynonymFilterFactory

Word Delimiter Filter has been deprecated in favor of Word Delimiter Graph Filter, which is required to produce a correct token graph so that e.g. phrase queries can work correctly.

http://blog.csdn.net/zteny/article/details/60633374
DocValues字段是一个面向列存储的字段，一个Segment只有一个DocValues文件。也就是被DocValues标记的字段在建索引时会额外存储文件到值的映射关系，存储这个映射的文件叫DocValues data，简称.dvd。对应的元数据文件叫.dvm:DocValues Metadata。
http://mozhenghua.iteye.com/blog/2275932

docValues和document的stored=ture存储的值，都是正排索引，单也是有区别的：

l 存储方式：

DocValues是面向列的存储方式，stored=true是面向行的存储方式，如果通过fieldid取列的值可定是用docValues的存储结构更高效。

l 是否分词：

Stored=true的存储方式是不会分词的，会将字段原值进行保存，而docValues的保存的值会进行分词。

如果在索引上要进行facet，gourp，highlight等查询尽量使用docValue，这样不用为内存开销烦恼了。

例如：solr4.0之后都会需要在schema中设置一个_version_字段来实现对文档的原子操作，为了节省内存，可以加上docValues：

<field name="_version_"

type="long" indexed="true" stored="true" docValues="true"/>

https://www.slideshare.net/lucidworks/search-analytics-component-presented-by-steven-bower-bloomberg-lp
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term.
https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.

row-oriented (stored fields)

{
  'doc1': {'A':1, 'B':2, 'C':3},
  'doc2': {'A':2, 'B':3, 'C':4},
  'doc3': {'A':4, 'B':3, 'C':2}
}

column-oriented (docValues)

{
  'A': {'doc1':1, 'doc2':2, 'doc3':4},
  'B': {'doc1':2, 'doc2':3, 'doc3':3},
  'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

When Solr/Lucene returns a set of document ids from a query, it will then use the row-oriented (aka, stored fields) view of the documents to retrieve the actual field values. This requires a very few number of seeks since all of the field data will be stored close together in the fields data file.

However, for faceting/sorting/grouping Lucene needs to iterate over every document to collect the field values. Traditionally, this is achieved by uninverting the term index. This performs very well actually, since the field values are already grouped (by nature of the index), but it is relatively slow to load and is maintained in memor

https://wiki.apache.org/solr/DocValues

What docvalues are:
- NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
- Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
- Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
- Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.
What docvalues are not:
- Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
- Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.

https://stackoverflow.com/questions/28960088/lucene-fields-vs-docvalues
https://lucene.apache.org/solr/guide/6_6/docvalues.html

The standard way that Solr builds the index is with an inverted index. This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document). This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.

In Lucene 4.0, a new approach was introduced. DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />

There is an additional configuration option available, which is to modify the docValuesFormat used by the field type. The default implementation employs a mixture of loading some things into memory and keeping some on disk.

<fieldType name="string_in_mem_dv" class="solr.StrField" docValues="true" docValuesFormat="Memory" />

If docValues="true" for a field, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.

When useDocValuesAsStored="false", non-stored DocValues fields can still be explicitly requested by name in the fl param, but will not match glob patterns ("*"). Note that returning DocValues along with "regular" stored fields at query time has performance implications that stored fields may not because DocValues are column-oriented and may therefore incur additional cost to retrieve for each returned document. Also note that while returning non-stored fields from DocValues, the values of a multi-valued field are returned in sorted order (and not insertion order). If you require the multi-valued fields to be returned in the original insertion order, then make your multi-valued field as stored (such a change requires re-indexing).

https://wiki.apache.org/solr/SchemaXml
https://cwiki.apache.org/confluence/display/solr/Field+Properties+by+Use+Case

Use Case	indexed	stored	multiValued	omitNorms	termVectors	termPositions	docValues
search within field	true
retrieve contents		true⁸					true⁸
use as unique key	true		false
sort on field	true⁷		false	true ¹			true⁷
highlighting	true ⁴	true			true²	true ³
faceting ⁵	true⁷						true⁷
add multiple values, maintaining order			true
field length affects doc score				false
MoreLikeThis ⁵					true ⁶

Notes:

¹ Recommended but not necessary.
² Will be used if present, but not necessary.
³ (if termVectors=true)
⁴ A tokenizer must be defined for the field, but it doesn't need to be indexed.
⁵ Described in Understanding Analyzers, Tokenizers, and Filters.
⁶ Term vectors are not mandatory here. If not true, then a stored field is analyzed. So term vectors are recommended, but only required if stored=false.
⁷ For most field types, either indexed or docValues must be true, but both are not required. DocValues can be more efficient in many cases. For [Int/Long/Float/Double/Date]PointFields, docValues=true is required.

https://cwiki.apache.org/confluence/display/solr/Defining+Fields

https://cwiki.apache.org/confluence/display/solr/Field+Type+Definitions+and+Properties

autoGeneratePhraseQueries

For text fields. If true, Solr automatically generates phrase queries for adjacent terms. If false, terms must be enclosed in double-quotes to be treated as phrases.

docValues

If true, the value of the field will be put in a column-oriented DocValues structure.

true or false

false

useDocValuesAsStored	If the field has docValues enabled, setting this to true would allow the field to be returned as if it were a stored field (even if it has `stored=false`) when matching "`*`" in an fl parameter.	true or false	true
large	Large fields are always lazy loaded and will only take up space in the document cache if the actual value is < 512KB. This option requires `stored="true"` and `multiValued="false"`. It's intended for fields that might have very large values so that they don't get cached in memory.

Although they have been deprecated for quite some time, Solr still has support for Schema based configuration of a <defaultSearchField/> (which is superseded by the df parameter) and <solrQueryParser defaultOperator="OR"/> (which is superseded by the q.op parameter.

http://lucene.472066.n3.nabble.com/autoGeneratePhraseQueries-sort-of-silently-set-to-false-td3770638.html

* SOLR-2015: Add a boolean attribute autoGeneratePhraseQueries to TextField.
autoGeneratePhraseQueries="true" (the default) causes the query parser to
generate phrase queries if multiple tokens are generated from a single
non-quoted analysis string. For example WordDelimiterFilter splitting text:pdp-11
will cause the parser to generate text:"pdp 11" rather than (text:PDP OR text:11).
Note that autoGeneratePhraseQueries="true" tends to not work well for non whitespace
delimited languages. (yonik)

with a ton of useful, though back and forth, commentary here: <https://issues.apache.org/jira/browse/SOLR-2015>

http://signaldump.org/solr/qpod/33443/solr-elevate-with-complex-query-specifying-field-names

The query elevation component matches queries exactly with entries in
elevate.xml. You can nominate a query field type that is used to process
the query before matching, but that won't help you when your queries have
explicit boosts.

Are you using the eDismax query parser? If so, you can separate your boosts
from the actual query, using the "qf" edismax configuration parameter,
which specifies which fields to query, and their boosts:

q=test
qf=ean name^10.00 persartnr^5.00 persartnrdirect shortdescription

That way your query isn't polluted with boosts (or fields for that matter),
and an entry in elevate.xml with will match.

http://lucene.472066.n3.nabble.com/Problems-with-elevation-component-configuration-td3993204.html
Thanks Chris, but actually, it turns out that "query text" from elevate.xml has to match the query (q=...). So in this case, elevation works only for http://localhost:8080/solr/elevate?q=brain, but not for http://localhost:8080/solr/elevate?q=indexingabstract:brain type of queries.
This could be solved by using DisMax query parser (http://localhost:8080/solr/elevate?q=brain&qf=indexingabstract&defType=dismax), but we have way more complicated queries which cannot be reduced just to q=searchterm&...

right ... query elevation by default is based on the raw query string.

: but we have way more complicated queries which cannot be reduced just to
: q=searchterm&...

what would you want/exect QEC to do in that type of situation? how would
it know what part of a complex query should/shouldn't be used for
elevation?

FWIW: ne thing you can do is configure a "queryFieldType" on in your
QueryElevationComponent .. if specific, i will use the analyzer for
that field type to process the raw query string before doing a lookup in
the QEC data -- so for example: you could use it to lowercase the input,
or strip out unwanted whitespace or punctuation.

it night not help for really complicated queries, but it would let you
easily deal with things like extra whitespace you want to ignore.

I think it would also be fairly easy to make QEC support an "elevate.q"
param similar to how there is a "spellcheck.q" param and a "hl.q" param to
let the client specify an alternate, simplified, string for the feature to
use
https://cwiki.apache.org/confluence/display/solr/Other+Parsers
https://issues.apache.org/jira/browse/SOLR-418

Absolute boosting

Absolute boosting enables a document to be consistently displayed at a given position in the result set when a user searches with a specific query. It also prevents individual documents from being displayed when a user searches with a specific query.

Under boosting, they have:

Boosting may be applied in two ways:

Query independent (document boosting). This is used to boost high quality pages for all queries that match the document
Query dependant (query boosting). In this case specific documents may be boosted for given queries

https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/component/QueryElevationComponent.java

For debugging it may be useful to see results with and without the elevated docs. To hide results, use enableElevation=false:

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=true

http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=false

Query nesting
q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book
https://wiki.apache.org/solr/FunctionQuery#strdist

Calculate the distance between two strings. Uses the Lucene spell checker StringDistance interface and supports all of the implementations available in that package, plus allows applications to plug in their own via Solr's resource loading capabilities.

Signature: strdist(s1, s2, {jw|edit|ngram|FQN}[, ngram size])
Example: strdist("SOLR",id,edit)

The third argument is the name of the distance measure to use. The abbreviations stand for:

jw - Jaro-Winkler
edit - Levenstein or Edit distance
ngram - The NGramDistance, if specified, can optionally pass in the ngram size too. Default is 2.
FQN - Fully Qualified class Name for an implementation of the StringDistance interface. Must have a no-arg constructor.

This function returns a float between 0 and 1 based on how similar the specified strings are to one another. Returning a value of 1 means the specified strings are identical and 0 means the string are maximally different.

query?q=title:iPhone+4S+Battery+Replacement&fl=*,score,lev_dist:strdist("iPhone 4S Battery Replacement",title_raw,edit)

http://lucene.apache.org/core/6_5_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html

Performs potentially multiple passes over Query text to parse any nested logic in PhraseQueries. - First pass takes any PhraseQuery content between quotes and stores for subsequent pass. All other query content is parsed as normal - Second pass parses any stored PhraseQuery content, checking all embedded clauses are referring to the same field and therefore can be rewritten as Span queries. All PhraseQuery clauses are expressed as ComplexPhraseQuery objects

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

The ComplexPhraseQParser provides support for wildcards, ORs, etc., inside phrase queries using Lucene's ComplexPhraseQueryParser . Under the covers, this query parser makes use of the Span group of queries, e.g., spanNear, spanOr, etc., and is subject to the same limitations as that family or parsers.

`inOrder`	Set to true to force phrase queries to match terms in the order specified. Default: true
`df`	The default search field.

{!complexphrase inOrder=true}name:"Jo* Smith"

{!complexphrase inOrder=false}name:"(john jon jonathan~) peters*" 

A mix of ordered and unordered complex phrase queries:

+_query_:"{!complexphrase inOrder=true}manu:\"a* c*\"" +_query_:"{!complexphrase inOrder=false df=name}\"bla* pla*\""

Performance is sensitive to the number of unique terms that are associated with a pattern. For instance, searching for "a*" will form a large OR clause (technically a SpanOr with many terms) for all of the terms in your index for the indicated field that start with the single letter 'a'. It may be prudent to restrict wildcards to at least two or preferably three letters as a prefix. Allowing very short prefixes may result in to many low-quality documents being returned.

Notice that it also supports leading wildcards "*a" as well with consequent performance implications. Applying ReversedWildcardFilterFactory in index-time analysis is usually a good idea.

You may need to increase MaxBooleanClauses in solrconfig.xml as a result of the term expansion above:

<maxBooleanClauses>4096</maxBooleanClauses>

It is recommended not to use stopword elimination with this query parser. Lets say we add the, up, to to stopwords.txt for your collection, and index a document containing the text "Stores up to 15,000 songs, 25,00 photos, or 150 yours of video" in a field named "features".

While the query below does not use this parser:

q=features:"Stores up to 15,000"

the document is returned. The next query that does use the Complex Phrase Query Parser, as in this query:

q=features:"sto* up to 15*"&defType=complexphrase

does not return that document because SpanNearQuery has no good way to handle stopwords in a way analogous to PhraseQuery. If you must remove stopwords for your use case, use a custom filter factory or perhaps a customized synonyms filter that reduces given stopwords to some impossible token.

https://stackoverflow.com/questions/2589086/lucene-fuzzy-match-on-phrase-instead-of-single-word

I'm trying to do a fuzzy match on the Phrase "Grand Prarie" (deliberately misspelled) using Apache Lucene. Part of my issue is that the ~ operator only does fuzzy matches on single word terms and behaves as a proximity match for phrases.

Is there a way to do a fuzzy match on a phrase with lucene?

There's no direct support for a fuzzy phrase, but you can simulate it by explicitly enumerating the fuzzy terms and then adding them to a MultiPhraseQuery. The resulting query would look like:

<MultiPhraseQuery: "grand (prarie prairie)">

https://stackoverflow.com/questions/5516503/searching-names-with-apache-solr/35764572
https://www.slideshare.net/basistech/simple-fuzzynamematchinginsolr-googleslides

Edit distance
https://stackoverflow.com/questions/21607413/edit-distance-similarity-in-lucene-solr

LocalParams
https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries

If a local parameter value appears without a name, it is given the implicit name of "type". This allows short-form representation for the type of query parser to use when parsing a query string. Thus

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield}solr rocks

If no "type" is specified (either explicitly or implicitly) then the lucene parser is used by default. Thus

fq={!df=summary}solr rocks

is equivilent to:

fq={!type=lucene df=summary}solr rocks

A special key of v within local parameters is an alternate way to specify the value of that parameter.

q={!dismax qf=myfield}solr rocks

is equivalent to

q={!type=dismax qf=myfield v='solr rocks'}

Parameter dereferencing or indirection lets you use the value of another argument rather than specifying it directly. This can be used to simplify queries, decouple user input from query parameters, or decouple front-end GUI parameters from defaults set in solrconfig.xml.

q={!dismax qf=myfield}solr rocks

is equivalent to:

q={!type=dismax qf=myfield v=$qq}&qq=solr rocks

http://www.lucenetutorial.com/lucene-query-syntax.html

Search for any word that starts with "foo" in the title field.

title:foo*

Search for any word that starts with "foo" and ends with bar in the title field.

title:foo*bar

Note that Lucene doesn't support using a * symbol as the first character of a search.

Search for "foo bar" within 4 words from each other.

"foo bar"~4

Note that for proximity searches, exact matches are proximity zero, and word transpositions (bar foo) are proximity 1.

A query such as "foo bar"~10000000 is an interesting alternative to foo AND bar.

Whilst both queries are effectively equivalent with respect to the documents that are returned, the proximity query assigns a higher score to documents for which the terms foo and bar are closer together.

The trade-off, is that the proximity query is slower to perform and requires more CPU.

Fuzzy search
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

roam~

This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2. For example:

roam~1

This will match terms like roams & foam - but not foams since it has an edit distance of "2".

https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
https://stackoverflow.com/questions/30909106/fuzzy-search-not-working-with-dismax-query-parser

DisMax, by design, does not support all lucene query syntax in it's query parameter. From the documentation:

This query parser supports an extremely simplified subset of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses ... but all other Lucene query parser special characters are escaped to simplify the user experience.

Fuzzy queries are one of the things that are not supported. There is a request to add it to the qfparameter, if you'd care to take a look, but it has not been implemented.

One good solution would be to go to the edismax query parser, instead. It's query parameter supports full lucene query parser syntax:

http://localhost:8983/solr/simple/select?q=test~1&defType=edismax&qf=fullText

https://issues.apache.org/jira/browse/SOLR-629

For pf, pf2 anf pf3 the ~ is already supported, but has the meaning of slop, not fuzzy. E.g. pf=body~15^5 is equivalent to pf=body^5&ps=15

http://developer4life.blogspot.com/2013/02/solr-and-lucene-fuzzy-search-closer-look.html

The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Than given an unknown query, how does Lucene finds all the terms in the index that are at distance <= than the specified required similarity? Well... this depends on the Solr/Lucene version you are using.

You can take a look at the warning that appears at Lucene 3.2.0 Javadoc:

Warning: this query is not very scalable with its default prefix length of 0 - in this case, *every* term will be enumerated and cause an edit score calculation.

Moreover, prior to 4.0 release Lucene implementation to compute this distance was done for each query for EACH term in the index. You really don't want to use this. So my advice to you is to upgrade - the faster the better.

The Lucene 4.0 Fuzzy took a very different approach. The search now works with FuzzyQuery. The underlying implementation has changed in 4.0 drastically, which lead to significant complexity improvements. Current implementation uses the Levenshtein Automata. This automaton is based on the work of Klaus U. Schulz and Stoyan Mihov "Fast string correction with Levenshtein automata". To make a very long story short this paper shows how to recognize the set of all words V in an index where the Levenshtein distance between V and the query does not exceed a distance d, which is exactly what one wants with Fuzzy Search. For a deeper look see here and here.

http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

roam~

This search will find terms like foam and roams.

Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:

roam~0.8

The default that is used if the parameter is not given is 0.5.

https://stackoverflow.com/questions/1752301/how-to-configure-solr-to-use-levenshtein-approximate-string-matching
just append the ~ character to all terms that you want to fuzzy match on the way in to solr. If you are using the default set up, this will give you fuzzy match.

term:"apple"~2

If you put quotes around apple, I think it becomes a phrase query, so the ~2 means proximity search, instead of edit distance.

Typically this is done with the SpellCheckComponent, which internally uses the Lucene SpellChecker by default, which implements Levenshtein.

The wiki really explains very well how it works, how to configure it and what options are available, no point repeating it here.

Or you could just use Lucene's fuzzy search operator.

Another option is using a phonetic filter instead of Levenshtein.

https://stackoverflow.com/questions/18629373/solr-xml-parser-exception

Caused by: org.apache.solr.common.SolrException: org.xml.sax.SAXParseException;systemId: solrres:/solrconfig.xml; lineNumber: 813; columnNumber: 19; The content of elements must consist of well-formed character data or markup.
  at org.apache.solr.core.Config.<init>(Config.java:148)
  at org.apache.solr.core.Config.<init>(Config.java:86)
  at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
  at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:589)
  ... 11 more
used by: org.xml.sax.SAXParseException; systemId: solrres:/solrconfig.xml; lineNumber: 813; columnNumber: 19; The content of elements must consist of well-formed character data or markup.

As < > are xml characters, parsing would fail.
You would need to use > and < for > & < respectively in the Solr Config xml file.

e.g. <str name="mm">4 < 100%</str>

https://lucene.apache.org/solr/guide/6_6/velocity-response-writer.html#velocity-response-writer
http://localhost:8983/solr/techproducts/browse

https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It does this by using terms from the original document to find similar documents in the index.

There are three ways to use MoreLikeThis. The first, and most common, is to use it as a request handler. In this case, you would send text to the MoreLikeThis request handler as needed (as in when a user clicked on a "similar documents" link). The second is to use it as a search component. This is less desirable since it performs the MoreLikeThis analysis on every document returned. This may slow search results. The final approach is to use it as a request handler but with externally supplied text. This case, also referred to as the MoreLikeThisHandler, will supply information about similar documents in the index based on the text of the input document.

https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
http://blog.thedigitalgroup.com/vijaym/beider-morse-phonetic-matching-in-solr/

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto">
</filter>
</analyzer>

RuleType

APPROX : Approximate rules, which will lead to the largest number of phonetic interpretations.
EXACT : Exact rules, which will lead to a minimum number of phonetic interpretations.

NameType

Supported types of names. Unless you are matching particular family names, use GENERIC. The GENERIC NameType should work reasonably well for non-name words. The other encodings are specifically tuned to family names, and may not work well at all for general text.
ASHKENAZI (ash) : Ashkenazi family names.
GENERIC (gen) : Generic names and words.
SEPHARDIC (sep) : Sephardic family names.

./install_solr_service.sh /home/yyuan/solr-6.6.0.tgz -n
./solr zk -h
./solr zk mkroot /yourroot
ln -s /etc/default/solr.in.sh /var/solr/solr.in.sh
service solr start
curl -d "action=clusterstatus&wt=json" http://localhost:8983/solr/admin/collections | jq
http://solr.pl/en/2011/02/28/sorting-by-function-value-in-solr-solr-1297/

q=:&sort=opis_sort+desc	200.000	267ms	0ms
q=:&sort=sum(geo_x,geo_y)+desc	200.000	823ms	0ms
q=opis:ala&sort=opis_sort+desc	200.000	266ms	1ms
q=opis:ala&sort=sum(geo_x,geo_y)+desc	200.000	810ms	1ms

Above test shows that sorting using the sort function is much slower than the default sort order (which you’d expect). Sorting on the basis of function value is also slower than sorting with the use of string based field, but the difference is not as significant as in the previous case.

https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/function-queries.html
if(termfreq (cat,'electronics'),popularity,42) : This function checks each document for the to see if it contains the term “electronics” in the cat field. If it does, then the value of the popularityfield is returned, otherwise the value of 42is returned.

query(subquery, default) q=product(popularity, ` query({!dismax v='solr rocks'}): returns the product of the popularity and the score of the DisMax query. `q=product (popularity, ` query($qq))&qq={!dismax}solr rocks`: equivalent to the previous query, using parameter de-referencing. q=product(popularity, ` query($qq,0.1))` &qq={!dismax} solr rocks: specifies a default score of 0.1 for documents that don’t match the DisMax query

and(not (exists (popularity)), exists(price)): returns true for any document which has a value in the price field, but does not have a value in the popularity field

https://stackoverflow.com/questions/22107683/solr-boost-query-syntax

The problem is the way you are trying to nest queries inside of each other w/o any sort of quoting -- the parser has no indication that the "b" param is if(exists(query({!v='user_type:ADMIN'})),10,1) it thinks it's "if(exists(query({!v='user_type:ADMIN'" and the rest is confusing it.

If you quote the "b" param to the boost parser, then it should work...

http://localhost:8983/solr/select?q={!boost b="if(exists(query({!v='foo_s:ADMIN'})),10,1)"}id:1

...or if you could use variable derefrencing, either of these should work...

http://localhost:8983/solr/select?q={!boost b=$b}id:1&b=if(exists(query({!v='foo_s:ADMIN'})),10,1)

http://localhost:8983/solr/select?q={!boost b=if(exists(query($nestedq)),10,1)}id:1&nestedq=foo_s:ADMIN

Nested function query must use $param or {!v=value}
http://lucene.472066.n3.nabble.com/Nested-function-query-must-use-td4038037.html
You have to do exactly what the error message tells you:

rewrite:

query(id:3)

as:

query({!=v'id:3'})
The correct syntax is:

http://localhost:8983/solr/articles.0/select/?q={!func}query({!query
v='hello'})&fl=Document.title,score,&debugQuery=on
https://lucidworks.com/2009/03/31/nested-queries-in-solr/

To embed a query of another type in a Lucene/Solr query string, simply use the magic field name _query_. The following example embeds a lucene query type:poems into another lucene query:

text:"roses are red" AND _query_:"type:poems"

Now of course this isn’t too useful on it’s own, but it becomes very powerful in conjunction with the query parser framework and local params which allows us to change the types of queries. The following example embeds a DisMax query in a normal lucene query:

text:hi  AND  _query_:"{!dismax qf=title pf=title}how now brown cow"

And we can further use parameter defererencing in the local params syntax to make it easier for the front-end to compose the request:

&q=text:hi  AND  _query_:"{!dismax qf=title pf=title v=$qq}
&qq=how now brown cow

Nested Queries in Function Query Syntax

q=how now brown cow&bq={!query v=$datefunc}

And the defaults for the handler in solrconfig.xml would contain the actual definition of datefunc as a function query:

<lst name="defaults">
   <str name="datefunc">{!func}recip(rord(date),1,1000,1000)</str>
   [...]

https://cwiki.apache.org/confluence/display/solr/Function+Queries

Use in a parameter that is explicitly for specifying functions, such as the EDisMax query parser's boost param, or DisMax query parser's bf (boost function) parameter. (Note that the bf parameter actually takes a list of function queries separated by white space and each with an optional boost. Make sure you eliminate any internal white space in single function queries when using bf). For example:

q=dismax&bf="ord(popularity)^0.5 recip(rord(price),1,1000,1000)^0.3"
Introduce a function query inline in the lucene QParser with the _val_ keyword. For example:

q=_val_:mynumericfield _val_:"recip(rord(myfield),1,2,3)"

https://stackoverflow.com/questions/17654266/solr-autocommit-vs-autosoftcommit

You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.

SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.

This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.

http://lucene.472066.n3.nabble.com/optimal-maxWarmingSearchers-in-solr-cloud-td4046164.html
Broadly speaking, you want it to be a smallish value as
background warming can be expensive. So how often are you doing a "hard"
commit and does it take longer to warm-up your searchers?

A few things to consider are:

auto-commits - can use openSearcher=false to avoid opening a new searcher
when doing large batch updates. This allows you to auto-commit more
frequently so that your update log doesn't get too big w/o paying the price
of warming a new searcher on every auto-commit.

new searcher warming queries - how many of these do you have and how long
do they take to warm up? You can get searcher warm-up time from the
Searcher MBean in the admin console.

cache auto-warming - again, how much of your existing caches are you
auto-warming? Keep a close eye on the filterCache and it's autowarmCount.
Warm-up times also available from the admin console.
https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F

Whenever a commit happens in Solr, a new "searcher" (with new caches) is opened, "warmed" up according to your SolrConfigXml settings, and then put in place. The previous searcher is not closed until the "warming" search is ready. If multiple commits happen in rapid succession -- before the warming searcher from first commit has enough time to warm up, then there can be multiple searchers all competeing for resources at the same time, even htough one of them will be thrown away as soon as the next one is ready.

maxWarmingSearchers is a setting in SolrConfigXml that helps you put a safety valve on the number of overlapping warming searchers that can exist at one time. If you see this error it means Solr prevented a commit from resulting an a new searcher being opened because there were already X warming searchers open.

If you encounter this error a lot, you can (in theory) increase the number in your maxWarmingSearchers, but that is risky to do unless you are confident you have the system resources (RAM, CPU, etc...) to do it safely. A more correct way to deal with the situation is to reduce how frequently you send commits.

If you only encounter this error infrequently because of fluke situations, you'll probably be ok just ignoring it.

Why doesn't my index directory get smaller (immediately) when i delete documents? force a merge? optimize?

Because of the "inverted index" data structure, deleting documents only annotates them as deleted for the purpose of searching. The space used by those documents will be reclaimed when the segments they are in are merged.

When segments are merged (either because of the Merge Policy as documents are added, or explicitly because of a forced merge or optimize command) then Solr attempts to delete old segment files, but on some filesystems Notably in Microsoft Windows) it is not possible to delete a file while the file is open for reading (Which is usually true since Solr is still serving requests against the old segments until the new Searcher is ready and has it's caches warmed). When this happens, the older segment files are left on disk, and Solr will re-attempt to delete them later the next time a merge happens.

http://blog.csdn.net/a550246215/article/details/52402232

结论spring-data-solr在save的时候如果没有设置事务管理会直接执行solrClient.commit()方式（硬提交）
解决方案：
方案1：设置事务
方案2：弃用spring-data-solr 改为源生solrj
直接调用solrClient.addBeans(args);
（只add 不提交，提交的动作由solr服务端决定）
配置solrconfig.xml

https://support.datastax.com/hc/en-us/articles/207690673-FAQ-Solr-logging-PERFORMANCE-WARNING-Overlapping-onDeckSearchers-and-its-meaning

WARN  [commitScheduler-4-thread-1] 2015-12-16 08:46:42,488  SolrCore.java:1712 - [mykeyspace.my_search_table] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

When a commit is issued to a Solr core, it makes index changes visible to new search requests. Commits may come from an application or an auto-commit. A "normal" commit in DSE is usually more often than not from an auto commit in which is, as outlined here, configured in the solr config file.

Each time a commit is issued a new searcher object is created. When there are too many searcher objects this warning will be observed.

Also if the configuration is such that a searcher has pre-warming queries, this can delay the start time meaning that the searcher is still starting up when a new commit comes in.

Decrease overlapping commits

Find out if there are commits being issued from the application. These might overlap with the auto commits. Ideally one would tune the auto commit settings to suit the application, negating the need for anything but auto commits.

Reduce auto warm count (if not using SolrFilterCache)

If you are not using the default SolrFilterCache Disable or reduce the autowarmCount setting for the given filter cache you are using. This setting controls the amount of objects populated from an older cache.

Increase the max searchers

One can increase the following setting in the solr_config.xml if required

<maxWarmingSearchers>16</maxWarmingSearchers>

Note: Having too high a number here can place more load on the node and have a negative impact on performance. It is normally recommended to keep the at 50% number of CPUs at the most. In most cases the default of 2 should be sufficient anyway.

http://lucene.472066.n3.nabble.com/Result-Grouping-vs-Collapsing-Query-Parser-Can-one-be-deprecated-td4302127.html
1) Collapse does not directly support faceting. It simply collapses the
results and the faceting components compute facets on the collapsed result
set. Grouping has direct support for faceting which, can be slow, but it
has options other then just computing facets on the collapsed result set.

2) Originally collapse only supported selecting group heads with min/max
value of a numeric field. It did not support using the sort parameter for
selecting the group head. Recently the sort parameter was added to
collapse, but this likely is not nearly as fast as using the min/max for
selecting group heads.

The collapse query parser filters search results so that only one document is returned out of all of those for a given field's value. Said differently, it collapses search results to one document per group of those with the same field value. This query parser is a special type called post-filter, which can only be used as a filter query because it needs to see the results of all other filter queries and the main query.

For min, the document with the smallest value is chosen, and for max, the largest. If your function query needs to be computed based on the document's score, refer to that via cscore().

If, and only if, your documents are partitioned into separate shards by manufacturer name would you get the correct group count, because each group would be guaranteed to only exist on one shard.

/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(field1, field2)}
/select?q=*:*&fq={!collapse field=fieldToCollapseOn max=sum(product(field1,
1000000), cscore())}
we are effectively sorting by field1 desc and then the cscore desc (a special “collapse score” function for getting the score of a document before collapsing)

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/test/org/apache/solr/client/solrj/request/SchemaTest.java
SchemaRequest.SchemaVersion schemaVersionRequest = new SchemaRequest.SchemaVersion();
SchemaResponse.SchemaVersionResponse schemaVersionResponse = schemaVersionRequest.process(getSolrClient());

https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/suggest/DocumentDictionary.java

<str name="weightField">price</str>

http://opensourceconnections.com/blog/2017/01/23/our-solution-to-solr-multiterm-synonyms/
http://blog.trifork.com/2009/10/20/result-grouping-field-collapsing-with-solr/
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

The Collapsing query parser and the Expand component combine to form an approach to grouping documents for field collapsing in search results. The Collapsing query parser groups documents (collapsing the result set) according to your parameters, while the Expand component provides access to documents in the collapsed group for use in results display or other processing by a client application. Collapse & Expand can together do what the older Result Grouping (group=true) does for most use-cases but not all. Generally, you should prefer Collapse & Expand.

In order to use these features with SolrCloud, the documents must be located on the same shard. To ensure document co-location, you can define the router.name parameter as compositeId when creating the collection.

The CollapsingQParser is really a post filter that provides more performant field collapsing than Solr's standard approach when the number of distinct groups in the result set is high. This parser collapses the result set to a single document per group before it forwards the result set to the rest of the search components. So all downstream components (faceting, highlighting, etc...) will work with the collapsed result set.

q=foo&fq={!collapse field=ISBN}&expand=true

http://opensourceconnections.com/blog/2016/01/22/solr-vs-elasticsearch-relevance-part-two/

{"query": {
    "multi_match": {
        "query": "dog catcher law",
        "fields": ["text", "catch_line"],
        "minimum_should_match": "50%",
        "type": "most_fields"
    }
}}

This verbosity pays off. It’s much easier for the uninitiated to look at the JSON and guess what’s happening. It’s clear that there’s a query, of type “multi_match” being passed a query stirng “dog catcher law.” You can see clearly the fields being searched. Without much knowledge, you could make guesses about what minimum_should_match ormost_fields might mean.

It’s also helpful that Elasticsearch always scopes the parameters to the current query. There’s no “local” vs “global” parameters. There’s just the current JSON query object and its arguments. To appreciate this point, you have to appreciate an annoying Solr localparams quirk. Solr localparams inherit the global query parameters. For example, let’s say you use the following query parameters q=dog catcher law&defType=edismax&q.op=AND&bq={!edismax mm=50% tie=1 qf='catch_line text'}cat (search for dog catcher law, boost (bq) by a ‘cat’ query). Well your scoped local params query unintuitively receives the outside parameter q.op=AND. More frustratingly, with this query you’ll get a deeply befuddling “Infinite Recursion” error from Solr. Why? because hey guess what, your local params query in bq also inherits the bq from the outside – aka itself! So in reality this query is

bq={!edismax mm=50% tie=1 q.op=AND bq='{!edismax mm=50% tie=1 q.op=AND bq='...' qf='catch_line text'} qf='catch_line text'}

. Solr keeps filling in that ‘bq’ from the outside bq, and therefore reports the not so intuitive:

org.apache.solr.search.SyntaxError: Infinite Recursion detected parsing query ‘dog catcher law’

To avoid accepting the external arguments, you need to be explicit in your local params query. Here we set no bq and change q.op to OR.

bq={!edismax mm=50% tie=1 bq='' q.op=OR qf='catch_line text'}

Elasticsearch, on the other hand, focusses on the common use cases.

Well Solr’s desire for terseness has created features like parameter substitution and dereferencing. These features let you reuse parts of queries in a fairly readable fashion. Moreover, Solr’s function query syntax gives you an extremely powerful function (the query()) function that lets you combine relevance scoring and math more seamlessly than the Elasticsearch equivalents.

http://www.elasticsearchtutorial.com/common-solr-queries-in-elasticsearch.html

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

Performance Problems with "Deep Paging"

In some situations, the results of a Solr search are not destined for a simple paginated user interface. When you wish to fetch a very large number of sorted results from Solr to feed into an external system, using very large values for the start or rows parameters can be very inefficient. Pagination using start and rows not only require Solr to compute (and sort) in memory all of the matching documents that should be fetched for the current page, but also all of the documents that would have appeared on previous pages. So while a request for start=0&rows=1000000 may be obviously inefficient because it requires Solr to maintain & sort in memory a set of 1 million documents, likewise a request for start=999000&rows=1000 is equally inefficient for the same reasons. Solr can't compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are. If the index is distributed, which is common when running in SolrCloud mode, then 1 million documents are retrieved from each shard. For a ten shard index, ten million entries must be retrieved and sorted to figure out the 1000 documents that match those query parameters.

Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.

cursorMark and start are mutually exclusive parameters
- Your requests must either not include a start parameter, or it must be specified with a value of "0".
sort clauses must include the uniqueKey field (either "asc" or "desc")
- If id is your uniqueKey field, then sort params like id asc and name asc, id desc would both work fine, but name asc by itself would not
Sorts including Date Math based functions that involve calculations relative to NOW will cause confusing results, since every document will get a new sort value on every subsequent request. This can easily result in cursors that never end, and constantly return the same documents over and over – even if the documents are never updated. In this situation, choose & re-use a fixed value for the NOW request param in all of your cursor requests.

Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents.

A client requests q=*:*&rows=5&start=0&sort=name asc, id asc&cursorMark=*
- Documents with the ids 1-5 will be returned to the client in order
Document id 3 is deleted
The client requests 5 more documents using the nextCursorMark from the previous response
- Documents 6-10 will be returned -- the deletion of a document that's already been returned doesn't affect the relative position of the cursor
3 new documents are now added with the ids 90, 91, and 92; All three documents have a name of A
The client requests 5 more documents using the nextCursorMark from the previous response
- Documents 11-15 will be returned -- the addition of new documents with sort values already past does not affect the relative position of the cursor
Document id 1 is updated to change its 'name' to Q
Document id 17 is updated to change its 'name' to A
The client requests 5 more documents using the nextCursorMark from the previous response
- The resulting documents are 16,1,18,19,20 in that order
- Because the sort value of document 1 changed so that it is after the cursor position, the document is returned to the client twice
- Because the sort value of document 17 changed so that it is before the cursor position, the document has been "skipped" and will not be returned to the client as the cursor continues to progress

In a nutshell: When fetching all results matching a query using cursorMark, the only way index modifications can result in a document being skipped, or returned twice, is if the sort value of the document changes.

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
https://www.slideshare.net/shalinmangar/parallel-sql-and-streaming-expressions-in-apache-solr-6

https://cwiki.apache.org/confluence/display/solr/Other+Parsers

A common mistake is to try to filter parents with a which filter, as in this bad example:

q={!parent which="title:join"}comments:SolrCloud

Instead, you should use a sibling mandatory clause as a filter:

q= +title:join +{!parent which="content_type:parentDocument"}comments:SolrCloud

Block Join Parent Query Parser

This parser takes a query that matches child documents and returns their parents. The syntax for this parser is similar: q={!parent which=<allParents>}<someChildren>. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someChildren is a query that matches some or all of the child documents. Note that the query for someChildrenshould match only child documents or you may get an exception: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. or in older version it's: child query must only match non-parent docs. As it's said, you can search for q=+(parentFilter) +(someChildren) to find a cause .

Again using the example documents above, we can construct a query such as q={!parent which="content_type:parentDocument"}comments:SolrCloud

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents

This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.

fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter limit=100]

Note that this transformer can be used even though the query itself is not a Block Join query.

When using this transformer, the parentFilter parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:

childFilter - query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)
limit - the maximum number of child documents to be returned per parent document (default: 10)

Solr Block Join - Nested Documents
https://blog.griddynamics.com/introduction-to-block-join-faceting-in-solr
The screenshot above is taken from an online retailer’s website. According to the graphic, a dress can be blue, pink or red, and only sizes XS and S are available in blue. However, for merchandisers and customers this dress is considered a single product, not many similar variations. When a customer navigates the site, she should see all SKUs belonging to the same product as a single product, not as multiple products. This means that for facet calculations, our facet counts should represent products, not SKUs. Thus, we need to find a way to aggregate SKU-level facets into product ones.

A common solution is to propagate properties from the SKU level to the product level and produce a single product document with multivalued fields aggregated from the SKUs. With this approach, our aggregated product looks like this:

results of the propagation of SKU level attributes to product level

However, this approach creates the possibility of false positive matches with regards to combinations of SKU-level fields. For example, if a customer filters by color ‘Blue’ and size ‘M’, Product_1 will be considered a valid match, even though there is no SKU in the original catalog which is both 'Blue' and 'M'. This happens because when we are aggregating values from the SKU level, we are losing information about what value comes from what SKU.

Getting back to the technology, this means we should carefully support our catalog structure when searching and faceting products. The problem of searching structured data is already addressed in Solr with a powerful, high performance and robust solution: Block Join Query.
https://blog.griddynamics.com/high-performance-join-in-solr-with-blockjoinquery

A Join Query looks like:

q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}acl:[1303 TO 1309]
You can see that Join almost never ran for less than a second, and the CPU saturated with 100 requests per minute. Adding more queries harmed latency.

All index was cached in RAM via memory mapped files magic

blockjoin.

q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!parent which=kind:body}acl:[1303 TO 1309]
Search now takes only a few tens of milliseconds and survives with 6K requests per minute (100 qps). And you see plenty of free CPU!

We can check where Join uses so much CPU power with jstack:

java.lang.Thread.State: RUNNABLE
at
o.a.l.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docFreq(BlockTreeTermsReader.java:2098)
at
o.a.s.search.JoinQuery$JoinQueryWeight.getDocSet(JoinQParserPlugin.java:338)

let’s explain how a 55GB index can ever be cached in just 8GB RAM. You should know that not all files in your index are equally valuable. (In other words, tune your schema wisely.) In my index the frq file is 7.7GB and the tim file is only 427MB, and it’s almost all that’s needed for these queries. Of course, a file which stores primary key values is also read, but it doesn’t seem significant.

BlockJoin is the most efficient way to do the join operation, but it doesn’t mean you need to get rid of your solution based on the other (slow) Join. The place for Join is frequent child updates -- and small indexes, of course.

https://blog.griddynamics.com/how-to-implement-block-join-faceting-in-solr/lucene

Solr supports the search of hierarchical documents using BlockJoinQuery (BJQ). Using this query requires a special way of indexing documents, based on their positioning in the index. All documents belonging to the same hierarchy have to be indexed together, starting with child documents followed by their parent documents.

BJQ works as a bridge between levels of the document hierarchy; e.g. it transforms matches on child documents to matches on parent documents. When we search using BJQ, we provide a child query and a parent filter as parameters. A child query represents what we are looking for among child documents, and a parent filter tells BJQ how to distinguish parent documents from child documents in the index.

For each matched child document, BJQ scans ahead in the index until it finds the nearest parent document, which is sent into the collector chain instead of the child document. This trick of relying on relative document positioning in the index, or “index-time join,” is the secret behind BJQ’s high performance.

We consider each hierarchy of matched documents separately. As we are using BJQ, each hierarchy is represented in the Solr index as a document block, or DocSet slice as we call it.

First, we calculate facets based on matched SKUs from our block. Then we aggregate obtained SKU counts into Product-level facet counts, increasing the product level facet count by only one for every matched block, irrespective of the number of matched SKUs within the block. For example, if we are searching by COLOR:Blue, even though two Blue SKUs were found within a block, aggregated product-level counts will be increased only by one.

structure of block index. When there are multiple children level hits in a single block, facet count on parent level has to be increased only by one.

This solution is implemented inside the BlockJoinFacetComponent, which extends the standard Solr SearchComponent. BlockJoinFacetComponent validates the query and injects a special BlockJoinFacetCollector into the Solr post-filter collectors chain. When BlockJoinFacetCollector is invoked, it receives a parent document, since the BlockJoinFacetComponent ensures that only ToParentBlockJoinQuery is allowed as a top level query.

http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

Use Lucene’s MMapDirectory on 64bit platforms, please!

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

https://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration

https://issues.apache.org/jira/browse/LUCENE-7452

when parent filter intersects with child query the exception exposes internal details: docnum and scorer class.

java.lang.IllegalStateException: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. docId=23, class org.apache.lucene.search.DisjunctionSumScorer

java.lang.IllegalStateException: Parent query must not match any docs beside parent filter. Combine them as must (+) and must-not (-) clauses to find a problem doc. docID=12

https://issues.apache.org/jira/browse/SOLR-6096

This is currently possible to do all this stuff on client side by issuing additional request to delete document before every update. It would be more efficient if this could be handled on SOLR side. One would benefit on atomic update. The biggest plus shows when using "delete-by-query".

Deletion of '1' by query

<delete>
  <query>title:*</query>
  <!-- implying also
    <query>_root_:1</query>
   -->
</delete>

In that case one would not have to first query all documents and issue deletes by those id and every document that are nested.

https://issues.apache.org/jira/browse/SOLR-7888
https://issues.apache.org/jira/browse/SOLR-7963

~~SOLR-7888~~ has introduced a new parameter for filtering suggester queries

suggest.cfq=ctx1 OR ctx2

The implementation use the Solr StandardQueryParser for parsing the cfq param.

This card is to allow to pass in local param queries such as

suggest.cfq={!terms f=contextx}ctx1,ctx2

https://www.garysieling.com/blog/list-solr-functions
https://wiki.apache.org/solr/CoreQueryParameters

echoParams

The echoParams parameter tells Solr what kinds of Request parameters should be included in the response for debugging purposes, legal values include:

none - don't include any request parameters for debugging
explicit - include the parameters explicitly specified by the client in the request
all - include all parameters involved in this request, either specified explicitly by the client, or implicit because of the request handler configuration.

TZ

The TZ parameter can be specified to override the default TimeZone (UTC) used for the purposes of adding and rounding in date math. The local rules for the specified TimeZone (including the start/end of DST if any) determine when each arbitrary day starts -- which affects not only rounding/adding of DAYs, but also cascades to rounding of HOUR, MIN, MONTH, YEAR as well.

For example "2013-03-10T12:34:56Z/YEAR" using the default TZ would be 2013-01-01T00:00:00Z but with TZ=America/Los_Angeles, the result is 2013-01-01T08:00:00Z. Likewise, 2013-03-10T08:00:00Z+1DAY evaluates to 2013-03-11T08:00:00Z by default, but with TZ=America/Los_Angeles the local DST rules result in 2013-03-11T07:00:00Z

qt

If a request uses the /select URL, and no SolrRequestHandler has been configured with /select as its name, then Solr uses the qt parameter to determine which Query Handler should be used to process the request. Valid values are any of the names specified by <requestHandler ... /> declarations in solrconfig.xml

"qt" doesn't really have a default, but the default request handler to dispatch to is "/select".

q.alt

Calls the standard query parser and defines query input strings, when the q parameter is not used.

Jira
http://lucene.472066.n3.nabble.com/both-way-synonyms-with-ManagedSynonymFilterFactory-td4256592.html
Think the issue here is that when the SynonymFilter is created based on the managed map, option “expand” is always set to “false”, while the default for file-based synonym dictionary is “true”.

So with expand=false, what happens is that the input word (e.g. “mb”) is *replaced* with the synonym “megabytes”. Confusingly enough, when synonyms are applied both on index and query side, your document will contain “megabytes” instead of “mb”, but when you query for “mb”, the same happens on query side, so you will actually match :-)

I think what we need is to switch default to expand=true, and make it configurable also in the managed factory.
https://issues.apache.org/jira/browse/SOLR-8737
We can use solr analysis admin ui to test whether synonym works

Thursday, May 25, 2017

How to Respond to Recruiters

Sunday, May 14, 2017

Solr Misc Part 4

RuleType

NameType

Why doesn't my index directory get smaller (immediately) when i delete documents? force a merge? optimize?

Performance Problems with "Deep Paging"

Block Join Parent Query Parser

echoParams

TZ

qt

Labels

Popular Posts