Massive Technical Interviews Tips: June 2017

Friday, June 23, 2017

How to Review Software Design

http://blog.bliley.com/5-simple-steps-for-truly-effective-design-reviews
Take notes in real time (digitally)
5. Follow up immediately
Now I know this may come as a surprise to most of us, but the majority of meetings are basically worthless. They could have just as well been an email and saved everyone invited precious time to be doing something more productive. Design reviews are one of those special exceptions that is actually worth pulling a bunch of high paid engineers into a room together.

There's a epic email salutation on this site I found recently that says "Do good, crush evil, but document it." Design reviews are certainly a place to do good. They are a great way to crush evil design defects. So it only stands to reason that we should document it.

https://effectivesoftwaredesign.com/2010/10/24/effective-design-reviews/

Therefore, effective design reviews must include the presentation of several alternatives. The discussion should not be if a particular design satisfies or not the requirements. Correctness is the most fundamental attribute of a solution, so the diverse design choices must differ on some other attributes that can be compared. In other words, each option has its advantages and disadvantages, and by choosing one of them we are actually deciding which attributes are the most important ones, but all options must be correct.

In this “multiple-choice” approach, the goal of the design review is to analyze several alternatives and understand their different implications. Then, if there is a solution that is clearly better than all others, according to some criteria, it should be the chosen one. Otherwise, we should try to look for additional options.

But here is the second problem with the previous definition of design review: It says that the “design is tested against its requirements.” In most cases this means only the functional requirements, the specific demands related to the feature being implemented. This definition assures that the design must be correct, but it does not address the question of how the new feature being implemented will affect the existing system architecture. In other words, this definition does not take in consideration the non-functional requirements.

Diverse design alternatives will probably have very different implications for non-functional requirements, and this should be used as the basis for comparison when selecting one of them.

if there are no alternatives, there is no real review being done.

“Then,” said Sloan, “I propose we postpone further discussion until our next meeting, to give ourselves time to develop disagreement and perhaps gain some understanding of what the decision is all about.”

So let’s hope we will have more disagreement in our next design reviews!

https://www.codeproject.com/Articles/20467/Software-Architecture-Review-Guidelines[Good]

https://www.quora.com/What-are-some-tips-for-conducting-a-software-design-review
- Your priority should be to understand objections and questions, not necessarily to answer them. People will come up with questions you haven't thought about and "helpful" suggestions that barely make sense. It's OK to note them and look into the issue after the meeting when you'll have time to carefully craft a diplomatic reply.

- You, the designer, should take notes on the feedback and decisions for your design, & circulate your notes promptly after the meeting. Bring a printout of your design spec to write notes on, and don't be afraid to ask the meeting to pause for a minute or two while you catch up on notes. I find that it's best for the project manager to take notes on action items that arise for people other than the designer, such as technical issues that need investigating by developers.

- Keep three running lists: "Questions for management", "Questions for users", and "Parking lot (interesting ideas that we won't pursue at this time)".

Listen graciously and thank everyone for their input. People always want to leave a meeting with the feeling that they were listened to and understood

http://www.processimpact.com/articles/revu_sins.html

Reviewers Critique the Producer, Not the Product

Reviews Are Not Planned

http://design.engineering.dal.ca/sites/default/files/studentresources/files/design_reviews.pdf

https://medium.com/git-out-the-vote/strengthening-products-and-teams-with-technical-design-reviews-ae6a1bec5216

The document should have sections including:

Background
Design goals
System diagram
Design summary
Design details
Tradeoffs made

Saturday, June 10, 2017

TV & Movie Data

https://www.themoviedb.org/
http://developer.tmsapi.com/io-docs

http://developer.tmsapi.com/page/Program_Metadata

audience	string	indicates program target audience, derived from genres; one of two possible values: Children, Adults only
recommendations	Recommendation [ ]	Top 3 TMS editorial recommendations, similar to given program
topCast	string [ ]	list of top three cast members, sorted by billingOrder ascending (not present in program details, which gives full cast)

preferredImage

Image

program image; if none found, generic image for program's entity type will be returned

tmsId

string

14-character alphanumeric identifier for program record; first 2 characters generally identify program type (MV=movie, SH=show/series, EP=episode, SP=sports event);
where available, parameter includeDetail=false will restrict program metadata to only programIds (tmsId, rootId, seriesId)

Movie Program Metadata

Field / object	Type	Notes
runTime	number	duration, specified in ISO-8601 format; PTxxHyyM = xx hours, yy minutes
qualityRating	QualityRating	ratingsBody: critical reviewer; currently only TMS ratings provided, where available value: numeric rating; TMS has a 1 - 4 star rating system (with 4 as best quality); possible values are 1, 1.5, 2, 2.5, 3, 3.5, 4
officialUrl	string	official movie website, if available

https://en.wikipedia.org/wiki/Limited_release

In the United States motion picture industry, a limited release is where a new film is played in a select few theaters across the country, typically in major metropolitan markets.

A limited release is often used to gauge the appeal of specialty films, like documentaries, independent films and art films. A common practice by film studios is to give highly anticipated and critically acclaimed films a limited release on or before December 31 in Los Angeles in order to qualify for an Academy Award nomination (as set by its rules). These films are almost always released to a wider audience in January or February of the following year.

API:
http://www.imdb.com/licensing/
IMDb licensing content includes: cast & crew, user ratings, plot summaries, release dates, box office, keywords, filmography credits, awards, biographies, nicknames and more.
http://www.imdb.com/interfaces

https://developer.fandango.com
The Tomatometer rating represents the percentage of positive professional reviews for movies and TV shows and is used by millions to guide entertainment viewing decisions.

Available Data

The Rotten Tomatoes API provides access to Rotten Tomatoes' ratings and reviews, allowing approved companies and individuals to enrich their applications and widgets with Rotten Tomatoes data. By implementing the Rotten Tomatoes API, your users can access:

Critic and Audience Scores. Tomatometer and Audience scores for movies.
Critic Reviews. A sampling of critic reviews for each movie.

We do not provide:

Detailed movie metadata. We only provide a subset of metadata to help with title matching.
Posters and images. We do not offer high-resolution images; these will need to be sourced from a metadata provider.
TV scores. Coming soon!

https://stackoverflow.com/questions/14172735/how-to-request-movie-data-from-rotten-tomatoes-using-their-json-api

http://variety.com/2016/digital/news/fandango-rotten-tomatoes-flixster-1201708444/
Online ticketing service Fandango has agreed to acquire Flixster and movie review-aggregator Rotten Tomatoes from Warner Bros.

Fandango will continue to operate as a unit of NBCUniversal.

Tuesday, June 6, 2017

Solr Query Relevancy

https://stackoverflow.com/questions/16655933/fuzzy-search-in-solr
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html

According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

So you need to write queries like this - Health~2

https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/the-dismax-query-parser.html

https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

pf	Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter appear in close proximity.
ps	Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase.
qs	Query Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase. Used specifically with the `qf`parameter.

bq	Boost Query: specifies a factor by which a term or phrase should be "boosted" in importance when considering a match.
bf	Boost Functions: specifies functions to be applied to boosts. (See for details about function queries.)

The `bq` (Boost Query) Parameter

The bq parameter specifies an additional, optional, query clause that will be added to the user's main query to influence the score. For example, if you wanted to add a relevancy boost for recent documents:

q=cheese

bq=date:[NOW/DAY-1YEAR TO NOW/DAY]

You can specify multiple bq parameters. If you want your query to be parsed as separate clauses with separate boosts, use multiple bq parameters.

https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

Using 'slop'

Dismax and Edismax can run queries against all query fields, and also run a query in the form of a phrase against the phrase fields. (This will work only for boosting documents, not actually for matching.) However, that phrase query can have a 'slop,' which is the distance between the terms of the query while still considering it a phrase match. For example:

q=foo bar

qf=field1^5 field2^10

pf=field1^50 field2^20

defType=dismax

With these parameters, the Dismax Query Parser generates a query that looks something like this:

(+(field1:foo^5 OR field2:foo^10) AND (field1:bar^5 OR field2:bar^10))

But it also generates another query that will only be used for boosting results:

field1:"foo bar"^50 OR field2:"foo bar"^20

Thus, any document that has the terms "foo" and "bar" will match; however if some of those documents have both of the terms as a phrase, it will score much higher because it's more relevant.

If you add the parameter ps (phrase slop), the second query will instead be:

ps=10 field1:"foo bar"~10^50 OR field2:"foo bar"~10^20

This means that if the terms "foo" and "bar" appear in the document with less than 10 terms between each other, the phrase will match. For example the doc that says:

*Foo* term1 term2 term3 *bar*

will match the phrase query.

How does one use phrase slop? Usually it is configured in the request handler (in solrconfig).

With query slop (qs) the concept is similar, but it applies to explicit phrase queries from the user. For example, if you want to search for a name, you could enter:

q="Hans Anderson"

A document that contains "Hans Anderson" will match, but a document that contains the middle name "Christian" or where the name is written with the last name first ("Anderson, Hans") won't. For those cases one could configure the query field qs, so that even if the user searches for an explicit phrase query, a slop is applied.

Finally, in addition to the phrase fields (pf) parameter, edismax also supports the pf2 and pf3 parameters, for fields over which to create bigram and trigram phrase queries. The phrase slop for these parameters' queries can be specified using the ps2 and ps3 parameters, respectively. If you use pf2/pf3 but ps2/ps3, then the phrase slop for these parameters' queries will be taken from the ps parameter, if any.

https://wiki.apache.org/solr/SolrRelevancyFAQ
How can I search for "superman" in both the title and subject fields

q=superman&qf=title subject

How can I make "superman" in the title field score higher than in the subject field

For the standard request handler, "boost" the clause on the title field:

q=title:superman^2 subject:superman

Using the dismax request handler, one can specify boosts on fields in parameters such as qf:

q=superman&qf=title^2 subject

Why are search results returned in the order they are?
If no other sort order is specified, the default is by relevancy score.

How can I make exact-case matches score higher

Example: a query of "Penguin" should score documents containing "Penguin" higher than docs containing "penguin".

The general strategy is to index the content twice, using different fields with different fieldTypes (and different analyzers associated with those fieldTypes). One analyzer will contain a lowercase filter for case-insensitive matches, and one will preserve case for exact-case matches.

Use copyField commands in the schema to index a single input field multiple times.

Once the content is indexed into multiple fields that are analyzed differently, query across both fields.

How can I search for one term near another term (say, "batman" and "movie")

A proximity search can be done with a sloppy phrase query. The closer together the two terms appear in the document, the higher the score will be. A sloppy phrase query specifies a maximum "slop", or the number of positions tokens need to be moved to get a match.

This example for the standard request handler will find all documents where "batman" occurs within 100 words of "movie":

q=text:"batman movie"~100

The dismax handler can easily create sloppy phrase queries with the pf (phrase fields) and ps (phrase slop) parameters:

q=batman movie&pf=text&ps=100

The dismax handler also allows users to explicitly specify a phrase query with double quotes, and the qs(query slop) parameter can be used to add slop to any explicit phrase queries:

q="batman movie"&qs=100

Since debugQuery=on only gives you scoring "explain" info for the documents returned, the explainOther parameter can be used to specify other documents you want detailed scoring info for.

q=supervillians&debugQuery=on&explainOther=id:juggernaut

How do I give a negative (or very low) boost to documents that match a query?

True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...

    q = foo^100 bar^100 xxx^0.00001    # NOT WHAT YOU WANT

...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does *not* match. For example...

    q =  foo^100 bar^100 (*:* -xxx)^999

NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit "*:*" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...

    ? defType = dismax 
    & q = foo bar 
    & bq = (*:* -xxx)^999

http://everydaydeveloper.blogspot.com/2012/02/solr-improve-relevancy-by-boosting.html
By boosting exact and phrase matching over the query matching we can achieve relevancy improvement by significant factor.
To set a field(s) for exact matching, add another field in the Schema.xml and copy the content into it using copyField

You would notice that the data type for titleExact is set to "textExact" (defined below), although similar exact match effect can be achieved by setting the datatype to "string" but with adding our own datatype we can further fine tune by adding appropriate tokenizer and filters.

Here I have used WhiteSpaceTokenizer without stopwords or stemming filters. I am using LimitTokenCounterFilter to limit the number of tokens and LowerCaseFilter to make the matching case-insensitive. We can further fine tune the textExact dataType to make the exact match a bit more lenient or strict per our use case.
Now to boost the exact match field and phrase matching, in the SolrConfig.xml -

<str name="qf">title titleExact^10</str>
<str name="pf">title^10 titleExact^100</str>

Now for both query and phrase matching we are boosting the exact matching field "titleExact" match higher than the non-exact matching field "title", also the same fields are boosted higher for phrase search (pf) compare to query or keyword search (qf). This would be a simple and first step to improving relevancy

Debug Solr Query
http://splainer.io/http://blog.thedigitalgroup.com/dattatrayap/understanding-solr-explain/

select?q=summary:”Apache solr”&fl=id,summary,score &debugQuery=true

http://opensourceconnections.com/blog/2013/08/21/name-search-in-solr/

<fieldType name="AuthorsType" class="solr.TextField"  positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

Ok down to business. There are two main issues that if we could tackle, would get at a very large portion of the name search problem

Author names get rearranged, either in the document or query, with some parts omitted: (Douglas Turnbull vs Turnbull, Douglas vs Turnbull Douglas G)
Many names are abbreviated: (Doug Turnbull vs D. Turnbull vs D. G. Turnbull vs Douglas G. Turnbull)

Authors:”Douglas Turnbull”~2

Will happily return results where Douglas and Turnbull occur within 2 token moves of each other (regardless of order).

What happens when a user searches for Doug Turnbull and all Solr has indexed are references to Douglas Turnbull? To help with efficient prefix queries, Solr gives us the EdgeNGramFilterFactory. The EdgeNGramFilterFactory takes a token, say Douglas, and generates tokens based on slicing the string from either the front or the back of the string. For example, with minGramSize=1 and side=”front” the token “Douglas” will result in the following tokens:

Input:  douglas
Tokens: [d] [do] [dou] [doug] [dougl] [dougla] [douglas]

An important note about this filter (and many others in Solr) is that each generated token ends up occupying the same position in the indexed document. In reality, we can envision the document as two dimensional:

Position N:     Position N+1
    [d]         ->  [t]
    [do]            [tu]
    ...             ...
    [douglas]       [turnbull]

So a phrase query for “do turnbull” will hit on this document in the same location in this document as the phrase “douglas turnbull”. What a nice characteristic!

<field name="AuthorsPre" type="AuthorsPrefix" indexed="true" multiValued="true"/>

Copy Field:

<copyField source="Authors" dest="AuthorsPre"/>

Field Type:

<fieldType name="AuthorsPrefix" class="solr.TextField"  positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="200" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

How to boost
http://lucene.472066.n3.nabble.com/How-do-I-use-multiple-boost-functions-td4144733.html
Use edismax query parser. boost parameter can take multiple values.

&boost=recip(geodist(destination,1.293841,103.846487),1,1000,1000)
&boost=if(exists(query({!v=$b1})),100,0)

if(termfreq(docType,"channel"),5,recip(ms(NOW/HOUR,releaseDate),3.16e-11,5,5))
A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b).
https://nolanlawson.com/2012/06/02/comparing-boost-methods-in-solr/

Summary of boost methods

Boost Method, with Example	Type	Input	Works With
`{!boost b}`	Multiplicative	Function	`lucene` `dismax` `edismax`
q={!boost b=myBoostFunction()}myQuery	Multiplicative	Function	`lucene` `dismax` `edismax`
`{!boost b}` with variables	Multiplicative	Function	`lucene` `dismax` `edismax`
q={!boost b=$myboost v=$qq} &myboost=myBoostFunction() &qq=myQuery	Multiplicative	Function	`lucene` `dismax` `edismax`
`bq` (boost query)	Additive	Query	`dismax` `edismax`
q=myQuery &bq=_val_:”myBoostFunction()“	Additive	Query	`dismax` `edismax`
`bf` (boost function)	Additive	Function	`dismax` `edismax`
q=myQuery &bf=myBoostFunction()	Additive	Function	`dismax` `edismax`
`boost`	Multiplicative	Function	`edismax`
q=myQuery &boost=myBoostFunction()	Multiplicative	Function	`edismax`

Conclusions (TL;DR)

Prefer multiplicative boosting to additive boosting.
Be careful not to confuse queries with functions.

The {!boost b} and boost methods, on the other hand, apply a true multiplicative boost, using BoostedQuery. That is, they multiply the boost function’s score by whatever score would normally be spit out. This method is more faithful to the Lucene Javadoc for Similarity.java, and it seems to be the recommended choice, given how dismissively the word “additive” is tossed around in the documentation.

So basically, this is the boost you’re looking for. If you’re using the default lucene parser or the dismax parser, go with the {!boost b} method. If you’re using edismax, though, take advantage of the nice boost parameter and use that instead.

http://www.solrtutorial.com/boost-documents-by-age.html

One way is: recip + linear, where recip computes an age-based score, and linear is used to boost it.

The dismax query parser provides and easy way to apply boost functions. For example:

http://localhost:8983/solr/search?qt=dismax&q=search&bf=linear(recip(rord(modify_date),1,1000,1000),11,0)

Another way is recip + ms:

http://localhost:8983/solr/search?qt=dismax&q=search&bf=recip(ms(NOW/HOUR,modify_date),3.16e-11,0.08,0.05)

http://localhost:8983/solr/search?qt=dismax&q=search&bf=recip(ms(NOW,modify_date),3.16e-11,1,1)

The above example will create an additive boost. If you want an multiplicative boost for "type=active" you could add:

&boost=if(termfreq(type,"active"),2,1)

Which gives a factor 2 boost for "type=active"

https://wiki.apache.org/solr/FunctionQuery

ord

ord(myfield) returns the ordinal of the indexed field value within the indexed list of terms for that field in lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering. The field must have a maximum of one value per document (not multiValued). 0 is returned for documents without a value in the field.

Example: If there were only three values for a particular field: "apple","banana","pear", then ord("apple")=1, ord("banana")=2, ord("pear")=3
Example Syntax: ord(myIndexedField)
Example SolrQuerySyntax: _val_:"ord(myIndexedField)"

WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use since they must use a FieldCache entry at the top level reader, while sorting and function queries now use entries at the segment level. Hence sorting or using a different function query, in addition to ord()/rord() will double memory use.

WARNING: ord() depends on the position in an index and can thus change when other documents are inserted or deleted, or if a MultiSearcher is used.

rord

The reverse ordering of what ord provides.

Example Syntax: rord(myIndexedField)
Example: rord(myDateField) is a metric for how old a document is: the youngest document will return 1, the oldest document will return the total number of documents.

http://lucene.472066.n3.nabble.com/What-is-omitNorms-td2987547.html
When you say "omitnorms=true" for any fields it means SOLR will not
store norms . AFAIK , if you do not store these norms then your index size
would be smaller and will take less memory . You could safely omit these
norms for smaller fields .
i.e your indexing time is more.

So if you do not store norms you save the memory

Norms are used to boosts and field length normalization during indexing
time so that short document has higher score
https://stackoverflow.com/questions/9261524/scoring-of-solr-multivalued-field

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

Person 2: David Letterman

Person 3: David Hasselhoff, David Michael Hasselhoff

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

You can just run your query q=field_name:David with debugQuery=on and see what happens.

These are the results (included the score through fl=*,score) sorted by score desc:

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

And this is the explanation:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

The scoring factors here are:

termFreq: how often a term appears in the document
idf: how often the term appears across the index
fieldNorm: importance of the term, depending on index-time boosting and field length

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex.

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>

https://stackoverflow.com/questions/27919702/make-multivalued-field-scoring-aware-of-field-values-count-with-lucene
Make multivalued field scoring aware of field values count with Lucene

When testing on these four documents:

 - tags.put("doc1", "piano, electric guitar, violon");

 - tags.put("doc2", "piano, electric guitar");

 - tags.put("doc3", "piano");

 - tags.put("doc4", "electric guitar");

What I get is:

 - Score : 1.0 
 Doc : Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<id:doc4> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:electric guitar>>
 - Score : 1.0 
 Doc : Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<id:doc2> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:piano> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:electric guitar>>
 - Score : 1.0 
 Doc : Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<id:doc1> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:piano> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:electric guitar> stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<tag:violon>>

Add an IntField containing the number of tags of the doc
Use a BoostedQuery using a Reciprocal function of the IntFieldSource

http://www.typo3-media.com/blog/solr-recip-boosting.html
http://opensourceconnections.com/blog/2014/11/26/stepwise-date-boosting-in-solr/

the Solr function query documentation gives you a basic date boost:

boost=recip(ms(NOW,mydatefield),3.16e-11,1,1)

This will give you a nicely curving multiplicative downboost where the latest documents come first (multiplying the relevancy score by X for stuff “NOW”), slowly sloping off into the past (multiplying the relevancy score by some decreasing value < X as content moves into the past).

Let’s say instead of a nice curving boost, you’d just like to downboost anything that happens before some date in the past. Say you set a date, 10 years ago, anything older than 10 years ago should get significant downboost. Anything newer should not be impacted.

Putting that together, you have something like:

boost=if(min(0, mydatefield-marker),0.8,1.0)

Expressing (mydatefield – marker) as a function query:

sub(ms(mydatefield),sub(ms(NOW),315569259747))

Putting it all together:

if(min(0,sub(ms(mydatefield),sub(ms(NOW),315569259747))),0.8,1)

https://wiki.apache.org/solr/FunctionQuery

recip

A reciprocal function with recip(x,m,a,b) implementing a/(m*x+b). m,a,b are constants, x is any numeric field or arbitrarily complex function.

When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve. These properties can make this an ideal function for boosting more recent documents when x is rord(datefield).

Example Syntax: recip(rord(creationDate),1,1000,1000)

Solr1.4 In Solr 1.4 and later, best practice is to avoid ord() and rord() and derive the boost directly from the value of the date field. See ms() for more details.

recency=product(
recip(ms(NOW/HOUR,
last_executed_on),
1.27E-10,0.08,0.05),
popularity)

Computes the product of the popularity and the age boost. The age boost is computed using Solr’s recip function: recip(x,m,a,b) = a / (m*x + b), where x is the document age and m, a, and b are parameters used to compute the age penalty. Document age is based on the number of milliseconds between the current time and the last_executed_on field.

https://stackoverflow.com/questions/22017616/stronger-boosting-by-date-in-solr

recip(x, m, a, b) implements f(x) = a/(xm+b) with :

x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11(1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.

How to make a date boosting stronger ?

Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive. Example here (page 8)
Apply a boost to the boost function itself with the bf parameter using dismax or edismax query parser : bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0

Note that bf behaves like an additive boost : it acts as a bonus added to newer document scoring, while {!boost b} acts more as a penalty applied to the score of older document. Anyway, using additive boost might be a good way to boost newer docs. Just remember that a bf score is independant of the global score (relevancy), meaning that a relevant resultset (with higher scores) may not be impacted as much as a no relevant resultset (with lower scores), so depending on your needs it could be interesting.

https://www.metaltoad.com/blog/date-boosting-solr-drupal-search-results

recip(abs(ms(NOW/HOUR,dm_field_date)),3.16e-11,1,.1)

This function will apply a +10 boost to a document with today's date, tapering off to about +1 after a year.
https://wiki.apache.org/solr/FunctionQuery#map

Solr1.3 map(x,min,max,target) with

Solr1.4 map(x,min,max,target,value) maps any values of the function x that fall within min and max inclusive to target. min,max,target,value are constants. It outputs the field's value (or "value") if it does not fall between min and max.

Example Syntax 1: map(x,0,0,1) change any values of 0 to 1... useful in handling default 0 values
Example Syntax 2 Solr1.4: map(x,0,0,1,0) change any values of 0 to 1 . and if the value is not zero it can be set to the value of the 5th argument instead of defaulting to the field's value
Example Syntax 3 Solr1.3: map(price,0,100,0) if price is between 0 and 100 return 0 otherwise return price.

Also, if x=NULL, it will also match min of 0.

https://stackoverflow.com/questions/6437512/order-by-an-expression-in-solr

Solr doesn't have an if/then (at least not until 4.0), but it does have a map function and the ability to use function queries in your sort. You can probably use something like this in your sort to achieve what you're after:

 ?q=*&sort=map(category,20,20,case,0),score desc

/select?q={!func}map(Category,20,20,1,0)&sort=score desc

The cool thing is that you can still sort on other fields, so:

&sort=score desc, name asc

http://robotlibrarian.billdueber.com/2012/03/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/
The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.

It just concatenates them together and indexes them.

Solr does allow a phrase query to be “sloppy”, though – basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.

    'pf' => 'name_text^10', # search this field as a phrase
    'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'

a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.

A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.

All you have to do is use the pf and ps parameters and you’re set.

Note that this should be telling you two things:

Always use the same positionIncrementGap for your multiValued fields
Make it a number much larger than the maximum number of tokens you expect to ever have in a field.

Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there – a large value doesn’t affect processing time or your index size or anything.

Slop of 4 makes the phrase “Sex in the City” be treated exactly the same as “In the Sex City”. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.

qs is a dismax param that affects query slop – how much slop to allow in phrases within the query, much like the ps param.

The query

A three-token query
  'q' => 'Bill "The Weasel" Dueber'

…has three tokens, the second of which (“The Weasel”) is a phrase. It’s that phrase token that is affected by query slop.

Solr doesn’t really separate multiple values from each other in a multiValued field
Phrase slop (ps) and query slop (qs) can be used to allow “phrase” to mean “a bunch of tokens within X spots of each other”

https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser

pf	Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter appear in close proximity.
ps	Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase.
qs	Query Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase. Used specifically with the `qf`parameter.

slop refers to the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query.

http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/

The number of other words permitted between words in query phrase is called “Slop“. We can use the tilde, “~”, symbol at the end of our Phrase for this. The lesser the distance between two terms the higher the score will be. A sloppy phrase query specifies a maximum “slop”, or the number of positions tokens need to be moved to get a match. The slop is zero by default, requiring exact matches.

ps (Phrase Slop) affects boosting, if you play with ps value, numFound and result set do not change. But the order of result set change. More exact matches are scored higher than sloppier matches, thus search results are sorted by exactness.
q=digital group&pf=text&ps=100

3. The dismax handler also allows users to explicitly specify a phrase query with double quotes, and the qs(query slop) parameter can be used to add slop to any explicit phrase queries.
q=”digital group”&qs=100

<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <lst name="defaults">
        <str name="defType">edismax</str>
        <str name="echoParams">explicit</str>
        <str name="qf">field</str>
        <str name="qs">10</str>
        <str name="pf">field</str>
        <str name="ps">10</str>
        <str name="q.alt">*:*</str>
    </lst>
</requestHandler>

By setting the qf (Query Fields), qs (Query Phrase Slop), pf (Phrase Fields), ps (Phrase Slop), pf2 (Phrase bigram fields), ps2 (Phrase bigram slop), pf3 (Phrase trigram fields), ps3 (Phrase trigram slop) parameter you can control which fields would be searched upon. Usually the words are search individually on all the fields and the scored as per the proximity.

Order of the words also matter. So “word1 word2″ is going to be different than “word2 word1″ because a different number of transpositions are allowed.

Friday, June 23, 2017

How to Review Software Design

Saturday, June 10, 2017

TV & Movie Data

Movie Program Metadata

Available Data

Tuesday, June 6, 2017

Solr Query Relevancy

The `bq` (Boost Query) Parameter

Using 'slop'

How can I make exact-case matches score higher

How can I search for one term near another term (say, "batman" and "movie")

How do I give a negative (or very low) boost to documents that match a query?

Conclusions (TL;DR)

ord

rord

recip

Labels

Popular Posts

Friday, June 23, 2017

How to Review Software Design

Saturday, June 10, 2017

TV & Movie Data

Movie Program Metadata

Available Data

Tuesday, June 6, 2017

Solr Query Relevancy

The bq (Boost Query) Parameter

Using 'slop'

How can I make exact-case matches score higher

How can I search for one term near another term (say, "batman" and "movie")

How do I give a negative (or very low) boost to documents that match a query?

Conclusions (TL;DR)

ord

rord

recip

Labels

Popular Posts

The `bq` (Boost Query) Parameter