Massive Technical Interviews Tips: Solr

Example
https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
./svm_rank_learn -c 3 example3/train.dat example3/model
svm_rank_classify example3/test.dat example3/model example3/predictions

http://www.ibm.com/watson/developercloud/doc/retrieve-rank/

https://cwiki.apache.org/confluence/display/solr/Result+Reranking

http://localhost:8983/solr/techproducts/schema/feature-store/_DEFAULT_

https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking

Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B). Since the more costly ranking from query B is only applied to the top N documents it will have less impact on performance then just using the complex query B by itself – the trade off is that documents which score very low using the simple query A may not be considered during the re-ranking phase, even if they would score very highly using query B.

q=greetings&rq={!rerank reRankQuery=$rqq reRankDocs=1000 reRankWeight=3}&rqq=(hi+hello+hey+hiya)

it can be used in conjunction with the collapse parser to re-rank the group heads after they've been collapsed. It also preserves the order of documents elevated by the elevation component. And it even has its own custom explain so you can see how the re-ranking scores were derived when looking at debug information.

https://issues.apache.org/jira/browse/SOLR-8183
https://github.com/ajkl/solr-ltr-plugin
https://issues.apache.org/jira/browse/SOLR-8542
This is a ticket to integrate learning to rank machine learning models into Solr. Solr Learning to Rank (LTR) provides a way for you to extract features directly inside Solr for use in training a machine learned model. You can then deploy that model to Solr and use it to rerank your top X search results. This concept was previously presented by the authors at Lucene/Solr Revolution 2015.

https://findwise.com/blog/improve-search-relevance-using-machine-learning-statistics-apache-solr-learning-rank/

1) select a supported training library for linear SVM and one for the LambdaMart ( basically the libraries that you already suggest in the README could be a starting point)

2) create an Update Request handler that accepts the training set ( and the format of the training set will be clearly described in the documentation like : LETOR )
This update handler will basically take the training set file and related parameters supported by the related library to proceed with the training.
Trying to use the default configuration parameter where possible, in the way to make it as easy as possible the user interaction.
The update handler will then extract the document features ( a revisit of the cache could be interesting in here, to improve the rycicling of feature extraction)

3) update request handler will train the model calling internally the selected library , using all the parameters provided. The model generated will be converted in the supported Json format and stored in the model store.

This sample approach could be complicated as much as we want ( we can add flexibility in the library to be used and make it easy to extend) .
A further next step could be to add a layer of signal processing directly in Solr , to build the training set as well .
( a sort of REST Api that takes in input the document, queryId, rating score) and automatically create an entry of the training set stored in some smart way.
Than we can trigger the model generation or set up schedule to refresh the model automatically.
We could even take into account only certain periods, store training data in different places, clean the training set automatically from time to time ect ext

http://alexbenedetti.blogspot.com/2016/07/solr-is-learning-to-rank-better-part-1.html

Introducing supervised learning from user behaviour and signals can improve the relevancy of the documents retrieved bringing a new approach in ranking them.

Collecting Signals

The start of the journey is the signals collection, it is a key phase and involves the modelling of the supervised training set that will be used to train the model.

A training set can be Explicit or Implicit.

Implicit

An Implicit training set is collected from user behaviours and interactions.

e.g.

Historical sales and transactions of an E-commerce website

User Clicks in the search result page

Time spent on each document accessed

A training set of this type is quite noisy but allows to collect great numbers of signals with small effort.

More the user was engaged with a particular document, stronger the signal of relevancy.

e.g.

A sale of a product is a stronger signal of relevancy than adding it to the basket

User Clicks in the search result page in comparison with documents shown but not clicked

Longer you read a document, stronger the relevancy

+ Pros : Cheap to build

- Cons : Noisy

Explicit

An Explicit training set is collected directly from the interaction with Human Experts.

Feature Engineering

Document Features (query independent)

Query Dependent Features

This kind of features depends on the query and on the document

e.g.

Is the document containing the query text in the title ?
Is the document (product) of the same brand as expressed in the query

Query Level Features

This kind of features depends only on the query.

e.g.

Number of words in the query

Cuisine Type Selected( e.g. "Italian", "Sushi" when searching for Restaurants)
Date Selected ( e.g. when searching in an Hotel Booking system)

Department Selected ( e.g. "electonics", "kitchen", "DIY" ... in an E-commerce website)

User Dependent Features

Also in this case this kind of feature does not depend on the document.

It only depends on the user running the query.

e.g.

Device

Age of the user

Gender

As described , for convenience of the mathematical algorithms each high level feature must be modelled as a numeric feature.

In the real world a feature describes an aspect of the object (document) and must be represented accordingly:

Ordinal Features

An ordinal feature represents a numerical value with a certain position in a sequence of numbers.

e.g.

Star Rating ( for a document describing an Hotel)

Price ( for a document describing an e-commerce product)

For the Star Rating feature, stands an order for the different values:

1<2<3<4<5 is logically correct.

For the Price feature, the same observation applies .

100$ < 200$ <300$

A feature is Ordinal when it is possible to compare different values and decide the ranking of these.

Categorical Features

A categorical feature represents an attribute of an object that have a set of distinct possible values.

In computer science it is common to call the possible values of a categorical features Enumerations.

e.g.

Colour ( for a document describing a dress)

Country ( for a document describing a location)

It easy to observe that to give an order to the values of a categorical feature does not make any sense.

For the Colour feature :

red < blue < black has no general meaning.

Binary Features

A binary feature represents an attribute of an object that can have only two possible values.

Traditionally 0 / 1 in accordance with the binary numeral system.

One Hot Encoding

When a categorical feature describes your Document, it is tempting to represent each category as an Integer Id :

e.g.

Categorical Feature : colour

Distinct Values : red, green, blue

Representation : colour:1, colour:2 colour:3

With this strategy, the Machine learning algorithm will be able to manage the feature values...

But is the information we pass, the same as the original one ?

Representing a categorical feature as an ordinal feature is introducing an additional ordinal relationship :

1(red) < 2(green) < 3(blue)

which doesn't reflect the original information.

There are different ways to encode categorical features to make them understandable by the training algorithm. We need basically to encode the original information the feature provides in an numeric form, without any loss or addition if possible.

One possible approach is called One Hot Encoding [3]:

Given a categorical feature with N distinct values, encode it in N binary features, each feature will state if the category applies to the Document.

e.g.

Categorical Feature : colour

Distinct Values : red, green, blue

Encoded Features : colour_red, colour_green, colour_blue

A document representing a blue shirt will be described by the following feature vector :

... colour_red:0 colour_green:0 colour_blue:1 ...

One Hot Encoding is really useful to properly model your information, but take care of the cardinality of your categorical feature as this will be reflected in the number of final features that will describe your signal.

Feature Value Quantisation

Missing Values

Outliers

Feature Definition

Keep it simple : start from a limited set of features which are really fundamental to describe your problem, the model produced will be really poor, but at least you have the baseline.
Iteratively train the model : removing or adding a feature at each execution. this is time consuming but will allow you to identify clearly which features really matter.
Visualise Important Features : After you trained the model, use a visualisation tool to verify which feature is appearing more
Meet the Business

Data Preparation

Poor signal quality (noise)
Incomplete feature vector representation
Not uniform distribution of relevant-not relevant documents per queries

Noise Removal

Given a scale 1 to 3 :

1 - User clicked the product

2 - User added the product to the basket

3 - User bought the product

This doesn't help the training algorithm at all, so we need to find a strategy to avoid that.

One possible way is to keep only the strongest signal per document-query per user episode .

In the case of a user buying a product, we avoid storing in the training set 3 signals, but we keep only the most relevant one.

Unbalanced Dataset

A dataset is unbalanced when the relevancy classes are not represented equally in the dataset i.e. we have many more samples of a relevancy class than another.

Taking again the E-commerce example, the number of relevant signals (sales) will be much less than the number of weak signals (clicks).

This unbalance can make the life harder to the training algorithm, as each relevant signal can be covered by many more weakly relevant ones.

Collect more data

Resample the dataset

You can manipulate the data you collected to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes: Oversampling and Undersampling[5].

You can add copies of instances from the under-represented relevancy class, this is called over-sampling, or

you can delete instances from the over-represented class, this technique is called under-sampling.

These approaches are often very easy to implement and fast to run. They are an excellent starting point.

Some ideas and suggestion :

Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more), you can undersample randomly or following any business logic available

Consider testing different resampled ratios (it is not necessary to target a 1:1 ratio)
When oversampling consider advanced approaches, instead of simple duplication of already existent samples, could be good to artificially generate new ones [6]

Be careful, resampling is not always helping your training algorithm, so experiment in details your use case.

http://alexbenedetti.blogspot.com/2016/08/solr-is-learning-to-rank-better-part-2.html

Generally, Ranking SVM includes three steps at training time:

It maps the similarities between queries and the clicked results onto a certain feature space.
It calculates the distances between any two of the vectors obtained in step 1.
It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver

The Ranking SVM function maps each search query to the features of each sample.
This mapping function projects each data pair onto a feature space.

Each sample ( query-document ) of the training data will be used for the Ranking SVM algorithm to refine the mapping.

Generally, Ranking SVM includes three steps at training time:

It maps the similarities between queries and the clicked results onto a certain feature space.
It calculates the distances between any two of the vectors obtained in step 1.
It forms an optimization problem which is similar to a standard SVM classification and solves this problem with the regular SVM solver

Given the list of the features that describe our problem, an SVM model will assign a numerical weight to each of them, the resulting function will assign a score to a document given in input the feature vector that describes it.

Given the documents (expressed with their feature vector):
D1 [1.0, 100, 1]
D2 [0.0, 80, 1]

Applying the model we get the scores :

Score(D1) = 1.0 * userTextTitleMatch(1.0) + 0.5 * originalScore(100) + 0.1 * isBook(1.0) = 51.1
Score(D2) = 1.0 * userTextTitleMatch(0.0) + 0.5 * originalScore(80) + 0.1 * isBook(1.0) = 40.1
The D1 documents is more relevant according the model.

Key points :

easy to debug and understand
linear function

LambdaMART is a tree ensemble based model.
Each tree of the ensemble is a weighted regression tree and the final predicted score is the weighted sum of the prediction of each regression tree.
A regression tree is a decision tree that takes in input a feature vector and returns a scalar numerical value in output.
At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG.

To understand LambdaMART let's explore the main two aspects: Lambda and MART.

MART
LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART).
Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.
Lambda
At each training point we need to provide a gradient that will allow us to minimize the cost function ( whatever we selected, NDCG for example).
To solve the problem LambdaMART uses an idea coming from lambdaRank:
at each certain point we calculate a value that acts as the gradient we require, this component will effectively modify the ranking, pushing up or down groups of documents and affecting effectively the rules for the leaf values, which cover the point used to calculate lambda.
LambdaMART is currently considered one of the most effective model and it has been proved by a number of winning contests.

Evaluation Metric

Evaluating the model will be vital for the validation phase during the training, for testing the model against new data and to consistently being able to assess the predicting quality of the model trained.

http://alexbenedetti.blogspot.com/2016/08/solr-is-learning-to-rank-better-part-3.html

The tools currently available provide the support to :

- index the model in a Solr collection

- index the training set in a Solr collection

- print the top scoring leaves from a LambdaMART model

To use the Ltr tools you must proceed with these simple steps :

set up the Solr backend - this will be a fresh Solr instance with 2 collections : models, trainingSet, the simple configuration is available in : ltr-tools/configuration
gradle build - this will package the executable jar in : ltr-tools/ltr-tools/build/libs

java -jar ltr-tools-1.0.jar -tool modelIndexer -model /models/lambdaMARTModel1.json -solrURL http://localhost:8983/solr/models
http://alexbenedetti.blogspot.com/2016/10/solr-is-learning-to-rank-better-part-4.html

The LTR Solr plugin allows Solr to rerank the search results evaluating a provided LTR model.
Main responsabilties of the plugin are :

- storage of feature definitions
- storage of models
- feature extraction and caching
- search result rerank

the feature vector is the mathematical representation of each document/query pair and the model will score each search result according to that vector.
Of course we need to tell Solr how to generate the feature vector for each document in the search results.
Here comes the Feature Definition file.
A Json array describing all the relevant features necessary to score our documents through the machine learned LTR model.

A Solr feature is defined by a Solr query following the Solr sintax.

The value of the Solr feature is calculated as the return value of the query run against the document we are scoring.

This feature can depend from query time parameters or can be query independent ( see examples)

e.g. 

"params":{"fq": ["{!terms f=category}book"] } 

- Query Independent

- Boolean feature

If the document match the term 'book' in the field 'category' the feature value will be 1.

It is query independent as no query param affects this calculation.

"params":{"q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"}

- Query Dependent

- Ordinal feature

The feature value will be calculated as the result of the function query, more recent the document, closer to 1 the value.

It is query dependent as 'NOW' affects the feature value.

"params":{"q": "{!field f=title}${user_text}" }

- Query Dependent

- Ordinal feature

The feature value will be calculated as the result of the query, more relevant the title content for the user query, higher the value.

It is query dependent as the 'user_text' query param affects the calculation.

{
  "name":"book_price",
  "class":"org.apache.solr.ltr.feature.FieldValueFeature",
  "params":{"field":"book_price"}
}
{
   "name" : "userFromMobile",
   "class" : "org.apache.solr.ltr.feature.ValueFeature",
   "params" : { "value" : "${userFromMobile:<default>}", "required":true }
}

- Query Independent

A Fiel Value feature is defined by a Solr field.

The value of the feature is calculated as the content of the field for the document we are scoring.

The field must be STORED or DOC-VALUED . This feature is query independent ( see examples)

e.g.

"params":{"field":"book_price"}

- Query Independent

- Ordinal feature

The value of the feature will be the content of the 'book_price' field for a given document.

It is query independent as no query param affects this calculation.

A Value feature is defined by a constant or an external query parameter.

The value of the feature is calculated as the value passed in the solr request as an efi(External Feature Information) parameter or as a constant.

This feature depends only on the param configured( see examples)

e.g.

"params" : { "value" : "${user_from_mobile:}", "required":false }
- Query Level
- Boolean feature
The user will pass the 'userFromMobile' request param as an efi
The value of the feature will be the value of the parameter
The default value will be assigned if the parameter is missing in the request
If it is required an exception will be thrown if the parameter is missing in the request

"params" : { "value" : "5", "required":false } 

- Constant

- Ordinal feature

The feature value will be calculated as the constant value of '5' .



Except the constant, nothing affect the calculation.

EFI ( External Feature Information )

As you noticed in the feature definition json, external request parameters can affect the feature extraction calculation.
rq={!ltr reRankDocs=3 model=externalmodel efi.user_from_mobile=1}

./dist/solr-ltr-6.4.0-195.jar

./bin/solr -e techproducts



curl -XPUT 'http://localhost:8983/solr/techproducts/schema/feature-store' --data-binary "@./contrib/ltr/example/techproducts-features.json" -H 'Content-type:application/json'

curl -XPUT 'http://localhost:8983/solr/techproducts/schema/model-store' --data-binary "@./contrib/ltr/example/techproducts-model.json" -H 'Content-type:application/json'Access to the default feature store



http://localhost:8983/solr/techproducts/schema/feature-store/_DEFAULT_


Access to the model store

http://localhost:8983/solr/techproducts/schema/model-store


Perform a reranking query using the model, and retrieve the features

http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=linear%20reRankDocs=25%20efi.user_query=%27test%27}&fl=[features],price,score,name

http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr model=linear reRankDocs=25 efi.user_query='test'}&fl=[features],price,score,name

python train_and_upload_demo_model.py -c config.json

This script deploys your features from config.json "featuresFile" to Solr. Then it takes the relevance judged query document pairs of "userQueriesFile" and merges it with the features extracted from Solr into a training file. That file is used to train a linear model, which is then deployed to Solr for you to rerank results.
Search and rerank the results using the trained model

http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=ExampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name

  <queryParser name="ltr" class="org.apache.solr.search.LTRQParserPlugin" />

Currently the Learning to Rank plugin supports 2 generalized forms of models: 1. Linear Model i.e. RankSVM, Prankingand 2. Multiple Additive Trees i.e. LambdaMART, Gradient Boosted Regression Trees (GBRT)

https://www.csie.ntu.edu.tw/~cjlin/liblinear/
- make

https://www.youtube.com/watch?v=M7BKwJoh96s

https://cwiki.apache.org/confluence/display/solr/Internal+-+TODO+List

https://github.com/bloomberg/lucene-solr/tree/master-ltr-plugin-release/solr/contrib/ltr
https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/README.md

https://en.wikipedia.org/wiki/Learning_to_rank
https://lucidworks.com/blog/2016/08/17/learning-to-rank-solr/

click statistics
https://www.safaribooksonline.com/blog/2014/11/04/implementing-popularity-boosting-in-search/
http://stackoverflow.com/questions/22812761/solr-click-scoring-implementation

http://www.slideshare.net/LucidImagination/bialecki-andrzej-clickthroughrelevancerankinginsolrlucidworksenterprise
Many techniques available in Solr / Lucene
• Indexing-time
! text analysis, morphological analysis, synonyms, ...
• Query-time
! boosting, rewriting, synonyms, DisMax, function queries …
• Editorial ranking (QueryElevationComponent)
! No direct feedback from users on relevance "
! What user actions do we know about?
• Search, navigation, click-through, other actions…

Click-through: user selects an item at a position among a nuber of results for a query
Why this information may be useful
• “Indicates” user's interest in a selected result
• “Implies” that the result is relevant to the query
• “Significant” when low-ranking results selected
• “May be” considered as user's implicit feedback

Click-through in context
! Query log, click positions, click intervals provide a
context
! Source of spell-checking data
• Query reformulation until a click event occurs
! Click events per user – total or during a session
• Building a user profile (e.g. topics of interest)
! Negative click events
• User did NOT click the top 3 results # demote?
! Clicks of all users for an item (or a query, or both)
• Item popularity or relevance to queries
! Goal: analysis and modification of result ranking

Click to add title…
! Clicking through == adding labels!
! Collaborative filtering, recommendation system
! Topic discovery & opinion mining
! Tracking the topic / opinion drift over time
! Click-stream is sparse and noisy – caveat emptor

Query log, with unique id=f(user,query,time)
• User id (or group)
• Query (+ facets, filters, origin, etc)
• Number of returned results
• Context (suggestions, autocomplete, “more like
this” terms …)
! Click-through log
• Query id , document id, click position & click
timestamp
! What data we would like to get?
• Map of docId =>
! Aggregated queries, aggregated users
! Weight factor f(clickCount, positions, intervals)

Positive feedback
clicked -> highly ranked -> clicked -> even higher ranked …
f(click data) should be sub-linear
• f(click data, time) should discount older clicks
• f(click data) should be sanitized and bounded

Optimizing Search Engines using Clickthrough Data

https://static.aminer.org/pdf/PDF/000/472/865/optimizing_search_engines_using_clickthrough_data.pdf

Kendall’s τ
For comparing the ordinal correlation
of two random variables, Kendall’s τ is the most frequently
used measure in statistics. For two finite strict orderings
ra ⊂ D × D and rb ⊂ D × D, Kendall’s τ can be defined
based on the number P of concordant pairs and the number
Q of discordant pairs (inversions).

A pair di 6= dj is concordant, if both ra and rb agree in how they order di and dj . It is discordant if they disagree.
Equation (2) depends only on Q for a fixed collection.

http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf
Accurately Interpreting Clickthrough Data as Implicit Feedback

https://speakerdeck.com/anilshanbhag/optimizing-search-engines-using-clickthrough-data
TF-IDF -> PageRank -> ML
Term Frequency–Inverse Document Frequency, is a numerical statistic
which reflects how important a word is to a document in a collection or
corpus. It is often used as a weighting factor in information retrieval and
text mining

• The tf-idf value increases proportionally to the number of times a word
appears in the document, but is offset by the frequency of the word in
the corpus, which helps to control for the fact that some words are
generally more common than others.

PageRank gives the notion of how well linked the document is on the
web. This is good indicator of quality of web page.

Learning to rank or machine-learned ranking (MLR) is a type of
supervised or semi-supervised machine learning problem in which the
goal is to automatically construct a ranking model from training data.
• Training data consists of lists of items with some partial order specified
between items in each list.
• This order is typically induced by giving a numerical or ordinal score or a
binary judgment (e.g. "relevant" or "not relevant") for each item.
• Ranking model's purpose is to rank, i.e. produce a permutation of items
in new, unseen lists in a way, which is "similar" to rankings in the training
data in some sense.

Click-through data can be thought as triplets (q,r,c) where
• q is the user query,
• r is the ranking presented to the user and
• c is the set of links the user clicked on

There are strong dependencies between the three parts of (q, r, c).
• The presented ranking r depends on the query q as determined by the
retrieval function implemented in the search engine.
• Furthermore, the set c of clicked-on links depends on both the query q
and the presented ranking r.
• A user is more likely to click on a link, if it is relevant to q. While this
dependency is desirable and interesting for analysis, the dependency of
the clicks on the presented ranking r muddies the water.

• Thus we can get only relative relevance judgement rather than absolute
relevance.

For a ranking (link1 , link2 , link3 , ...) and a set C containing the ranks of
the clicked-on links, extract a preference example
linki <r∗ linkj
for all pairs 1 ≤ j < i, with i ∈ C and j ∉ C.

both r∗ and rf(q) are binary relations over D × D such that if a
document di is ranked higher than dj for an ordering r, i.e. di <r dj, then
(di,dj) ∈ r, otherwise (di , dj ) ∉ r

Kendall’s τ as performance measure
A pair di ≠ dj is concordant, if both ra and rb agree in how they order di
and dj . It is discordant if they disagree.
• P : number of concordant pairs
Q : number of discordant pairs
P+Q = mC2 on a finite domain D of m documents

Ranking SVM (Support Vector Machine)
• Ranking SVM is used to adaptively sort the web-pages by their
relevance to a specific query.
• Generally, Ranking SVM includes three steps in the training period:
1.It maps the similarities between queries and the clicked pages onto
certain feature space.
2.It calculates the distances between any two of the vectors obtained in
step1
3.It forms optimization problem which is similar to SVM classification and
solve such problem with the regular SVM solver.

Rank-SVM
https://www.quora.com/What-is-the-intuition-for-SVM-Rank-and-when-should-I-use-it
SVM-Rank is a technique to order lists of items. SVM-Rank use standard SVM for ranking task.
Lets suppose,
we have a classifier(SVM) and we have two items, item1 and item2.
Item1 is expected to be ordered before item2.

Then,
Input to the classifier: (item1, item2)
Output of the classifier: 1 [which implies ordering of item is correct ie. item1 is better than item2]

Input of the classifier : (item2, item1)
Output of the classifier : -1 [which implies ordering of item is incorrect ]

Lets say the preferred order of three items is (item1 , item2 , item3)
Classify the pair of items using the trained classifier
(item1, item2) = 1 , (item1, item3) = 1 => score of item1 = 1 + 1 = 2
(item2, item1) = -1 , (item2, item3) = 1 => score of item2 = -1 + 1 = 0
(item3, item1) = -1 , (item3, item2) = -1 => score of item3 = -1 - 1 = -2

Based on the score we get rank: item1, item2 , item3

The same process can be repeated for any number of items.

Initial step of optimisation problem is formulated as ordinal regression; however, using pair wise difference the regression problem is converted into classification problem.

SVM-Rank used pairwise difference vectors to produce rank list of items.

So, given a list of items represented as feature vector, if we want to order them then SVM-Rank is right things to use.
https://en.wikipedia.org/wiki/Binary_relation

http://yonik.com/solr-6-4/
snapshotcli.sh command line tool to manage snapshots
http://yonik.com/download/
https://builds.apache.org/job/Solr-Artifacts-6.x/lastSuccessfulBuild/artifact/solr/package/

https://cwiki.apache.org/confluence/display/solr/Post+Tool
http://www.solrtutorial.com/solr-in-5-minutes.html

bin/solr -e techproducts

bin/solr start -p 8983 -s "example/techproducts/solr"

https://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/

http://www.cnblogs.com/kemaswill/archive/2013/06/01/3109497.html

1.1 相关度排序模型(Relevance Ranking Model)

相关度排序模型根据查询和文档之间的相似度来对文档进行排序。常用的模型包括：布尔模型(Boolean Model)，向量空间模型(Vector Space Model)，隐语义分析(Latent Semantic Analysis)，BM25，LMIR模型等等。
1.2 重要性排序模型(Importance Ranking Model)

重要性排序模型不考虑查询，而仅仅根据网页(亦即文档)之间的图结构来判断文档的权威程度，典型的权威网站包括Google，Yahoo!等。常用的模型包括PageRank，HITS，HillTop，TrustRank等等。
2. 为什么需要使用机器学习的方法来进行排序

对于传统的排序模型，单个模型往往只能考虑某一个方面(相关度或者重要性)，所以只是用单个模型达不到要求。搜索引擎通常会组合多种排序模型来进行排序，但是，如何组合多个排序模型来形成一个新的排序模型，以及如何调节这些参数，都是一个很大的问题。

使用机器学习的方法，我们可以把各个现有排序模型的输出作为特征，然后训练一个新的模型，并自动学得这个新的模型的参数，从而很方便的可以组合多个现有的排序模型来生成新的排序模型。

与文本分类不同，L2R考虑的是给定查询的文档集合的排序。所以，L2R用到的特征不仅仅包含文档d本身的一些特征(比如是否是Spam)等，也包括文档d和给定查询q之间的相关度，以及文档在整个网络上的重要性(比如PageRank值等)，亦即我们可以使用相关性排序模型和重要性排序模型的输出来作为L2R的特征。

1). 传统排序模型的输出，既包括相关性排序模型的输出f(q,d)，也包括重要性排序模型的输出。

2). 文档本身的一些特征，比如是否是Spam等。

L2R的训练数据可以有三种形式：对于每个查询，各个文档的绝对相关值(非常相关，比较相关，不相关，等等)；对于每个查询，两两文档之间的相对相关值(文档1比文档2相关，文档4比文档3相关，等等)；对于每个查询，所有文档的按相关度排序的列表(文档1>文档2>文档3)。这三种形式的训练数据之间可以相互转换，详见[1]。

从日志中挖掘：搜索引擎都有大量的日志记录用户的行为，我们可以从中提取出L2R的训练数据。Joachims提出了一种很有意思的方法[4]：给定一个查询，搜索引擎返回的结果列表为L，用户点击的文档的集合为C，如果一个文档di被点击过，另外一个文档dj没有被点击过，并且dj在结果列表中排在di之前，则di>dj就是一条训练记录。亦即训练数据为：{di>dj|di属于C，dj属于L-C，p(dj)<p(di)}，其中p(d)表示文档d在查询结果列表中的位置，越小表示越靠前。

对与每个给定的查询-文档对(query document pair)，抽取相应的特征(既包括查询和文档之间的各种相关度，也包括文档本身的特征以及重要性等)，另外通过或者人工标注或者从日志中挖掘的方法来得到给定查询下文档集合的真实序列。然后我们使用L2R的各种算法来学到一个排序模型，使其输出的文档序列和真实序列尽可能相似。

Pairwise方法考虑给定查询下，两个文档之间的相对相关度。亦即给定查询q的一个真实文档序列，我们只需要考虑任意两个相关度不同的文档之间的相对相关度：di>dj，或者di<dj。

相比于Pointwise方法，Pairwise方法通过考虑两两文档之间的相对相关度来进行排序，有一定的进步。但是，Pairwise使用的这种基于两两文档之间相对相关度的损失函数，和真正衡量排序效果的一些指标之间，可能存在很大的不同，有时甚至是负相关，如下图所示(pairwise的损失函数和NDCG之呈现出负相关性)：

另外，有的Pairwise方法没有考虑到排序结果前几名对整个排序的重要性，也没有考虑不同查询对应的文档集合的大小对查询结果的影响(但是有的Pairwise方法对这些进行了改进，比如IR SVM就是对Ranking SVM针对以上缺点进行改进得到的算法)。

与Pointwise和Pairwise方法不同，Listwise方法直接考虑给定查询下的文档集合的整体序列，直接优化模型输出的文档序列，使得其尽可能接近真实文档序列。

Listwise算法主要包括以下几种算法：LambdaRank (NIPS 2006), AdaRank (SIGIR 2007), SVM-MAP (SIGIR 2007), SoftRank (LR4IR 2007), GPRank (LR4IR 2007), CCA (SIGIR 2007), RankCosine (IP&M 2007), ListNet (ICML 2007), ListMLE (ICML 2008) 。

相比于Pointwise和Pairwise方法，Listwise方法直接优化给定查询下，整个文档集合的序列，所以比较好的解决了克服了以上算法的缺陷。Listwise方法中的LambdaMART(是对RankNet和LambdaRank的改进)在Yahoo Learning to Rank Challenge表现出最好的性能。

L2R是用机器学习的方法来进行排序，所以评价L2R效果的指标就是评价排序的指标，主要包括一下几种：

1) WTA(Winners take all) 对于给定的查询q，如果模型返回的结果列表中，第一个文档是相关的，则WTA(q)=1，否则为0.

2) MRR(Mean Reciprocal Rank) 对于给定查询q，如果第一个相关的文档的位置是R(q)，则MRR(q)=1/R(q)。

3) MAP(Mean Average Precision) 对于每个真实相关的文档d，考虑其在模型排序结果中的位置P(d)，统计该位置之前的文档集合的分类准确率，取所有这些准确率的平均值。

4) NDCG(Normalized Discounted Cumulative Gain) 是一种综合考虑模型排序结果和真实序列之间的关系的一种指标，也是最常用的衡量排序结果的指标，详见Wikipedia。

5) RC(Rank Correlation) 使用相关度来衡量排序结果和真实序列之间的相似度，常用的指标是Kendall's Tau。

http://www.cnblogs.com/kemaswill/p/3241963.html
Ranking SVM是一种Pointwise的排序算法, 给定查询q, 文档d1>d2>d3(亦即文档d1比文档d2相关, 文档d2比文档d3相关, x1, x2, x3分别是d1, d2, d3的特征)。为了使用机器学习的方法进行排序，我们将排序转化为一个分类问题。我们定义新的训练样本, 令x1-x2, x1-x3, x2-x3为正样本,令x2-x1, x3-x1, x3-x2为负样本, 然后训练一个二分类器(支持向量机)来对这些新的训练样本进行分类，如下图所示:

左图中每个椭圆代表一个查询, 椭圆内的点代表那些要计算和该查询的相关度的文档, 三角代表很相关, 圆圈代表一般相关, 叉号代表不相关。我们把左图中的单个的文档转换成右图中的文档对(di, dj), 实心方块代表正样本, 亦即di>dj, 空心方块代表负样本, 亦即di<dj。

将排序问题转化为分类问题之后, 我们就可以使用常用的机器学习方法解决该问题。 Ranking SVM使用SVM来进行分类:

其中w为参数向量, x为文档的特征,y为文档对之间的相对相关性, ξ为松弛变量。

其中1, 3, 7三个结果被用户点击过, 其他的则没有。因为返回的结果本身是有序的, 用户更倾向于点击排在前面的结果, 所以用户的点击行为本身是有偏(Bias)的。为了从有偏的点击数据中获得文档的相关信息, 我们认为: 如果一个用户点击了a而没有点击b, 但是b在排序结果中的位置高于a, 则a>b。

所以上面的用户点击行为意味着: 3>2, 7>2, 7>4, 7>5, 7>6。

数据的格式与LIBSVM的输入格式比较相似, 第一列代表文档的相关性, 值越大代表越相关, 第二列代表查询, 后面的代表特征

http://www.cnblogs.com/eyeszjwang/articles/2368087.html
MAP(Mean Average Precision)：单个主题的平均准确率是每篇相关文档检索出后的准确率的平均值。主集合的平均准确率(MAP)是每个主题的平均准确率的平均值。MAP 是反映系统在全部相关文档上性能的单值指标。系统检索出来的相关文档越靠前(rank 越高)，MAP就可能越高。如果系统没有返回相关文档，则准确率默认为0。

NDCG(Normalized Discounted Cumulative Gain)：计算相对复杂。对于排在结位置n处的NDCG的计算公式如下图所示：

在MAP中，四个文档和query要么相关，要么不相关，也就是相关度非0即1。NDCG中改进了下，相关度分成从0到r的r+1的等级(r可设定)

MRR(Mean Reciprocal Rank)：是把标准答案在被评价系统给出结果中的排序取倒数作为它的准确度，再对所有的问题取平均

http://blog.csdn.net/clheang/article/details/45767103
SVMrank包括一个学习模块(svm_rank_learn)，还有一个预测模块(svm_rank_classify)。

Feature/value对必须按照 Feature 增加的顺序排列。Feature 值为0的可以忽略。target值表示每个查询实例的顺序。”qid”可以对pairwise的生成进行约束，只有”qid”值相同的特征才会被用来生成pairwise

svm_rank_learn的结果是从训练数据 train.dat 中通过训练所得的模型。模型保存在model.dat中。
通过svm_rank_classify进行预测的命令为：

svm_rank_classify test.dat model.dat predictions
1


1

对测试数据test.dat中的每一行，会在 predictions 文件中保存一个代表ranking score的预测结果。predictions 中的每一行都与test example中的每一行对应。通过预测得到的分数，就能对test example中的数据进行排序。

http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html

https://en.wikipedia.org/wiki/Discounted_cumulative_gain

The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document

d_{i}

above a higher ranked, less relevant, document

d_{{j}}

does not change the computed value for CG. Based on the two assumptions made above about the usefulness of search results, DCG is used in place of CG for a more accurate measure.

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position

p

is defined as:^[2]

\mathrm {DCG_{p}} =\sum _{i=1}^{p}{\frac {rel_{i}}{\log _{2}(i+1)}}=rel_{1}+\sum _{i=2}^{p}{\frac {rel_{i}}{\log _{2}(i+1)}}

An alternative formulation of DCG^[5] places stronger emphasis on retrieving relevant documents:

{\mathrm {DCG_{{p}}}}=\sum _{{i=1}}^{{p}}{\frac {2^{{rel_{{i}}}}-1}{\log _{{2}}(i+1)}}

Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of

p

should be normalized across queries. This is done by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position

p

, also called Ideal DCG (IDCG) through that position. For a query, the normalized discounted cumulative gain, or nDCG, is computed as:

{\mathrm {nDCG_{{p}}}}={\frac {DCG_{{p}}}{IDCG_{{p}}}}

where:

\mathrm {IDCG_{p}} =\sum _{i=1}^{|REL|}{\frac {2^{rel_{i}}-1}{\log _{2}(i+1)}}

and |REL| represents the list of relevant documents (ordered by their relevance) in the corpus up to position p.

http://www.kdd.org/

Thursday, December 15, 2016

Solr - Machine Learning

Collecting Signals

Implicit

Explicit

One Hot Encoding

Feature Definition

Noise Removal

Unbalanced Dataset

Collect more data

EFI ( External Feature Information )

Labels

Popular Posts