Massive Technical Interviews Tips: Solr Internal

Friday, April 1, 2016

Solr Internal

http://yonik.com/solr-count-distinct/
Approximation:

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function

The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:

It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)

Example use:

$ curl http://localhost:8983/solr/techproducts/query -d '

q=*:*&

json.facet={

  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string

}'

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions

A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:

It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)

Example use:

$ curl http://localhost:8983/solr/techproducts/query -d '

q=*:*&

json.facet={

  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string

}'

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions

HyperLogLog

http://yonik.com/noggit-json-parser/

Friday, April 1, 2016

Solr Internal

“unique” Facet Function

“unique” Facet Function

HyperLogLog

Labels

Popular Posts