Friday, April 1, 2016

Solr Internal



http://yonik.com/solr-count-distinct/
Approximation:
A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function

The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).
When the number of unique values does exceed 100 in any given shard, the following algorithm is used:
  • It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
  • totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
  • uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
  • notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
  • factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
  • estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
Example use:
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'
For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions
A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function

The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).
When the number of unique values does exceed 100 in any given shard, the following algorithm is used:
  • It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
  • totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
  • uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
  • notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
  • factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
  • estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)
Example use:
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'
For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions

HyperLogLog

http://yonik.com/noggit-json-parser/

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts