Sunday, May 14, 2017

Solr Misc Part 4


https://cwiki.apache.org/confluence/display/solr/Other+Parsers
 A common mistake is to try to filter parents with a which filter, as in this bad example:
q={!parent which="title:join"}comments:SolrCloud 
Instead, you should use a sibling mandatory clause as a filter:
q= +title:join +{!parent which="content_type:parentDocument"}comments:SolrCloud

Block Join Parent Query Parser

This parser takes a query that matches child documents and returns their parents. The syntax for this parser is similar: q={!parent which=<allParents>}<someChildren>.  The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someChildren is a query that matches some or all of the child documents. Note that the query for someChildrenshould match only child documents or you may get an exception: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. or in older version it's: child query must only match non-parent docs. As it's said, you can search for q=+(parentFilter) +(someChildren) to find a cause .
Again using the example documents above, we can construct a query such as q={!parent which="content_type:parentDocument"}comments:SolrCloud

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.
fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter limit=100]
Note that this transformer can be used even though the query itself is not a Block Join query.
When using this transformer, the parentFilter parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:
  • childFilter - query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)
  • limit - the maximum number of child documents to be returned per parent document (default: 10)

Solr Block Join - Nested Documents
https://blog.griddynamics.com/introduction-to-block-join-faceting-in-solr
The screenshot above is taken from an online retailer’s website. According to the graphic, a dress can be blue, pink or red, and only sizes XS and S are available in blue. However, for merchandisers and customers this dress is considered a single product, not many similar variations. When a customer navigates the site,  she should see all SKUs belonging to the same product as a single product, not as multiple products. This means that for facet calculations, our facet counts should represent products, not SKUs. Thus, we need to find a way to aggregate SKU-level facets into product ones.

A common solution is to propagate properties from the SKU level to the product level and produce a single product document with multivalued fields aggregated from the SKUs. With this approach, our aggregated product looks like this:
results of the propagation of SKU level attributes to product level
However, this approach creates the possibility of false positive matches with regards to combinations of SKU-level fields. For example, if a customer filters by color ‘Blue’ and size ‘M’, Product_1 will be considered a valid match, even though there is no SKU in the original catalog which is both 'Blue' and 'M'. This happens because when we are aggregating values from the SKU level, we are losing information about what value comes from what SKU. 

Getting back to the technology, this means we should carefully support our catalog structure when searching and faceting products. The problem of searching structured data is already addressed in Solr with a powerful, high performance and robust solution: Block Join Query.
https://blog.griddynamics.com/high-performance-join-in-solr-with-blockjoinquery
Join Query looks like:
q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!join from=join_id to=id}acl:[1303 TO 1309]
You can see that Join almost never ran for less than a second, and the CPU saturated with 100 requests per minute. Adding more queries harmed latency.

All index was cached in RAM via memory mapped files magic

blockjoin.
q=text_all:(patient OR autumn OR helen)&fl=id,score&sort=score desc&fq={!parent which=kind:body}acl:[1303 TO 1309]
Search now takes only a few tens of milliseconds and survives with 6K requests per minute (100 qps). And you see plenty of free CPU!

We can check where Join uses so much CPU power with jstack:
java.lang.Thread.State: RUNNABLE
at
o.a.l.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.docFreq(BlockTreeTermsReader.java:2098)
at
o.a.s.search.JoinQuery$JoinQueryWeight.getDocSet(JoinQParserPlugin.java:338)

let’s explain how a 55GB index can ever be cached in just 8GB RAM. You should know that not all files in your index are equally valuable. (In other words, tune your schema wisely.) In my index the frq file is 7.7GB and the tim file is only 427MB, and it’s almost all that’s needed for these queries. Of course, a file which stores primary key values is also read, but it doesn’t seem significant.

BlockJoin is the most efficient way to do the join operation, but it doesn’t mean you need to get rid of your solution based on the other (slow) Join. The place for Join is frequent child updates -- and small indexes, of course.

Solr supports the search of hierarchical documents using BlockJoinQuery (BJQ). Using this query requires a special way of indexing documents, based on their positioning in the index. All documents belonging to the same hierarchy have to be indexed together, starting with child documents followed by their parent documents. 

BJQ works as a bridge between levels of the document hierarchy; e.g. it transforms matches on child documents to matches on parent documents. When we search using BJQ, we provide a child query and a parent filter as parameters. A child query represents what we are looking for among child documents, and a parent filter tells BJQ how to distinguish parent documents from child documents in the index.
For each matched child document, BJQ scans ahead in the index until it finds the nearest parent document, which is sent into the collector chain instead of the child document. This trick of relying on relative document positioning in the index, or “index-time join,” is the secret behind BJQ’s high performance.
We consider each hierarchy of matched documents separately. As we are using BJQ, each hierarchy is represented in the Solr index as a document block, or DocSet slice as we call it.
First, we calculate facets based on matched SKUs from our block. Then we aggregate obtained SKU counts into Product-level facet counts, increasing the product level facet count by only one for every matched block, irrespective of the  number of matched SKUs within the block. For example, if we are searching by COLOR:Blue, even though two Blue SKUs were found within a block, aggregated product-level counts will be increased only by one.
structure of block index. When there are multiple children level hits in a single block, facet count on parent level has to be increased only by one.
This solution is implemented inside the BlockJoinFacetComponent, which extends the standard Solr SearchComponentBlockJoinFacetComponent validates the query and injects a special BlockJoinFacetCollector into the Solr post-filter collectors chain. When BlockJoinFacetCollector is invoked, it receives a parent document, since the BlockJoinFacetComponent ensures that only ToParentBlockJoinQuery is allowed as a top level query.

Use Lucene’s MMapDirectory on 64bit platforms, please!

https://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration

https://issues.apache.org/jira/browse/LUCENE-7452
when parent filter intersects with child query the exception exposes internal details: docnum and scorer class. 

java.lang.IllegalStateException: Child query must not match same docs with parent filter. Combine them as must clauses (+) to find a problem doc. docId=23, class org.apache.lucene.search.DisjunctionSumScorer
java.lang.IllegalStateException: Parent query must not match any docs beside parent filter. Combine them as must (+) and must-not (-) clauses to find a problem doc. docID=12
https://issues.apache.org/jira/browse/SOLR-6096
This is currently possible to do all this stuff on client side by issuing additional request to delete document before every update. It would be more efficient if this could be handled on SOLR side. One would benefit on atomic update. The biggest plus shows when using "delete-by-query".
Deletion of '1' by query
<delete>
  <query>title:*</query>
  <!-- implying also
    <query>_root_:1</query>
   -->
</delete>
In that case one would not have to first query all documents and issue deletes by those id and every document that are nested.




    Labels

    Review (552) System Design (289) System Design - Review (188) Java (177) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Knowledge (39) Linux (38) Interview-Java (35) Knowledge - Review (32) Database (29) Design Patterns (29) Product Architecture (28) Big Data (27) Miscs (25) Concurrency (24) Cracking Code Interview (24) MultiThread (24) Soft Skills (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Distributed (20) Interview Q&A (20) OOD Design (20) System Design - Practice (19) How to Ace Interview (15) Security (15) Brain Teaser (14) Algorithm (13) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Redis (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Google (11) Resource (10) Testing (10) Amazon (9) Cache (9) Search (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Solr (8) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Machine Learning (7) NoSQL (7) C++ (6) Design (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) Trouble Shooting (6) CareerCup (5) Cassandra (5) Code Review (5) Company - Facebook (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Kafka (5) Leetcode (5) Must Known (5) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) Data structures (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) Puzzles (3) Python (3) Resource-System Desgin (3) Scala (3) UML (3) geeksquiz (3) AI (2) API Design (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Elasticsearch (2) Garbage Collection (2) Go (2) Hadoop (2) Html (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) ZooKeeper (2) reddit (2) Ads (1) Advanced data structures (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Chrome (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) RateLimiter (1) Reactive (1) Reading (1) Reading Code (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

    Popular Posts