Monday, September 18, 2017

Solr Misc Part 5


https://www.elastic.co/blog/lucene-points-6.0
Coming in Lucene's next major release (6.0) is a new feature called dimensional points, using the k-d tree geo-spatial data structure to offer fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering.

This feature replaces the now deprecated numeric fields and numeric range query since it has better overall performance and is more general, allowing up to 8 dimensions (versus 1) and up to 16 bytes (versus the 8 byte limit today) per dimension.
At index time, they are built by recursively partitioning the full space of N-dimensional points to be indexed into smaller and smaller rectangular cells, splitting equally along the widest ranging dimension at each step of the recursion. However, unlike an ordinary k-d tree, a block k-d tree stops recursing once there are fewer than a pre-specified (1024 in our case, by default) number of points in the cell.
At that point, all points within that cell are written into one leaf block on disk and the starting file-pointer for that block is saved into an in-heap binary tree structure. In the 1D case, this is simply a full sort of all values, divided into adjacent leaf blocks. There are k-d tree variants that can support removing values, and rebalancing, but Lucene does not need these operations because of its write-once per-segment design.
The current data-structure that underlies dimensional points is called a block KD tree, which in the case of a single dimension is a simple binary search tree that stores blocks of values on the leaves rather than individual values. Lucene currently defaults to having between 512 and 1024 values on the leaves.
In the case of Lucene, the fact that files are write-once makes the writing easy since we can build a perfectly balanced tree and then not worry about rebalancing since the tree will never be modified. Merging is easy too since you can get a sorted iterator over the values that are stored in a segment, then merge these iterators into a sorted iterator on the fly and build the tree of the merged segment from this sorted iterator.

While integer types are easy to compress based on the number of bits that they actually use (for instance if all integers are between 2 and 200, they can be encoded on one byte), the same is not true for floating-point data. In particular, the fact that they cannot represent decimals accurately makes them use all available bits in order to be as close as possible to the value that needs to be represented. This makes compression hard since values in a given segment rarely have a pattern that can be leveraged for compression.
But do you actually need the accuracy of a float or a double for your data, or would you happily trade some precision for reduced disk usage and faster queries? For those to whom this sounds like a logical trade-off, we introduced two new field types called half_float and scaled_float.
Half floats work the same way as floats and doubles, except that they use fewer bits for the mantissa and the exponent, allowing them to be stored on 16 bits in total, rather than 32 in the case of floats and 64 in the case of doubles. However beware that this storage reduction comes at a cost: they only have 3.3 significant decimal digits and the maximum value is 65504.
We suspect the new scaled_float field will be even more useful in practice: it takes a scaling factor and for every value, it internally stores the long that is closest to the product of the value and the scaling factor. For instance, with a scaling factor of 100, 0.123 would internally be indexed as 12 and all queries and aggregations would behave as if the value was actually 0.12. This helps because even though values are decimal, they are internally encoded as integers, which Lucene can compress more efficiently
The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces


Old Trie Field
Solr uses this idea of prefixes to index numbers so that it can perform range queries efficiently. Just like we can organize words into tries, we can also organize numbers into tries.

Let's say I want to index the integer 3735928559. For clarity, let's rewrite that in hexadecimal, 0xDEADBEEF. When we index this using a TrieIntField, Solr stores the integer four times at different levels of precision.

0xDE000000
0xDEAD0000
0xDEADBE00
0xDEADBEEF

What Solr is doing here is constructing numbers with different length prefixes. This would be equivalent to a trie with this structure.

DE--AD--BE--EF

The reason that this allows for fast range queries is because of what the prefixes represent. The prefixes represent a range of values. It might be better to think of them indexed like this, instead.

0xDExxxxxx
0xDEADxxxx
0xDEADBExx
0xDEADBEEF

The precision step lets you tune your index, trading range query speed for index size. A smaller precision step will result in a larger index and faster range queries. A larger precision step will result in a smaller index and slower range queries.

In the example above, I was using a precision step of 8, the default. What the precision step means is how many bits get pruned off the end of the number. Let's see what would happen if we indexed 0xDEADBEEF with a precision step of 12.

0xDExxxxxx
0xDEADBxxx
0xDEADBEEF

As you can see, compared to the default precision step of 8, a precision step of 4 doubled the number of entries in the index. The way it speeds up range searches is by allowing better granularity. If I wanted to search for the documents matching the range 0xDEADBEE0 to 0xDEADBEEF with the default precision step, I would have to check all 16 records in the index and merge the results. With the precision step of 4, I can check the one record for 0xDEADBEEx and get the results I want.
Lets assume we want to find all records with term values between “423” and “642”. Naive algorithm here would be to expand the range to separate values: 423 OR 445 OR ... 641 OR 642 (Note: I omitted values which were not indexed to simplify description). But as we use special type of field, instead of selecting all terms in lowermost row, query is optimized to only match on labelled term values (elements with gray fill on the diagram) with lower precision, where applicable. It is enough to select “5” to match all records starting with “5” (“521”, “522”) or “44” for “445”, “446”, “448”. Query is therefore simplified to match all records containing the following terms: “423”, “44”, “5”, “63”, “641”, or “642”.

So, instead of doing search by every value in the requested range, algorithm uses grouped values wherever possible.

Index time

During index time all terms for value and its quotients are need to be produced. Solr index flow is as simple as: DirectUpdateHandler->IndexWriter#updateDocuments->DocumentWriter->DocumentConsumer#processDocument(fieldInfos)->DocFieldProcessor->DocFieldProcessorPerField->DocInverterPerField

So, for incoming document we get all its fields and process values of each field by DocInverterPerField. Basically it gets a token stream (Lucene's TokenStreams are explained by Mike McCandles) which produces the sequence of tokens to be indexed for a document's fields.

Search time

Again we have FieldType here which creates a query. TrieIntField creates a NumericRangeQuery.

NumericRangeQuery extends MultiTermQuery
Underneath it uses utility class method NumericUtils.splitXXXRange() to calculate  values to be matched. 
Solr.IntField
Store and index the text value verbatim and hence don't correctly support range queries, since the lexicographic ordering isn't equal to the numeric ordering
[1,10],11,2,3,4,5,6,7,8,9
Interesting, but “sort by” works fine.. Clever comparator knows that values are ints!
 solr.SortableIntField
 ortable”, in fact, refer to the notion of making the numbers have correctly sorted order. It’s not about “sort by” actually!
● Processed and compared as strings!!! tricky string encoding: NumberUtils.int2sortableStr(...)
● Deprecated and will be removed in 5.X
● What should i use then?
 TrieIntField
 NumericTokenStream is where half of magic happens!
● precision step = 1 ● value = 11
Algorithm requires to index all 32/precisionStep terms
So, for “11” we have 11, 10, 8, 8, 0, 0, 0, 0, 0....0

Sub-classes of FieldType could override #getRangeQuery(...) to provide their own range query implementation.
If not, then likely you will have:
MultiTermQuery rangeQuery = TermRangeQuery. newStringRange(...)
TrieField overrides it. And here comes...

● Decimal example, precisionStep = ten ● q = price:[423 TO 642]


Multi-Tenant
https://www.slideshare.net/lucidworks/efficient-scalable-search-in-a-multi-tenant-environment-harry-hight
•  Massive scale - shards have to be left offline until needed

Incremental Search
•  Calculating the full result set is time consuming
•  Query cache usually cold due to unload
•  Shards load takes 7me
•  Users want to review a subset before exporting
•  Shards and results are date sorted
•  Search shards sequentially, and return partial results as available
•  Creates a streaming interface

Pinned Shards
•  Incremental search starts with the most recent data
•  `Pin` shards for most recent data
•  Subset of shards to be kept loaded at all 7mes
•  Shards already loaded for the beginning of the stream
•  User doesn’t see the load times for the rest since it happens while they review initial results
•  Allows query caches to be more effective
•  User sees results in seconds rather than minutes

What if each user has a different view of a document?
•  User 1 has permission to view the red
•  User 2 has permission to view green
•  User 3 has permission to view everything

•  Post process each document
•  Ends up being horribly slow
•  Ties applica7on logic to backend
•  Generate a unique document for each view
•  1000s of unique views makes for an unmanageable index
•  Trillions of documents is a whole different problem!
•  Dynamic fields
•  text_view1:value1, text_view2:value2, text_view3:”value1 value2”
•  Solr doesn’t have a max number of fields, but string interning becomes an issue
•  Mangle field values
•  text:”view1_value1 view2_value2 view3_value1 view3_value2”
•  Works pre^y well
https://www.slideshare.net/lucidworks/multitenant-log-analytics-saas-service-using-solr-presented-by-chirag-gupta-srivatsan-parthasarathy-microsoft

https://www.elastic.co/blog/found-multi-tenancy
there is a base cost of memory for each shard and that this base cost is constant, even if the shard contains no documents. 

https://lucidworks.com/2015/09/23/bloomberg-scales-apache-solr-multi-tenant-environment/
Basic security always starts with different users having access to subsets of the documents, but gets more interesting when users only have access to a subset of the data within a given document, and their search results must reflect that restriction to avoid revealing information

https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/
Using routing to handle such data is one of the solutions, and it allows one to efficiently divide the clients and put them into dedicated shards while still using all the goodness of SolrCloud


how to deal with some of the problems that come up with this solution: the different number of documents in shards and the uneven load.

it is usually best to avoid per/tenant collection creation. Having hundreds or thousands of collections inside a single SolrCloud cluster will most likely cause maintenance headaches and can stress the SolrCloud and ZooKeeper nodes that work together


This is not, however, a perfect solution. Very large tenant data will end up in a single shard. This means that some shards may become very large and the queries for such tenants will be slower, even though Solr doesn’t have to aggregate data from multiple shards. For such use cases — when you know the size of your tenants — you can use a modified routing approach in Solr, one that will place the data of such tenants in multiple shards.

Composite Hash Routing

Solr allows us to modify the routing value slightly and provide information on how many bits from the routing key to use (this is possible for the default, compositeId router). For example, instead of providing the routing value of user1!12345, we could use user1/2!12345. The /2 part indicates how many bits to use in the composite hash. In this case we would use 2 bits, which means that the data would go to 1/4 of the shards. If we would set it to /3 the data would go to 1/8 of the shards, and so on.
So instead of providing the _route_ request parameter and setting it to user1!, we’ll provide the composite hash, meaning the _route_ parameter value should be user1/2!.

The nice thing about the provided routing solutions is that they can be combined to work together. For small shards you can just provide a routing value equal to the user identifier; for medium tenants you can send data to more than a single shard; and, for large tenants, you can send them to multiple shards. The best thing about these strategies is that they are done automatically, and one only needs to care about the routing values during indexing and querying.
https://lucene.apache.org/solr/guide/6_6/streaming-expressions.html
Many of the provided streaming functions are designed to work with entire result sets rather then the top N results like normal search. This is supported by the /export handler.
Some streaming functions act as stream sources to originate the stream flow. Other streaming functions act as stream decorators to wrap other stream functions and perform operations on the stream of tuples. Many streams functions can be parallelized across a worker collection.
curl --data-urlencode 'expr=search(enron_emails,
                                   q="from:1800flowers*",
                                   fl="from, to",
                                   sort="from asc",
                                   qt="/export")' http://localhost:8983/solr/enron_emails/stream

StreamFactory streamFactory = new StreamFactory().withCollectionZkHost("collection1", zkServer.getZkAddress())
    .withStreamFunction("search", CloudSolrStream.class)
    .withStreamFunction("unique", UniqueStream.class)
    .withStreamFunction("top", RankStream.class)
    .withStreamFunction("group", ReducerStream.class)
    .withStreamFunction("parallel", ParallelStream.class);

ParallelStream pstream = (ParallelStream)streamFactory.constructStream("parallel(collection1, group(search(collection1, q=\"*:*\", fl=\"id,a_s,a_i,a_f\", sort=\"a_s asc,a_f asc\", partitionKeys=\"a_s\"), by=\"a_s asc\"), workers=\"2\", zkHost=\""+zkHost+"\", sort=\"a_s asc\")");


partitionKeys: Comma delimited list of keys to partition the search results by. To be used with the parallel function for parallelizing operations across worker nodes.

jdbc(
    connection="jdbc:hsqldb:mem:.",
    sql="select NAME, ADDRESS, EMAIL, AGE from PEOPLE where AGE > 25 order by AGE, NAME DESC",
    sort="AGE asc, NAME desc",
    driver="org.hsqldb.jdbcDriver"
)
echo("Hello world")
facet(collection1,
      q="*:*",
      buckets="year_i, month_i, day_i",
      bucketSorts="year_i desc, month_i desc, day_i desc",
      bucketSizeLimit=100,
      sum(a_i),
      sum(a_f),
      min(a_i),
      min(a_f),
      max(a_i),
      max(a_f),
      avg(a_i),
      avg(a_f),
      count(*))
random(baskets,
       q="productID:productX",
       rows="100",
       fl="basketID")
shortestPath(collection,
             from="john@company.com",
             to="jane@company.com",
             edge="from_address=to_address",
             threads="6",
             partitionSize="300",
             fq="limiting query",
             maxDepth="4")


http://solr.pl/en/2012/02/20/simple-photo-search/
Use tika to extract meta data from images
https://hortonworks.com/hadoop-tutorial/indexing-and-searching-text-within-images-with-apache-solr/
  • Download Leptonica, an image processing library
wget http://www.leptonica.org/source/leptonica-1.69.tar.gz
  • Download Tesseract, an Optical Character Recognition engine

    wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz

    https://sematext.com/blog/2017/06/19/solr-vs-elasticsearch-differences/

    https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html
    In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster.
    To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.
    To activate this request processor you’ll need to add the following to your solrconfig.xml:


    <updateRequestProcessorChain name="ignore-commit-from-client" default="true">
      <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
        <int name="statusCode">200</int>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.DistributedUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>

    Avoiding Distributed Deadlock


    Each shard serves top-level query requests and then makes sub-requests to all of the other shards. Care should be taken to ensure that the max number of threads serving HTTP requests is greater than the possible number of requests from both top-level clients and other shards. If this is not the case, the configuration may result in a distributed deadlock.
    For example, a deadlock might occur in the case of two shards, each with just a single thread to service HTTP requests. Both threads could receive a top-level request concurrently, and make sub-requests to each other. Because there are no more remaining threads to service requests, the incoming requests will be blocked until the other pending requests are finished, but they will not finish since they are waiting for the sub-requests. By ensuring that Solr is configured to handle a sufficient number of threads, you can avoid deadlock situations like this.

    Prefer Local Shards


    Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a distributed query should prefer local replicas of a shard when available. In other words, if a query includes preferLocalShards=true, then the query controller will look for local replicas to service the query instead of selecting replicas at random from across the cluster. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.
    Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases because the query controller will have to direct the query to non-local replicas for most of the shards. In other words, this feature is mostly useful for optimizing queries directed towards collections with a small number of shards and many replicas. Also, this option should only be used if you are load balancing requests across all nodes that host replicas for the collection you are querying, as Solr’s CloudSolrClient will do. If not load-balancing, this feature can introduce a hotspot in the cluster since queries won’t be evenly distributed across the cluster.

    Prefer Local Shards


    Solr allows you to pass an optional boolean parameter named preferLocalShards to indicate that a distributed query should prefer local replicas of a shard when available. In other words, if a query includes preferLocalShards=true, then the query controller will look for local replicas to service the query instead of selecting replicas at random from across the cluster. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.
    Lastly, it follows that the value of this feature diminishes as the number of shards in a collection increases because the query controller will have to direct the query to non-local replicas for most of the shards. In other words, this feature is mostly useful for optimizing queries directed towards collections with a small number of shards and many replicas. Also, this option should only be used if you are load balancing requests across all nodes that host replicas for the collection you are querying, as Solr’s CloudSolrClient will do. If not load-balancing, this feature can introduce a hotspot in the cluster since queries won’t be evenly distributed across the cluster.


    If an update fails because cores are reloading schemas and some have finished but others have not, the leader tells the nodes that the update failed and starts the recovery procedure.
    Behind the scenes, this means that Solr has accepted updates that are only on one of the nodes (the current leader). Solr supports the optional min_rf parameter on update requests that cause the server to return the achieved replication factor for an update request in the response. For the example scenario described above, if the client application included min_rf >= 1, then Solr would return rf=1 in the Solr response header because the request only succeeded on the leader. The update request will still be accepted as the min_rf parameter only tells Solr that the client application wishes to know what the achieved replication factor was for the update request. In other words, min_rf does not mean Solr will enforce a minimum replication factor as Solr does not support rolling back updates that succeed on a subset of replicas.
    On the client side, if the achieved replication factor is less than the acceptable level, then the client application can take additional measures to handle the degraded state. For instance, a client application may want to keep a log of which update requests were sent while the state of the collection was degraded and then resend the updates once the problem has been resolved. In short, min_rf is an optional mechanism for a client application to be warned that an update request was accepted while the collection is in a degraded state.


    If shards.tolerant=true then partial results may be returned. If the returned response does not contain results from all the appropriate shards then the response header contains a special flag called partialResults.
    https://doc.lucidworks.com/fusion/3.0/Indexing_Data/Time-Based-Partitioning.html
    Once a collection is configured for time-base partitioning, Fusion automatically ages out old partitions and creates new ones, using the configured partition sizes, expiration intervals, and so on. No manual maintenance is needed.
    https://lucene.apache.org/solr/guide/6_6/config-api.html


    Every core watches the ZooKeeper directory for the configset being used with that core. In standalone mode, however, there is no watch (because ZooKeeper is not running). If there are multiple cores in the same node using the same configset, only one ZooKeeper watch is used. For instance, if the configset 'myconf' is used by a core, the node would watch /configs/myconf. Every write operation performed through the API would 'touch' the directory (sets an empty byte[] to trigger watches) and all watchers are notified. Every core would check if the Schema file, solrconfig.xml or configoverlay.json is modified by comparing the znode versions and if modified, the core is reloaded.

    SolrCore#addConfListener(Runnable listener)
    to get notified for config changes. This is not very useful if the files modified result in core reloads (i.e., configoverlay.xml or Schema). Components can use this to reload the files they are interested in.

    <requestHandler name="/my_handler" class="solr.SearchHandler" useParams="my_handler_params"/>
    https://github.com/apache/lucene-solr/tree/master/solr/example/films




    Labels

    Review (554) System Design (293) System Design - Review (189) Java (178) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Knowledge (39) Linux (38) Interview-Java (35) Knowledge - Review (32) Database (30) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Miscs (25) MultiThread (25) Concurrency (24) Cracking Code Interview (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Distributed (20) Interview Q&A (20) OOD Design (20) System Design - Practice (19) Algorithm (15) How to Ace Interview (15) Security (15) Brain Teaser (14) Google (13) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Redis (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Resource (10) Solr (10) Testing (10) Amazon (9) Cache (9) Search (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Cassandra (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Machine Learning (7) NoSQL (7) C++ (6) Design (6) File System (6) Highscalability (6) How to Better (6) Kafka (6) Network (6) Restful (6) Trouble Shooting (6) CareerCup (5) Code Review (5) Company - Facebook (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) Puzzles (3) Python (3) Resource-System Desgin (3) Scala (3) UML (3) geeksquiz (3) AI (2) API Design (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Elasticsearch (2) Garbage Collection (2) Go (2) Hadoop (2) Html (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) ZooKeeper (2) reddit (2) Ads (1) Advanced data structures (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Chrome (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) RateLimiter (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

    Popular Posts