Massive Technical Interviews Tips: Lucene-Solr Architecture

Saturday, March 30, 2019

Lucene-Solr Architecture

https://sease.io/2015/07/exploring-solr-internals-lucene.html

Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies, so if curious [2]) .

Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to actually load into the memory ( RAM ) chunk of the index ( or possibly all the index) when necessary.

The index in the file system will look like a collection of immutable segments.

Each segment is a fully working Inverted Index, built from a set of documents.

The segment is a partition of the full index, it represents a part of it and it is fully searchable.

Term Dictionary

The term dictionary is a sorted skip list containing all the unique terms for the specific field.

Two operations are permitted, starting from a pointer in the dictionary :

next() -> to iterate one by one on the terms

advance(ByteRef b) -> to jump to an entry >= than the input ( this operation is O(n) = log n where n= number of unique terms).

An auxiliary Automaton is stored in memory, accepting a set of smart calculated prefixes of the terms in the dictionary.

It is a weighted Automaton, and a weight will be associated to each prefix ( i.e. the offset to look into the Term Dictionary) .

This automaton is used at query time to identify a starting point to look into the dictionary.

When we run a query ( a TermQuery for example) :

1) we give in input the query to the In Memory Automaton, an Offset is returned

2) we access the location associated to the Offset in the Term Dictionary

3) we advance to the ByteRef representation of the TermQuery

4) if the term is a match for the TermQuery we return the Posting List associated

Posting List

The posting list is the sorted skip list of DocIds that contains the related term.

It’s used to return the documents for the searched term.

Let’s have a deep look to a complete Posting List for the term game in the field title :

0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
2 : 1 : [1] : [0-4]

Each element of this posting list is :
Document Ordinal : Term Frequency : [array of Term Positions] : [array of Term Offset] .

Document Ordinal -> The DocN Ordinal (Lucene ID) for the DocN document in the corpus containing the related term.
Never relies at application level on this ordinal, as it may change over time ( during segments merge for example).

https://community.hortonworks.com/articles/212776/understanding-solr-architecture-and-best-practices.html

1. Solr works on a non master-slave architecture, every solr node is master of its own. Solr nodes uses Zookeper to learn about the state of the cluster.
2. A solr Node (JVM) can host multiple core
3. Core is the place where Lucene (Index) engine is running. Every core has its own Lucene engine
4. A collection will be divided in shards.
5. A shard will be represented as core (A part of JVM) in the Solr Node (JVM)
6. Every solr node keeps sending heartbeat to Zookeeper to inform about its availability.
7. Usage of Local FS provides the most stable and best IO for solr.
8. A replication factor of 2 should be maintained on local mode to avoid any data loss.
9. Do remember every replication will have a core attached to it and also space is disk.
10. If a collection is divided into 3 shards with replication factor of 3 : in total 9 cores will be hosted across the solr nodes.
Data saved on local fs will be 3X
11. Solr node doesnt publish data to ambari metrics by default. A solr metric process ( a seperate process that solr node) needs to be run on every node where solr node is hosted. The metric process fetches data from solr node and pushes to ambari metrics.

Saturday, March 30, 2019

Lucene-Solr Architecture

Term Dictionary

Posting List

Labels

Popular Posts