Massive Technical Interviews Tips: Solr Miscs

Thursday, August 13, 2015

Solr Miscs

<field name="_root_" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
http://blog-archive.griddynamics.com/2013/09/solr-block-join-support.html

Customize DocumentObjectBinder
https://gist.github.com/mdread/7680293
private <T> T mapJsonFields(T bean, SolrDocument document) throws IllegalArgumentException, IllegalAccessException {
Class<?> clazz = bean.getClass();

for (Field field : clazz.getDeclaredFields()) {
if(field.getAnnotation(JSONField.class) != null){
JSONField annotation = field.getAnnotation(JSONField.class);
String solrField = annotation.value();

if(solrField.equals(JSONField.DEFAULT))
solrField = field.getName();

String json = (String)document.getFieldValue(solrField);
if(json == null || json.trim().length() == 0)
continue;

Type type = field.getType();
field.setAccessible(true);

if(Collection.class.isAssignableFrom((Class<?>)type))
type = field.getGenericType();

Object value = new Gson().fromJson(json, type);
field.set(bean, value);
}
}

return bean;
}

http://stackoverflow.com/questions/20917040/what-is-the-root-field-in-schema-xml

is needed for block-join support. See here for more detailed explanation.

You can use this when you have relationships between entities and you don't want to flatten your docs, for example, one Class doc, contains many Student docs, and you want to be able to query in a more similar way as you would do it in a DB.

http://stackoverflow.com/questions/20826325/solr-can-i-get-parent-fields-and-child-fields-in-the-sae-resultChild documents can be retrieved with the help of expand attribute as explained in expand block join. If we add

expand=true&expand.q=*:*&expand.field=_root_

to the query. We should be able to get all child documents in "extended" attribute.

[child]

DocTransformer for optionally including Block-Join decendent documents inline in the results of a search. This transformer returns all descendants of each parent document in a flat list nested inside the parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relavant parent documents for any type of search query.

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents#TransformingResultDocuments-[child]-ChildDocTransformerFactory

`[child]` - ChildDocTransformerFactory

This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document. This is useful when you have indexed nested child documents and want to retrieve the child documents for the relevant parent documents for any type of search query.

fl=id,[child parentFilter=doc_type:book childFilter=doc_type:chapter limit=100]

Note that this transformer can be used even though the query itself is not a Block Join query.

When using this transformer, the parentFilter parameter must be specified, and works the same as in all Block Join Queries, additional optional parameters are:

childFilter - query to filter which child documents should be included, this can be particularly useful when you have multiple levels of hierarchical documents (default: all children)
limit - the maximum number of child documents to be returned per parent document (default: 10)

http://yonik.com/solr-nested-objects/

Nested Documents (also called Nested Objects) provides the ability to “nest” some documents inside of other documents in a parent/child relationship.

Lucene Index Representation

Lucene has a flat object model and does not really support “nesting” of documents in the index.
Lucene *does* support adding a list of documents atomically and contiguously (i.e. a virtual “block”), and this is the feature used by Solr to implement “nested objects”.

When you add a parent document with 3 children, these appear int the index contiguously as

child1, child2, child3, parent

There is no Lucene-level information that links parent and child, or distinguishes this parent/child block from the other documents in the index that come before or after. Successfully using parent/child relationships relies on more information being provided at query time.

Limitations

All children of a parent document must be indexed together with the parent document. One cannot update any document (parent or child) individually. The entire block needs to be re-indexed of any changes need to be made.

Schema Requirements

There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).
Any document can have nested child documents.

Lucene Index Representation

When you add a parent document with 3 children, these appear int the index contiguously as

child1, child2, child3, parent

Limitations

Schema Requirements

There are no schema requirements except that the _root_ field must exist (but that is there by default in all our schemas).
Any document can have nested child documents.

Block Join

“Block Join” refers to the set of related query technologies to efficiently map from parents to children or vice versa at query time. The locality of children and parents can be used to both speed up query operations and lower memory requirements compared to other join methods.

NOTE: This example currently requires Solr 5.3 or later.

$ curl http://localhost:8983/solr/demo/update?commitWithin=3000 -d '

[

 {id : book1, type_s:book, title_t : "The Way of Kings", author_s : "Brandon Sanderson",

  cat_s:fantasy, pubyear_i:2010, publisher_s:Tor,

  _childDocuments_ : [

    { id: book1_c1, type_s:review, review_dt:"2015-01-03T14:30:00Z",

      stars_i:5, author_s:yonik,

      comment_t:"A great start to what looks like an epic series!"

    }

    ,

    { id: book1_c2, type_s:review, review_dt:"2014-03-15T12:00:00Z",

      stars_i:3, author_s:dan,

      comment_t:"This book was too long."

    }

  ]

 }

]'

https://cwiki.apache.org/confluence/display/solr/Other+Parsers

Block Join Children Query Parser

This parser takes a query that matches some parent documents and returns their children. The syntax for this parser is: q={!child of=<allParents>}<someParents>. The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someParents identifies a query that will match some of the parent documents. The output is the children.

Using the example documents above, we can construct a query such as q={!child of="content_type:parentDocument"}title:lucene. We only get one document in response:

Block Join Parent Query Parser

This parser takes a query that matches child documents and returns their parents. The syntax for this parser is similar: q={!parent which=<allParents>}<someChildren>. Again the parameter The parameter allParents is a filter that matches only parent documents; here you would define the field and value that you used to identify all parent documents. The parameter someChildren is a query that matches some or all of the child documents. Note that the query for someChildren should match only child documents or you may get an exception.

Again using the example documents above, we can construct a query such as q={!parent which="content_type:parentDocument"}comments:SolrCloud. We get this document in response:

http://lucene.472066.n3.nabble.com/Does-SolrJ-support-nested-annotated-beans-td868375.html
https://issues.apache.org/jira/browse/SOLR-1945
http://stackoverflow.com/questions/37241489/solrj-6-0-0-insertion-of-a-bean-object-which-associate-list-of-bean-object-is-g

So as of now I am annotating @Field annotation at field level rather than at setter:

 @Field (child = true)
    private Collection<Technology2> technologies2;

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production

By default, the bin/solr script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for getting started with Solr. For production, you’ll want to increase the maximum heap size based on the memory requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production servers. When you need to change the memory settings for your Solr server, use the SOLR_JAVA_MEM variable in the include file, such as:

SOLR_JAVA_MEM="-Xms10g -Xmx10g"

ZooKeeper chroot

If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored under /solr, you can put /solr on the end of your ZK_HOST connection string, such as:

ZK_HOST=zk1,zk2,zk3/solr

Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the zkcli.sh script. We can use the makepath command for that:

$ server/scripts/cloud-scripts/zkcli.sh -zkhost zk1,zk2,zk3 -cmd makepath /solr

https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference

-f	Start Solr in the foreground; you cannot use this option when running examples with the -e option.

-m <memory>

Start Solr with the defined value as the min (-Xms) and max (-Xmx) heap size for the JVM.

-m 1g
https://cwiki.apache.org/confluence/display/solr/JVM+Settings

http://grokbase.com/t/lucene/solr-user/128r96vwz6/how-do-i-represent-a-group-of-customer-key-value-pairs
The general rule in Solr is simple: denormalize your data.

If you have some maps (or tables) and a set of keys (columns) for each map
(table), define fields with names like <map-name>_<key-name>, such as
"map1_name", "map2_name", "map1_field1", "map2_field1". Solr has dynamic
fields, so you can define "<map-name>_*" to have a desired type - if all the
keys have the same type.

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/test/org/apache/solr/client/solrj/beans/TestDocumentObjectBinder.java
@Field("supplier_*")
Map<String, List<String>> supplier;

@Field("sup_simple_*")
Map<String, String> supplier_simple;

private String[] allSuppliers;

@Field("supplier_*")
public void setAllSuppliers(String[] allSuppliers) {
this.allSuppliers = allSuppliers;
}
@Field(child = true)
Child[] child;

http://stackoverflow.com/questions/6238181/solrj-and-dynamic-fields
Solrj and Dynamic Fields

I worked out the matching of the 'pattern' value in the @Field annotation doesn't have to match what's in your schema.xml. So, I defined a map in my doc class:

@Field("*DF")
private Map<String, Object> dynamicFields;

and then in the schema.xml the dynamicFields have patterns postfixed by 'DF':

<dynamicField name="*_sDF" type="string" indexed="true" stored="true"/>
<dynamicField name="*_siDF" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_tDF" type="date" indexed="true" stored="true"/>

Now all the dynamicField with different value types get stored and retrieved using solrServer.addBean(doc) and solrResponse.getBeans(Doc.class).

http://lucene.472066.n3.nabble.com/Maximum-number-of-fields-allowed-in-a-Solr-document-td505435.html
There is no build-in limit. The limit is going to be dictated by your hardware resources.
http://stackoverflow.com/questions/26139507/how-to-retrieve-all-stored-fields-from-core-with-solrj
How to retrieve all stored fields from Core with SolrJ

    SolrServer solrCore = new HttpSolrServer("http://{host:port}/solr/core-name");
    SolrQuery query = new SolrQuery();
    query.add(CommonParams.QT, "/schema/fields");
    QueryResponse response = solrCore.query(query);
    NamedList responseHeader = response.getResponseHeader();
    ArrayList<SimpleOrderedMap> fields = (ArrayList<SimpleOrderedMap>) response.getResponse().get("fields");
    for (SimpleOrderedMap field : fields) {
        Object fieldName = field.get("name");
        Object fieldType = field.get("type");
        Object isIndexed = field.get("indexed");
        Object isStored = field.get("stored");
        // you can use other attributes here.

        System.out.println(field.toString());
    }

https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

To choose a different request handler, there is a specific method available in SolrJ version 4.0 and later:

query.setRequestHandler("/spellCheckCompRH");

Uploading Content in XML or Binary Formats

SolrJ lets you upload content in binary format instead of the default XML format. Use the following code to upload using binary format, which is the same format SolrJ uses to fetch results. If you are trying to mix Solr and SolrJ versions where one is version 1.x and the other is 3.x or later, then you MUST stick with the XML request writer. The binary format changed in 3.x, and the two javabin versions are entirely incompatible.

solr.setRequestWriter(new BinaryRequestWriter());

Using the ConcurrentUpdateSolrClient

When implementing java applications that will be bulk loading a lot of documents at once, ConcurrentUpdateSolrClient is an alternative to consider instead of using HttpSolrClient. The ConcurrentUpdateSolrClient buffers all added documents and writes them into open HTTP connections. This class is thread safe. Although any SolrClient request can be made with this implementation, it is only recommended to use the ConcurrentUpdateSolrClient for /update requests.

https://qnalist.com/questions/662591/in-a-requesthandlers-init-how-to-get-solr-data-dir
You can implement the SolrCoreAware interface which will give you access to the SolrCore object through the SolrCoreAware#inform method you will need to implement. It is called after the init method.
https://cwiki.apache.org/confluence/display/solr/Faceting

In addition to some of the general local parameters supported by other types of faceting, a stats local parameters can be used with facet.pivot to refer to stats.field instances (by tag) that you would like to have computed for each Pivot Constraint.

In the example below, two different (overlapping) sets of statistics are computed for each of the facet.pivot result hierarchies:

stats=true

stats.field={!tag=piv1,piv2 min=true max=true}price

stats.field={!tag=piv2 mean=true}popularity

facet=true

facet.pivot={!stats=piv1}cat,inStock

facet.pivot={!stats=piv2}manu,inStock

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
stats.facet

This legacy parameter is not recommended for new users - instead please consider combining stats.field with facet.pivot

http://stackoverflow.com/questions/634765/using-or-and-not-in-solr-query

Solr currently checks for a "pure negative" query and inserts *:* (which matches all documents) so that it works correctly.

-foo is transformed by solr into (*:* -foo)

The big caveat is that Solr only checks to see if the top level query is a pure negative query! So this means that a query like bar OR (-foo) is not changed since the pure negative query is in a sub-clause of the top level query. You need to transform this query yourself into bar OR (*:* -foo)

Use defType=edismax
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

supports pure negative nested queries: queries such as +foo (-foo) will match all documents.

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

Pure negative queries (all clauses prohibited) are allowed (only as a top-level clause)
- -inStock:false finds all field values where inStock is not false
- -field:[* TO *] finds all documents without a value for field

http://stackoverflow.com/questions/16922247/how-to-access-the-admin-interface-of-an-embeddedsolrserver-instance

No, there isn't. At least, not while your application is running.

The embedded solr server reads and writes the index directly on your filesystem in your solr install directory. To access the admin console, shut down your app. In a console navigate to your /example in your solr install directory. Type java -jar start.jar. Then, you can navigate to http://localhost:8983/solr to access the admin directory.

You can not have your solr server running and be running an application that accesses the same index via EmbeddedSolrServer or you will get a LockObtainFailedException. Solr only allows one reader/writer per index at a time, and that reader/writer obtains a 'lock' in order to access the index. This prevents the index from becoming corrupted from multiple simultaneous reads/writes.

For that reason, I prefer to use the HttpSolrServer instead of the EmbeddedSolrServer, even for a development environment.

http://wiki.apache.org/solr/Solrj#EmbeddedSolrServer
http://wiki.apache.org/solr/EmbeddedSolr

If you want to use MultiCore features, then you should use this:

    File home = new File( "/path/to/solr/home" );
    File f = new File( home, "solr.xml" );
    CoreContainer container = new CoreContainer();
    container.load( "/path/to/solr/home", f );

    EmbeddedSolrServer server = new EmbeddedSolrServer( container, "core name as defined in solr.xml" );

org.apache.solr.core.SolrConfig.plugins
.add(new SolrPluginInfo(TransformerFactory.class, "transformer", REQUIRE_NAME, REQUIRE_CLASS, MULTI_OK))

RawValueTransformerFactory

https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
q=*:*&fl=id,greeting:[value v='hello']

`[subquery]`

This transformer executes a separate query per transforming document passing document fields as an input for subquery parameters. It's usually used with {!join}and {!parent} query parsers, and is intended to be an improvement for [child].

It must be given an unique name: fl=*,children:[subquery]
There might be a few of them, eg fl=*,sons:[subquery],daughters:[subquery].
Every [subquery] occurrence adds a field into a result document with the given name, the value of this field is a document list, which is a result of executing subquery using document fields as an input.

Subquery Parameters Shift

If subquery is declared as fl=*,foo:[subquery], subquery parameters are prefixed with the given name and period. eg

q=*:*&fl=*,foo:[subquery]&foo.q=to be continued&foo.rows=10&foo.sort=id desc

https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component

The Query Elevation Component lets you configure the top results for a given query regardless of the normal Lucene scoring. This is sometimes called "sponsored search," "editorial boosting," or "best bets." This component matches the user query text to a configured map of top results. The text can be any string or non-string IDs, as long as it's indexed. Although this component will work with any QueryParser, it makes the most sense to use with DisMax or eDisMax.

The Query Elevation Component is supported by distributed searching.

https://zookeeper.apache.org/doc/r3.3.6/zookeeperAdmin.html
The ZooKeeper server creates snapshot and log files, but never deletes them. The retention policy of the data and log files is implemented outside of the ZooKeeper server. The server itself only needs the latest complete fuzzy snapshot and the log files from the start of that snapshot.
https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

The ZooKeeper Data Directory contains files which are a persistent copy of the znodes stored by a particular serving ensemble. These are the snapshot and transactional log files. As changes are made to the znodes these changes are appended to a transaction log, occasionally, when a log grows large, a snapshot of the current state of all znodes will be written to the filesystem. This snapshot supercedes all previous logs.

A ZooKeeper server will not remove old snapshots and log files when using the default configuration (see autopurge below), this is the responsibility of the operator. Every serving environment is different and therefore the requirements of managing these files may differ from install to install (backup for example).

The PurgeTxnLog utility implements a simple retention policy that administrators can use. The API docs contains details on calling conventions (arguments, etc...).

autopurge.snapRetainCount

(No Java system property)

New in 3.4.0: When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots and the corresponding transaction logs in the dataDirand dataLogDir respectively and deletes the rest. Defaults to 3. Minimum value is 3.

autopurge.purgeInterval

(No Java system property)

New in 3.4.0: The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.

http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/80357

You should be able to simply use the ZkCli clear cmd:

http://wiki.apache.org/solr/SolrCloud#Command_Line_Util

Just make sure you stop your Solr instances before clearing it. Clearing out zk from under a running Solr
instance is not a good thing to do.

This should be as simple as, stop your Solr instances, use the clean command on / or /solr (whatever the root
is in zk for you Solr stuff), start your Solr instances, create the collection again.

https://cwiki.apache.org/confluence/display/solr/Adding+Custom+Plugins+in+SolrCloud+Mode

https://cwiki.apache.org/confluence/display/solr/Blob+Store+API

The Blob Store REST API provides REST methods to store, retrieve or list files in a Lucene index. This can be used to upload a jar file which contains standard solr components such as RequestHandlers, SearchComponents, or other custom code you have written for Solr.

When using the blob store, note that the API does not delete or overwrite a previous object if a new one is uploaded with the same name. It always adds a new version of the blob to the index. Deletes can be performed with standard REST delete commands.

Create a .system Collection

Before using the blob store, a special collection must be created and it must be named .system.

The BlobHandler is automatically registered in the .system collection. The solrconfig.xml, Schema, and other configuration files for the collection are automatically provided by the system and don't need to be defined specifically.

If you do not use the -shards or -replicationFactor options, then defaults of 1 shard and 1 replica will be used.

You can create the .system collection with the Collections API, as in this example:

curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=.system&replicationFactor=2"

Upload Files to Blob Store

After the .system collection has been created, files can be uploaded to the blob store with a request similar to the following:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @{filename} http://localhost:8983/solr/.system/blob/{blobname}

For example, to upload a file named "test1.jar" as a blob named "test", you would make a POST request like:

curl -X POST -H 'Content-Type: application/octet-stream' --data-binary @test1.jar http://localhost:8983/solr/.system/blob/test

A GET request will return the list of blobs and other details:

curl http://localhost:8983/solr/.system/blob?omitHeader=true

Use a Blob in a Handler or Component

To use the blob as the class for a request handler or search component, you create a request handler in solrconfig.xml as usual. You will need to define the following parameters:

class: the fully qualified class name. For example, if you created a new request handler class called CRUDHandler, you would enter org.apache.solr.core.CRUDHandler.
runtimeLib: Set to true to require that this component should be loaded from the classloader that loads the runtime jars.

For example, to use a blob named test, you would configure solrconfig.xml like this:

<requestHandler name="/myhandler" class="org.apache.solr.core.myHandler" runtimeLib="true" version="1">

</requestHandler>

http://stackoverflow.com/questions/21236774/how-does-solr-sort-by-default-when-using-filter-query

Querying for *:* is also called a MatchAllDocsQuery. According to the SO question How are results ordered in solr in a "match all docs" query it will return the docs in the same order as they were stored in the index.

https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
./cores/core1/core.properties

dataDir

The core's data directory (where indexes are stored) as either an absolute pathname, or a path relative to the value of instanceDir. This isdata by default.

https://cwiki.apache.org/confluence/display/solr/MBean+Request+Handler

http://localhost:8983/solr/techproducts/admin/mbeans?cat=CACHE

To return information and statistics about the CACHE category only, formatted in JSON:

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&cat=CACHE&indent=true&wt=json

To return information for everything, and statistics for everything except the fieldCache:

http://localhost:8983/solr/techproducts/admin/mbeans?stats=true&f.fieldCache.stats=false

To return information and statistics for the fieldCache only:

http://localhost:8983/solr/techproducts/admin/mbeans?key=fieldCache&stats=true

admin/mbeans?stats=true&wt=json&cat=QUERYHANDLER
https://scoutapp.com/plugin_urls/10831-solr-stats

Average Timer Per Request (milliseconds)
Median Request Time (milliseconds)
95th Percentile Request Time (milliseconds)

admin/system

/admin/mbeans?stats=true&cat=CACHE&key=queryResultCache&key=filterCache

admin/mbeans?stats=true&cat=QUERYHANDLER&key=/select

https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/https://lucidworks.com/blog/2013/06/13/solr-cloud-document-routing/

Shards Are Assigned to the 32 Bit Hash Space

When a Solr Cloud collection is created with the numShards parameter, Solr assigns each shard a range of the 32 bit hash space.

In clusterstate.json each shard has a range attribute which shows the range the shard has been assigned. The ranges are shown in hex. Below is the decimal translation of the hex ranges for a 4 shard collection:

Shard1 : 2147483648-3221225471
Shard2 : 3221225472-4294967295
Shard3 : 0-1073741823
Shard4 : 1073741824-2147483647

Simple Document Routing With a Document Id Only

Each document indexed in Solr Cloud must have a unique document id assigned to it. For example:

doc50

When presented with a document id the compositeId router calculates a 32 bit murmurhash3 for the id. The compositeId router then routes the document to the shard whose range includes the hash value for the document id.

Composite Id Document Routing
A shard key can be pre-pended to the unique document id to create a composite id. The composite id is formed with the following syntax:

shard_key!document_id

The ! is the separator.

In a multi-tenant setup this might look like this:

tenant1!doc50

When a shard key is provided, the compositeId router calculates the 32 bit hash for both the shard key and the document id.

Then it creates a composite 32 bit hash by taking 16 bits from the shard key’s hash and 16 bits from the document id’s hash.

The upper bits of the hash are taken from the shard key and the lower bits from the document id.

The compositeId router then routes the document to the shard whose range includes the hash value for the composite id.

The upper bits, which come from the shard key, will dictate which shard the document is placed in.

The lower bits of the hash, which come from the unique doc id, place the document within a 65536 slice of the shard.

This scenario allows tenants to be split into multiple shards if needed in the future through Shard Splitting which was introduced in Solr 4.3.

Spreading Tenants Across More Then One Shard

When a tenant is too large to fit on a single shard it can be spread across multiple shards be specifying the number of bits to use from the shard key.

The syntax for this is:

shard_key/num!document_id

The /num is the number of bits from the shard key to use in the composite hash.

For example:

tenant1/4!doc50

This will take 4 bits from the shard key and 28 bits from the unique doc id, spreading the tenant over 1/16th of the shards in the collection.

3 bits would spread the tenant over 1/8th of the collection.
2 bits would spread the tenant over 1/4th of the collection.
1 bit would spread the tenant over 1/2 the collection.
0 bits would spread the tenant across the entire collection.

At Query Time

At query time the parameter shard.keys can be used to limit the query to a specific shard or range of shards. You specify which shard to query using this syntax:

shard.keys=tenant1!

or multiple keys:

shard.keys=tenant1!,tenant2!

During indexing:
Add a document id with routing related information for each document. e.g. myapp!user1!doc

At query time:
To query all records for myapp: shard.keys=myapp/8!

Note the explicit mention of 8 bits in case of querying by component 1 only i.e. app level. This is required because the usage of the router as 2 or 3 level isn’t implicit. Specifying ‘8’ bits for the component highlights the use of ‘3’ level router.

To query all records belonging to user1 for myapp: shard.keys=myapp!user1!

https://cwiki.apache.org/confluence/display/solr/Advanced+Distributed+Request+Options

In the traditional non-SolrCloud distributed setup, the shards parameter can be used to provide a comma-separated list of servers on which a distributed request is to be performed.

http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr

http://localhost:8983/solr/collection1/select?q=*:*&shards=localhost:7574/solr,localhost:8983/solr

SolrCloud makes it easier because it knows about the shard names and the physical addresses of the replicas. By default, SolrCloud will run searches on all shards and combine the results if the shards parameter is not specified. You can specify one or more shard names as the value of the "shards" parameter to limit the shards that you want to search against.

http://localhost:8983/solr/collection1/select?q=*:*&shards=shard1

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

Square brackets [ ] denote an inclusive range query that matches values including the upper and lower bound.
Curly brackets { } denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves.
You can mix these types so one end of the range is inclusive and the other is exclusive. Here's an example: count:{1 TO 10]

http://yonik.com/solr/query-syntax/

+color:blue^=1 text:shoes
(inStock:true text:solr)^=100 native code faceting

Filter Query

(Since Solr 5.4)
A filter query retrieves a set of documents matching a query from the filter cache. Since scores are not cached, all documents that match the filter produce the same score (0 by default). Cached filters will be extremely fast when they are used again in another query.

Filter Query Example:

description:HDTV OR filter(+promotion:tv +promotion_date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY])

The power of the filter() syntax is that it may be used anywhere within a lucene/solr query syntax. Normal fq support is limited to top-level conjunctions. However when normal top-level fq filter caching can be used, that form is preferred.

http://yonik.com/solr/atomic-updates/

$ curl http://localhost:8983/solr/demo/update -d '

[

 {"id"         : "book1",

  "author_s"   : {"set":"Neal Stephenson"},

  "copies_i"   : {"inc":3},

  "cat_ss"     : {"add":"Cyberpunk"}

 }

]'

// create the SolrJ client

HttpSolrClient client = new HttpSolrClient("http://localhost:8983/solr");

// create the document

SolrInputDocument sdoc = new SolrInputDocument();

sdoc.addField("id","book1");

Map<String,Object> fieldModifier = new HashMap<>(1);

fieldModifier.put("add","Cyberpunk");

sdoc.addField("cat", fieldModifier);  // add the map as the field value

client.add( sdoc );  // send it to the solr server

select?q={!join+from=manu_id_s+to=id}cat:"graphics card"
zkcli:
https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities

https://wiki.apache.org/solr/CommonQueryParameters#debugQuery
debugQuery=true

debug

Clients may also specify control over individual parts of debugging output by specifying debug= with one of four options:

timing -- Provide debug info about timing of components, etc. only
query -- Provide debug info about the query only
results -- Provide debug info about the results (currently explains)
true -- If true, this is the equivalent of &debugQuery=true

explainOther

This parameter allows clients to specify a Lucene query to identify a set of documents. If non-blank, the explain info of each document that matches this query, relative to the main query (specified by the q parameter) will be returned along with the rest of the debugging information. This is useful, for instance, for understanding why a particular document is not in the result set. For instance, the query http://localhost:8983/solr/select?q=ipod&debug=results&explainOther=id:MA* (run against

Solr4.0) shows the explanations for the query ipod and also shows the explanations for all documents that match id:MA* as if the main query were run (ipod) and produced the documents that the id:MA* query produced.

http://stackoverflow.com/questions/4238609/how-to-query-solr-for-empty-fields

According to SolrQuerySyntax, you can use q=-id:[* TO *].

One caveat! If you want to compose this via OR or AND you cannot use it in this form:

-myfield:*

but you must use

(*:* NOT myfield:*)

This form is perfectly composable. Apparently SOLR will expand the first form to the second, but only when it is a top node. Hope this saves you some time!

https://wiki.apache.org/solr/SearchHandler

defaults - provides default param values that will be used if the param does not have a value specified at request time.
appends - provides param values that will be used in addition to any values specified at request time (or as defaults.
invariants - provides param values that will be used in spite of any values provided at request time. They are a way of letting the Solr maintainer lock down the options available to Solr clients. Any params values specified here are used regardless of what values may be specified in either the query, the "defaults", or the "appends" params.

https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/94143

fq=((*:* -(field1:value1)))+OR+(field2:value2).

http://blog.csdn.net/matthewei6/article/details/50620600
bin/solr start 启动单机版
bin/solr start -f 前台启动
bin/solr start -p 8984 指定端口启动
bin/solr start -cloud 启动分布式版本

bin/solr start -e cloud -noprompt -e表示要启动一个现有的例子，例子名称是cloud，cloud这个例子是以SolrCloud方式启动的

bin/solr restart 重启项目

create

如果是单机版要创建core，如果是分布式的要创建collection

bin/solr create -help 查看create帮助

bin/solr create -c abc

abc是core或collection的名字，取决于solr是单机版还是cloud版本；刷新http://localhost:8983/solr ，可以看到core selector中多了一个abc

abc目录的位置创建在solr.solr.home（默认是solr的server/solr目录）目录下

post提交数据生成索引

bin/post -c abc docs/

向名为abc的core或collection提交数据，数据源在docs/目录中

删除

bin/solr delete -c abc 删除一个core或collection

删除索引

bin/post -c abc -d "<delete><id>/home/matthewi/software/solr-5.4.1/docs/solr-morphlines-core/allclasses-noframe.html</id></delete>"

重新执行上面的搜索可以看到搜索结果的数量少了一条：numFound列

bin/post -c abc -d "<delete><query>*:*</query></delete>"

删除所有数据

停止solr

bin/solr stop -all

状态

bin/solr status

http://blog.csdn.net/ajian005/article/details/37669765

•Just like all request handlers, update handlers can be mapped to a specific URL and have their own set of default or invariant parameters.
• Each update handler can have it’s own Update Processor Chain that can do Document-level operations prior to indexing, or even redirect indexing to a different server or create multiple documents (or zero) from a single one.
• All of the configuration is declarative, including the specification of update processor chains.

Lucene/Solr plugins
•RequestHandlers – handle a request at a URL like /select
•SearchComponents – part of a SearchHandler, a componentized request handler
–Includes, Query, Facet, Highlight, Debug, Stats
–Distributed Search capable
•UpdateHandlers – handle an indexing request
•Update Processor Chains – per-handler componentized chain that handle updates
•Query Parser plugins
–Mix and match query types in a single request
–Function plugins for Function Query
•Text Analysis plugins: Analyzers, Tokenizers, TokenFilters
•ResponseWriters serialize & stream response to client

Each request handler can be mapped to a different URL
• SearchHandler is a componentized RequestHandler that allows search components to be chained together and also enables the framework for distributed search operations.
• Each Searchhandler can have it’s own custom set of search components, along with default or invariant parameters
• All of the configuration is declarative – including adding new request handlers or search components.
• The QueryResponse object is very generic and can handle returning any type of data

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
bin/solr stop -all
By default, an embedded Zookeeper server runs at the Solr port plus 1000.

an index split across multiple nodes is called a collection.
SolrCloud also depends on a distributed coordination service called Apache ZooKeeper.

ZooKeeper as an abstract service that manages cluster state and distributes configuration files to nodes joining the cluster.

When first starting a SolrCloud cluster, you need to start one Solr node that lays the groundwork for all other nodes to come. We’ll refer to this first node as our bootstrap node, as it performs special one-time initialization work for the rest of the cluster.

The collection.configName parameter specifies the name of a configuration directory in ZooKeeper. Every collection in SolrCloud needs to identify a named configuration directory in ZooKeeper;

The bootstrap_confdir parameter tells the bootstrap node to upload its configuration files to ZooKeeper. One of the primary features provided by ZooKeeper in SolrCloud is a centralized configuration store. Centralized configuration allows all nodes in the cluster to download their configurations from a central location instead of a system administrator having to push configuration changes to multiple nodes. Before ZooKeeper can provide centralized configuration, you need to upload the configuration from the bootstrap server.

a shard leader handles additional responsibilities when processing update requests to the shard, such as assigning a unique version number for each document being created or updated.

java -DzkHost=localhost:9983 -Djetty.port=8984 -jar start.jar
The zkHost parameter activates SolrCloud mode by telling Solr to register itself with the specified ZooKeeper server during initialization. ZooKeeper will assign the initializing Solr instance to a specific shard and assign a role for that shard, either leader or replica.

you need to upload the configuration files to ZooKeeper prior to creating the collection using the Collections API. The configuration for a new collection must exist in ZooKeeper before the collection is created.

cd $SOLR_INSTALL/shard1/scripts/cloud-scripts/
./zkcli.sh
upconfig uploads the configuration files from your local workstation to ZooKeeper.
Using ZooKeeper to Manage Configuration Files
https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files
bin/solr -e cloud -noprompt

bin/solr create -c mycollection -d data_driven_schema_configs
The create command will upload a copy of the data_driven_schema_configs configuration directory to ZooKeeper under /configs/mycollection.

sh zkcli.sh -cmd upconfig -zkhost <host:port> -confname <name for configset> -solrhome <solrhome> -confdir <path to directory with configset>
zkcli.sh -zkhost localhost:2181 -cmd putfile /solr.xml /path/to/solr.xml

/admin/collections?action=CREATE&name=name&numShards=number&replicationFactor=number&maxShardsPerNode=number&createNodeSet=nodelist&collection.configName=configname

http://stackoverflow.com/questions/28589942/delete-remove-solr-configuration-from-zookeeper-using-zkcli
From within your zoo keeper bin, start zkCli.sh, this will open zookeeper console then fire the following command to delete a node/config

rmr /configshttp://stackoverflow.com/questions/28589942/delete-remove-solr-configuration-from-zookeeper-using-zkcli

Using Collections API
https://cwiki.apache.org/confluence/display/solr/https://cwiki.apache.org/confluence/display/solr/Collections+API
/admin/collections?action=CREATE&name=name&numShards=number&replicationFactor=number&maxShardsPerNode=number&createNodeSet=nodelist&collection.configName=configname

/admin/collections?action=DELETESHARD&shard=shardID&collection=name

http://lucene.472066.n3.nabble.com/Querying-a-specific-core-in-solr-cloud-td4079964.html
1> With SolrCloud, you don't need to specify shards. That's only
really for non-SolrCloud mode.
2> You can add &distrib=false to your query to only return the results
from the node you direct the query to.
https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference

https://cwiki.apache.org/confluence/display/solr/Schema+API
http://localhost:8983/solr/message/schema/fields

Range Searches

A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches documents whose values for the specified field or fields fall within the range. Range queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically, except on numeric fields. For example, the range query below matches all documents whose mod_date field has a value between 20020101 and 20030101, inclusive.

mod_date:[20020101 TO 20030101]

Range queries are not limited to date fields or even numerical fields. You could also use range queries with non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.

The brackets around a query determine its inclusiveness.

Square brackets [ ] denote an inclusive range query that matches values including the upper and lower bound.
Curly brackets { } denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves.
You can mix these types so one end of the range is inclusive and the other is exclusive. Here's an example: count:{1 TO 10]

Reloading Zoo keeper solr conf (schema.xml)
Solr Start Script Reference
Command Line Utilities
http://rayoo.iteye.com/blog/2121443

zkcli.sh -zkhost localhost:9983 -cmd bootstrap -solrhome /opt/solr

根据[配置名称] 和 [本地配置路径] 上传本地Solr配置到ZooKeeper服务器:

zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir /opt/solr/collection1/conf -confname myconf

根据[配置名称] 下载ZooKeeper服务器配置到 [本地]

zkcli.sh -zkhost localhost:9983 -cmd downconfig -confdir /opt/solr/collection1/conf -confname myconf

连接Solr中的 [集合名称] 和 [配置名称] 的配置

zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection collection1 -confname myconf

....创建路径

zkcli.sh -zkhost localhost:9983 -cmd makepath /apache/solr

....将内容写到服务器上的某个文件

zkcli.sh -zkhost localhost:9983 -cmd put /solr.conf 'conf data'

....将本地文件写到服务器上的某个文件

zkcli.sh -zkhost localhost:9983 -cmd putfile /solr.xml /User/myuser/solr/solr.xml

....从服务器上查看某个文件的内容

zkcli.sh -zkhost localhost:9983 -cmd get /solr.xml

....从服务器上下载某个文件的内容

zkcli.sh -zkhost localhost:9983 -cmd getfile /solr.xml solr.xml.file

zkcli.sh -zkhost localhost:9983 -cmd clear /solr

zkcli.sh -zkhost localhost:9983 -cmd list

http://localhost:8983/solr/admin/info/system
solr/collection/schema/version
https://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits

If you commit very frequently, you may send a new commit before the previous commit is finished. If you have cache warming enabled as just discussed, this is more of a problem. If you have a high maxWarmingSearchers in your solrconfig.xml, you can end up with a lot of new searchers warming at the same time, which is very I/O intensive, so the problem compounds itself.

If you are having problems with slow commit times when NOT opening a new searcher, then this is probably due to a general performance problem, like extreme GC pauses or not enough OS memory for disk caching. Both of these issues are discussed earlier on this page.

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Be very careful committing from the client! In fact, don’t do it. By and large, do not issue commits from any client indexing to Solr, it’s almost always a mistake. And especially in those cases where you have multiple clients indexing at once, it is A Bad Thing. What happens is commits come in unpredictably close to each other, generating work as above. You’ll possibly see warnings in your log about “too many warming searchers”. Or you’ll see a zillion small segments. Or… Let your autocommit settings (both soft and hard) in solrconfig.xml handle the commit frequency. If you absolutely must control the visibility, say you want to search docs right after the indexing run happens and you can’t afford to wait for your autocommit settings to kick in, commit once at the end.

Index-heavy, Query-heavy

This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start

Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.

Consider a soft commit. On execution you have the following:

The tlog has NOT been truncated. It will continue to grow.
The documents WILL be visible.
Some caches will have to be reloaded.
Your top-level caches will be invalidated.
Autowarming will be performed.
New segments are created that will be merged.

Note, I haven’t said a thing about index segments! That’s for hard commits. And again, soft commits are “less expensive” than hard commits (openSearcher=true), but they are not free. The motto of the Lunar Colony in a science fiction novel (“The Moon Is a Harsh Mistress” by Robert Heinlein) was TANSTAAFL, There Ain’t No Such Thing As A Free Lunch. Soft commits are there to support Near Real Time, and they do. But they do have cost, so use the longest soft commit interval you can for best performance.

Hard commits are about durability, soft commits are about visibility. There are really two flavors here, openSearcher=true and openSearcher=false. First we’ll talk about what happens in both cases. If openSearcher=true or openSearcher=false, the following two consequences are most important:

The tlog is truncated: A new tlog is started. Old tlogs will be deleted if there are more than 100 documents in newer, closed tlogs.
The current index segment is closed and flushed.
Background segment merges may be initiated.

The above happens on all hard commits. That leaves the openSearcher setting

openSearcher=true: The Solr/Lucene searchers are re-opened and all caches are invalidated. Autowarming is done etc. This used to be the only way you could see newly-added documents.
openSearcher=false: Nothing further happens other than the four points above. To search the docs, a soft commit is necessary.

To get solrcloud nodes info(ip address:)
java -classpath "*" org.apache.solr.cloud.ZkCLI "-zkhost myzkhost -cmd get /clusterstate.json"

/opt/solr-5.4.1/server/scripts/cloud-scripts/zkcli.sh -zkhost myzkhost -cmd get /clusterstate.json

Thursday, August 13, 2015

Solr Miscs

`[child]` - ChildDocTransformerFactory

Lucene Index Representation

Limitations

Schema Requirements

Lucene Index Representation

Limitations

Schema Requirements

Block Join

Block Join Children Query Parser

Block Join Parent Query Parser

ZooKeeper chroot

Uploading Content in XML or Binary Formats

Using the ConcurrentUpdateSolrClient

`[subquery]`

Subquery Parameters Shift

Create a .system Collection

Upload Files to Blob Store

Use a Blob in a Handler or Component

Filter Query

debug

explainOther

Range Searches

Index-heavy, Query-heavy

Labels

Popular Posts

Thursday, August 13, 2015

Solr Miscs

[child] - ChildDocTransformerFactory

Lucene Index Representation

Limitations

Schema Requirements

Lucene Index Representation

Limitations

Schema Requirements

Block Join

Block Join Children Query Parser

Block Join Parent Query Parser

ZooKeeper chroot

Uploading Content in XML or Binary Formats

Using the ConcurrentUpdateSolrClient

[subquery]

Subquery Parameters Shift

Create a .system Collection

Upload Files to Blob Store

Use a Blob in a Handler or Component

Filter Query

debug

explainOther

Range Searches

Index-heavy, Query-heavy

Labels

Popular Posts

`[child]` - ChildDocTransformerFactory

`[subquery]`