Saturday, 10 May 2014

Intro to Solr schema.xml File

Till Now, I have introduced about the dashboard of Apache Solr. Now I am going to discuss the core part of Apache Solr i.e. Solr schema.xml file.

"schema.xml file defines the field which will be used as a reference for inserting as well as for querying data from Apache Solr."

This XML file defines that your document can have these fields, if you provide more fields which are not defined in schema.xml file then Solr will ignore those extra fields.
When you provide your document in any of the supported format then with the help of this file Apache Solr decide how different fields provided in given document will be treated. i.e. This file is use for 

  1. Deciding unique field in the document. 
  2. Fields used for indexing. 
  3. Fields to be generated  dynamically.
  4. How different fields will be used at the time of indexing as well as querying etc.
I divide this XML file into five core tags for beginners. This file also have some other tags but for the beginning these are enough to know. I will discuss about all of them in Advance Solr Learning Tutorial Series.
  1. Fields
  2. Unique Key 
  3. Copy Field
  4. Types
  5. Default Search Fields
This schema.xml file is located at /solr4.7.2/example/collection1/conf directory.  In this XML file root tag is <schema> & all other tags are defined within it.

<fields> Tag
This is the section where you defined the fields which you want to have in your document & used for indexing as well as for Query purpose, within this tag we define multiple <field> tag ex.

<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>

name : Every fields has this attribute which will be used as a reference for adding document & for search queries.
type : It defines the type of field i.e. string ,Integer, etc.
stored : If false then it's value will not be stored in Apache Solr. It's default value is True.
indexed : If this fields is set as true then this field will be used for indexing, searching, facetable by Apache Solr.
multiValued : If this field is set as true then Solr will manage multiple values for this field. It's default values is False.
required : It tells that your document must have this fields.
This <fields> tag also supports dynamic fields

For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an string field with that name...

<dynamicField name="*_i" type="string" indexed="true" stored="true" />

using this field type Solr provides us a facility to dynamically generate a field i.e. we doesn't have a need to define all the fields Solr will do it for us. Above declaration will tell Solr that whenever it sees a field name ending in "_i" which is not explicitly defined field, then generate a string field with that name.
you can use * at the start or ending of the name field if multiple fields defines same rules then that field will be used which is defined first.

Mandatory fields - 

_root_
points to the root document of a block of nested documents. Required for nested document support, may be removed otherwise.
_version_
The _version_ field is an internal field that is used by the partial update procedure, the update log process, etc. It is only used internally for those processes, and simply providing the _version_ field in your schema.xml should be sufficient.

Copy Fields Tag
<copyField source="title" dest="text"/>
<copyField source="text_data" dest="text"/>
<copyField source="description" dest="text"/>

it will copy the text of title field to text field. You can copy the text of multiple fields to on field and for doing this text field must exist in your schema you can use it for searching and indexing purpose.

<uniqueKey>id</uniqueKey>
As it name shows the primary key for the document records & used to uniquely identify the records during modification, deletion, as well as for searching.

<defaultSearchField>text</defaultSearchfield>
This field will be used if you doesn't provide any field name explicitly i.e. if you search only text without any field label then it will search in text field and show you the results.

<types>
This section allows you to add different <fieldtype> which will be available to your schema ex.

<fieldType name="string" class="solr.StrField" />

name : This the unique name work as identifier and used as a reference to your Solr class.
class : this field provides you the information about which of the Solr class is used for this field type. It also contains some default options which you want for your field. Default schema.xml file comes with a large number of < fieldTypes> just go through them you will know the option provided for each field type. As I am not going to discuss all optional attribute. I will explain them on the basis of our requirement. 
For common numeric types (integer, float, etc...) there are multiple implementations provided depending on your needs, .
you can also create your own custom field type. please see SolrPlugins for information on how to ensure that your own custom Field Types can be loaded into Solr.

Lets define new schema as show below just copy and past this schema into your schema.xml file & then start your Solr Server.


<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example core zero" version="1.1">
<fields>
<field name="_version_" type="long" indexed="true" stored="true"/> 

<field name="_root_" type="string" indexed="false" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="string" indexed="true" stored="true" multiValued="true" />
<field name="address" type="string" indexed="true" stored="true" multiValued="true" />
<field name="comments" type="string" indexed="true" stored="true" multiValued="true" />
<field name="text" type="string" indexed="true" stored="false" multiValued="true"/>
<dynamicField name="*_i" type="string" indexed="true" stored="true" />
</fields>

<uniqueKey>id</uniqueKey>


<copyField source="name" dest="text"/>

<copyField source="address" dest="text"/>
<copyField source="comments" dest="text"/>

<types>

<fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
</types>
</schema>

check your schema is updated or not for that -
Go to Solr Dashboard Screen -> select collection1 Core ->select files -> schema.xml




Now, go to Documents section copy and paste the given JSON in to document list tab -
{
"id":"Solr101",
"name":"Solr version 4.7.2",
"address":["House No - 100, LR Apache, 40702","Address 2 ","address 3"],
"comments":["Working with Apache Solr and it's cool till now I am a beginner and started with Ankur.",
        "Comment 2","Comment 3"],
"dynamicField_i":"It is dynamically genrated field."
}
your screen looks like - 









Change the value of document type as JSON.
then go to query section & click the Execute Query button. You will the  screen as follows-





















Here you can see address & comments fields are JSON array fields as I select multivalued for both of these fields & one more field has been created named as dynamciField_i as I declare it as a dynamic field, so Solr creates it for us dynamically.