Вы находитесь на странице: 1из 10

"name":"id",

"type":"string",
"multiValued":false,
"indexed":true,
"required":true,
"stored":true,
"uniqueKey":true}]}
Configuring Schemaless Mode
As described above, there are three configuration elements that need to be in place to use
Solr in
schemaless mode. In the _default configset included with Solr these are already configured.
If, however,
you would like to implement schemaless on your own, you should make the following changes.
Enable Managed Schema
As described in the section Schema Factory Definition in SolrConfig, Managed Schema support
is enabled by
default, unless your configuration specifies that ClassicIndexSchemaFactory should be used.
You can configure the ManagedIndexSchemaFactory (and control the resource file used, or
disable future
modifications) by adding an explicit <schemaFactory/> like the one below, please see Schema
Factory
Definition in SolrConfig for more details on the options available.
<schemaFactory class="ManagedIndexSchemaFactory">
<bool name="mutable">true</bool>
<str name="managedSchemaResourceName">managed-schema</str>
</schemaFactory>
Enable Field Class Guessing
In Solr, an UpdateRequestProcessorChain defines a chain of plugins that are applied to
documents before or
while they are indexed.
Apache Solr Reference Guide 7.7 Page 215 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
The field guessing aspect of Solr’s schemaless mode uses a specially-defined
UpdateRequestProcessorChain
that allows Solr to guess field types. You can also define the default field type classes to
use.
To start, you should define it as follows (see the javadoc links below for update processor
factory
documentation):
<updateProcessor class="solr.UUIDUpdateProcessorFactory" name="uuid"/>
<updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory" name="remove-blank"/>
<updateProcessor class="solr.FieldNameMutatingUpdateProcessorFactory" name="field-
namemutating">

<str name="pattern">[^\w-\.]</str>
<str name="replacement">_</str>
</updateProcessor>
<updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/>

<updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
<updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
<updateProcessor class="solr.ParseDateFieldUpdateProcessorFactory" name="parse-date">
<arr name="format">
<str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
<str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
<str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
<str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
<str>yyyy-MM-dd'T'HH:mm:ssZ</str>
<str>yyyy-MM-dd'T'HH:mm:ss</str>
<str>yyyy-MM-dd'T'HH:mmZ</str>
<str>yyyy-MM-dd'T'HH:mm</str>
<str>yyyy-MM-dd HH:mm:ss.SSSZ</str>
<str>yyyy-MM-dd HH:mm:ss,SSSZ</str>
<str>yyyy-MM-dd HH:mm:ss.SSS</str>
<str>yyyy-MM-dd HH:mm:ss,SSS</str>
<str>yyyy-MM-dd HH:mm:ssZ</str>
<str>yyyy-MM-dd HH:mm:ss</str>
<str>yyyy-MM-dd HH:mmZ</str>
<str>yyyy-MM-dd HH:mm</str>
<str>yyyy-MM-dd</str>
</arr>
</updateProcessor>
<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">

<lst name="typeMapping">
<str name="valueClass">java.lang.String</str> ④
<str name="fieldType">text_general</str>
<lst name="copyField"> ⑤
<str name="dest">*_str</str>
<int name="maxChars">256</int>
</lst>
<!-- Use as default mapping instead of defaultFieldType -->
<bool name="default">true</bool>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
Page 216 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">pdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str> ⑥
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">plongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">pdoubles</str>
</lst>
</updateProcessor>
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -
->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default=
"${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parsedouble,
parse-date,add-schema-fields"> ⑦
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
There are many things defined in this chain. Let’s step through a few of them.
① First, we’re using the FieldNameMutatingUpdateProcessorFactory to lower-case all field
names. Note
that this and every following <processor> element include a name. These names will be used
in the final
chain definition at the end of this example.
② Next we add several update request processors to parse different field types. Note the
ParseDateFieldUpdateProcessorFactory includes a long list of possible date formations that
would be
parsed into valid Solr dates. If you have a custom date, you could add it to this list (see
the link to the
Javadocs below to get information on how).
③ Once the fields have been parsed, we define the field types that will be assigned to
those fields. You can
modify any of these that you would like to change.
④ In this definition, if the parsing step decides the incoming data in a field is a string,
we will put this into a
field in Solr with the field type text_general. This field type by default allows Solr to
query on this field.
⑤ After we’ve added the text_general field, we have also defined a copy field rule that
will copy all data
from the new text_general field to a field with the same name suffixed with _str. This is
done by Solr’s
dynamic fields feature. By defining the target of the copy field rule as a dynamic field in
this way, you
can control the field type used in your schema. The default selection allows Solr to facet,
highlight, and
sort on these fields.
⑥ This is another example of a mapping rule. In this case we define that when either of the
Long or
Integer field parsers identify a field, they should both map their fields to the plongs
field type.
⑦ Finally, we add a chain definition that calls the list of plugins. These plugins are each
called by the names
we gave to them when we defined them. We can also add other processors to the chain, as shown
here.
Apache Solr Reference Guide 7.7 Page 217 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
Note we have also given the entire chain a name ("add-unknown-fields-to-the-schema"). We’ll
use this
name in the next section to specify that our update request handler should use this chain
definition.

This chain definition will make a number of copy field rules for string fields to be created
from corresponding text fields. If your data causes you to end up with a lot of copy field
rules, indexing may be slowed down noticeably, and your index size will be larger. To
control for these issues, it’s recommended that you review the copy field rules that are
created, and remove any which you do not need for faceting, sorting, highlighting, etc.
If you’re interested in more information about the classes used in this chain, here are
links to the Javadocs
for update processor factories mentioned above:
• UUIDUpdateProcessorFactory
• RemoveBlankFieldUpdateProcessorFactory
• FieldNameMutatingUpdateProcessorFactory
• ParseBooleanFieldUpdateProcessorFactory
• ParseLongFieldUpdateProcessorFactory
• ParseDoubleFieldUpdateProcessorFactory
• ParseDateFieldUpdateProcessorFactory
• AddSchemaFieldsUpdateProcessorFactory
Set the Default UpdateRequestProcessorChain
Once the UpdateRequestProcessorChain has been defined, you must instruct your
UpdateRequestHandlers
to use it when working with index updates (i.e., adding, removing, replacing documents).
There are two ways to do this. The update chain shown above has a default=true attribute
which will use it
for any update handler.
An alternative, more explicit way is to use InitParams to set the defaults on all /update
request handlers:
<initParams path="/update/**">
<lst name="defaults">
<str name="update.chain">add-unknown-fields-to-the-schema</str>
</lst>
</initParams>
After all of these changes have been made, Solr should be restarted or the cores
reloaded.
Disabling Automatic Field Guessing
Automatic field creation can be disabled with the update.autoCreateFields property. To do
this, you can
use bin/solr config with a command such as:
Page 218 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
bin/solr config -c mycollection -p 8983 -action set-user-property -property
update.autoCreateFields -value false
Examples of Indexed Documents
Once the schemaless mode has been enabled (whether you configured it manually or are using
the
_default configset), documents that include fields that are not defined in your schema will
be indexed,
using the guessed field types which are automatically added to the schema.
For example, adding a CSV document will cause unknown fields to be added, with fieldTypes
based on
values:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:
application/csv" -d '
id,Artist,Album,Released,Rating,FromDistributor,Sold
44C,Old Shews,Mead for Walking,1988-08-13,0.01,14,0'
Output indicating success:
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">106</int></lst>
</response>
The fields now in the schema (output from curl
http://localhost:8983/solr/gettingstarted/schema/fields ):
Apache Solr Reference Guide 7.7 Page 219 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04
{
"responseHeader":{
"status":0,
"QTime":2},
"fields":[{
"name":"Album",
"type":"text_general"},
{
"name":"Artist",
"type":"text_general"},
{
"name":"FromDistributor",
"type":"plongs"},
{
"name":"Rating",
"type":"pdoubles"},
{
"name":"Released",
"type":"pdates"},
{
"name":"Sold",
"type":"plongs"},
{
"name":"_root_", ...},
{
"name":"_text_", ...},
{
"name":"_version_", ...},
{
"name":"id", ...}
]}
In addition string versions of the text fields are indexed, using copyFields to a *_str
dynamic field: (output
from curl http://localhost:8983/solr/gettingstarted/schema/copyfields ):
{
"responseHeader":{
"status":0,
"QTime":0},
"copyFields":[{
"source":"Artist",
"dest":"Artist_str",
"maxChars":256},
{
"source":"Album",
"dest":"Album_str",
"maxChars":256}]}
Page 220 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation

You Can Still Be Explicit


Even if you want to use schemaless mode for most fields, you can still use the Schema API
to pre-emptively create some fields, with explicit types, before you index documents that
use them.
Internally, the Schema API and the Schemaless Update Processors both use the same
Managed Schema functionality.
Also, if you do not need the *_str version of a text field, you can simply remove the
copyField definition from the auto-generated schema and it will not be re-added since the
original field is now defined.
Once a field has been added to the schema, its field type is fixed. As a consequence, adding
documents with
field value(s) that conflict with the previously guessed field type will fail. For example,
after adding the above
document, the “Sold” field has the fieldType plongs, but the document below has a non-
integral decimal
value in this field:
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&wt=xml" -H "Contenttype:
application/csv" -d '
id,Description,Sold
19F,Cassettes by the pound,4.93'
This document will fail, as shown in this output:
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">7</int>
</lst>
<lst name="error">
<str name="msg">ERROR: [doc=19F] Error adding field 'Sold'='4.93' msg=For input string:
"4.93"</str>
<int name="code">400</int>
</lst>
</response>
Apache Solr Reference Guide 7.7 Page 221 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04

Understanding Analyzers, Tokenizers, and


Filters
The following sections describe how Solr breaks down and works with textual data. There are
three main
concepts to understand: analyzers, tokenizers, and filters.
• Field analyzers are used both during ingestion, when a document is indexed, and at query
time. An
analyzer examines the text of fields and generates a token stream. Analyzers may be a single
class or
they may be composed of a series of tokenizer and filter classes.
• Tokenizers break field data into lexical units, or tokens.
• Filters examine a stream of tokens and keep them, transform or discard them, or create new
ones.
Tokenizers and filters may be combined to form pipelines, or chains, where the output of one
is input to
the next. Such a sequence of tokenizers and filters is called an analyzer and the resulting
output of an
analyzer is used to match query results or build indices.
Page 222 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation

Using Analyzers, Tokenizers, and Filters


Although the analysis process is used for both indexing and querying, the same analysis
process need not
be used for both operations. For indexing, you often want to simplify, or normalize, words.
For example,
setting all letters to lowercase, eliminating punctuation and accents, mapping words to their
stems, and so
on. Doing so can increase recall because, for example, "ram", "Ram" and "RAM" would all match
a query for
"ram". To increase query-time precision, a filter could be employed to narrow the matches by,
for example,
ignoring all-cap acronyms if you’re interested in male sheep, but not Random Access Memory.
The tokens output by the analysis process define the values, or terms, of that field and are
used either to
build an index of those terms when a new document is added, or to identify which documents
contain the
terms you are querying for.
For More Information
These sections will show you how to configure field analyzers and also serves as a reference
for the details
of configuring each of the available tokenizer and filter classes. It also serves as a guide
so that you can
configure your own analysis classes if you have special needs that cannot be met with the
included filters or
tokenizers.
For Analyzers, see:
• Analyzers: Detailed conceptual information about Solr analyzers.
• Running Your Analyzer: Detailed information about testing and running your Solr analyzer.
For Tokenizers, see:
• About Tokenizers: Detailed conceptual information about Solr tokenizers.
• Tokenizers: Information about configuring tokenizers, and about the tokenizer factory
classes included
in this distribution of Solr.
For Filters, see:
• About Filters: Detailed conceptual information about Solr filters.
• Filter Descriptions: Information about configuring filters, and about the filter factory
classes included in
this distribution of Solr.
• CharFilterFactories: Information about filters for pre-processing input characters.
To find out how to use Tokenizers and Filters with various languages, see:
• Language Analysis: Information about tokenizers and filters for character set conversion or
for use with
specific languages.
Apache Solr Reference Guide 7.7 Page 223 of 1426
c 2019, Apache Software Foundation Guide Version 7.7 - Published: 2019-03-04

Analyzers
An analyzer examines the text of fields and generates a token stream.
Analyzers are specified as a child of the <fieldType> element in the schema.xml
configuration file (in the
same conf/ directory as solrconfig.xml).
In normal usage, only fields of type solr.TextField or solr.SortableTextField will
specify an analyzer.
The simplest way to configure an analyzer is with a single <analyzer> element whose class
attribute is a fully
qualified Java class name. The named class must derive from
org.apache.lucene.analysis.Analyzer. For
example:
<fieldType name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
</fieldType>
In this case a single class, WhitespaceAnalyzer, is responsible for analyzing the content
of the named text
field and emitting the corresponding tokens. For simple cases, such as plain English prose, a
single analyzer
class like this may be sufficient. But it’s often necessary to do more complex analysis of
the field content.
Even the most complex analysis requirements can usually be decomposed into a series of
discrete, relatively
simple processing steps. As you will soon discover, the Solr distribution comes with a large
selection of
tokenizers and filters that covers most scenarios you are likely to encounter. Setting up an
analyzer chain is
very straightforward; you specify a simple <analyzer> element (no class attribute) with
child elements that
name factory classes for the tokenizer and filters to use, in the order you want them to run.
For example:
<fieldType name="nametext" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
Note that classes in the org.apache.solr.analysis package may be referred to here with the
shorthand
solr. prefix.
In this case, no Analyzer class was specified on the <analyzer> element. Rather, a sequence
of more
specialized classes are wired together and collectively act as the Analyzer for the field.
The text of the field is
passed to the first item in the list ( solr.StandardTokenizerFactory), and the tokens that
emerge from the
last one (solr.EnglishPorterFilterFactory) are the terms that are used for indexing or
querying any
fields that use the "nametext" fieldType.
Page 224 of 1426 Apache Solr Reference Guide 7.7
Guide Version 7.7 - Published: 2019-03-04 c 2019, Apache Software Foundation
Field Values versus Indexed Terms
The output of an Analyzer affects the terms indexed in a given field (and the terms used
when parsing queries against those fields) but it has no impact on the stored value for the
fields. For example: an analyzer might split "Brown Cow" into two indexed terms "brown"
and "cow", but the stored value will still be a single String: "Brown Cow"
Analysis Phases
Analysis takes place in two contexts. At index time, when a field is being created, the token
stream that
results from analysis is added to an index and defines the set of terms (including positions,
sizes, and so on)
for the field. At query time, the values being searched for are analyzed and the terms that
result are
matched against those that are stored in the field’s index.
In many cases, the same analysis should be applied to both phases. This is desirable when you
want to
query for exact string matches, possibly with case-insensitivity, for example. In other
cases, you may want to
apply slightly different analysis steps during indexing than those used at query time.
If you provide a simple <analyzer> definition for a field type, as in the examples above,
then it will be used
for both indexing and queries. If you want distinct analyzers for each phase, you may include
two
<analyzer> definitions distinguished with a type attribute. For example:
<fieldType name="nametext" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In this theoretical example, at index time the text is tokenized, the tokens are set to
lowercase, any that are
not listed in keepwords.txt are discarded and those that remain are mapped to alternate
values as defined
by the synonym rules in the file syns.txt. This essentially builds an index from a
restricted set of possible
values and then normalizes them to values that may not even occur in the original text.
At query time, the only normalization that happens is to convert the query terms to
lowercase. The filtering

Вам также может понравиться