14.5. Analysis
Analyzer
s to control this process.
14.5.1. Default Analyzer and Analyzer by Class
default.analyzer
property. The default value for this property is org.apache.lucene.analysis.standard.StandardAnalyzer
.
@Field
, which is useful when multiple fields are indexed from a single property.
EntityAnalyzer
is used to index all tokenized properties, such as name except, summary and body, which are indexed with PropertyAnalyzer
and FieldAnalyzer
respectively.
Example 14.9. Different ways of using @Analyzer
@Indexed @Analyzer(impl = EntityAnalyzer.class) public class MyEntity { @Field private String name; @Field @Analyzer(impl = PropertyAnalyzer.class) private String summary; @Field(analyzer = @Analyzer(impl = FieldAnalyzer.class)) private String body; }
Note
QueryParser
. Use the same analyzer for indexing and querying on any field.
14.5.2. Named Analyzers
@Analyzer
declarations and includes the following:
- a name: the unique string used to refer to the definition.
- a list of
CharFilters
: eachCharFilter
is responsible to pre-process input characters before the tokenization.CharFilters
can add, change, or remove characters. One common usage is for character normalization. - a
Tokenizer
: responsible for tokenizing the input stream into individual words. - a list of filters: each filter is responsible to remove, modify, or sometimes add words into the stream provided by the
Tokenizer
.
Analyzer
separates these components into multiple tasks, allowing individual components to be reused and components to be built with flexibility using the following procedure:
Procedure 14.1. The Analyzer Process
- The
CharFilters
process the character input. Tokenizer
converts the character input into tokens.- The tokens are the processed by the
TokenFilters
.
14.5.3. Analyzer Definitions
@Analyzer
annotation.
Example 14.10. Referencing an analyzer by name
@Indexed @AnalyzerDef(name = "customanalyzer") public class Team { @Field private String name; @Field private String location; @Field @Analyzer(definition = "customanalyzer") private String description; }
@AnalyzerDef
are also available by their name in the SearchFactory
, which is useful when building queries.
Analyzer analyzer = Search.getSearchManager(cache).getSearchFactory().getAnalyzer("customanalyzer")
14.5.4. @AnalyzerDef for Solr
org.hibernate:hibernate-search-analyzers
. Add the following dependency:
<dependency> <groupId>org.hibernate</groupId> <artifactId>hibernate-search-analyzers</artifactId> <version>${version.hibernate.search}</version> <dependency>
CharFilter
is defined by its factory. In this example, a mapping char filter is used, which will replace characters in the input based on the rules specified in the mapping file. Finally, a list of filters is defined by their factories. In this example, the StopFilter
filter is built reading the dedicated words property file. The filter will ignore case.
Procedure 14.2. @AnalyzerDef and the Solr framework
Configure the CharFilter
Define aCharFilter
by factory. In this example, a mappingCharFilter
is used, which will replace characters in the input based on the rules specified in the mapping file.@AnalyzerDef(name = "customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) },
Define the Tokenizer
ATokenizer
is then defined using theStandardTokenizerFactory.class
.@AnalyzerDef(name = "customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) }, tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class)
List of Filters
Define a list of filters by their factories. In this example, theStopFilter
filter is built reading the dedicated words property file. The filter will ignore case.@AnalyzerDef(name = "customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) }, tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = StopFilterFactory.class, params = { @Parameter(name = "words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ), @Parameter(name = "ignoreCase", value = "true") }) }) public class Team { }
Note
CharFilters
are applied in the order they are defined in the @AnalyzerDef
annotation.
14.5.5. Loading Analyzer Resources
Tokenizers
, TokenFilters
, and CharFilters
can load resources such as configuration or metadata files using the StopFilterFactory.class
or the synonym filter. The virtual machine default can be explicitly specified by adding a resource_charset
parameter.
Example 14.11. Use a specific charset to load the property file
@AnalyzerDef(name = "customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) }, tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = StopFilterFactory.class, params = { @Parameter(name="words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties"), @Parameter(name = "resource_charset", value = "UTF-16BE"), @Parameter(name = "ignoreCase", value = "true") }) }) public class Team { }
14.5.6. Dynamic Analyzer Selection
@AnalyzerDiscriminator
annotation to enable the dynamic analyzer selection.
BlogEntry
class, the analyzer can depend on the language property of the entry. Depending on this property, the correct language-specific stemmer can then be chosen to index the text.
Discriminator
interface must return the name of an existing Analyzer definition, or null if the default analyzer is not overridden.
de
' or 'en
', which is specified in the @AnalyzerDefs
.
Procedure 14.3. Configure the @AnalyzerDiscriminator
Predefine Dynamic Analyzers
The@AnalyzerDiscriminator
requires that all analyzers that are to be used dynamically are predefined via@AnalyzerDef
. The@AnalyzerDiscriminator
annotation can then be placed either on the class, or on a specific property of the entity, in order to dynamically select an analyzer. An implementation of theDiscriminator
interface can be specified using the@AnalyzerDiscriminator
impl
parameter.@Indexed @AnalyzerDefs({ @AnalyzerDef(name = "en", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = EnglishPorterFilterFactory.class) }), @AnalyzerDef(name = "de", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = GermanStemFilterFactory.class) }) }) public class BlogEntry { @Field @AnalyzerDiscriminator(impl = LanguageDiscriminator.class) private String language; @Field private String text; private Set<BlogEntry> references; // standard getter/setter }
Implement the Discriminator Interface
Implement thegetAnalyzerDefinitionName()
method, which is called for each field added to the Lucene document. The entity being indexed is also passed to the interface method.Thevalue
parameter is set if the@AnalyzerDiscriminator
is placed on the property level instead of the class level. In this example, the value represents the current value of this property.public class LanguageDiscriminator implements Discriminator { public String getAnalyzerDefinitionName(Object value, Object entity, String field) { if (value == null || !(entity instanceof Article)) { return null; } return (String) value; } }
14.5.7. Retrieving an Analyzer
- Standard analyzer: used in the
title
field. - Stemming analyzer: used in the
title_stemmed
field.
Example 14.12. Using the scoped analyzer when building a full-text query
SearchManager manager = Search.getSearchManager(cache); org.apache.lucene.queryParser.QueryParser parser = new QueryParser( org.apache.lucene.util.Version.LUCENE_36, "title", manager.getSearchFactory().getAnalyzer(Song.class) ); org.apache.lucene.search.Query luceneQuery = parser.parse("title:sky Or title_stemmed:diamond"); // wrap Lucene query in a org.infinispan.query.CacheQuery CacheQuery cacheQuery = manager.getQuery(luceneQuery, Song.class); List result = cacheQuery.list(); //return the list of matching objects
Note
@AnalyzerDef
can also be retrieved by their definition name using searchFactory.getAnalyzer(String)
.
14.5.8. Available Analyzers
CharFilters
, tokenizers
, and filters
. A complete list of CharFilter
, tokenizer
, and filter
factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters
, tokenizers
, and filters
.
Factory | Description | Parameters | Additional dependencies |
---|---|---|---|
MappingCharFilterFactory | Replaces one or more characters with one or more characters, based on mappings specified in the resource file | mapping : points to a resource file containing the mappings using the format:
| none |
HTMLStripCharFilterFactory | Remove HTML standard tags, keeping the text | none | none |
Factory | Description | Parameters | Additional dependencies |
---|---|---|---|
StandardTokenizerFactory | Use the Lucene StandardTokenizer | none | none |
HTMLStripCharFilterFactory | Remove HTML tags, keep the text and pass it to a StandardTokenizer. | none | solr-core |
PatternTokenizerFactory | Breaks text at the specified regular expression pattern. | pattern : the regular expression to use for tokenizing
group: says which pattern group to extract into tokens
| solr-core |
Factory | Description | Parameters | Additional dependencies |
---|---|---|---|
StandardFilterFactory | Remove dots from acronyms and 's from words | none | solr-core |
LowerCaseFilterFactory | Lowercases all words | none | solr-core |
StopFilterFactory | Remove words (tokens) matching a list of stop words | words : points to a resource file containing the stop words
ignoreCase: true if
case should be ignored when comparing stop words, false otherwise
| solr-core |
SnowballPorterFilterFactory | Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words. | language : Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more | solr-core |
ISOLatin1AccentFilterFactory | Remove accents for languages like French | none | solr-core |
PhoneticFilterFactory | Inserts phonetically similar tokens into the token stream | encoder : One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex
inject:
true will add tokens to the stream, false will replace the existing token
maxCodeLength : sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings
| solr-core and commons-codec |
CollationKeyFilterFactory | Converts each token into its java.text.CollationKey , and then encodes the CollationKey with IndexableBinaryStringTools , to allow it to be stored as an index term. | custom , language , country , variant , strength , decomposition see Lucene's CollationKeyFilter javadocs for more info | solr-core and commons-io |
org.apache.solr.analysis.TokenizerFactory
and org.apache.solr.analysis.TokenFilterFactory
are checked in your IDE to see available implementations.