4.5.8. Available Analyzers


Apache Solr and Lucene ship with a number of default CharFilters, tokenizers, and filters. A complete list of CharFilter, tokenizer, and filter factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters, tokenizers, and filters.
Table 4.1. Example of available CharFilters
Factory Description Parameters Additional dependencies
MappingCharFilterFactory Replaces one or more characters with one or more characters, based on mappings specified in the resource file
mapping: points to a resource file containing the mappings using the format:


                    "á" => "a"
                    "ñ" => "n"
                    "ø" => "o"

none
HTMLStripCharFilterFactory Remove HTML standard tags, keeping the text none none
Table 4.2. Example of available tokenizers
Factory Description Parameters Additional dependencies
StandardTokenizerFactory Use the Lucene StandardTokenizer none none
HTMLStripCharFilterFactory Remove HTML tags, keep the text and pass it to a StandardTokenizer. none solr-core
PatternTokenizerFactory Breaks text at the specified regular expression pattern.
pattern: the regular expression to use for tokenizing
group: says which pattern group to extract into tokens
solr-core
Table 4.3. Examples of available filters
Factory Description Parameters Additional dependencies
StandardFilterFactory Remove dots from acronyms and 's from words none solr-core
LowerCaseFilterFactory Lowercases all words none solr-core
StopFilterFactory Remove words (tokens) matching a list of stop words
words: points to a resource file containing the stop words
ignoreCase: true if case should be ignored when comparing stop words, false otherwise
solr-core
SnowballPorterFilterFactory Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words. language: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more solr-core
ISOLatin1AccentFilterFactory Remove accents for languages like French none solr-core
PhoneticFilterFactory Inserts phonetically similar tokens into the token stream
encoder: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex
inject: true will add tokens to the stream, false will replace the existing token
maxCodeLength: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings
solr-core and commons-codec
CollationKeyFilterFactory Converts each token into its java.text.CollationKey, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term. custom, language, country, variant, strength, decompositionsee Lucene's CollationKeyFilter javadocs for more info solr-core and commons-io
It is recommended that all implementations of org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory are checked in your IDE to see available implementations.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.