4.5.8. Available Analyzers

Apache Solr and Lucene ship with a number of default CharFilters, tokenizers, and filters. A complete list of CharFilter, tokenizer, and filter factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters, tokenizers, and filters.

Table 4.1. Example of available CharFilters
Factory	Description	Parameters	Additional dependencies
`MappingCharFilterFactory`	Replaces one or more characters with one or more characters, based on mappings specified in the resource file	`mapping`: points to a resource file containing the mappings using the format: "á" => "a" "ñ" => "n" "ø" => "o"	none
`HTMLStripCharFilterFactory`	Remove HTML standard tags, keeping the text	none	none

Table 4.2. Example of available tokenizers
Factory	Description	Parameters	Additional dependencies
`StandardTokenizerFactory`	Use the Lucene StandardTokenizer	none	none
`HTMLStripCharFilterFactory`	Remove HTML tags, keep the text and pass it to a StandardTokenizer.	none	`solr-core`
`PatternTokenizerFactory`	Breaks text at the specified regular expression pattern.	`pattern`: the regular expression to use for tokenizing group: says which pattern group to extract into tokens	`solr-core`

Table 4.3. Examples of available filters
Factory	Description	Parameters	Additional dependencies
`StandardFilterFactory`	Remove dots from acronyms and 's from words	none	`solr-core`
`LowerCaseFilterFactory`	Lowercases all words	none	`solr-core`
`StopFilterFactory`	Remove words (tokens) matching a list of stop words	`words`: points to a resource file containing the stop words ignoreCase: true if `case` should be ignored when comparing stop words, `false` otherwise	`solr-core`
`SnowballPorterFilterFactory`	Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words.	`language`: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more	`solr-core`
`ISOLatin1AccentFilterFactory`	Remove accents for languages like French	none	`solr-core`
`PhoneticFilterFactory`	Inserts phonetically similar tokens into the token stream	`encoder`: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex inject: `true` will add tokens to the stream, `false` will replace the existing token `maxCodeLength`: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings	`solr-core` and `commons-codec`
`CollationKeyFilterFactory`	Converts each token into its `java.text.CollationKey`, and then encodes the `CollationKey` with `IndexableBinaryStringTools`, to allow it to be stored as an index term.	`custom`, `language`, `country`, `variant`, `strength`, `decomposition`see Lucene's `CollationKeyFilter` javadocs for more info	`solr-core` and `commons-io`

It is recommended that all implementations of org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory are checked in your IDE to see available implementations.

Report a bug

4.5.8. Available Analyzers

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links