此内容没有您所选择的语言版本。

4.5.8. Available Analyzers


Apache Solr and Lucene ship with a number of default CharFilters, tokenizers, and filters. A complete list of CharFilter, tokenizer, and filter factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters, tokenizers, and filters.
Expand
Table 4.1. Example of available CharFilters
Factory Description Parameters Additional dependencies
MappingCharFilterFactory Replaces one or more characters with one or more characters, based on mappings specified in the resource file
mapping: points to a resource file containing the mappings using the format:


                    "á" => "a"
                    "ñ" => "n"
                    "ø" => "o"

none
HTMLStripCharFilterFactory Remove HTML standard tags, keeping the text none none
Expand
Table 4.2. Example of available tokenizers
Factory Description Parameters Additional dependencies
StandardTokenizerFactory Use the Lucene StandardTokenizer none none
HTMLStripCharFilterFactory Remove HTML tags, keep the text and pass it to a StandardTokenizer. none solr-core
PatternTokenizerFactory Breaks text at the specified regular expression pattern.
pattern: the regular expression to use for tokenizing
group: says which pattern group to extract into tokens
solr-core
Expand
Table 4.3. Examples of available filters
Factory Description Parameters Additional dependencies
StandardFilterFactory Remove dots from acronyms and 's from words none solr-core
LowerCaseFilterFactory Lowercases all words none solr-core
StopFilterFactory Remove words (tokens) matching a list of stop words
words: points to a resource file containing the stop words
ignoreCase: true if case should be ignored when comparing stop words, false otherwise
solr-core
SnowballPorterFilterFactory Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words. language: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more solr-core
ISOLatin1AccentFilterFactory Remove accents for languages like French none solr-core
PhoneticFilterFactory Inserts phonetically similar tokens into the token stream
encoder: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex
inject: true will add tokens to the stream, false will replace the existing token
maxCodeLength: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings
solr-core and commons-codec
CollationKeyFilterFactory Converts each token into its java.text.CollationKey, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term. custom, language, country, variant, strength, decompositionsee Lucene's CollationKeyFilter javadocs for more info solr-core and commons-io
It is recommended that all implementations of org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory are checked in your IDE to see available implementations.
返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat