Ce contenu n'est pas disponible dans la langue sélectionnée.

4.5.8. Available Analyzers


Apache Solr and Lucene ship with a number of default CharFilters, tokenizers, and filters. A complete list of CharFilter, tokenizer, and filter factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters, tokenizers, and filters.
Table 4.1. Example of available CharFilters
Factory Description Parameters Additional dependencies
MappingCharFilterFactory Replaces one or more characters with one or more characters, based on mappings specified in the resource file
mapping: points to a resource file containing the mappings using the format:


                    "á" => "a"
                    "ñ" => "n"
                    "ø" => "o"

none
HTMLStripCharFilterFactory Remove HTML standard tags, keeping the text none none
Table 4.2. Example of available tokenizers
Factory Description Parameters Additional dependencies
StandardTokenizerFactory Use the Lucene StandardTokenizer none none
HTMLStripCharFilterFactory Remove HTML tags, keep the text and pass it to a StandardTokenizer. none solr-core
PatternTokenizerFactory Breaks text at the specified regular expression pattern.
pattern: the regular expression to use for tokenizing
group: says which pattern group to extract into tokens
solr-core
Table 4.3. Examples of available filters
Factory Description Parameters Additional dependencies
StandardFilterFactory Remove dots from acronyms and 's from words none solr-core
LowerCaseFilterFactory Lowercases all words none solr-core
StopFilterFactory Remove words (tokens) matching a list of stop words
words: points to a resource file containing the stop words
ignoreCase: true if case should be ignored when comparing stop words, false otherwise
solr-core
SnowballPorterFilterFactory Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words. language: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more solr-core
ISOLatin1AccentFilterFactory Remove accents for languages like French none solr-core
PhoneticFilterFactory Inserts phonetically similar tokens into the token stream
encoder: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex
inject: true will add tokens to the stream, false will replace the existing token
maxCodeLength: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings
solr-core and commons-codec
CollationKeyFilterFactory Converts each token into its java.text.CollationKey, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term. custom, language, country, variant, strength, decompositionsee Lucene's CollationKeyFilter javadocs for more info solr-core and commons-io
It is recommended that all implementations of org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory are checked in your IDE to see available implementations.
Red Hat logoGithubRedditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez leBlog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

© 2024 Red Hat, Inc.