• API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Core

    Basic, general-purpose analysis components.

    Classes

    KeywordAnalyzer

    "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

    KeywordTokenizer

    Emits the entire input as a single token.

    KeywordTokenizerFactory

    Factory for KeywordTokenizer.

    <fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

    LetterTokenizer

    A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by System.Char.IsLetter(System.Char) predicate.

    Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating LetterTokenizer:

    • As of 3.1, CharTokenizer uses an System.Int32 based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

    LetterTokenizerFactory

    Factory for LetterTokenizer.

    <fieldType name="text_letter" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.LetterTokenizerFactory"/>
      </analyzer>
    </fieldType>

    LowerCaseFilter

    Normalizes token text to lower case.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating LowerCaseFilter:

    • As of 3.1, supplementary characters are properly lowercased.

    LowerCaseFilterFactory

    Factory for LowerCaseFilter.

    <fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    LowerCaseTokenizer

    LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation.

    Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating LowerCaseTokenizer:

    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

    LowerCaseTokenizerFactory

    Factory for LowerCaseTokenizer.

    <fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      </analyzer>
    </fieldType>

    SimpleAnalyzer

    An Lucene.Net.Analysis.Analyzer that filters LetterTokenizer with LowerCaseFilter

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating CharTokenizer:

    • As of 3.1, LowerCaseTokenizer uses an int based API to normalize and detect token codepoints. See IsTokenChar(Int32) and Normalize(Int32) for details.

    StopAnalyzer

    Filters LetterTokenizer with LowerCaseFilter and StopFilter.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating StopAnalyzer:

    • As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords
    • As of 2.9, position increments are preserved

    StopFilter

    Removes stop words from a token stream.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating StopFilter:

    • As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords and position increments are preserved

    StopFilterFactory

    Factory for StopFilter.

    <fieldType name="text_stop" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
                words="stopwords.txt" format="wordset" />
      </analyzer>
    </fieldType>

    All attributes are optional:

    • ignoreCase defaults to false
    • words should be the name of a stopwords file to parse, if not specified the factory will use ENGLISH_STOP_WORDS_SET
    • format defines how the words file will be parsed, and defaults to wordset. If words is not specified, then format must not be specified.

    The valid values for the format option are:

    • wordset - This is the default format, which supports one word per line (including any intra-word whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored. See GetLines(Stream, Encoding) for details.
    • snowball - This format allows for multiple words specified on each line, and trailing comments may be specified using the vertical line ("|"). Blank lines are ignored. See GetSnowballWordSet(TextReader, LuceneVersion) for details.

    TypeTokenFilter

    Removes tokens whose types appear in a set of blocked types from a token stream.

    TypeTokenFilterFactory

    Factory class for TypeTokenFilter.

    <fieldType name="chars" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
                      useWhitelist="false"/>
      </analyzer>
    </fieldType>

    UpperCaseFilter

    Normalizes token text to UPPER CASE.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating UpperCaseFilter

    NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you Require uppercase tokens. Use the LowerCaseFilter for general search matching

    UpperCaseFilterFactory

    Factory for UpperCaseFilter.

    <fieldType name="text_uppercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.UpperCaseFilterFactory"/>
      </analyzer>
    </fieldType>

    NOTE: In Unicode, this transformation may lose information when the upper case character represents more than one lower case character. Use this filter when you require uppercase tokens. Use the LowerCaseFilterFactory for general search matching

    WhitespaceAnalyzer

    An Lucene.Net.Analysis.Analyzer that uses WhitespaceTokenizer.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating CharTokenizer:

    • As of 3.1, WhitespaceTokenizer uses an int based API to normalize and detect token codepoints. See IsTokenChar(Int32) and Normalize(Int32) for details.

    WhitespaceTokenizer

    A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating WhitespaceTokenizer:

    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

    WhitespaceTokenizerFactory

    Factory for WhitespaceTokenizer.

    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)