Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Standard

    Fast, general-purpose grammar-based tokenizers.

    The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

    • StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.

      [StandardAnalyzer](xref:Lucene.Net.Analysis.Standard.StandardAnalyzer) includes
      [StandardTokenizer](xref:Lucene.Net.Analysis.Standard.StandardTokenizer),
      [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter), 
      [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter)
      and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter).
      When the `Version` specified in the constructor is lower than 
      

      3.1, the ClassicTokenizer implementation is invoked.

    • ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.

    • UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.

      [UAX29URLEmailAnalyzer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailAnalyzer) includes
      [UAX29URLEmailTokenizer](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer),
      [StandardFilter](xref:Lucene.Net.Analysis.Standard.StandardFilter),
      [LowerCaseFilter](xref:Lucene.Net.Analysis.Core.LowerCaseFilter)
      and [StopFilter](xref:Lucene.Net.Analysis.Core.StopFilter).
      

    Classes

    ClassicAnalyzer

    Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

    You must specify the required LuceneVersion compatibility when creating ClassicAnalyzer:

    • As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords
    • As of 2.9, StopFilter preserves position increments
    • As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1068)
    ClassicAnalyzer was named StandardAnalyzer in Lucene versions prior to 3.1. As of 3.1, StandardAnalyzer implements Unicode text segmentation, as specified by UAX#29.

    ClassicFilter

    Normalizes tokens extracted with ClassicTokenizer.

    ClassicFilterFactory

    Factory for ClassicFilter.

    <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ClassicTokenizerFactory"/>
        <filter class="solr.ClassicFilterFactory"/>
      </analyzer>
    </fieldType>

    ClassicTokenizer

    A grammar-based tokenizer constructed with JFlex (and then ported to .NET)

    This should be a good tokenizer for most European-language documents:

    • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
    • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
    • Recognizes email addresses and internet hostnames as one token.

    Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

    ClassicTokenizerFactory

    Factory for ClassicTokenizer.

    <fieldType name="text_clssc" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="120"/>
      </analyzer>
    </fieldType>

    StandardAnalyzer

    Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

    You must specify the required LuceneVersion compatibility when creating StandardAnalyzer:

    • As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
    • As of 3.1, StandardTokenizer implements Unicode text segmentation, and StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords. ClassicTokenizer and ClassicAnalyzer are the pre-3.1 implementations of StandardTokenizer and StandardAnalyzer.
    • As of 2.9, StopFilter preserves position increments
    • As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1068)

    StandardFilter

    Normalizes tokens extracted with StandardTokenizer.

    StandardFilterFactory

    Factory for StandardFilter.

    <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
      </analyzer>
    </fieldType>

    StandardTokenizer

    A grammar-based tokenizer constructed with JFlex.

    As of Lucene version 3.1, this class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

    Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

    You must specify the required LuceneVersion compatibility when creating StandardTokenizer:

    • As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.
    • As of 3.1, StandardTokenizer implements Unicode text segmentation. If you use a previous version number, you get the exact behavior of ClassicTokenizer for backwards compatibility.

    StandardTokenizerFactory

    Factory for StandardTokenizer.

    <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
      </analyzer>
    </fieldType>

    StandardTokenizerImpl

    This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

    Tokens produced are of the following types:

    • <ALPHANUM>: A sequence of alphabetic and numeric characters
    • <NUM>: A number
    • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
    • <IDEOGRAPHIC>: A single CJKV ideographic character
    • <HIRAGANA>: A single hiragana character
    • <KATAKANA>: A sequence of katakana characters
    • <HANGUL>: A sequence of Hangul characters

    StandardTokenizerInterface

    UAX29URLEmailAnalyzer

    Filters UAX29URLEmailTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.

    You must specify the required LuceneVersion compatibility when creating UAX29URLEmailAnalyzer

    UAX29URLEmailTokenizer

    This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in ` Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

    Tokens produced are of the following types:

    • <ALPHANUM>: A sequence of alphabetic and numeric characters
    • <NUM>: A number
    • <URL>: A URL
    • <EMAIL>: An email address
    • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
    • <IDEOGRAPHIC>: A single CJKV ideographic character
    • <HIRAGANA>: A single hiragana character

    You must specify the required LuceneVersion compatibility when creating UAX29URLEmailTokenizer:

    • As of 3.4, Hiragana and Han characters are no longer wrongly split from their combining characters. If you use a previous version number, you get the exact broken behavior for backwards compatibility.

    UAX29URLEmailTokenizerFactory

    Factory for UAX29URLEmailTokenizer.

    <fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
      </analyzer>
    </fieldType>

    UAX29URLEmailTokenizerImpl

    This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

    Tokens produced are of the following types:

    • <ALPHANUM>: A sequence of alphabetic and numeric characters
    • <NUM>: A number
    • <URL>: A URL
    • <EMAIL>: An email address
    • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
    • <IDEOGRAPHIC>: A single CJKV ideographic character
    • <HIRAGANA>: A single hiragana character
    • <KATAKANA>: A sequence of katakana characters
    • <HANGUL>: A sequence of Hangul characters

    Interfaces

    IStandardTokenizerInterface

    Internal interface for supporting versioned grammars. @lucene.internal

    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)