Show / Hide Table of Contents

    Class Analyzer

    An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

    In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).

    Simple example:

    Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
    {
        Tokenizer source = new FooTokenizer(reader);
        TokenStream filter = new FooFilter(source);
        filter = new BarFilter(filter);
        return new TokenStreamComponents(source, filter);
    });

    For more examples, see the Lucene.Net.Analysis namespace documentation.

    For some concrete implementations bundled with Lucene, look in the analysis modules:

    • Common: Analyzers for indexing content in different languages and domains.
    • ICU: Exposes functionality from ICU to Apache Lucene.
    • Kuromoji: Morphological analyzer for Japanese text.
    • Morfologik: Dictionary-driven lemmatization for the Polish language.
    • Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
    • Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
    • Stempel: Algorithmic Stemmer for the Polish Language.
    • UIMA: Analysis integration with Apache UIMA.

    Inheritance
    System.Object
    Analyzer
    AnalyzerWrapper
    Namespace: Lucene.Net.Analysis
    Assembly: Lucene.Net.dll
    Syntax
    public abstract class Analyzer : IDisposable

    Constructors

    | Improve this Doc View Source

    Analyzer()

    Create a new Analyzer, reusing the same set of components per-thread across calls to GetTokenStream(String, TextReader).

    Declaration
    public Analyzer()
    | Improve this Doc View Source

    Analyzer(ReuseStrategy)

    Expert: create a new Analyzer with a custom ReuseStrategy.

    NOTE: if you just want to reuse on a per-field basis, its easier to use a subclass of AnalyzerWrapper such as Lucene.Net.Analysis.Common.Miscellaneous.PerFieldAnalyzerWrapper instead.

    Declaration
    public Analyzer(ReuseStrategy reuseStrategy)
    Parameters
    Type Name Description
    ReuseStrategy reuseStrategy

    Fields

    | Improve this Doc View Source

    GLOBAL_REUSE_STRATEGY

    A predefined ReuseStrategy that reuses the same components for every field.

    Declaration
    public static readonly ReuseStrategy GLOBAL_REUSE_STRATEGY
    Field Value
    Type Description
    ReuseStrategy
    | Improve this Doc View Source

    PER_FIELD_REUSE_STRATEGY

    A predefined ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name.

    Declaration
    public static readonly ReuseStrategy PER_FIELD_REUSE_STRATEGY
    Field Value
    Type Description
    ReuseStrategy

    Properties

    | Improve this Doc View Source

    Strategy

    Returns the used ReuseStrategy.

    Declaration
    public ReuseStrategy Strategy { get; }
    Property Value
    Type Description
    ReuseStrategy

    Methods

    | Improve this Doc View Source

    CreateComponents(String, TextReader)

    Creates a new TokenStreamComponents instance for this analyzer.

    Declaration
    protected abstract TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    Parameters
    Type Name Description
    System.String fieldName

    the name of the fields content passed to the TokenStreamComponents sink as a reader

    TextReader reader

    the reader passed to the Tokenizer constructor

    Returns
    Type Description
    TokenStreamComponents

    the TokenStreamComponents for this analyzer.

    | Improve this Doc View Source

    Dispose()

    Frees persistent resources used by this Analyzer

    Declaration
    public void Dispose()
    | Improve this Doc View Source

    Dispose(Boolean)

    Frees persistent resources used by this Analyzer

    Declaration
    protected virtual void Dispose(bool disposing)
    Parameters
    Type Name Description
    System.Boolean disposing
    | Improve this Doc View Source

    GetOffsetGap(String)

    Just like GetPositionIncrementGap(String), except for Token offsets instead. By default this returns 1. this method is only called if the field produced at least one token for indexing.

    Declaration
    public virtual int GetOffsetGap(string fieldName)
    Parameters
    Type Name Description
    System.String fieldName

    the field just indexed

    Returns
    Type Description
    System.Int32

    offset gap, added to the next token emitted from GetTokenStream(String, TextReader). this value must be >= 0.

    | Improve this Doc View Source

    GetPositionIncrementGap(String)

    Invoked before indexing a IIndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IIndexableField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IIndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IIndexableField instance boundaries.

    Declaration
    public virtual int GetPositionIncrementGap(string fieldName)
    Parameters
    Type Name Description
    System.String fieldName

    IIndexableField name being indexed.

    Returns
    Type Description
    System.Int32

    position increment gap, added to the next token emitted from GetTokenStream(String, TextReader). this value must be >= 0.

    | Improve this Doc View Source

    GetTokenStream(String, String)

    Returns a TokenStream suitable for fieldName, tokenizing the contents of text.

    This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader).

    NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.

    Declaration
    public TokenStream GetTokenStream(string fieldName, string text)
    Parameters
    Type Name Description
    System.String fieldName

    the name of the field the created TokenStream is used for

    System.String text

    the the streams source reads from

    Returns
    Type Description
    TokenStream

    TokenStream for iterating the analyzed content of reader

    See Also
    GetTokenStream(String, TextReader)
    | Improve this Doc View Source

    GetTokenStream(String, TextReader)

    Returns a TokenStream suitable for fieldName, tokenizing the contents of text.

    This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader).

    NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.

    Declaration
    public TokenStream GetTokenStream(string fieldName, TextReader reader)
    Parameters
    Type Name Description
    System.String fieldName

    the name of the field the created TokenStream is used for

    TextReader reader

    the reader the streams source reads from

    Returns
    Type Description
    TokenStream

    TokenStream for iterating the analyzed content of

    See Also
    GetTokenStream(String, String)
    | Improve this Doc View Source

    InitReader(String, TextReader)

    Override this if you want to add a CharFilter chain.

    The default implementation returns reader unchanged.

    Declaration
    protected virtual TextReader InitReader(string fieldName, TextReader reader)
    Parameters
    Type Name Description
    System.String fieldName

    IIndexableField name being indexed

    TextReader reader

    original

    Returns
    Type Description
    TextReader

    reader, optionally decorated with CharFilter(s)

    | Improve this Doc View Source

    NewAnonymous(Func<String, TextReader, TokenStreamComponents>)

    Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the createComponents parameter. Simple example:

        var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
        {
            Tokenizer source = new FooTokenizer(reader);
            TokenStream filter = new FooFilter(source);
            filter = new BarFilter(filter);
            return new TokenStreamComponents(source, filter);
        });

    LUCENENET specific

    Declaration
    public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents)
    Parameters
    Type Name Description
    Func<System.String, TextReader, TokenStreamComponents> createComponents

    A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a fieldName and a reader and returns the TokenStreamComponents for this analyzer.

    Returns
    Type Description
    Analyzer

    A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance.

    | Improve this Doc View Source

    NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>)

    Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the createComponents parameter and the body of the InitReader(String, TextReader) method through the initReader parameter. Simple example:

        var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
        {
            Tokenizer source = new FooTokenizer(reader);
            TokenStream filter = new FooFilter(source);
            filter = new BarFilter(filter);
            return new TokenStreamComponents(source, filter);
        }, initReader: (fieldName, reader) => 
        {
            return new HTMLStripCharFilter(reader);
        });

    LUCENENET specific

    Declaration
    public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, Func<string, TextReader, TextReader> initReader)
    Parameters
    Type Name Description
    Func<System.String, TextReader, TokenStreamComponents> createComponents

    A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a fieldName and a reader and returns the TokenStreamComponents for this analyzer.

    Func<System.String, TextReader, TextReader> initReader

    A delegate method that represents (is called by) the InitReader(String, TextReader) method. It accepts a fieldName and a reader and returns the that can be modified or wrapped by the initReader method.

    Returns
    Type Description
    Analyzer

    A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance.

    | Improve this Doc View Source

    NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>, ReuseStrategy)

    Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the createComponents parameter, the body of the InitReader(String, TextReader) method through the initReader parameter, and allows the use of a ReuseStrategy. Simple example:

        var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
        {
            Tokenizer source = new FooTokenizer(reader);
            TokenStream filter = new FooFilter(source);
            filter = new BarFilter(filter);
            return new TokenStreamComponents(source, filter);
        }, initReader: (fieldName, reader) => 
        {
            return new HTMLStripCharFilter(reader);
        }, reuseStrategy);

    LUCENENET specific

    Declaration
    public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, Func<string, TextReader, TextReader> initReader, ReuseStrategy reuseStrategy)
    Parameters
    Type Name Description
    Func<System.String, TextReader, TokenStreamComponents> createComponents

    A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a fieldName and a reader and returns the TokenStreamComponents for this analyzer.

    Func<System.String, TextReader, TextReader> initReader

    A delegate method that represents (is called by) the InitReader(String, TextReader) method. It accepts a fieldName and a reader and returns the that can be modified or wrapped by the initReader method.

    ReuseStrategy reuseStrategy

    A custom ReuseStrategy instance.

    Returns
    Type Description
    Analyzer

    A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance.

    | Improve this Doc View Source

    NewAnonymous(Func<String, TextReader, TokenStreamComponents>, ReuseStrategy)

    Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader) method through the createComponents parameter and allows the use of a ReuseStrategy. Simple example:

        var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => 
        {
            Tokenizer source = new FooTokenizer(reader);
            TokenStream filter = new FooFilter(source);
            filter = new BarFilter(filter);
            return new TokenStreamComponents(source, filter);
        }, reuseStrategy);

    LUCENENET specific

    Declaration
    public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, ReuseStrategy reuseStrategy)
    Parameters
    Type Name Description
    Func<System.String, TextReader, TokenStreamComponents> createComponents

    An delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a fieldName and a reader and returns the TokenStreamComponents for this analyzer.

    ReuseStrategy reuseStrategy

    A custom ReuseStrategy instance.

    Returns
    Type Description
    Analyzer

    A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance.

    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)