Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class PatternAnalyzer

    Efficient Lucene analyzer/tokenizer that preferably operates on a System.String rather than a System.IO.TextReader, that can flexibly separate text into terms via a regular expression System.Text.RegularExpressions.Regex (with behaviour similar to System.Text.RegularExpressions.Regex.Split(System.String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.

    If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via System.Text.RegularExpressions.Regex.Split(System.String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.

    This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:

    PatternAnalyzer pat = ...
    TokenStream tokenStream = new SnowballFilter(
        pat.GetTokenStream("content", "James is running round in the woods"), 
        "English"));

    Inheritance
    System.Object
    Lucene.Net.Analysis.Analyzer
    PatternAnalyzer
    Implements
    System.IDisposable
    Inherited Members
    Analyzer.NewAnonymous(Func<String, TextReader, TokenStreamComponents>)
    Analyzer.NewAnonymous(Func<String, TextReader, TokenStreamComponents>, ReuseStrategy)
    Analyzer.NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>)
    Analyzer.NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>, ReuseStrategy)
    Analyzer.GetTokenStream(String, TextReader)
    Analyzer.GetTokenStream(String, String)
    Analyzer.InitReader(String, TextReader)
    Analyzer.GetPositionIncrementGap(String)
    Analyzer.GetOffsetGap(String)
    Lucene.Net.Analysis.Analyzer.Strategy
    Lucene.Net.Analysis.Analyzer.Dispose()
    Analyzer.Dispose(Boolean)
    Lucene.Net.Analysis.Analyzer.GLOBAL_REUSE_STRATEGY
    Lucene.Net.Analysis.Analyzer.PER_FIELD_REUSE_STRATEGY
    System.Object.Equals(System.Object, System.Object)
    System.Object.GetType()
    System.Object.MemberwiseClone()
    System.Object.ReferenceEquals(System.Object, System.Object)
    System.Object.ToString()
    Namespace: Lucene.Net.Analysis.Miscellaneous
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    [Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
    public sealed class PatternAnalyzer : Analyzer, IDisposable

    Constructors

    | Improve this Doc View Source

    PatternAnalyzer(LuceneVersion, Regex, Boolean, CharArraySet)

    Constructs a new instance with the given parameters.

    Declaration
    public PatternAnalyzer(LuceneVersion matchVersion, Regex pattern, bool toLowerCase, CharArraySet stopWords)
    Parameters
    Type Name Description
    Lucene.Net.Util.LuceneVersion matchVersion

    currently does nothing

    System.Text.RegularExpressions.Regex pattern

    a regular expression delimiting tokens

    System.Boolean toLowerCase

    if

    true
    returns tokens after applying String.toLowerCase()

    CharArraySet stopWords

    if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via MakeStopSet(LuceneVersion, String[])and/or WordlistLoaderas in

    WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")

    or other stop words lists .

    Fields

    | Improve this Doc View Source

    DEFAULT_ANALYZER

    A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.

    Declaration
    public static readonly PatternAnalyzer DEFAULT_ANALYZER
    Field Value
    Type Description
    PatternAnalyzer
    | Improve this Doc View Source

    EXTENDED_ANALYZER

    A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html

    Declaration
    public static readonly PatternAnalyzer EXTENDED_ANALYZER
    Field Value
    Type Description
    PatternAnalyzer
    | Improve this Doc View Source

    NON_WORD_PATTERN

    "\W+"; Divides text at non-letters (NOT Character.isLetter(c))

    Declaration
    public static readonly Regex NON_WORD_PATTERN
    Field Value
    Type Description
    System.Text.RegularExpressions.Regex
    | Improve this Doc View Source

    WHITESPACE_PATTERN

    "\s+"; Divides text at whitespaces (Character.isWhitespace(c))

    Declaration
    public static readonly Regex WHITESPACE_PATTERN
    Field Value
    Type Description
    System.Text.RegularExpressions.Regex

    Methods

    | Improve this Doc View Source

    CreateComponents(String, TextReader)

    Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to GetTokenStream(String, TextReader) and is less efficient than GetTokenStream(String, TextReader).

    Declaration
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    Parameters
    Type Name Description
    System.String fieldName

    the name of the field to tokenize (currently ignored).

    System.IO.TextReader reader

    the reader delivering the text

    Returns
    Type Description
    Lucene.Net.Analysis.TokenStreamComponents

    a new token stream

    Overrides
    Analyzer.CreateComponents(String, TextReader)
    | Improve this Doc View Source

    CreateComponents(String, TextReader, String)

    Creates a token stream that tokenizes the given string into token terms (aka words).

    Declaration
    public TokenStreamComponents CreateComponents(string fieldName, TextReader reader, string text)
    Parameters
    Type Name Description
    System.String fieldName

    the name of the field to tokenize (currently ignored).

    System.IO.TextReader reader

    reader (e.g. charfilter) of the original text. can be null.

    System.String text

    the string to tokenize

    Returns
    Type Description
    Lucene.Net.Analysis.TokenStreamComponents

    a new token stream

    | Improve this Doc View Source

    Equals(Object)

    Indicates whether some other object is "equal to" this one.

    Declaration
    public override bool Equals(object other)
    Parameters
    Type Name Description
    System.Object other

    the reference object with which to compare.

    Returns
    Type Description
    System.Boolean

    true if equal, false otherwise

    Overrides
    System.Object.Equals(System.Object)
    | Improve this Doc View Source

    GetHashCode()

    Returns a hash code value for the object.

    Declaration
    public override int GetHashCode()
    Returns
    Type Description
    System.Int32

    the hash code.

    Overrides
    System.Object.GetHashCode()

    Implements

    System.IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.