Class PatternAnalyzer

Efficient Lucene analyzer/tokenizer that preferably operates on a string rather than a TextReader, that can flexibly separate text into terms via a regular expression Regex (with behaviour similar to Split(string)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.

If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via Split(string). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.

This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:

PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
    pat.GetTokenStream("content", "James is running round in the woods"), 
    "English"));

Inheritance

object

Analyzer

PatternAnalyzer

Implements

IDisposable

Inherited Members

Analyzer.NewAnonymous(Func<string, TextReader, TokenStreamComponents>)

Analyzer.NewAnonymous(Func<string, TextReader, TokenStreamComponents>, ReuseStrategy)

Analyzer.NewAnonymous(Func<string, TextReader, TokenStreamComponents>, Func<string, TextReader, TextReader>)

Analyzer.NewAnonymous(Func<string, TextReader, TokenStreamComponents>, Func<string, TextReader, TextReader>, ReuseStrategy)

Analyzer.GetTokenStream(string, TextReader)

Analyzer.GetTokenStream(string, string)

Analyzer.GetPositionIncrementGap(string)

Analyzer.GetOffsetGap(string)

Analyzer.Strategy

Analyzer.Dispose()

Analyzer.GLOBAL_REUSE_STRATEGY

Analyzer.PER_FIELD_REUSE_STRATEGY

object.Equals(object, object)

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Lucene.Net.Analysis.Miscellaneous

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

[Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
public sealed class PatternAnalyzer : Analyzer, IDisposable

Constructors

PatternAnalyzer(LuceneVersion, Regex, bool, CharArraySet)

Constructs a new instance with the given parameters.

Declaration

public PatternAnalyzer(LuceneVersion matchVersion, Regex pattern, bool toLowerCase, CharArraySet stopWords)

Parameters

Type	Name	Description
LuceneVersion	matchVersion	currently does nothing
Regex	pattern	a regular expression delimiting tokens
bool	toLowerCase	if `true` returns tokens after applying String.toLowerCase()
CharArraySet	stopWords	if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via MakeStopSet(LuceneVersion, params string[])and/or WordlistLoaderas in `WordlistLoader.getWordSet(new File("samples/fulltext/stopwords.txt")` or other stop words lists .

Fields

DEFAULT_ANALYZER

A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.

Declaration

public static readonly PatternAnalyzer DEFAULT_ANALYZER

Field Value

Type	Description
PatternAnalyzer

EXTENDED_ANALYZER

A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html

Declaration

public static readonly PatternAnalyzer EXTENDED_ANALYZER

Field Value

Type	Description
PatternAnalyzer

NON_WORD_PATTERN

"\W+"; Divides text at non-letters (NOT Character.isLetter(c))

Declaration

public static readonly Regex NON_WORD_PATTERN

Field Value

Type	Description
Regex

WHITESPACE_PATTERN

"\s+"; Divides text at whitespaces (Character.isWhitespace(c))

Declaration

public static readonly Regex WHITESPACE_PATTERN

Field Value

Type	Description
Regex

Methods

CreateComponents(string, TextReader)

Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to GetTokenStream(string, TextReader) and is less efficient than GetTokenStream(string, TextReader).

Declaration

protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)

Parameters

Type	Name	Description
string	fieldName	the name of the field to tokenize (currently ignored).
TextReader	reader	the reader delivering the text

Returns

Type	Description
TokenStreamComponents	a new token stream

Overrides

Analyzer.CreateComponents(string, TextReader)

CreateComponents(string, TextReader, string)

Creates a token stream that tokenizes the given string into token terms (aka words).

Declaration

public TokenStreamComponents CreateComponents(string fieldName, TextReader reader, string text)

Parameters

Type	Name	Description
string	fieldName	the name of the field to tokenize (currently ignored).
TextReader	reader	reader (e.g. charfilter) of the original text. can be null.
string	text	the string to tokenize

Returns

Type	Description
TokenStreamComponents	a new token stream

Equals(object)

Indicates whether some other object is "equal to" this one.

Declaration

public override bool Equals(object other)

Parameters

Type	Name	Description
object	other	the reference object with which to compare.

Returns

Type	Description
bool	true if equal, false otherwise

Overrides

object.Equals(object)

GetHashCode()

Returns a hash code value for the object.

Declaration

public override int GetHashCode()

Returns

Type	Description
int	the hash code.

Overrides

object.GetHashCode()

Implements

IDisposable