Class PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a string rather than a TextReader, that can flexibly separate text into terms via a regular expression Regex (with behaviour similar to Split(string)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via Split(string). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
    pat.GetTokenStream("content", "James is running round in the woods"), 
    "English"));
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Miscellaneous
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
public sealed class PatternAnalyzer : Analyzer, IDisposable
  Constructors
PatternAnalyzer(LuceneVersion, Regex, bool, CharArraySet)
Constructs a new instance with the given parameters.
Declaration
public PatternAnalyzer(LuceneVersion matchVersion, Regex pattern, bool toLowerCase, CharArraySet stopWords)
  Parameters
| Type | Name | Description | 
|---|---|---|
| LuceneVersion | matchVersion | currently does nothing  | 
      
| Regex | pattern | a regular expression delimiting tokens  | 
      
| bool | toLowerCase | if  returns tokens after applying
String.toLowerCase()
 | 
      
| CharArraySet | stopWords | if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via MakeStopSet(LuceneVersion, params string[])and/or WordlistLoaderas in 
 | 
      
Fields
DEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
Declaration
public static readonly PatternAnalyzer DEFAULT_ANALYZER
  Field Value
| Type | Description | 
|---|---|
| PatternAnalyzer | 
EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
Declaration
public static readonly PatternAnalyzer EXTENDED_ANALYZER
  Field Value
| Type | Description | 
|---|---|
| PatternAnalyzer | 
NON_WORD_PATTERN
"\W+"; Divides text at non-letters (NOT Character.isLetter(c))
Declaration
public static readonly Regex NON_WORD_PATTERN
  Field Value
| Type | Description | 
|---|---|
| Regex | 
WHITESPACE_PATTERN
"\s+"; Divides text at whitespaces (Character.isWhitespace(c))
Declaration
public static readonly Regex WHITESPACE_PATTERN
  Field Value
| Type | Description | 
|---|---|
| Regex | 
Methods
CreateComponents(string, TextReader)
Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to GetTokenStream(string, TextReader) and is less efficient than GetTokenStream(string, TextReader).
Declaration
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
  Parameters
| Type | Name | Description | 
|---|---|---|
| string | fieldName | the name of the field to tokenize (currently ignored).  | 
      
| TextReader | reader | the reader delivering the text  | 
      
Returns
| Type | Description | 
|---|---|
| TokenStreamComponents | a new token stream  | 
      
Overrides
CreateComponents(string, TextReader, string)
Creates a token stream that tokenizes the given string into token terms (aka words).
Declaration
public TokenStreamComponents CreateComponents(string fieldName, TextReader reader, string text)
  Parameters
| Type | Name | Description | 
|---|---|---|
| string | fieldName | the name of the field to tokenize (currently ignored).  | 
      
| TextReader | reader | reader (e.g. charfilter) of the original text. can be null.  | 
      
| string | text | the string to tokenize  | 
      
Returns
| Type | Description | 
|---|---|
| TokenStreamComponents | a new token stream  | 
      
Equals(object)
Indicates whether some other object is "equal to" this one.
Declaration
public override bool Equals(object other)
  Parameters
| Type | Name | Description | 
|---|---|---|
| object | other | the reference object with which to compare.  | 
      
Returns
| Type | Description | 
|---|---|
| bool | true if equal, false otherwise  | 
      
Overrides
GetHashCode()
Returns a hash code value for the object.
Declaration
public override int GetHashCode()
  Returns
| Type | Description | 
|---|---|
| int | the hash code.  |