Class PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a System.String rather than a System.IO.TextReader, that can flexibly separate text into terms via a regular expression System.Text.RegularExpressions.Regex (with behaviour similar to System.Text.RegularExpressions.Regex.Split(System.String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via System.Text.RegularExpressions.Regex.Split(System.String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
    pat.GetTokenStream("content", "James is running round in the woods"), 
    "English"));
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Miscellaneous
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
public sealed class PatternAnalyzer : Analyzer, IDisposableConstructors
| Improve this Doc View SourcePatternAnalyzer(LuceneVersion, Regex, Boolean, CharArraySet)
Constructs a new instance with the given parameters.
Declaration
public PatternAnalyzer(LuceneVersion matchVersion, Regex pattern, bool toLowerCase, CharArraySet stopWords)Parameters
| Type | Name | Description | 
|---|---|---|
| Lucene.Net.Util.LuceneVersion | matchVersion | currently does nothing | 
| System.Text.RegularExpressions.Regex | pattern | a regular expression delimiting tokens | 
| System.Boolean | toLowerCase | if returns tokens after applying
String.toLowerCase() | 
| CharArraySet | stopWords | if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via MakeStopSet(LuceneVersion, String[])and/or WordlistLoaderas in  | 
Fields
| Improve this Doc View SourceDEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
Declaration
public static readonly PatternAnalyzer DEFAULT_ANALYZERField Value
| Type | Description | 
|---|---|
| PatternAnalyzer | 
EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
Declaration
public static readonly PatternAnalyzer EXTENDED_ANALYZERField Value
| Type | Description | 
|---|---|
| PatternAnalyzer | 
NON_WORD_PATTERN
"\W+"; Divides text at non-letters (NOT Character.isLetter(c)) 
Declaration
public static readonly Regex NON_WORD_PATTERNField Value
| Type | Description | 
|---|---|
| System.Text.RegularExpressions.Regex | 
WHITESPACE_PATTERN
"\s+"; Divides text at whitespaces (Character.isWhitespace(c)) 
Declaration
public static readonly Regex WHITESPACE_PATTERNField Value
| Type | Description | 
|---|---|
| System.Text.RegularExpressions.Regex | 
Methods
| Improve this Doc View SourceCreateComponents(String, TextReader)
Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to GetTokenStream(String, TextReader) and is less efficient than GetTokenStream(String, TextReader).
Declaration
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)Parameters
| Type | Name | Description | 
|---|---|---|
| System.String | fieldName | the name of the field to tokenize (currently ignored). | 
| System.IO.TextReader | reader | the reader delivering the text | 
Returns
| Type | Description | 
|---|---|
| Lucene.Net.Analysis.TokenStreamComponents | a new token stream | 
Overrides
| Improve this Doc View SourceCreateComponents(String, TextReader, String)
Creates a token stream that tokenizes the given string into token terms (aka words).
Declaration
public TokenStreamComponents CreateComponents(string fieldName, TextReader reader, string text)Parameters
| Type | Name | Description | 
|---|---|---|
| System.String | fieldName | the name of the field to tokenize (currently ignored). | 
| System.IO.TextReader | reader | reader (e.g. charfilter) of the original text. can be null. | 
| System.String | text | the string to tokenize | 
Returns
| Type | Description | 
|---|---|
| Lucene.Net.Analysis.TokenStreamComponents | a new token stream | 
Equals(Object)
Indicates whether some other object is "equal to" this one.
Declaration
public override bool Equals(object other)Parameters
| Type | Name | Description | 
|---|---|---|
| System.Object | other | the reference object with which to compare. | 
Returns
| Type | Description | 
|---|---|
| System.Boolean | true if equal, false otherwise | 
Overrides
GetHashCode()
Returns a hash code value for the object.
Declaration
public override int GetHashCode()Returns
| Type | Description | 
|---|---|
| System.Int32 | the hash code. |