Class PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a System.String rather than a System.IO.TextReader, that can flexibly separate text into terms via a regular expression System.Text.RegularExpressions.Regex (with behaviour similar to System.Text.RegularExpressions.Regex.Split(System.String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.
If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via System.Text.RegularExpressions.Regex.Split(System.String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.
This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.GetTokenStream("content", "James is running round in the woods"),
"English"));
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Miscellaneous
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("(4.0) use the pattern-based analysis in the analysis/pattern package instead.")]
public sealed class PatternAnalyzer : Analyzer, IDisposable
Constructors
| Improve this Doc View SourcePatternAnalyzer(LuceneVersion, Regex, Boolean, CharArraySet)
Constructs a new instance with the given parameters.
Declaration
public PatternAnalyzer(LuceneVersion matchVersion, Regex pattern, bool toLowerCase, CharArraySet stopWords)
Parameters
Type | Name | Description |
---|---|---|
Lucene.Net.Util.LuceneVersion | matchVersion | currently does nothing |
System.Text.RegularExpressions.Regex | pattern | a regular expression delimiting tokens |
System.Boolean | toLowerCase | if returns tokens after applying
String.toLowerCase()
|
CharArraySet | stopWords | if non-null, ignores all tokens that are contained in the given stop set (after previously having applied toLowerCase() if applicable). For example, created via MakeStopSet(LuceneVersion, String[])and/or WordlistLoaderas in
|
Fields
| Improve this Doc View SourceDEFAULT_ANALYZER
A lower-casing word analyzer with English stop words (can be shared freely across threads without harm); global per class loader.
Declaration
public static readonly PatternAnalyzer DEFAULT_ANALYZER
Field Value
Type | Description |
---|---|
PatternAnalyzer |
EXTENDED_ANALYZER
A lower-casing word analyzer with extended English stop words (can be shared freely across threads without harm); global per class loader. The stop words are borrowed from http://thomas.loc.gov/home/stopwords.html, see http://thomas.loc.gov/home/all.about.inquery.html
Declaration
public static readonly PatternAnalyzer EXTENDED_ANALYZER
Field Value
Type | Description |
---|---|
PatternAnalyzer |
NON_WORD_PATTERN
"\W+"
; Divides text at non-letters (NOT Character.isLetter(c))
Declaration
public static readonly Regex NON_WORD_PATTERN
Field Value
Type | Description |
---|---|
System.Text.RegularExpressions.Regex |
WHITESPACE_PATTERN
"\s+"
; Divides text at whitespaces (Character.isWhitespace(c))
Declaration
public static readonly Regex WHITESPACE_PATTERN
Field Value
Type | Description |
---|---|
System.Text.RegularExpressions.Regex |
Methods
| Improve this Doc View SourceCreateComponents(String, TextReader)
Creates a token stream that tokenizes all the text in the given SetReader; This implementation forwards to GetTokenStream(String, TextReader) and is less efficient than GetTokenStream(String, TextReader).
Declaration
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the name of the field to tokenize (currently ignored). |
System.IO.TextReader | reader | the reader delivering the text |
Returns
Type | Description |
---|---|
Lucene.Net.Analysis.TokenStreamComponents | a new token stream |
Overrides
| Improve this Doc View SourceCreateComponents(String, TextReader, String)
Creates a token stream that tokenizes the given string into token terms (aka words).
Declaration
public TokenStreamComponents CreateComponents(string fieldName, TextReader reader, string text)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the name of the field to tokenize (currently ignored). |
System.IO.TextReader | reader | reader (e.g. charfilter) of the original text. can be null. |
System.String | text | the string to tokenize |
Returns
Type | Description |
---|---|
Lucene.Net.Analysis.TokenStreamComponents | a new token stream |
Equals(Object)
Indicates whether some other object is "equal to" this one.
Declaration
public override bool Equals(object other)
Parameters
Type | Name | Description |
---|---|---|
System.Object | other | the reference object with which to compare. |
Returns
Type | Description |
---|---|
System.Boolean | true if equal, false otherwise |
Overrides
GetHashCode()
Returns a hash code value for the object.
Declaration
public override int GetHashCode()
Returns
Type | Description |
---|---|
System.Int32 | the hash code. |