Class Analyzer
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.
In order to define what analysis is done, subclasses must define their TokenStreamComponents in CreateComponents(String, TextReader). The components are then reused in each call to GetTokenStream(String, TextReader).
Simple example:
Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
});
For more examples, see the Lucene.Net.Analysis namespace documentation.
For some concrete implementations bundled with Lucene, look in the analysis modules:
- Common: Analyzers for indexing content in different languages and domains.
- ICU: Exposes functionality from ICU to Apache Lucene.
- Kuromoji: Morphological analyzer for Japanese text.
- Morfologik: Dictionary-driven lemmatization for the Polish language.
- OpenNLP: Analysis integration with Apache OpenNLP.
- Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
- Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
- Stempel: Algorithmic Stemmer for the Polish Language.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis
Assembly: Lucene.Net.dll
Syntax
public abstract class Analyzer : IDisposable
Constructors
| Improve this Doc View SourceAnalyzer()
Create a new Analyzer, reusing the same set of components per-thread across calls to GetTokenStream(String, TextReader).
Declaration
protected Analyzer()
Analyzer(ReuseStrategy)
Expert: create a new Analyzer with a custom ReuseStrategy.
NOTE: if you just want to reuse on a per-field basis, its easier to
use a subclass of AnalyzerWrapper such as
Lucene.Net.Analysis.Common.Miscellaneous.PerFieldAnalyzerWrapper
instead.
Declaration
protected Analyzer(ReuseStrategy reuseStrategy)
Parameters
Type | Name | Description |
---|---|---|
ReuseStrategy | reuseStrategy |
Fields
| Improve this Doc View SourceGLOBAL_REUSE_STRATEGY
A predefined ReuseStrategy that reuses the same components for every field.
Declaration
public static readonly ReuseStrategy GLOBAL_REUSE_STRATEGY
Field Value
Type | Description |
---|---|
ReuseStrategy |
PER_FIELD_REUSE_STRATEGY
A predefined ReuseStrategy that reuses components per-field by maintaining a Map of TokenStreamComponents per field name.
Declaration
public static readonly ReuseStrategy PER_FIELD_REUSE_STRATEGY
Field Value
Type | Description |
---|---|
ReuseStrategy |
Properties
| Improve this Doc View SourceStrategy
Returns the used ReuseStrategy.
Declaration
public ReuseStrategy Strategy { get; }
Property Value
Type | Description |
---|---|
ReuseStrategy |
Methods
| Improve this Doc View SourceCreateComponents(String, TextReader)
Creates a new TokenStreamComponents instance for this analyzer.
Declaration
protected abstract TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the name of the fields content passed to the TokenStreamComponents sink as a reader |
System.IO.TextReader | reader | the reader passed to the Tokenizer constructor |
Returns
Type | Description |
---|---|
TokenStreamComponents | the TokenStreamComponents for this analyzer. |
Dispose()
Frees persistent resources used by this Analyzer
Declaration
public void Dispose()
Dispose(Boolean)
Frees persistent resources used by this Analyzer
Declaration
protected virtual void Dispose(bool disposing)
Parameters
Type | Name | Description |
---|---|---|
System.Boolean | disposing |
GetOffsetGap(String)
Just like GetPositionIncrementGap(String), except for Token offsets instead. By default this returns 1. this method is only called if the field produced at least one token for indexing.
Declaration
public virtual int GetOffsetGap(string fieldName)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the field just indexed |
Returns
Type | Description |
---|---|
System.Int32 | offset gap, added to the next token emitted from GetTokenStream(String, TextReader).
this value must be |
GetPositionIncrementGap(String)
Invoked before indexing a IIndexableField instance if terms have already been added to that field. This allows custom analyzers to place an automatic position increment gap between IIndexableField instances using the same field name. The default value position increment gap is 0. With a 0 position increment gap and the typical default token position increment of 1, all terms in a field, including across IIndexableField instances, are in successive positions, allowing exact PhraseQuery matches, for instance, across IIndexableField instance boundaries.
Declaration
public virtual int GetPositionIncrementGap(string fieldName)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | IIndexableField name being indexed. |
Returns
Type | Description |
---|---|
System.Int32 | position increment gap, added to the next token emitted from GetTokenStream(String, TextReader).
this value must be |
GetTokenStream(String, TextReader)
Returns a TokenStream suitable for fieldName
, tokenizing
the contents of text
.
This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader).
NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.
Declaration
public TokenStream GetTokenStream(string fieldName, TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the name of the field the created TokenStream is used for |
System.IO.TextReader | reader | the reader the streams source reads from |
Returns
Type | Description |
---|---|
TokenStream | TokenStream for iterating the analyzed content of System.IO.TextReader |
Exceptions
Type | Condition |
---|---|
System.ObjectDisposedException | if the Analyzer is disposed. |
System.IO.IOException | if an i/o error occurs (may rarely happen for strings). |
See Also
| Improve this Doc View SourceGetTokenStream(String, String)
Returns a TokenStream suitable for fieldName
, tokenizing
the contents of text
.
This method uses CreateComponents(String, TextReader) to obtain an instance of TokenStreamComponents. It returns the sink of the components and stores the components internally. Subsequent calls to this method will reuse the previously stored components after resetting them through SetReader(TextReader).
NOTE: After calling this method, the consumer must follow the workflow described in TokenStream to properly consume its contents. See the Lucene.Net.Analysis namespace documentation for some examples demonstrating this.
Declaration
public TokenStream GetTokenStream(string fieldName, string text)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | the name of the field the created TokenStream is used for |
System.String | text | the System.String the streams source reads from |
Returns
Type | Description |
---|---|
TokenStream | TokenStream for iterating the analyzed content of |
Exceptions
Type | Condition |
---|---|
System.ObjectDisposedException | if the Analyzer is disposed. |
System.IO.IOException | if an i/o error occurs (may rarely happen for strings). |
See Also
| Improve this Doc View SourceInitReader(String, TextReader)
Override this if you want to add a CharFilter chain.
The default implementation returns reader
unchanged.
Declaration
protected virtual TextReader InitReader(string fieldName, TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
System.String | fieldName | IIndexableField name being indexed |
System.IO.TextReader | reader | original System.IO.TextReader |
Returns
Type | Description |
---|---|
System.IO.TextReader | reader, optionally decorated with CharFilter(s) |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>)
Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the createComponents
parameter.
Simple example:
var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
});
LUCENENET specific
Declaration
public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents)
Parameters
Type | Name | Description |
---|---|---|
System.Func<System.String, System.IO.TextReader, TokenStreamComponents> | createComponents | A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a System.String fieldName and a System.IO.TextReader reader and returns the TokenStreamComponents for this analyzer. |
Returns
Type | Description |
---|---|
Analyzer | A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance. |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, ReuseStrategy)
Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the createComponents
parameter and allows the use of a ReuseStrategy.
Simple example:
var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
}, reuseStrategy);
LUCENENET specific
Declaration
public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, ReuseStrategy reuseStrategy)
Parameters
Type | Name | Description |
---|---|---|
System.Func<System.String, System.IO.TextReader, TokenStreamComponents> | createComponents | An delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a System.String fieldName and a System.IO.TextReader reader and returns the TokenStreamComponents for this analyzer. |
ReuseStrategy | reuseStrategy | A custom ReuseStrategy instance. |
Returns
Type | Description |
---|---|
Analyzer | A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance. |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>)
Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the createComponents
parameter and the body of the InitReader(String, TextReader)
method through the initReader
parameter.
Simple example:
var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
}, initReader: (fieldName, reader) =>
{
return new HTMLStripCharFilter(reader);
});
LUCENENET specific
Declaration
public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, Func<string, TextReader, TextReader> initReader)
Parameters
Type | Name | Description |
---|---|---|
System.Func<System.String, System.IO.TextReader, TokenStreamComponents> | createComponents | A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a System.String fieldName and a System.IO.TextReader reader and returns the TokenStreamComponents for this analyzer. |
System.Func<System.String, System.IO.TextReader, System.IO.TextReader> | initReader | A delegate method that represents (is called by) the InitReader(String, TextReader)
method. It accepts a System.String fieldName and a System.IO.TextReader reader and
returns the System.IO.TextReader that can be modified or wrapped by the |
Returns
Type | Description |
---|---|
Analyzer | A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance. |
NewAnonymous(Func<String, TextReader, TokenStreamComponents>, Func<String, TextReader, TextReader>, ReuseStrategy)
Creates a new instance with the ability to specify the body of the CreateComponents(String, TextReader)
method through the createComponents
parameter, the body of the InitReader(String, TextReader)
method through the initReader
parameter, and allows the use of a ReuseStrategy.
Simple example:
var analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
{
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
}, initReader: (fieldName, reader) =>
{
return new HTMLStripCharFilter(reader);
}, reuseStrategy);
LUCENENET specific
Declaration
public static Analyzer NewAnonymous(Func<string, TextReader, TokenStreamComponents> createComponents, Func<string, TextReader, TextReader> initReader, ReuseStrategy reuseStrategy)
Parameters
Type | Name | Description |
---|---|---|
System.Func<System.String, System.IO.TextReader, TokenStreamComponents> | createComponents | A delegate method that represents (is called by) the CreateComponents(String, TextReader) method. It accepts a System.String fieldName and a System.IO.TextReader reader and returns the TokenStreamComponents for this analyzer. |
System.Func<System.String, System.IO.TextReader, System.IO.TextReader> | initReader | A delegate method that represents (is called by) the InitReader(String, TextReader)
method. It accepts a System.String fieldName and a System.IO.TextReader reader and
returns the System.IO.TextReader that can be modified or wrapped by the |
ReuseStrategy | reuseStrategy | A custom ReuseStrategy instance. |
Returns
Type | Description |
---|---|
Analyzer | A new Lucene.Net.Analysis.Analyzer.AnonymousAnalyzer instance. |