Class NGramTokenFilter
Tokenizes the input into n-grams of the given size(s).
You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating a NGramTokenFilter. As of Lucene 4.4, this token filters:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
You can make this filter use the old behavior by providing a version < Lucene.Net.Util.LuceneVersion.LUCENE_44 in the constructor but this is not recommended as it will lead to broken Lucene.Net.Analysis.TokenStreams that will cause highlighting bugs.
If you were using this Lucene.Net.Analysis.TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override IsTokenChar(int) to perform pre-tokenization.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.NGram
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class NGramTokenFilter : TokenFilter, IDisposable
Constructors
NGramTokenFilter(LuceneVersion, TokenStream)
Creates NGramTokenFilter with default min and max n-grams.
Declaration
public NGramTokenFilter(LuceneVersion version, TokenStream input)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | version | Lucene version to enable correct position increments. See NGramTokenFilter for details. |
TokenStream | input | Lucene.Net.Analysis.TokenStream holding the input to be tokenized |
NGramTokenFilter(LuceneVersion, TokenStream, int, int)
Creates NGramTokenFilter with given min and max n-grams.
Declaration
public NGramTokenFilter(LuceneVersion version, TokenStream input, int minGram, int maxGram)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | version | Lucene version to enable correct position increments. See NGramTokenFilter for details. |
TokenStream | input | Lucene.Net.Analysis.TokenStream holding the input to be tokenized |
int | minGram | the smallest n-gram to generate |
int | maxGram | the largest n-gram to generate |
Fields
DEFAULT_MAX_NGRAM_SIZE
Tokenizes the input into n-grams of the given size(s).
You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating a NGramTokenFilter. As of Lucene 4.4, this token filters:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
You can make this filter use the old behavior by providing a version < Lucene.Net.Util.LuceneVersion.LUCENE_44 in the constructor but this is not recommended as it will lead to broken Lucene.Net.Analysis.TokenStreams that will cause highlighting bugs.
If you were using this Lucene.Net.Analysis.TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override IsTokenChar(int) to perform pre-tokenization.
Declaration
public const int DEFAULT_MAX_NGRAM_SIZE = 2
Field Value
Type | Description |
---|---|
int |
DEFAULT_MIN_NGRAM_SIZE
Tokenizes the input into n-grams of the given size(s).
You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating a NGramTokenFilter. As of Lucene 4.4, this token filters:
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
You can make this filter use the old behavior by providing a version < Lucene.Net.Util.LuceneVersion.LUCENE_44 in the constructor but this is not recommended as it will lead to broken Lucene.Net.Analysis.TokenStreams that will cause highlighting bugs.
If you were using this Lucene.Net.Analysis.TokenFilter to perform partial highlighting, this won't work anymore since this filter doesn't update offsets. You should modify your analysis chain to use NGramTokenizer, and potentially override IsTokenChar(int) to perform pre-tokenization.
Declaration
public const int DEFAULT_MIN_NGRAM_SIZE = 1
Field Value
Type | Description |
---|---|
int |
Methods
IncrementToken()
Returns the next token in the stream, or null at EOS.
Declaration
public override sealed bool IncrementToken()
Returns
Type | Description |
---|---|
bool |
Overrides
Reset()
This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always callbase.Reset()
, otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on further usage).
Declaration
public override void Reset()
Overrides
Remarks
NOTE:
The default implementation chains the call to the input Lucene.Net.Analysis.TokenStream, so
be sure to call base.Reset()
when overriding this method.