Namespace Lucene.Net.Analysis.NGram
Character n-gram tokenizers and filters.
Classes
EdgeNGramFilterFactory
Creates new instances of Edge
<fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
</fieldType>
EdgeNGramTokenFilter
Tokenizes the given token into n-grams of given size(s).
This Lucene.
As of Lucene 4.4, this filter does not support
BACK (you can use Reverse
EdgeNGramTokenizer
Tokenizes the input from an edge into n-grams of given size(s).
This Lucene.
As of Lucene 4.4, this tokenizer
- can handle
larger than 1024 chars, but beware that this will result in increased memory usagemaxGram
- doesn't trim the input,
- sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones
- doesn't support backward n-grams anymore.
- supports Is
Token pre-tokenization,Char(Int32) - correctly handles supplementary characters.
Although highly discouraged, it is still possible
to use the old behavior through Lucene43Edge
EdgeNGramTokenizerFactory
Creates new instances of Edge
<fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="1" maxGramSize="1"/>
</analyzer>
</fieldType>
Lucene43EdgeNGramTokenizer
Old version of Edge
Lucene43NGramTokenizer
Old broken version of NGram
NGramFilterFactory
Factory for NGram
<fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
</fieldType>
NGramTokenFilter
Tokenizes the input into n-grams of the given size(s).
You must specify the required Lucene.
- handles supplementary characters correctly,
- emits all n-grams for the same token at the same position,
- does not modify offsets,
- sorts n-grams by their offset in the original token first, then increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", "c").
You can make this filter use the old behavior by providing a version <
LUCENE_44 in the constructor but this is not recommended as
it will lead to broken Lucene.
If you were using this Lucene.
NGramTokenizer
Tokenizes the input into n-grams of the given size(s).
On the contrary to NGram
For example, "abcde" would be tokenized as (minGram=2, maxGram=3):
TermPosition incrementPosition lengthOffsets | |
---|---|
ab11[0,2[ | |
abc11[0,3[ | |
bc11[1,3[ | |
bcd11[1,4[ | |
cd11[2,4[ | |
cde11[2,5[ | |
de11[3,5[ |
This tokenizer changed a lot in Lucene 4.4 in order to:
- tokenize in a streaming fashion to support streams which are larger than 1024 chars (limit of the previous version),
- count grams based on unicode code points instead of java chars (and never split in the middle of surrogate pairs),
- give the ability to pre-tokenize the stream (Is
Token ) before computing n-grams.Char(Int32)
Additionally, this class doesn't trim trailing whitespaces and emits tokens in a different order, tokens are now emitted by increasing start offsets while they used to be emitted by increasing lengths (which prevented from supporting large input streams).
Although highly discouraged, it is still possible
to use the old behavior through Lucene43NGram
NGramTokenizerFactory
Factory for NGram
<fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="2"/>
</analyzer>
</fieldType>
Enums
EdgeNGramTokenFilter.Side
Specifies which side of the input the n-gram should be generated from
Lucene43EdgeNGramTokenizer.Side
Specifies which side of the input the n-gram should be generated from