Class CommonGramsFilter
Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of PositionIncrement. Bigrams have a type of GRAM_TYPE Example:
- input:"the quick brown fox"
- output:|"the","the-quick"|"brown"|"fox"|
- "the-quick" has a position increment of 0 so it is in the same position as "the" "the-quick" has a term.type() of "gram"
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.CommonGrams
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class CommonGramsFilter : TokenFilter, IDisposable
Constructors
| Improve this Doc View SourceCommonGramsFilter(LuceneVersion, TokenStream, CharArraySet)
Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .
Declaration
public CommonGramsFilter(LuceneVersion matchVersion, TokenStream input, CharArraySet commonWords)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | lucene compatibility version |
TokenStream | input | TokenStream input in filter chain |
CharArraySet | commonWords | The set of common words. |
Fields
| Improve this Doc View SourceGRAM_TYPE
Declaration
public const string GRAM_TYPE = "gram"
Field Value
Type | Description |
---|---|
System.String |
Methods
| Improve this Doc View SourceIncrementToken()
Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram"
TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this.
TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups.
Declaration
public override bool IncrementToken()
Returns
Type | Description |
---|---|
System.Boolean |
Overrides
| Improve this Doc View SourceReset()
This method is called by a consumer before it begins consumption using IncrementToken().
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call base.Reset()
, otherwise
some internal state will not be correctly reset (e.g., Tokenizer will
throw System.InvalidOperationException on further usage).
Declaration
public override void Reset()
Overrides
Remarks
NOTE:
The default implementation chains the call to the input TokenStream, so
be sure to call base.Reset()
when overriding this method.