Class WordDelimiterFilter
Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric
characters):
"Wi-Fi"
→"Wi", "Fi"
- split on case transitions:
"PowerShot"
→"Power", "Shot"
- split on letter-number transitions:
"SD500"
→"SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored:
"//hello---there, 'dude'"
→"hello", "there", "dude"
- trailing "'s" are removed for each subword:
"O'Neil's"
→"O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations:
→"PowerShot"
0:"Power", 1:"Shot"
(0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of
non-numeric subwords are catenated and produced at the same position of the
last subword in the run:
"PowerShot"
→0:"Power", 1:"Shot" 1:"PowerShot"
"A's+B's&C's"
-gt;0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!"
→0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Miscellaneous
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class WordDelimiterFilter : TokenFilter, IDisposable
Constructors
| Improve this Doc View SourceWordDelimiterFilter(LuceneVersion, TokenStream, WordDelimiterFlags, CharArraySet)
Creates a new WordDelimiterFilter using DEFAULT_WORD_DELIM_TABLE as its charTypeTable
Declaration
public WordDelimiterFilter(LuceneVersion matchVersion, TokenStream in, WordDelimiterFlags configurationFlags, CharArraySet protWords)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | lucene compatibility version |
TokenStream | in | TokenStream to be filtered |
WordDelimiterFlags | configurationFlags | Flags configuring the filter |
CharArraySet | protWords | If not null is the set of tokens to protect from being delimited |
WordDelimiterFilter(LuceneVersion, TokenStream, Byte[], WordDelimiterFlags, CharArraySet)
Creates a new WordDelimiterFilter
Declaration
public WordDelimiterFilter(LuceneVersion matchVersion, TokenStream in, byte[] charTypeTable, WordDelimiterFlags configurationFlags, CharArraySet protWords)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | lucene compatibility version |
TokenStream | in | TokenStream to be filtered |
System.Byte[] | charTypeTable | table containing character types |
WordDelimiterFlags | configurationFlags | Flags configuring the filter |
CharArraySet | protWords | If not null is the set of tokens to protect from being delimited |
Fields
| Improve this Doc View SourceALPHA
Declaration
public const int ALPHA = 3
Field Value
Type | Description |
---|---|
System.Int32 |
ALPHANUM
Declaration
public const int ALPHANUM = 7
Field Value
Type | Description |
---|---|
System.Int32 |
DIGIT
Declaration
public const int DIGIT = 4
Field Value
Type | Description |
---|---|
System.Int32 |
LOWER
Declaration
public const int LOWER = 1
Field Value
Type | Description |
---|---|
System.Int32 |
SUBWORD_DELIM
Declaration
public const int SUBWORD_DELIM = 8
Field Value
Type | Description |
---|---|
System.Int32 |
UPPER
Declaration
public const int UPPER = 2
Field Value
Type | Description |
---|---|
System.Int32 |
Methods
| Improve this Doc View SourceIncrementToken()
Declaration
public override bool IncrementToken()
Returns
Type | Description |
---|---|
System.Boolean |
Overrides
| Improve this Doc View SourceReset()
Declaration
public override void Reset()