Class WordDelimiterFilter

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Creates a new WordDelimiterFilter using DEFAULT_WORD_DELIM_TABLE as its charTypeTable

Type	Name	Description
LuceneVersion	matchVersion	lucene compatibility version
TokenStream	in	Lucene.Net.Analysis.TokenStream to be filtered
WordDelimiterFlags	configurationFlags	Flags configuring the filter
CharArraySet	protWords	If not null is the set of tokens to protect from being delimited

Creates a new WordDelimiterFilter

Type	Name	Description
LuceneVersion	matchVersion	lucene compatibility version
TokenStream	in	TokenStream to be filtered
byte[]	charTypeTable	table containing character types
WordDelimiterFlags	configurationFlags	Flags configuring the filter
CharArraySet	protWords	If not null is the set of tokens to protect from being delimited

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

Type	Description
int

Consumers (i.e., Lucene.Net.Index.IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Lucene.Net.Util.IAttributes with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use Lucene.Net.Util.AttributeSource.CaptureState() to create a copy of the current attribute state.

this method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to Lucene.Net.Util.AttributeSource.AddAttribute<T>() and Lucene.Net.Util.AttributeSource.GetAttribute<T>(), references to all Lucene.Net.Util.IAttributes that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in Lucene.Net.Analysis.TokenStream.IncrementToken().

Type	Description
bool	false for end of stream; true otherwise

This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().

Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will throw InvalidOperationException on further usage).

NOTE: The default implementation chains the call to the input Lucene.Net.Analysis.TokenStream, so be sure to call base.Reset() when overriding this method.

Inheritance

Implements

Inherited Members

Namespace: Lucene.Net.Analysis.Miscellaneous

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

Constructors

WordDelimiterFilter(LuceneVersion, TokenStream, WordDelimiterFlags, CharArraySet)

Declaration

Parameters

WordDelimiterFilter(LuceneVersion, TokenStream, byte[], WordDelimiterFlags, CharArraySet)

Declaration

Parameters

Fields

ALPHA

Declaration

Field Value

ALPHANUM

Declaration

Field Value

DIGIT

Declaration

Field Value

LOWER

Declaration

Field Value

SUBWORD_DELIM

Declaration

Field Value

UPPER

Declaration

Field Value

Methods

IncrementToken()

Declaration

Returns

Overrides

Reset()

Declaration

Overrides

Remarks

Implements