Namespace Lucene.Net.Analysis.Miscellaneous

Miscellaneous TokenStreams

Classes

ASCIIFoldingFilter

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted:

C1 Controls and Latin-1 Supplement: http://www.unicode.org/charts/PDF/U0080.pdf

Latin Extended-A: http://www.unicode.org/charts/PDF/U0100.pdf

Latin Extended-B: http://www.unicode.org/charts/PDF/U0180.pdf

Latin Extended Additional: http://www.unicode.org/charts/PDF/U1E00.pdf

Latin Extended-C: http://www.unicode.org/charts/PDF/U2C60.pdf

Latin Extended-D: http://www.unicode.org/charts/PDF/UA720.pdf

IPA Extensions: http://www.unicode.org/charts/PDF/U0250.pdf

Phonetic Extensions: http://www.unicode.org/charts/PDF/U1D00.pdf

Phonetic Extensions Supplement: http://www.unicode.org/charts/PDF/U1D80.pdf

General Punctuation: http://www.unicode.org/charts/PDF/U2000.pdf

Superscripts and Subscripts: http://www.unicode.org/charts/PDF/U2070.pdf

Enclosed Alphanumerics: http://www.unicode.org/charts/PDF/U2460.pdf

Dingbats: http://www.unicode.org/charts/PDF/U2700.pdf

Supplemental Punctuation: http://www.unicode.org/charts/PDF/U2E00.pdf

Alphabetic Presentation Forms: http://www.unicode.org/charts/PDF/UFB00.pdf

Halfwidth and Fullwidth Forms: http://www.unicode.org/charts/PDF/UFF00.pdf

See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode

For example, 'à' will be replaced by 'a'.

ASCIIFoldingFilterFactory

Factory for ASCIIFoldingFilter.

<fieldType name="text_ascii" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
  </analyzer>
</fieldType>

CapitalizationFilter

A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case.

This filter is particularly useful to build nice looking facet parameters. This filter is not appropriate if you intend to use a prefix query.

CodepointCountFilter

Removes words that are too long or too short from the stream.

Note: Length is calculated as the number of Unicode codepoints.

CodepointCountFilterFactory

Factory for CodepointCountFilter.

<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.CodepointCountFilterFactory" min="0" max="1" />
  </analyzer>
</fieldType>

When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"/>
     <filter class="solr.HyphenatedWordsFilterFactory"/>
     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
 <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"/>
     <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldtype>

HyphenatedWordsFilterFactory

Factory for HyphenatedWordsFilter.

<fieldType name="text_hyphn" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.HyphenatedWordsFilterFactory"/>
  </analyzer>
</fieldType>

KeepWordFilter

A Lucene.Net.Analysis.TokenFilter that only keeps tokens with text contained in the required words. This filter behaves like the inverse of StopFilter.

@since solr 1.3

KeepWordFilterFactory

Factory for KeepWordFilter.

<fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false"/>
  </analyzer>
</fieldType>

KeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute.

KeywordMarkerFilterFactory

Factory for KeywordMarkerFilter.

<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protectedkeyword.txt" pattern="^.+er$" ignoreCase="false"/>
  </analyzer>
</fieldType>

KeywordRepeatFilter

This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with IsKeyword set to true and once set to false. This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.

KeywordRepeatFilterFactory

Factory for KeywordRepeatFilter.

Since KeywordRepeatFilter emits two tokens for every input token, and any tokens that aren't transformed later in the analysis chain will be in the document twice. Therefore, consider adding RemoveDuplicatesTokenFilterFactory later in the analysis chain.

LengthFilter

Removes words that are too long or too short from the stream.

Note: Length is calculated as the number of UTF-16 code units.

LengthFilterFactory

Factory for LengthFilter.

<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="0" max="1" />
  </analyzer>
</fieldType>

LimitTokenCountAnalyzer

This Lucene.Net.Analysis.Analyzer limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside IndexWriter.

LimitTokenCountFilter

This Lucene.Net.Analysis.TokenFilter limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside IndexWriter.

By default, this filter ignores any tokens in the wrapped Lucene.Net.Analysis.TokenStream once the limit has been reached, which can result in Reset() being called prior to IncrementToken() returning false. For most Lucene.Net.Analysis.TokenStream implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a Lucene.Net.Analysis.TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use the LimitTokenCountFilter(TokenStream, Int32, Boolean) consumeAllTokens option.

LimitTokenCountFilterFactory

Factory for LimitTokenCountFilter.

<fieldType name="text_lngthcnt" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10" consumeAllTokens="false" />
  </analyzer>
</fieldType>

The Lucene.Net.Analysis.Miscellaneous.LimitTokenCountFilterFactory.consumeAllTokens property is optional and defaults to false.
See LimitTokenCountFilter for an explanation of it's use.

LimitTokenPositionFilter

This Lucene.Net.Analysis.TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

By default, this filter ignores any tokens in the wrapped Lucene.Net.Analysis.TokenStream once the limit has been exceeded, which can result in Reset() being called prior to IncrementToken() returning false. For most Lucene.Net.Analysis.TokenStream implementations this should be acceptable, and faster then consuming the full stream. If you are wrapping a Lucene.Net.Analysis.TokenStream which requires that the full stream of tokens be exhausted in order to function properly, use the LimitTokenPositionFilter(TokenStream, Int32, Boolean) consumeAllTokens option.

LimitTokenPositionFilterFactory

Factory for LimitTokenPositionFilter.

<fieldType name="text_limit_pos" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3" consumeAllTokens="false" />
  </analyzer>
</fieldType>

The Lucene.Net.Analysis.Miscellaneous.LimitTokenPositionFilterFactory.consumeAllTokens property is optional and defaults to false.
See LimitTokenPositionFilter for an explanation of its use.

Lucene47WordDelimiterFilter

Old Broken version of WordDelimiterFilter

PatternAnalyzer

Efficient Lucene analyzer/tokenizer that preferably operates on a System.String rather than a System.IO.TextReader, that can flexibly separate text into terms via a regular expression System.Text.RegularExpressions.Regex (with behaviour similar to System.Text.RegularExpressions.Regex.Split(System.String)), and that combines the functionality of LetterTokenizer, LowerCaseTokenizer, WhitespaceTokenizer, StopFilter into a single efficient multi-purpose class.

If you are unsure how exactly a regular expression should look like, consider prototyping by simply trying various expressions on some test texts via System.Text.RegularExpressions.Regex.Split(System.String). Once you are satisfied, give that regex to PatternAnalyzer. Also see Regular Expression Tutorial.

This class can be considerably faster than the "normal" Lucene tokenizers. It can also serve as a building block in a compound Lucene Lucene.Net.Analysis.TokenFilter chain. For example as in this stemming example:

PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
    pat.GetTokenStream("content", "James is running round in the woods"), 
    "English"));

PatternKeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute. Each token that matches the provided pattern is marked as a keyword by setting IsKeyword to true.

PerFieldAnalyzerWrapper

This analyzer is used to facilitate scenarios where different fields Require different analysis techniques. Use the Map argument in PerFieldAnalyzerWrapper(Analyzer, IDictionary<String, Analyzer>) to add non-default analyzers for fields.

Example usage:

IDictionary<string, Analyzer> analyzerPerField = new Dictionary<string, Analyzer>();
analyzerPerField["firstname"] = new KeywordAnalyzer();
analyzerPerField["lastname"] = new KeywordAnalyzer();

PerFieldAnalyzerWrapper aWrapper =
  new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField);

In this example, StandardAnalyzer will be used for all fields except "firstname" and "lastname", for which KeywordAnalyzer will be used.

A PerFieldAnalyzerWrapper can be used like any other analyzer, for both indexing and query parsing.

PrefixAndSuffixAwareTokenFilter

Links two PrefixAwareTokenFilter.

NOTE: This filter might not behave correctly if used with custom IAttributes, i.e. IAttributes other than the ones located in Lucene.Net.Analysis.TokenAttributes.

PrefixAwareTokenFilter

Joins two token streams and leaves the last token of the first stream available to be used when updating the token values in the second stream based on that token.

The default implementation adds last prefix token end offset to the suffix token start and end offsets.

NOTE: This filter might not behave correctly if used with custom IAttributes, i.e. IAttributes other than the ones located in Lucene.Net.Analysis.TokenAttributes.

RemoveDuplicatesTokenFilter

A Lucene.Net.Analysis.TokenFilter which filters out Lucene.Net.Analysis.Tokens at the same position and Term text as the previous token in the stream.

RemoveDuplicatesTokenFilterFactory

Factory for RemoveDuplicatesTokenFilter.

<fieldType name="text_rmdup" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

ScandinavianFoldingFilter

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.

It's is a semantically more destructive solution than ScandinavianNormalizationFilter but can in addition help with matching raksmorgas as räksmörgås.

blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas

Background: Swedish åäö are in fact the same letters as Norwegian and Danish åæø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters.

In that situation almost all Swedish people use a, a, o instead of å, ä, ö.

Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above.

This filter solves that mismatch problem, but might also cause new.

ScandinavianFoldingFilterFactory

Factory for ScandinavianFoldingFilter.

<fieldType name="text_scandfold" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ScandinavianFoldingFilterFactory"/>
  </analyzer>
</fieldType>

ScandinavianNormalizationFilter

This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.

It's a semantically less destructive solution than ScandinavianFoldingFilter, most useful when a person with a Norwegian or Danish keyboard queries a Swedish index and vice versa. This filter does not the common Swedish folds of å and ä to a nor ö to o.

blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas

ScandinavianNormalizationFilterFactory

Factory for ScandinavianNormalizationFilter.

<fieldType name="text_scandnorm" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.ScandinavianNormalizationFilterFactory"/>
  </analyzer>
</fieldType>

SetKeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute. Each token contained in the provided set is marked as a keyword by setting IsKeyword to true.

SingleTokenTokenStream

A Lucene.Net.Analysis.TokenStream containing a single token.

StemmerOverrideFilter

Provides the ability to override any KeywordAttribute aware stemmer with custom dictionary-based stemming.

StemmerOverrideFilter.Builder

This builder builds an FST for the StemmerOverrideFilter

StemmerOverrideFilter.StemmerOverrideMap

A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups for StemmerOverrideFilter

StemmerOverrideFilterFactory

Factory for StemmerOverrideFilter.

<fieldType name="text_dicstem" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StemmerOverrideFilterFactory" dictionary="dictionary.txt" ignoreCase="false"/>
  </analyzer>
</fieldType>

TrimFilter

Trims leading and trailing whitespace from Tokens in the stream.

As of Lucene 4.4, this filter does not support updateOffsets=true anymore as it can lead to broken token streams.

TrimFilterFactory

Factory for TrimFilter.

<fieldType name="text_trm" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.NGramTokenizerFactory"/>
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldType>

TruncateTokenFilter

A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts

TruncateTokenFilterFactory

Factory for TruncateTokenFilter. The following type is recommended for "diacritics-insensitive search" for Turkish.

<fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
    <filter class="solr.KeywordRepeatFilterFactory"/>
    <filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

WordDelimiterFilter

Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

split on intra-word delimiters (by default, all non alpha-numeric characters): "Wi-Fi" → "Wi", "Fi"
split on case transitions: "PowerShot" → "Power", "Shot"
split on letter-number transitions: "SD500" → "SD", "500"
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"
trailing "'s" are removed for each subword: "O'Neil's" → "O", "Neil"

The combinations parameter affects how subwords are combined:

combinations="0" causes no subword combinations:
```
"PowerShot"
```
→ 0:"Power", 1:"Shot" (0 and 1 are the token positions)
combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run:

One use for WordDelimiterFilter is to help match words with different subword delimiters. For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so is to specify combinations="1" in the analyzer used for indexing, and combinations="0" (the default) in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that does not do this (such as WhitespaceTokenizer).

WordDelimiterFilterFactory

Factory for WordDelimiterFilter.

<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" protected="protectedword.txt"
            preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="1"
            catenateWords="0" catenateNumbers="0" catenateAll="0"
            generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
            types="wdfftypes.txt" />
  </analyzer>
</fieldType>

WordDelimiterIterator

A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterFilter rules.

This is a Lucene.NET INTERNAL API, use at your own risk

Enums

WordDelimiterFlags

Configuration options for the WordDelimiterFilter.

LUCENENET specific - these options were passed as int constant flags in Lucene.