Namespace Lucene.Net.Analysis.Miscellaneous
Miscellaneous TokenStreams
Classes
ASCIIFoldingFilter
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted:
- C1 Controls and Latin-1 Supplement: http://www.unicode.org/charts/PDF/U0080.pdf
- Latin Extended-A: http://www.unicode.org/charts/PDF/U0100.pdf
- Latin Extended-B: http://www.unicode.org/charts/PDF/U0180.pdf
- Latin Extended Additional: http://www.unicode.org/charts/PDF/U1E00.pdf
- Latin Extended-C: http://www.unicode.org/charts/PDF/U2C60.pdf
- Latin Extended-D: http://www.unicode.org/charts/PDF/UA720.pdf
- IPA Extensions: http://www.unicode.org/charts/PDF/U0250.pdf
- Phonetic Extensions: http://www.unicode.org/charts/PDF/U1D00.pdf
- Phonetic Extensions Supplement: http://www.unicode.org/charts/PDF/U1D80.pdf
- General Punctuation: http://www.unicode.org/charts/PDF/U2000.pdf
- Superscripts and Subscripts: http://www.unicode.org/charts/PDF/U2070.pdf
- Enclosed Alphanumerics: http://www.unicode.org/charts/PDF/U2460.pdf
- Dingbats: http://www.unicode.org/charts/PDF/U2700.pdf
- Supplemental Punctuation: http://www.unicode.org/charts/PDF/U2E00.pdf
- Alphabetic Presentation Forms: http://www.unicode.org/charts/PDF/UFB00.pdf
- Halfwidth and Fullwidth Forms: http://www.unicode.org/charts/PDF/UFF00.pdf
See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
For example, 'à' will be replaced by 'a'.
ASCIIFoldingFilterFactory
Factory for ASCIIFolding
<fieldType name="text_ascii" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
</analyzer>
</fieldType>
CapitalizationFilter
A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case.
This filter is particularly useful to build nice looking facet parameters. This filter is not appropriate if you intend to use a prefix query.
CapitalizationFilterFactory
Factory for Capitalization
The factory takes parameters:
"onlyFirstWord" - should each word be capitalized or all of the words?
"keep" - a keep word list. Each word that should be kept separated by whitespace.
"keepIgnoreCase - true or false. If true, the keep list will be considered case-insensitive.
"forceFirstLetter" - Force the first letter to be capitalized even if it is in the keep list
"okPrefix" - do not change word capitalization if a word begins with something in this list. for example if "McK" is on the okPrefix list, the word "McKinley" should not be changed to "Mckinley"
"minWordLength" - how long the word needs to be to get capitalization applied. If the minWordLength is 3, "and" > "And" but "or" stays "or"
"maxWordCount" - if the token contains more then maxWordCount words, the capitalization is assumed to be correct.
"culture" - the culture to use to apply the capitalization rules. If not supplied or the string "invariant" is supplied, the invariant culture is used.
<fieldType name="text_cptlztn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CapitalizationFilterFactory" onlyFirstWord="true"
keep="java solr lucene" keepIgnoreCase="false"
okPrefix="McK McD McA"/>
</analyzer>
</fieldType>
@since solr 1.3
CodepointCountFilter
Removes words that are too long or too short from the stream.
Note: Length is calculated as the number of Unicode codepoints.
CodepointCountFilterFactory
Factory for Codepoint
<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.CodepointCountFilterFactory" min="0" max="1" />
</analyzer>
</fieldType>
EmptyTokenStream
An always exhausted token stream.
HyphenatedWordsFilter
When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
HyphenatedWordsFilterFactory
Factory for Hyphenated
<fieldType name="text_hyphn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
</analyzer>
</fieldType>
KeepWordFilter
A Lucene.
@since solr 1.3
KeepWordFilterFactory
Factory for Keep
<fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
KeywordMarkerFilter
Marks terms as keywords via the Keyword
KeywordMarkerFilterFactory
Factory for Keyword
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protectedkeyword.txt" pattern="^.+er$" ignoreCase="false"/>
</analyzer>
</fieldType>
KeywordRepeatFilter
This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with
Istrue
and once set to false
.
This is useful if used with a stem filter that respects the Keyword
KeywordRepeatFilterFactory
Factory for Keyword
Since Keyword
LengthFilter
Removes words that are too long or too short from the stream.
Note: Length is calculated as the number of UTF-16 code units.
LengthFilterFactory
Factory for Length
<fieldType name="text_lngth" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="0" max="1" />
</analyzer>
</fieldType>
LimitTokenCountAnalyzer
This Lucene.
LimitTokenCountFilter
This Lucene.
By default, this filter ignores any tokens in the wrapped Lucene.false
. For most
Lucene.
LimitTokenCountFilterFactory
Factory for Limit
<fieldType name="text_lngthcnt" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10" consumeAllTokens="false" />
</analyzer>
</fieldType>
The Lucene.false
.
See Limit
LimitTokenPositionFilter
This Lucene.
By default, this filter ignores any tokens in the wrapped Lucene.false
. For most
Lucene.
LimitTokenPositionFilterFactory
Factory for Limit
<fieldType name="text_limit_pos" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LimitTokenPositionFilterFactory" maxTokenPosition="3" consumeAllTokens="false" />
</analyzer>
</fieldType>
The Lucene.false
.
See Limit
Lucene47WordDelimiterFilter
Old Broken version of Word
PatternAnalyzer
Efficient Lucene analyzer/tokenizer that preferably operates on a System.
If you are unsure how exactly a regular expression should look like, consider
prototyping by simply trying various expressions on some test texts via
System.
This class can be considerably faster than the "normal" Lucene tokenizers.
It can also serve as a building block in a compound Lucene
Lucene.
PatternAnalyzer pat = ...
TokenStream tokenStream = new SnowballFilter(
pat.GetTokenStream("content", "James is running round in the woods"),
"English"));
PatternKeywordMarkerFilter
Marks terms as keywords via the Keywordtrue
.
PerFieldAnalyzerWrapper
This analyzer is used to facilitate scenarios where different fields Require different analysis techniques. Use the Map argument in PerFieldAnalyzerWrapper(Analyzer, IDictionary<String, Analyzer>) to add non-default analyzers for fields.
Example usage:
IDictionary<string, Analyzer> analyzerPerField = new Dictionary<string, Analyzer>();
analyzerPerField["firstname"] = new KeywordAnalyzer();
analyzerPerField["lastname"] = new KeywordAnalyzer();
PerFieldAnalyzerWrapper aWrapper =
new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField);
In this example, Standard
A Per
PrefixAndSuffixAwareTokenFilter
Links two Prefix
NOTE: This filter might not behave correctly if used with custom IAttributes, i.e. IAttributes other than the ones located in Lucene.Net.Analysis.TokenAttributes.
PrefixAwareTokenFilter
Joins two token streams and leaves the last token of the first stream available to be used when updating the token values in the second stream based on that token.
The default implementation adds last prefix token end offset to the suffix token start and end offsets.
NOTE: This filter might not behave correctly if used with custom IAttributes, i.e. IAttributes other than the ones located in Lucene.Net.Analysis.TokenAttributes.
RemoveDuplicatesTokenFilter
A Lucene.
RemoveDuplicatesTokenFilterFactory
Factory for Remove
<fieldType name="text_rmdup" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
ScandinavianFoldingFilter
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
It's is a semantically more destructive solution than Scandinavian
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
Background: Swedish åäö are in fact the same letters as Norwegian and Danish åæø and thus interchangeable when used between these languages. They are however folded differently when people type them on a keyboard lacking these characters.
In that situation almost all Swedish people use a, a, o instead of å, ä, ö.
Norwegians and Danes on the other hand usually type aa, ae and oe instead of å, æ and ø. Some do however use a, a, o, oo, ao and sometimes permutations of everything above.
This filter solves that mismatch problem, but might also cause new.
ScandinavianFoldingFilterFactory
Factory for Scandinavian
<fieldType name="text_scandfold" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ScandinavianFoldingFilterFactory"/>
</analyzer>
</fieldType>
ScandinavianNormalizationFilter
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
It's a semantically less destructive solution than Scandinavian
blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej but not blabarsyltetoj räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas but not raksmorgas
ScandinavianNormalizationFilterFactory
Factory for Scandinavian
<fieldType name="text_scandnorm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ScandinavianNormalizationFilterFactory"/>
</analyzer>
</fieldType>
SetKeywordMarkerFilter
Marks terms as keywords via the Keywordtrue
.
SingleTokenTokenStream
A Lucene.
StemmerOverrideFilter
Provides the ability to override any Keyword
StemmerOverrideFilter.Builder
This builder builds an FST for the Stemmer
StemmerOverrideFilter.StemmerOverrideMap
A read-only 4-byte FST backed map that allows fast case-insensitive key
value lookups for Stemmer
StemmerOverrideFilterFactory
Factory for Stemmer
<fieldType name="text_dicstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="dictionary.txt" ignoreCase="false"/>
</analyzer>
</fieldType>
TrimFilter
Trims leading and trailing whitespace from Tokens in the stream.
As of Lucene 4.4, this filter does not support updateOffsets=true anymore as it can lead to broken token streams.
TrimFilterFactory
Factory for Trim
<fieldType name="text_trm" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
TruncateTokenFilter
A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts
TruncateTokenFilterFactory
Factory for Truncate
<fieldType name="text_tr_ascii_f5" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.TruncateTokenFilterFactory" prefixLength="5"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
WordDelimiterFilter
Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric
characters):
"Wi-Fi"
→"Wi", "Fi"
- split on case transitions:
"PowerShot"
→"Power", "Shot"
- split on letter-number transitions:
"SD500"
→"SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored:
"//hello---there, 'dude'"
→"hello", "there", "dude"
- trailing "'s" are removed for each subword:
"O'Neil's"
→"O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations:
→"PowerShot"
0:"Power", 1:"Shot"
(0 and 1 are the token positions) - combinations="1" means that in addition to the subwords, maximum runs of
non-numeric subwords are catenated and produced at the same position of the
last subword in the run:
"PowerShot"
→0:"Power", 1:"Shot" 1:"PowerShot"
"A's+B's&C's"
-gt;0:"A", 1:"B", 2:"C", 2:"ABC"
"Super-Duper-XL500-42-AutoCoder!"
→0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for Word
WordDelimiterFilterFactory
Factory for Word
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" protected="protectedword.txt"
preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"
generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
types="wdfftypes.txt" />
</analyzer>
</fieldType>
WordDelimiterIterator
A BreakIterator-like API for iterating over subwords in text, according to Word
Enums
WordDelimiterFlags
Configuration options for the Word
LUCENENET specific - these options were passed as int constant flags in Lucene.