Namespace Lucene.Net.Analysis.Core
Basic, general-purpose analysis components.
Classes
KeywordAnalyzer
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
KeywordTokenizer
Emits the entire input as a single token.
KeywordTokenizerFactory
Factory for Keyword
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
LetterTokenizer
A Letter
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
You must specify the required Lucene
- As of 3.1, Char
Tokenizer uses anbased API to normalize and detect token characters. See Is Token and Normalize(Int32) for details.Char(Int32)
LetterTokenizerFactory
Factory for Letter
<fieldType name="text_letter" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
</fieldType>
LowerCaseFilter
Normalizes token text to lower case.
You must specify the required Lucene
- As of 3.1, supplementary characters are properly lowercased.
LowerCaseFilterFactory
Factory for Lower
<fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
LowerCaseTokenizer
Lower
Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
You must specify the required Lucene
- As of 3.1, Char
Tokenizer uses an int based API to normalize and detect token characters. See IsToken and Normalize(Int32) for details.Char(Int32)
LowerCaseTokenizerFactory
Factory for Lower
<fieldType name="text_lwrcase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
SimpleAnalyzer
An Analyzer that filters Letter
You must specify the required Lucene
- As of 3.1, Lower
Case uses an int based API to normalize and detect token codepoints. See IsTokenizer Token and Normalize(Int32) for details.Char(Int32)
StopAnalyzer
Filters Letter
You must specify the required Lucene
- As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords
- As of 2.9, position increments are preserved
StopFilter
Removes stop words from a token stream.
You must specify the required Lucene
- As of 3.1, StopFilter correctly handles Unicode 4.0 supplementary characters in stopwords and position increments are preserved
StopFilterFactory
Factory for Stop
<fieldType name="text_stop" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" format="wordset" />
</analyzer>
</fieldType>
All attributes are optional:
ignoreCase
defaults tofalse
words
should be the name of a stopwords file to parse, if not specified the factory will use ENGLISH_STOP_WORDS_SETformat
defines how thewords
file will be parsed, and defaults towordset
. Ifwords
is not specified, thenformat
must not be specified.
The valid values for the format
option are:
wordset
- This is the default format, which supports one word per line (including any intra-word whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored. See GetLines(Stream, Encoding) for details.snowball
- This format allows for multiple words specified on each line, and trailing comments may be specified using the vertical line ("|"). Blank lines are ignored. Seefor details.
TypeTokenFilter
Removes tokens whose types appear in a set of blocked types from a token stream.
TypeTokenFilterFactory
Factory class for Type
<fieldType name="chars" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
useWhitelist="false"/>
</analyzer>
</fieldType>
UpperCaseFilter
Normalizes token text to UPPER CASE.
You must specify the required Lucene
NOTE: In Unicode, this transformation may lose information when the
upper case character represents more than one lower case character. Use this filter
when you Require uppercase tokens. Use the Lower
UpperCaseFilterFactory
Factory for Upper
<fieldType name="text_uppercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.UpperCaseFilterFactory"/>
</analyzer>
</fieldType>
NOTE: In Unicode, this transformation may lose information when the
upper case character represents more than one lower case character. Use this filter
when you require uppercase tokens. Use the Lower
WhitespaceAnalyzer
An Analyzer that uses Whitespace
You must specify the required Lucene
- As of 3.1, Whitespace
Tokenizer uses an int based API to normalize and detect token codepoints. See IsToken and Normalize(Int32) for details.Char(Int32)
WhitespaceTokenizer
A Whitespace
You must specify the required Lucene
- As of 3.1, Char
Tokenizer uses an int based API to normalize and detect token characters. See IsToken and Normalize(Int32) for details.Char(Int32)
WhitespaceTokenizerFactory
Factory for Whitespace
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>