Namespace Lucene.Net.Analysis.Ja
Kuromoji is a morphological analyzer for Japanese text.
This module provides support for Japanese text analysis, including features such as part-of-speech tagging, lemmatization, and compound word analysis.
For an introduction to Lucene's analysis API, see the Lucene.Net.Analysis package documentation.
Classes
GraphvizFormatter
Outputs the dot (graphviz) string for the viterbi lattice.
JapaneseAnalyzer
Analyzer for Japanese that uses morphological analysis.
JapaneseBaseFormFilter
Replaces term text with the IBaseFormAttribute.
This acts as a lemmatizer for verbs and adjectives. To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom TokenFilter that sets the IKeywordAttribute before this TokenStream.
JapaneseBaseFormFilterFactory
Factory for JapaneseBaseFormFilter.
<fieldType name="text_ja" class="solr.TextField">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
</analyzer>
</fieldType>
JapaneseIterationMarkCharFilter
Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
JapaneseIterationMarkCharFilterFactory
Factory for JapaneseIterationMarkCharFilter.
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer>
<charFilter class="solr.JapaneseIterationMarkCharFilterFactory normalizeKanji="true" normalizeKana="true"/>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
</analyzer>
</fieldType>
JapaneseKatakanaStemFilter
A TokenFilter that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only katakana words longer than a minimum length are stemmed (default is four).
JapaneseKatakanaStemFilterFactory
Factory for JapaneseKatakanaStemFilter.
<fieldType name="text_ja" class="solr.TextField">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory"
minimumLength="4"/>
</analyzer>
</fieldType>
JapanesePartOfSpeechStopFilter
Removes tokens that match a set of part-of-speech tags.
JapanesePartOfSpeechStopFilterFactory
Factory for JapanesePartOfSpeechStopFilter.
<fieldType name="text_ja" class="solr.TextField">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory"
tags="stopTags.txt"
enablePositionIncrements="true"/>
</analyzer>
</fieldType>
JapaneseReadingFormFilter
A TokenFilter that replaces the term attribute with the reading of a token in either katakana or romaji form. The default reading form is katakana.
JapaneseReadingFormFilterFactory
Factory for JapaneseReadingFormFilter.
<fieldType name="text_ja" class="solr.TextField">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"/>
<filter class="solr.JapaneseReadingFormFilterFactory"
useRomaji="false"/>
</analyzer>
</fieldType>
JapaneseTokenizer
Tokenizer for Japanese that uses morphological analysis.
JapaneseTokenizerFactory
Factory for JapaneseTokenizer.
<fieldType name="text_ja" class="solr.TextField">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory"
mode="NORMAL"
userDictionary="user.txt"
userDictionaryEncoding="UTF-8"
discardPunctuation="true"
/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
</analyzer>
</fieldType>
Token
Analyzed token with morphological data from its dictionary.
Enums
JapaneseTokenizerMode
Tokenization mode: this determines how the tokenizer handles compound and unknown words.
JapaneseTokenizerType
Token type reflecting the original source of this token