Namespace Lucene.Net.Analysis.Ja

Kuromoji is a morphological analyzer for Japanese text.

This module provides support for Japanese text analysis, including features such as part-of-speech tagging, lemmatization, and compound word analysis.

For an introduction to Lucene's analysis API, see the Lucene.Net.Analysis package documentation.

Classes

GraphvizFormatter

Outputs the dot (graphviz) string for the viterbi lattice.

JapaneseAnalyzer

Analyzer for Japanese that uses morphological analysis.

JapaneseBaseFormFilter

Replaces term text with the IBaseFormAttribute.

This acts as a lemmatizer for verbs and adjectives. To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom TokenFilter that sets the IKeywordAttribute before this TokenStream.

JapaneseBaseFormFilterFactory

Factory for JapaneseBaseFormFilter.

<fieldType name="text_ja" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
    <filter class="solr.JapaneseBaseFormFilterFactory"/>
  </analyzer>
</fieldType>

JapaneseIterationMarkCharFilter

Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

JapaneseIterationMarkCharFilterFactory

Factory for JapaneseIterationMarkCharFilter.

<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
  <analyzer>
    <charFilter class="solr.JapaneseIterationMarkCharFilterFactory normalizeKanji="true" normalizeKana="true"/>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
  </analyzer>
</fieldType>

JapaneseKatakanaStemFilter

A TokenFilter that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only katakana words longer than a minimum length are stemmed (default is four).

JapaneseKatakanaStemFilterFactory

Factory for JapaneseKatakanaStemFilter.

<fieldType name="text_ja" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
    <filter class="solr.JapaneseKatakanaStemFilterFactory"
            minimumLength="4"/>
  </analyzer>
</fieldType>

JapanesePartOfSpeechStopFilter

Removes tokens that match a set of part-of-speech tags.

JapanesePartOfSpeechStopFilterFactory

Factory for JapanesePartOfSpeechStopFilter.

<fieldType name="text_ja" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
    <filter class="solr.JapanesePartOfSpeechStopFilterFactory"
            tags="stopTags.txt" 
            enablePositionIncrements="true"/>
  </analyzer>
</fieldType>

JapaneseReadingFormFilter

A TokenFilter that replaces the term attribute with the reading of a token in either katakana or romaji form. The default reading form is katakana.

JapaneseReadingFormFilterFactory

Factory for JapaneseReadingFormFilter.

<fieldType name="text_ja" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"/>
    <filter class="solr.JapaneseReadingFormFilterFactory"
            useRomaji="false"/>
  </analyzer>
</fieldType>

JapaneseTokenizer

Tokenizer for Japanese that uses morphological analysis.

JapaneseTokenizerFactory

Factory for JapaneseTokenizer.

<fieldType name="text_ja" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.JapaneseTokenizerFactory"
      mode="NORMAL"
      userDictionary="user.txt"
      userDictionaryEncoding="UTF-8"
      discardPunctuation="true"
    />
    <filter class="solr.JapaneseBaseFormFilterFactory"/>
  </analyzer>
</fieldType>

Token

Analyzed token with morphological data from its dictionary.

Enums

JapaneseTokenizerMode

Tokenization mode: this determines how the tokenizer handles compound and unknown words.

JapaneseTokenizerType

Token type reflecting the original source of this token