• API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Cjk

    Analyzer for Chinese, Japanese, and Korean, which indexes bigrams. This analyzer generates bigram terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul characters.

    Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. * ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token. * CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. * SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens. Example phrase: "我是中国人" 1. ChineseAnalyzer: 我-是-中-国-人 2. CJKAnalyzer: 我是-是中-中国-国人 3. SmartChineseAnalyzer: 我-是-中国-人

    Classes

    CJKAnalyzer

    An Lucene.Net.Analysis.Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter

    CJKBigramFilter

    Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.

    CJK types are set by these tokenizers, but you can also use CJKBigramFilter(TokenStream, CJKScript) to explicitly control which of the CJK scripts are turned into bigrams.

    By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you want to always output both unigrams and bigrams, set the

    outputUnigrams
    flag in CJKBigramFilter(TokenStream, CJKScript, Boolean). This can be used for a combined unigram+bigram approach.

    In all cases, all non-CJK input is passed thru unmodified.

    CJKBigramFilterFactory

    Factory for CJKBigramFilter.

    <fieldType name="text_cjk" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" 
          han="true" hiragana="true" 
          katakana="true" hangul="true" outputUnigrams="false" />
      </analyzer>
    </fieldType>

    CJKTokenizer

    CJKTokenizer is designed for Chinese, Japanese, and Korean languages.


    The tokens returned are every two adjacent characters with overlap match.

    Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

    Additionally, the following is applied to Latin text (such as English):
    • Text is converted to lowercase.
    • Numeric digits, '+', '#', and '_' are tokenized as letters.
    • Full-width forms are converted to half-width forms.
    For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google

    CJKTokenizerFactory

    Factory for CJKTokenizer.

    <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.CJKTokenizerFactory"/>
      </analyzer>
    </fieldType>

    CJKWidthFilter

    A Lucene.Net.Analysis.TokenFilter that normalizes CJK width differences:

    • Folds fullwidth ASCII variants into the equivalent basic latin
    • Folds halfwidth Katakana variants into the equivalent kana

    NOTE: this filter can be viewed as a (practical) subset of NFKC/NFKD Unicode normalization. See the normalization support in the ICU package for full normalization.

    CJKWidthFilterFactory

    Factory for CJKWidthFilter.

    <fieldType name="text_cjk" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory"/>
      </analyzer>
    </fieldType>

    Enums

    CJKScript

    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)