Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Cn

    Analyzer for Chinese, which indexes unigrams (individual chinese characters).

    Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. * StandardAnalyzer: Index unigrams (individual Chinese characters) as a token. * CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. * SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens. Example phrase: "我是中国人" 1. StandardAnalyzer: 我-是-中-国-人 2. CJKAnalyzer: 我是-是中-中国-国人 3. SmartChineseAnalyzer: 我-是-中国-人

    Classes

    ChineseAnalyzer

    An Lucene.Net.Analysis.Analyzer that tokenizes text with ChineseTokenizer and filters with ChineseFilter

    ChineseFilter

    A Lucene.Net.Analysis.TokenFilter with a stop word table.

    • Numeric tokens are removed.
    • English tokens must be larger than 1 character.
    • One Chinese character as one Chinese word.
    TO DO:
    1. Add Chinese stop words, such as \ue400
    2. Dictionary based Chinese word extraction
    3. Intelligent Chinese word extraction

    ChineseFilterFactory

    Factory for ChineseFilter

    ChineseTokenizer

    Tokenize Chinese text as individual chinese characters.

    The difference between ChineseTokenizer and CJKTokenizer is that they have different token parsing logic.

    For example, if the Chinese text "C1C2C3C4" is to be indexed:

    • The tokens returned from ChineseTokenizer are C1, C2, C3, C4.
    • The tokens returned from the CJKTokenizer are C1C2, C2C3, C3C4.

    Therefore the index created by CJKTokenizer is much larger.

    The problem is that when searching for C1, C1C2, C1C3, C4C2, C1C2C3 ... the ChineseTokenizer works, but the CJKTokenizer will not work.

    ChineseTokenizerFactory

    Factory for ChineseTokenizer

    • Improve this Doc
    Back to top Copyright © 2020 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.