Namespace Lucene.Net.Analysis.Cn.Smart
Analyzer for Simplified Chinese, which indexes words.
Note
This API is experimental and might change in incompatible ways in the next release.
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
CJKAnalyzer (in the Lucene.
Net. namespace of Lucene.Analysis. Cjk Net. ): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.Analysis. Common SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: "我是中国人"
StandardAnalyzer: 我-是-中-国-人
CJKAnalyzer: 我是-是中-中国-国人
SmartChineseAnalyzer: 我-是-中国-人
Classes
AnalyzerProfile
Manages analysis data configuration for Smart
Note
This API is experimental and might change in incompatible ways in the next release.
HMMChineseTokenizer
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.HMMChineseTokenizerFactory
Factory for HMMChinese
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"
Note
This API is experimental and might change in incompatible ways in the next release.
SentenceTokenizer
Tokenizes input text into sentences.
The output tokens can then be broken into words with Word
Note
This API is experimental and might change in incompatible ways in the next release.
SmartChineseAnalyzer
Smart
Segmentation is based upon the Hidden Markov Model. A large training corpus was used to calculate Chinese word frequency probability.
This analyzer requires a dictionary to provide statistical data.
Smart
The included dictionary data is from ICTCLAS1.0. Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License!
Note
This API is experimental and might change in incompatible ways in the next release.
SmartChineseSentenceTokenizerFactory
Factory for the Smart
Note
This API is experimental and might change in incompatible ways in the next release.
SmartChineseWordTokenFilterFactory
Factory for the Smart
Note: this class will currently emit tokens for punctuation. So you should either add
a Lucene.
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"
Note
This API is experimental and might change in incompatible ways in the next release.
Utility
Smart
Note
This API is experimental and might change in incompatible ways in the next release.
WordTokenFilter
A Lucene.
Note
This API is experimental and might change in incompatible ways in the next release.
Enums
CharType
Internal Smart
Note
This API is experimental and might change in incompatible ways in the next release.
WordType
Internal Smart
Note
This API is experimental and might change in incompatible ways in the next release.