Namespace Lucene.Net.Analysis.Cn.Smart
Analyzer for Simplified Chinese, which indexes words.
For an introduction to Lucene's analysis API, see the Lucene.
Classes
AnalyzerProfile
Manages analysis data configuration for Smart
Smart
NOTE: To use an alternate dicationary than the built-in one, put the "bigramdict.dct" and "coredict.dct" files in a subdirectory of your application named "analysis-data". This subdirectory can be placed in any directory up to and including the root directory (if the OS permission allows). To place the files in an alternate location, set an environment variable named "analysis.data.dir" with the name of the directory the "bigramdict.dct" and "coredict.dct" files can be located within.
The default "bigramdict.dct" and "coredict.dct" files can be found at: https://issues.apache.org/jira/browse/LUCENE-1629.
HMMChineseTokenizer
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
HMMChineseTokenizerFactory
Factory for HMMChinese
Note: this class will currently emit tokens for punctuation. So you should either add
a Word
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"
SentenceTokenizer
Tokenizes input text into sentences.
The output tokens can then be broken into words with Word
SmartChineseAnalyzer
Smart
Segmentation is based upon the Hidden Markov Model. A large training corpus was used to calculate Chinese word frequency probability.
This analyzer requires a dictionary to provide statistical data.
Smart
The included dictionary data is from ICTCLAS1.0. Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License!
SmartChineseSentenceTokenizerFactory
Factory for the Smart
SmartChineseWordTokenFilterFactory
Factory for the Smart
Note: this class will currently emit tokens for punctuation. So you should either add
a Word
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"
Utility
Smart
WordTokenFilter
A Token
Enums
CharType
Internal Smart
WordType
Internal Smart