Class HMMChineseTokenizer
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.Inheritance
AttributeSource
TokenStream
Tokenizer
SegmentingTokenizerBase
HMMChineseTokenizer
Implements
Inherited Members
SegmentingTokenizerBase.BUFFERMAX
SegmentingTokenizerBase.m_buffer
SegmentingTokenizerBase.m_offset
SegmentingTokenizerBase.IncrementToken()
SegmentingTokenizerBase.End()
Tokenizer.m_input
TokenStream.Dispose()
AttributeSource.GetAttributeFactory()
AttributeSource.GetAttributeClassesEnumerator()
AttributeSource.GetAttributeImplsEnumerator()
AttributeSource.AddAttributeImpl(Attribute)
AttributeSource.AddAttribute<T>()
AttributeSource.HasAttributes
AttributeSource.HasAttribute<T>()
AttributeSource.GetAttribute<T>()
AttributeSource.ClearAttributes()
AttributeSource.CaptureState()
AttributeSource.RestoreState(AttributeSource.State)
AttributeSource.GetHashCode()
AttributeSource.ReflectWith(IAttributeReflector)
AttributeSource.CloneAttributes()
AttributeSource.CopyTo(AttributeSource)
AttributeSource.ToString()
Namespace: Lucene.Net.Analysis.Cn.Smart
Assembly: Lucene.Net.Analysis.SmartCn.dll
Syntax
public class HMMChineseTokenizer : SegmentingTokenizerBase, IDisposable
Constructors
HMMChineseTokenizer(AttributeFactory, TextReader)
Creates a new HMMChineseTokenizer, supplying the Lucene.Net.Util.AttributeSource.AttributeFactory
Declaration
public HMMChineseTokenizer(AttributeSource.AttributeFactory factory, TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | |
TextReader | reader |
HMMChineseTokenizer(TextReader)
Creates a new HMMChineseTokenizer
Declaration
public HMMChineseTokenizer(TextReader reader)
Parameters
Type | Name | Description |
---|---|---|
TextReader | reader |
Methods
Dispose(bool)
Releases resources used by the HMMChineseTokenizer and if overridden in a derived class, optionally releases unmanaged resources.
Declaration
protected override void Dispose(bool disposing)
Parameters
Type | Name | Description |
---|---|---|
bool | disposing |
|
Overrides
IncrementWord()
Returns true if another word is available
Declaration
protected override bool IncrementWord()
Returns
Type | Description |
---|---|
bool |
Overrides
Lucene.Net.Analysis.Util.SegmentingTokenizerBase.IncrementWord()
Reset()
This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always callbase.Reset()
, otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on further usage).
Declaration
public override void Reset()
Overrides
Lucene.Net.Analysis.Util.SegmentingTokenizerBase.Reset()
SetNextSentence(int, int)
Provides the next input sentence for analysis
Declaration
protected override void SetNextSentence(int sentenceStart, int sentenceEnd)
Parameters
Type | Name | Description |
---|---|---|
int | sentenceStart | |
int | sentenceEnd |