Class HMMChineseTokenizer

Tokenizer for Chinese or mixed Chinese-English text.

The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

Inheritance

object

AttributeSource

TokenStream

Tokenizer

SegmentingTokenizerBase

HMMChineseTokenizer

Implements

IDisposable

Inherited Members

SegmentingTokenizerBase.BUFFERMAX

SegmentingTokenizerBase.m_buffer

SegmentingTokenizerBase.m_offset

SegmentingTokenizerBase.IncrementToken()

SegmentingTokenizerBase.End()

SegmentingTokenizerBase.IsSafeEnd(char)

Tokenizer.m_input

Tokenizer.CorrectOffset(int)

Tokenizer.SetReader(TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(object)

AttributeSource.ReflectAsString(bool)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

object.Equals(object, object)

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

Namespace: Lucene.Net.Analysis.Cn.Smart

Assembly: Lucene.Net.Analysis.SmartCn.dll

Syntax

public class HMMChineseTokenizer : SegmentingTokenizerBase, IDisposable

Constructors

HMMChineseTokenizer(AttributeFactory, TextReader)

Creates a new HMMChineseTokenizer, supplying the Lucene.Net.Util.AttributeSource.AttributeFactory

Declaration

public HMMChineseTokenizer(AttributeSource.AttributeFactory factory, TextReader reader)

Parameters

Type	Name	Description
AttributeSource.AttributeFactory	factory
TextReader	reader

HMMChineseTokenizer(TextReader)

Creates a new HMMChineseTokenizer

Declaration

public HMMChineseTokenizer(TextReader reader)

Parameters

Type	Name	Description
TextReader	reader

Methods

Dispose(bool)

Releases resources used by the HMMChineseTokenizer and if overridden in a derived class, optionally releases unmanaged resources.

Declaration

protected override void Dispose(bool disposing)

Parameters

Type	Name	Description
bool	disposing	`true` to release both managed and unmanaged resources; `false` to release only unmanaged resources.

Overrides

Tokenizer.Dispose(bool)

IncrementWord()

Returns true if another word is available

Declaration

protected override bool IncrementWord()

Returns

Type	Description
bool

Overrides

Lucene.Net.Analysis.Util.SegmentingTokenizerBase.IncrementWord()

Reset()

This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().

Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will throw InvalidOperationException on further usage).

Declaration

public override void Reset()

Overrides

Lucene.Net.Analysis.Util.SegmentingTokenizerBase.Reset()

SetNextSentence(int, int)

Provides the next input sentence for analysis

Declaration

protected override void SetNextSentence(int sentenceStart, int sentenceEnd)

Parameters

Type	Name	Description
int	sentenceStart
int	sentenceEnd

Overrides

SegmentingTokenizerBase.SetNextSentence(int, int)

Implements

IDisposable