Class CJKTokenizer

CJKTokenizer is designed for Chinese, Japanese, and Korean languages.

The tokens returned are every two adjacent characters with overlap match.

Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

Additionally, the following is applied to Latin text (such as English):

Text is converted to lowercase.
Numeric digits, '+', '#', and '_' are tokenized as letters.
Full-width forms are converted to half-width forms.

For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

CJKTokenizer

Implements

IDisposable

Inherited Members

Tokenizer.m_input

Tokenizer.Dispose(Boolean)

Tokenizer.CorrectOffset(Int32)

Tokenizer.SetReader(TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

Namespace: Lucene.Net.Analysis.Cjk

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

public sealed class CJKTokenizer : Tokenizer, IDisposable

Constructors

| Improve this Doc View Source

CJKTokenizer(AttributeSource.AttributeFactory, TextReader)

Declaration

public CJKTokenizer(AttributeSource.AttributeFactory factory, TextReader in)

Parameters

Type	Name	Description
AttributeSource.AttributeFactory	factory
TextReader	in

| Improve this Doc View Source

CJKTokenizer(TextReader)

Construct a token stream processing the given input.

Declaration

public CJKTokenizer(TextReader in)

Parameters

Type	Name	Description
TextReader	in	I/O reader

Methods

| Improve this Doc View Source

End()

Declaration

public override sealed void End()

Overrides

TokenStream.End()

| Improve this Doc View Source

IncrementToken()

Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

Declaration

public override bool IncrementToken()

Returns

Type	Description
System.Boolean	false for end of stream, true otherwise

Overrides

TokenStream.IncrementToken()

| Improve this Doc View Source

Reset()

Declaration

public override void Reset()

Overrides

Tokenizer.Reset()

Implements

IDisposable