Show / Hide Table of Contents

    Class CJKTokenizer

    CJKTokenizer is designed for Chinese, Japanese, and Korean languages.


    The tokens returned are every two adjacent characters with overlap match.

    Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".

    Additionally, the following is applied to Latin text (such as English):
    • Text is converted to lowercase.
    • Numeric digits, '+', '#', and '_' are tokenized as letters.
    • Full-width forms are converted to half-width forms.
    For more info on Asian language (Chinese, Japanese, and Korean) text segmentation: please search google

    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    CJKTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.m_input
    Tokenizer.Dispose(Boolean)
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Cjk
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public sealed class CJKTokenizer : Tokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    CJKTokenizer(AttributeSource.AttributeFactory, TextReader)

    Declaration
    public CJKTokenizer(AttributeSource.AttributeFactory factory, TextReader in)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory
    TextReader in
    | Improve this Doc View Source

    CJKTokenizer(TextReader)

    Construct a token stream processing the given input.

    Declaration
    public CJKTokenizer(TextReader in)
    Parameters
    Type Name Description
    TextReader in

    I/O reader

    Methods

    | Improve this Doc View Source

    End()

    Declaration
    public override sealed void End()
    Overrides
    TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.

    Declaration
    public override bool IncrementToken()
    Returns
    Type Description
    System.Boolean

    false for end of stream, true otherwise

    Overrides
    TokenStream.IncrementToken()
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Tokenizer.Reset()

    Implements

    IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)