Show / Hide Table of Contents

    Class JapaneseTokenizer

    Tokenizer for Japanese that uses morphological analysis.

    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    JapaneseTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.m_input
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Ja
    Assembly: Lucene.Net.Analysis.Kuromoji.dll
    Syntax
    public sealed class JapaneseTokenizer : Tokenizer, IDisposable
    Remarks

    This tokenizer sets a number of additional attributes:

    • IBaseFormAttribute containing base form for inflected adjectives and verbs.
    • IPartOfSpeechAttribute containing part-of-speech.
    • IReadingAttribute containing reading and pronunciation.
    • IInflectionAttribute containing additional part-of-speech information for inflected forms.

    This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is SEARCH, we output the alternate segmentation as well.

    Constructors

    | Improve this Doc View Source

    JapaneseTokenizer(AttributeSource.AttributeFactory, TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)

    Create a new JapaneseTokenizer.

    Declaration
    public JapaneseTokenizer(AttributeSource.AttributeFactory factory, TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory

    The AttributeFactory to use.

    TextReader input

    TextReader containing text.

    UserDictionary userDictionary

    Optional: if non-null, user dictionary.

    System.Boolean discardPunctuation

    true if punctuation tokens should be dropped from the output.

    JapaneseTokenizerMode mode

    Tokenization mode.

    | Improve this Doc View Source

    JapaneseTokenizer(TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)

    Create a new JapaneseTokenizer.

    Uses the default AttributeFactory.

    Declaration
    public JapaneseTokenizer(TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)
    Parameters
    Type Name Description
    TextReader input

    TextReader containing text.

    UserDictionary userDictionary

    Optional: if non-null, user dictionary.

    System.Boolean discardPunctuation

    true if punctuation tokens should be dropped from the output.

    JapaneseTokenizerMode mode

    Tokenization mode.

    Fields

    | Improve this Doc View Source

    DEFAULT_MODE

    Default tokenization mode. Currently this is SEARCH.

    Declaration
    public static readonly JapaneseTokenizerMode DEFAULT_MODE
    Field Value
    Type Description
    JapaneseTokenizerMode

    Properties

    | Improve this Doc View Source

    GraphvizFormatter

    Expert: set this to produce graphviz (dot) output of the Viterbi lattice

    Declaration
    public GraphvizFormatter GraphvizFormatter { get; set; }
    Property Value
    Type Description
    GraphvizFormatter

    Methods

    | Improve this Doc View Source

    Dispose(Boolean)

    Declaration
    protected override void Dispose(bool disposing)
    Parameters
    Type Name Description
    System.Boolean disposing
    | Improve this Doc View Source

    End()

    Declaration
    public override void End()
    Overrides
    TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Declaration
    public override bool IncrementToken()
    Returns
    Type Description
    System.Boolean
    Overrides
    TokenStream.IncrementToken()
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Tokenizer.Reset()

    Implements

    IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)