Class JapaneseTokenizer

Tokenizer for Japanese that uses morphological analysis.

Inheritance

System.Object

AttributeSource

TokenStream

Tokenizer

JapaneseTokenizer

Implements

IDisposable

Inherited Members

Tokenizer.m_input

Tokenizer.CorrectOffset(Int32)

Tokenizer.SetReader(TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.CaptureState()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

Namespace: Lucene.Net.Analysis.Ja

Assembly: Lucene.Net.Analysis.Kuromoji.dll

Syntax

public sealed class JapaneseTokenizer : Tokenizer, IDisposable

Remarks

This tokenizer sets a number of additional attributes:

IBaseFormAttribute containing base form for inflected adjectives and verbs.
IPartOfSpeechAttribute containing part-of-speech.
IReadingAttribute containing reading and pronunciation.
IInflectionAttribute containing additional part-of-speech information for inflected forms.

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is SEARCH, we output the alternate segmentation as well.

Constructors

| Improve this Doc View Source

JapaneseTokenizer(AttributeSource.AttributeFactory, TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)

Create a new JapaneseTokenizer.

Declaration

public JapaneseTokenizer(AttributeSource.AttributeFactory factory, TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)

Parameters

Type	Name	Description
AttributeSource.AttributeFactory	factory	The AttributeFactory to use.
TextReader	input	TextReader containing text.
UserDictionary	userDictionary	Optional: if non-null, user dictionary.
System.Boolean	discardPunctuation	`true` if punctuation tokens should be dropped from the output.
JapaneseTokenizerMode	mode	Tokenization mode.

| Improve this Doc View Source

JapaneseTokenizer(TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)

Create a new JapaneseTokenizer.

Uses the default AttributeFactory.

Declaration

public JapaneseTokenizer(TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)

Parameters

Type	Name	Description
TextReader	input	TextReader containing text.
UserDictionary	userDictionary	Optional: if non-null, user dictionary.
System.Boolean	discardPunctuation	`true` if punctuation tokens should be dropped from the output.
JapaneseTokenizerMode	mode	Tokenization mode.

Fields

| Improve this Doc View Source

DEFAULT_MODE

Default tokenization mode. Currently this is SEARCH.

Declaration

public static readonly JapaneseTokenizerMode DEFAULT_MODE

Field Value

Type	Description
JapaneseTokenizerMode

Properties

| Improve this Doc View Source

GraphvizFormatter

Expert: set this to produce graphviz (dot) output of the Viterbi lattice

Declaration

public GraphvizFormatter GraphvizFormatter { get; set; }

Property Value

Type	Description
GraphvizFormatter

Methods

| Improve this Doc View Source

Dispose(Boolean)

Declaration

protected override void Dispose(bool disposing)

Parameters

Type	Name	Description
System.Boolean	disposing

| Improve this Doc View Source

End()

Declaration

public override void End()

Overrides

TokenStream.End()

| Improve this Doc View Source

IncrementToken()

Declaration

public override bool IncrementToken()

Returns

Type	Description
System.Boolean

Overrides

TokenStream.IncrementToken()

| Improve this Doc View Source

Reset()

Declaration

public override void Reset()

Overrides

Tokenizer.Reset()

Implements

IDisposable