Class JapaneseTokenizer
Tokenizer for Japanese that uses morphological analysis.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Ja
Assembly: Lucene.Net.Analysis.Kuromoji.dll
Syntax
public sealed class JapaneseTokenizer : Tokenizer, IDisposable
Remarks
This tokenizer sets a number of additional attributes:
- IBaseFormAttribute containing base form for inflected adjectives and verbs.
- IPartOfSpeechAttribute containing part-of-speech.
- IReadingAttribute containing reading and pronunciation.
- IInflectionAttribute containing additional part-of-speech information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is SEARCH, we output the alternate segmentation as well.
Constructors
| Improve this Doc View SourceJapaneseTokenizer(AttributeSource.AttributeFactory, TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)
Create a new JapaneseTokenizer.
Declaration
public JapaneseTokenizer(AttributeSource.AttributeFactory factory, TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | The AttributeFactory to use. |
System.IO.TextReader | input | TextReader containing text. |
UserDictionary | userDictionary | Optional: if non-null, user dictionary. |
System.Boolean | discardPunctuation |
|
JapaneseTokenizerMode | mode | Tokenization mode. |
JapaneseTokenizer(TextReader, UserDictionary, Boolean, JapaneseTokenizerMode)
Create a new JapaneseTokenizer.
Uses the default AttributeFactory.
Declaration
public JapaneseTokenizer(TextReader input, UserDictionary userDictionary, bool discardPunctuation, JapaneseTokenizerMode mode)
Parameters
Type | Name | Description |
---|---|---|
System.IO.TextReader | input | TextReader containing text. |
UserDictionary | userDictionary | Optional: if non-null, user dictionary. |
System.Boolean | discardPunctuation |
|
JapaneseTokenizerMode | mode | Tokenization mode. |
Fields
| Improve this Doc View SourceDEFAULT_MODE
Default tokenization mode. Currently this is SEARCH.
Declaration
public static readonly JapaneseTokenizerMode DEFAULT_MODE
Field Value
Type | Description |
---|---|
JapaneseTokenizerMode |
Properties
| Improve this Doc View SourceGraphvizFormatter
Expert: set this to produce graphviz (dot) output of the Viterbi lattice
Declaration
public GraphvizFormatter GraphvizFormatter { get; set; }
Property Value
Type | Description |
---|---|
GraphvizFormatter |
Methods
| Improve this Doc View SourceDispose(Boolean)
Declaration
protected override void Dispose(bool disposing)
Parameters
Type | Name | Description |
---|---|---|
System.Boolean | disposing |
Overrides
| Improve this Doc View SourceEnd()
Declaration
public override void End()
Overrides
| Improve this Doc View SourceIncrementToken()
Declaration
public override bool IncrementToken()
Returns
Type | Description |
---|---|
System.Boolean |
Overrides
| Improve this Doc View SourceReset()
Declaration
public override void Reset()