Class DefaultICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

Inheritance

System.Object

DefaultICUTokenizerConfig

Namespace: Lucene.Net.Analysis.Icu.Segmentation

Assembly: Lucene.Net.ICU.dll

Syntax

public class DefaultICUTokenizerConfig : ICUTokenizerConfig

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Constructors

| Improve this Doc View Source

DefaultICUTokenizerConfig(Boolean, Boolean)

Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.

Declaration

public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)

Parameters

Type	Name	Description
System.Boolean	cjkAsWords	true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults.
System.Boolean	myanmarAsWords	If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.

Fields

| Improve this Doc View Source

WORD_HANGUL

Token type for words containing Korean hangul

Declaration

public static readonly string WORD_HANGUL

Field Value

Type	Description
System.String

| Improve this Doc View Source

WORD_HIRAGANA

Token type for words containing Japanese hiragana

Declaration

public static readonly string WORD_HIRAGANA

Field Value

Type	Description
System.String

| Improve this Doc View Source

WORD_IDEO

Token type for words containing ideographic characters

Declaration

public static readonly string WORD_IDEO

Field Value

Type	Description
System.String

| Improve this Doc View Source

WORD_KATAKANA

Token type for words containing Japanese katakana

Declaration

public static readonly string WORD_KATAKANA

Field Value

Type	Description
System.String

| Improve this Doc View Source

WORD_LETTER

Token type for words that contain letters

Declaration

public static readonly string WORD_LETTER

Field Value

Type	Description
System.String

| Improve this Doc View Source

WORD_NUMBER

Token type for words that appear to be numbers

Declaration

public static readonly string WORD_NUMBER

Field Value

Type	Description
System.String

Properties

| Improve this Doc View Source

CombineCJ

Declaration

public override bool CombineCJ { get; }

Property Value

Type	Description
System.Boolean

Overrides

ICUTokenizerConfig.CombineCJ

Methods

| Improve this Doc View Source

GetBreakIterator(Int32)

Declaration

public override BreakIterator GetBreakIterator(int script)

Parameters

Type	Name	Description
System.Int32	script

Returns

Type	Description
BreakIterator

Overrides

ICUTokenizerConfig.GetBreakIterator(Int32)

| Improve this Doc View Source

GetType(Int32, Int32)

Declaration

public override string GetType(int script, int ruleStatus)

Parameters

Type	Name	Description
System.Int32	script
System.Int32	ruleStatus

Returns

Type	Description
System.String

Overrides

ICUTokenizerConfig.GetType(Int32, Int32)