Class DefaultICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

Inheritance

object

ICUTokenizerConfig

DefaultICUTokenizerConfig

Inherited Members

ICUTokenizerConfig.EMOJI_SEQUENCE_STATUS

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Lucene.Net.Analysis.Icu.Segmentation

Assembly: Lucene.Net.ICU.dll

Syntax

public class DefaultICUTokenizerConfig : ICUTokenizerConfig

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

Constructors

DefaultICUTokenizerConfig(bool, bool)

Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.

Declaration

public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)

Parameters

Type	Name	Description
bool	cjkAsWords	true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults.
bool	myanmarAsWords	If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

Fields

WORD_EMOJI

Token type for words that appear to be emoji sequences

Declaration

public static readonly string WORD_EMOJI

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_HANGUL

Token type for words containing Korean hangul

Declaration

public static readonly string WORD_HANGUL

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_HIRAGANA

Token type for words containing Japanese hiragana

Declaration

public static readonly string WORD_HIRAGANA

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_IDEO

Token type for words containing ideographic characters

Declaration

public static readonly string WORD_IDEO

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_KATAKANA

Token type for words containing Japanese katakana

Declaration

public static readonly string WORD_KATAKANA

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_LETTER

Token type for words that contain letters

Declaration

public static readonly string WORD_LETTER

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

WORD_NUMBER

Token type for words that appear to be numbers

Declaration

public static readonly string WORD_NUMBER

Field Value

Type	Description
string

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

Properties

CombineCJ

true if Han, Hiragana, and Katakana scripts should all be returned as Japanese

Declaration

public override bool CombineCJ { get; }

Property Value

Type	Description
bool

Overrides

ICUTokenizerConfig.CombineCJ

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

Methods

GetBreakIterator(int)

Return a breakiterator capable of processing a given script.

Declaration

public override RuleBasedBreakIterator GetBreakIterator(int script)

Parameters

Type	Name	Description
int	script

Returns

Type	Description
RuleBasedBreakIterator

Overrides

ICUTokenizerConfig.GetBreakIterator(int)

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.

GetType(int, int)

Return a token type value for a given script and BreakIterator rule status.

Declaration

public override string GetType(int script, int ruleStatus)

Parameters

Type	Name	Description
int	script
int	ruleStatus

Returns

Type	Description
string

Overrides

ICUTokenizerConfig.GetType(int, int)

Remarks

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

Note

This API is experimental and might change in incompatible ways in the next release.