Class DefaultICUTokenizerConfig
Default ICUTokenizerConfig that is generally applicable to many languages.
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public class DefaultICUTokenizerConfig : ICUTokenizerConfig
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
Constructors
DefaultICUTokenizerConfig(bool, bool)
Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
Declaration
public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)
Parameters
Type | Name | Description |
---|---|---|
bool | cjkAsWords | true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. |
bool | myanmarAsWords | If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC. |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
Fields
WORD_EMOJI
Token type for words that appear to be emoji sequences
Declaration
public static readonly string WORD_EMOJI
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_HANGUL
Token type for words containing Korean hangul
Declaration
public static readonly string WORD_HANGUL
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_HIRAGANA
Token type for words containing Japanese hiragana
Declaration
public static readonly string WORD_HIRAGANA
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_IDEO
Token type for words containing ideographic characters
Declaration
public static readonly string WORD_IDEO
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_KATAKANA
Token type for words containing Japanese katakana
Declaration
public static readonly string WORD_KATAKANA
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_LETTER
Token type for words that contain letters
Declaration
public static readonly string WORD_LETTER
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
WORD_NUMBER
Token type for words that appear to be numbers
Declaration
public static readonly string WORD_NUMBER
Field Value
Type | Description |
---|---|
string |
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
Properties
CombineCJ
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
Declaration
public override bool CombineCJ { get; }
Property Value
Type | Description |
---|---|
bool |
Overrides
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
Methods
GetBreakIterator(int)
Return a breakiterator capable of processing a given script.
Declaration
public override RuleBasedBreakIterator GetBreakIterator(int script)
Parameters
Type | Name | Description |
---|---|---|
int | script |
Returns
Type | Description |
---|---|
RuleBasedBreakIterator |
Overrides
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.
GetType(int, int)
Return a token type value for a given script and BreakIterator rule status.
Declaration
public override string GetType(int script, int ruleStatus)
Parameters
Type | Name | Description |
---|---|---|
int | script | |
int | ruleStatus |
Returns
Type | Description |
---|---|
string |
Overrides
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Note
This API is experimental and might change in incompatible ways in the next release.