Class DefaultICUTokenizerConfig
Default ICUTokenizerConfig that is generally applicable to many languages.
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public class DefaultICUTokenizerConfig : ICUTokenizerConfig
Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
Constructors
| Improve this Doc View SourceDefaultICUTokenizerConfig(Boolean, Boolean)
Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
Declaration
public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)
Parameters
Type | Name | Description |
---|---|---|
System.Boolean | cjkAsWords | true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. |
System.Boolean | myanmarAsWords | If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC. |
Fields
| Improve this Doc View SourceWORD_HANGUL
Token type for words containing Korean hangul
Declaration
public static readonly string WORD_HANGUL
Field Value
Type | Description |
---|---|
System.String |
WORD_HIRAGANA
Token type for words containing Japanese hiragana
Declaration
public static readonly string WORD_HIRAGANA
Field Value
Type | Description |
---|---|
System.String |
WORD_IDEO
Token type for words containing ideographic characters
Declaration
public static readonly string WORD_IDEO
Field Value
Type | Description |
---|---|
System.String |
WORD_KATAKANA
Token type for words containing Japanese katakana
Declaration
public static readonly string WORD_KATAKANA
Field Value
Type | Description |
---|---|
System.String |
WORD_LETTER
Token type for words that contain letters
Declaration
public static readonly string WORD_LETTER
Field Value
Type | Description |
---|---|
System.String |
WORD_NUMBER
Token type for words that appear to be numbers
Declaration
public static readonly string WORD_NUMBER
Field Value
Type | Description |
---|---|
System.String |
Properties
| Improve this Doc View SourceCombineCJ
Declaration
public override bool CombineCJ { get; }
Property Value
Type | Description |
---|---|
System.Boolean |
Overrides
Methods
| Improve this Doc View SourceGetBreakIterator(Int32)
Declaration
public override BreakIterator GetBreakIterator(int script)
Parameters
Type | Name | Description |
---|---|---|
System.Int32 | script |
Returns
Type | Description |
---|---|
ICU4N.Text.BreakIterator |
Overrides
| Improve this Doc View SourceGetType(Int32, Int32)
Declaration
public override string GetType(int script, int ruleStatus)
Parameters
Type | Name | Description |
---|---|---|
System.Int32 | script | |
System.Int32 | ruleStatus |
Returns
Type | Description |
---|---|
System.String |