Class DefaultICUTokenizerConfig
Default ICUTokenizerConfig that is generally applicable to many languages.
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public class DefaultICUTokenizerConfig : ICUTokenizerConfig
  Remarks
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
 
Constructors
| Improve this Doc View SourceDefaultICUTokenizerConfig(Boolean, Boolean)
Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
Declaration
public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)
  Parameters
| Type | Name | Description | 
|---|---|---|
| System.Boolean | cjkAsWords | true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults.  | 
      
| System.Boolean | myanmarAsWords | If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.  | 
      
Fields
| Improve this Doc View SourceWORD_HANGUL
Token type for words containing Korean hangul
Declaration
public static readonly string WORD_HANGUL
  Field Value
| Type | Description | 
|---|---|
| System.String | 
WORD_HIRAGANA
Token type for words containing Japanese hiragana
Declaration
public static readonly string WORD_HIRAGANA
  Field Value
| Type | Description | 
|---|---|
| System.String | 
WORD_IDEO
Token type for words containing ideographic characters
Declaration
public static readonly string WORD_IDEO
  Field Value
| Type | Description | 
|---|---|
| System.String | 
WORD_KATAKANA
Token type for words containing Japanese katakana
Declaration
public static readonly string WORD_KATAKANA
  Field Value
| Type | Description | 
|---|---|
| System.String | 
WORD_LETTER
Token type for words that contain letters
Declaration
public static readonly string WORD_LETTER
  Field Value
| Type | Description | 
|---|---|
| System.String | 
WORD_NUMBER
Token type for words that appear to be numbers
Declaration
public static readonly string WORD_NUMBER
  Field Value
| Type | Description | 
|---|---|
| System.String | 
Properties
| Improve this Doc View SourceCombineCJ
Declaration
public override bool CombineCJ { get; }
  Property Value
| Type | Description | 
|---|---|
| System.Boolean | 
Overrides
Methods
| Improve this Doc View SourceGetBreakIterator(Int32)
Declaration
public override BreakIterator GetBreakIterator(int script)
  Parameters
| Type | Name | Description | 
|---|---|---|
| System.Int32 | script | 
Returns
| Type | Description | 
|---|---|
| ICU4N.Text.BreakIterator | 
Overrides
| Improve this Doc View SourceGetType(Int32, Int32)
Declaration
public override string GetType(int script, int ruleStatus)
  Parameters
| Type | Name | Description | 
|---|---|---|
| System.Int32 | script | |
| System.Int32 | ruleStatus | 
Returns
| Type | Description | 
|---|---|
| System.String |