Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class DefaultICUTokenizerConfig

    Default ICUTokenizerConfig that is generally applicable to many languages.

    Inheritance
    object
    ICUTokenizerConfig
    DefaultICUTokenizerConfig
    Inherited Members
    ICUTokenizerConfig.EMOJI_SEQUENCE_STATUS
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Analysis.Icu.Segmentation
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public class DefaultICUTokenizerConfig : ICUTokenizerConfig
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Constructors

    DefaultICUTokenizerConfig(bool, bool)

    Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.

    Declaration
    public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)
    Parameters
    Type Name Description
    bool cjkAsWords

    true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults.

    bool myanmarAsWords

    If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.

    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Fields

    WORD_EMOJI

    Token type for words that appear to be emoji sequences

    Declaration
    public static readonly string WORD_EMOJI
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_HANGUL

    Token type for words containing Korean hangul

    Declaration
    public static readonly string WORD_HANGUL
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_HIRAGANA

    Token type for words containing Japanese hiragana

    Declaration
    public static readonly string WORD_HIRAGANA
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_IDEO

    Token type for words containing ideographic characters

    Declaration
    public static readonly string WORD_IDEO
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_KATAKANA

    Token type for words containing Japanese katakana

    Declaration
    public static readonly string WORD_KATAKANA
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_LETTER

    Token type for words that contain letters

    Declaration
    public static readonly string WORD_LETTER
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    WORD_NUMBER

    Token type for words that appear to be numbers

    Declaration
    public static readonly string WORD_NUMBER
    Field Value
    Type Description
    string
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Properties

    CombineCJ

    true if Han, Hiragana, and Katakana scripts should all be returned as Japanese

    Declaration
    public override bool CombineCJ { get; }
    Property Value
    Type Description
    bool
    Overrides
    ICUTokenizerConfig.CombineCJ
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Methods

    GetBreakIterator(int)

    Return a breakiterator capable of processing a given script.

    Declaration
    public override RuleBasedBreakIterator GetBreakIterator(int script)
    Parameters
    Type Name Description
    int script
    Returns
    Type Description
    RuleBasedBreakIterator
    Overrides
    ICUTokenizerConfig.GetBreakIterator(int)
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    GetType(int, int)

    Return a token type value for a given script and BreakIterator rule status.

    Declaration
    public override string GetType(int script, int ruleStatus)
    Parameters
    Type Name Description
    int script
    int ruleStatus
    Returns
    Type Description
    string
    Overrides
    ICUTokenizerConfig.GetType(int, int)
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.