Show / Hide Table of Contents

    Class DefaultICUTokenizerConfig

    Default ICUTokenizerConfig that is generally applicable to many languages.

    Inheritance
    System.Object
    ICUTokenizerConfig
    DefaultICUTokenizerConfig
    Namespace: Lucene.Net.Analysis.Icu.Segmentation
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public class DefaultICUTokenizerConfig : ICUTokenizerConfig
    Remarks

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.GetWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    Constructors

    | Improve this Doc View Source

    DefaultICUTokenizerConfig(Boolean, Boolean)

    Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.

    Declaration
    public DefaultICUTokenizerConfig(bool cjkAsWords, bool myanmarAsWords)
    Parameters
    Type Name Description
    System.Boolean cjkAsWords

    true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults.

    System.Boolean myanmarAsWords

    If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.

    Fields

    | Improve this Doc View Source

    WORD_HANGUL

    Token type for words containing Korean hangul

    Declaration
    public static readonly string WORD_HANGUL
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    WORD_HIRAGANA

    Token type for words containing Japanese hiragana

    Declaration
    public static readonly string WORD_HIRAGANA
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    WORD_IDEO

    Token type for words containing ideographic characters

    Declaration
    public static readonly string WORD_IDEO
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    WORD_KATAKANA

    Token type for words containing Japanese katakana

    Declaration
    public static readonly string WORD_KATAKANA
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    WORD_LETTER

    Token type for words that contain letters

    Declaration
    public static readonly string WORD_LETTER
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    WORD_NUMBER

    Token type for words that appear to be numbers

    Declaration
    public static readonly string WORD_NUMBER
    Field Value
    Type Description
    System.String

    Properties

    | Improve this Doc View Source

    CombineCJ

    Declaration
    public override bool CombineCJ { get; }
    Property Value
    Type Description
    System.Boolean
    Overrides
    ICUTokenizerConfig.CombineCJ

    Methods

    | Improve this Doc View Source

    GetBreakIterator(Int32)

    Declaration
    public override BreakIterator GetBreakIterator(int script)
    Parameters
    Type Name Description
    System.Int32 script
    Returns
    Type Description
    BreakIterator
    Overrides
    ICUTokenizerConfig.GetBreakIterator(Int32)
    | Improve this Doc View Source

    GetType(Int32, Int32)

    Declaration
    public override string GetType(int script, int ruleStatus)
    Parameters
    Type Name Description
    System.Int32 script
    System.Int32 ruleStatus
    Returns
    Type Description
    System.String
    Overrides
    ICUTokenizerConfig.GetType(Int32, Int32)
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)