Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Icu.Segmentation

    Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.

    Classes

    DefaultICUTokenizerConfig

    Default ICUTokenizerConfig that is generally applicable to many languages.

    ICUTokenizer

    Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

    Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

    Note

    This API is experimental and might change in incompatible ways in the next release.

    ICUTokenizerConfig

    Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    ICUTokenizerFactory

    Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the ICU4N.Text.BreakIterator and typing provided by the DefaultICUTokenizerConfig.

    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.