Namespace Lucene.Net.Analysis.Icu.Segmentation

Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.

Classes

DefaultICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

ICUTokenizer

Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

Note

This API is experimental and might change in incompatible ways in the next release.

ICUTokenizerConfig

Class that allows for tailored Unicode Text Segmentation on a per-writing system basis.

Note

This API is experimental and might change in incompatible ways in the next release.

ICUTokenizerFactory

Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the ICU4N.Text.BreakIterator and typing provided by the DefaultICUTokenizerConfig.