Lucene.Net.ICU

This module exposes functionality from ICU to Apache Lucene. ICU4N is a .NET library that enhances .NET's internationalization support by improving performance, keeping current with the Unicode Standard, and providing richer APIs.

Note

Since the .NET platform doesn't provide a BreakIterator class (or similar), the functionality that utilizes it was consolidated from Java Lucene's analyzers-icu package, Lucene.Net.Analysis.Common and Lucene.Net.Highlighter into this unified package.

Warning

While ICU4N's BreakIterator has customizable rules, its default behavior is not the same as the one in the JDK. When using any features of this package outside of the Lucene.Net.Analysis.Icu namespace, they will behave differently than they do in Java Lucene and the rules may need some tweaking to fit your needs. See the Break Rules ICU documentation for details on how to customize ICU4N.Text.RuleBaseBreakIterator.

This module exposes the following functionality:

Text Analysis: For an introduction to Lucene's analysis API, see the <xref:Lucene.Net.Analysis> package documentation.
- Text Segmentation: Tokenizes text based on properties and rules defined in Unicode.
- Collation: Compare strings according to the conventions and standards of a particular language, region or country.
- Normalization: Converts text to a unique, equivalent form.
- Case Folding: Removes case distinctions with Unicode's Default Caseless Matching algorithm.
- Search Term Folding: Removes distinctions (such as accent marks) between similar characters for a loose or fuzzy search.
- Text Transformation: Transforms Unicode text in a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese
- Thai Language Analysis
Unicode Highlighter Support
- Postings Highlighter: Highlighter implementation that uses offsets from postings lists.
- Vector Highlighter: An implementation of IBoundaryScanner for use with the vector highlighter in the Lucene.Net.Highlighter module.