Class ICUTokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)
Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig
Note
This API is experimental and might change in incompatible ways in the next release.
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public sealed class ICUTokenizer : Tokenizer, IDisposable
Constructors
| Improve this Doc View SourceICUTokenizer(AttributeSource.AttributeFactory, TextReader, ICUTokenizerConfig)
Construct a new ICUTokenizer that breaks text into words from the given System.IO.TextReader, using a tailored ICU4N.Text.BreakIterator configuration.
Declaration
public ICUTokenizer(AttributeSource.AttributeFactory factory, TextReader input, ICUTokenizerConfig config)
Parameters
| Type | Name | Description |
|---|---|---|
| Lucene.Net.Util.AttributeSource.AttributeFactory | factory | Lucene.Net.Util.AttributeSource.AttributeFactory to use. |
| System.IO.TextReader | input | System.IO.TextReader containing text to tokenize. |
| ICUTokenizerConfig | config | Tailored ICU4N.Text.BreakIterator configuration. |
ICUTokenizer(TextReader)
Construct a new ICUTokenizer that breaks text into words from the given System.IO.TextReader.
Declaration
public ICUTokenizer(TextReader input)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.TextReader | input | System.IO.TextReader containing text to tokenize. |
Remarks
The default script-specific handling is used.
The default attribute factory is used.
See Also
| Improve this Doc View SourceICUTokenizer(TextReader, ICUTokenizerConfig)
Construct a new ICUTokenizer that breaks text into words from the given System.IO.TextReader, using a tailored ICU4N.Text.BreakIterator configuration.
Declaration
public ICUTokenizer(TextReader input, ICUTokenizerConfig config)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.TextReader | input | System.IO.TextReader containing text to tokenize. |
| ICUTokenizerConfig | config | Tailored ICU4N.Text.BreakIterator configuration. |
Remarks
The default attribute factory is used.
Methods
| Improve this Doc View SourceEnd()
Declaration
public override void End()
Overrides
IncrementToken()
Declaration
public override bool IncrementToken()
Returns
| Type | Description |
|---|---|
| System.Boolean |
Overrides
Reset()
Declaration
public override void Reset()