Class SegmentingTokenizerBase
Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
Note
This API is experimental and might change in incompatible ways in the next release.
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Util
Assembly: Lucene.Net.ICU.dll
Syntax
public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable
Constructors
| Improve this Doc View SourceSegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)
Construct a new SegmenterBase, also supplying the Lucene.Net.Util.AttributeSource.AttributeFactory
Declaration
protected SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
Lucene.Net.Util.AttributeSource.AttributeFactory | factory | |
System.IO.TextReader | reader | |
ICU4N.Text.BreakIterator | iterator |
SegmentingTokenizerBase(TextReader, BreakIterator)
Construct a new SegmenterBase, using the provided ICU4N.Text.BreakIterator for sentence segmentation.
Note that you should never share ICU4N.Text.BreakIterators across different Lucene.Net.Analysis.TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
Declaration
protected SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
System.IO.TextReader | reader | |
ICU4N.Text.BreakIterator | iterator |
Fields
| Improve this Doc View SourceBUFFERMAX
Declaration
protected const int BUFFERMAX = 1024
Field Value
Type | Description |
---|---|
System.Int32 |
m_buffer
Declaration
protected readonly char[] m_buffer
Field Value
Type | Description |
---|---|
System.Char[] |
m_offset
accumulated offset of previous buffers for this reader, for offsetAtt
Declaration
protected int m_offset
Field Value
Type | Description |
---|---|
System.Int32 |
Methods
| Improve this Doc View SourceEnd()
Declaration
public sealed override void End()
Overrides
IncrementToken()
Declaration
public sealed override bool IncrementToken()
Returns
Type | Description |
---|---|
System.Boolean |
Overrides
IncrementWord()
Returns true if another word is available
Declaration
protected abstract bool IncrementWord()
Returns
Type | Description |
---|---|
System.Boolean |
IsSafeEnd(Char)
For sentence tokenization, these are the unambiguous break positions.
Declaration
protected virtual bool IsSafeEnd(char ch)
Parameters
Type | Name | Description |
---|---|---|
System.Char | ch |
Returns
Type | Description |
---|---|
System.Boolean |
Reset()
Declaration
public override void Reset()
Overrides
SetNextSentence(Int32, Int32)
Provides the next input sentence for analysis
Declaration
protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
Parameters
Type | Name | Description |
---|---|---|
System.Int32 | sentenceStart | |
System.Int32 | sentenceEnd |