Class SegmentingTokenizerBase
Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
@lucene.experimental
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Util
Assembly: Lucene.Net.ICU.dll
Syntax
public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable
Constructors
| Improve this Doc View SourceSegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)
Construct a new SegmenterBase, also supplying the AttributeSource.AttributeFactory
Declaration
public SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | |
System.IO.TextReader | reader | |
ICU4N.Text.BreakIterator | iterator |
SegmentingTokenizerBase(TextReader, BreakIterator)
Construct a new SegmenterBase, using the provided ICU4N.Text.BreakIterator for sentence segmentation.
Note that you should never share ICU4N.Text.BreakIterators across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
Declaration
public SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
System.IO.TextReader | reader | |
ICU4N.Text.BreakIterator | iterator |
Fields
| Improve this Doc View SourceBUFFERMAX
Declaration
protected const int BUFFERMAX = 1024
Field Value
Type | Description |
---|---|
System.Int32 |
m_buffer
Declaration
protected readonly char[] m_buffer
Field Value
Type | Description |
---|---|
System.Char[] |
m_offset
accumulated offset of previous buffers for this reader, for offsetAtt
Declaration
protected int m_offset
Field Value
Type | Description |
---|---|
System.Int32 |
Methods
| Improve this Doc View SourceEnd()
Declaration
public override sealed void End()
Overrides
| Improve this Doc View SourceIncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
Type | Description |
---|---|
System.Boolean |
Overrides
| Improve this Doc View SourceIncrementWord()
Returns true if another word is available
Declaration
protected abstract bool IncrementWord()
Returns
Type | Description |
---|---|
System.Boolean |
IsSafeEnd(Char)
For sentence tokenization, these are the unambiguous break positions.
Declaration
protected virtual bool IsSafeEnd(char ch)
Parameters
Type | Name | Description |
---|---|---|
System.Char | ch |
Returns
Type | Description |
---|---|
System.Boolean |
Reset()
Declaration
public override void Reset()
Overrides
| Improve this Doc View SourceSetNextSentence(Int32, Int32)
Provides the next input sentence for analysis
Declaration
protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
Parameters
Type | Name | Description |
---|---|---|
System.Int32 | sentenceStart | |
System.Int32 | sentenceEnd |