Class SegmentingTokenizerBase
Breaks text into sentences with a
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
@lucene.experimental
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Util
Assembly: Lucene.Net.ICU.dll
Syntax
public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable
Constructors
| Improve this Doc View SourceSegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)
Construct a new SegmenterBase, also supplying the AttributeSource.AttributeFactory
Declaration
public SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
Parameters
| Type | Name | Description |
|---|---|---|
| AttributeSource.AttributeFactory | factory | |
| TextReader | reader | |
| BreakIterator | iterator |
SegmentingTokenizerBase(TextReader, BreakIterator)
Construct a new SegmenterBase, using
the provided
Note that you should never share
Declaration
public SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
Parameters
| Type | Name | Description |
|---|---|---|
| TextReader | reader | |
| BreakIterator | iterator |
Fields
| Improve this Doc View SourceBUFFERMAX
Declaration
protected const int BUFFERMAX = null
Field Value
| Type | Description |
|---|---|
| System.Int32 |
m_buffer
Declaration
protected readonly char[] m_buffer
Field Value
| Type | Description |
|---|---|
| System.Char[] |
m_offset
accumulated offset of previous buffers for this reader, for offsetAtt
Declaration
protected int m_offset
Field Value
| Type | Description |
|---|---|
| System.Int32 |
Methods
| Improve this Doc View SourceEnd()
Declaration
public override sealed void End()
Overrides
| Improve this Doc View SourceIncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
| Type | Description |
|---|---|
| System.Boolean |
Overrides
| Improve this Doc View SourceIncrementWord()
Returns true if another word is available
Declaration
protected abstract bool IncrementWord()
Returns
| Type | Description |
|---|---|
| System.Boolean |
IsSafeEnd(Char)
For sentence tokenization, these are the unambiguous break positions.
Declaration
protected virtual bool IsSafeEnd(char ch)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Char | ch |
Returns
| Type | Description |
|---|---|
| System.Boolean |
Reset()
Declaration
public override void Reset()
Overrides
| Improve this Doc View SourceSetNextSentence(Int32, Int32)
Provides the next input sentence for analysis
Declaration
protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Int32 | sentenceStart | |
| System.Int32 | sentenceEnd |