Class SegmentingTokenizerBase
Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
Note
This API is experimental and might change in incompatible ways in the next release.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Util
Assembly: Lucene.Net.ICU.dll
Syntax
public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable
Constructors
SegmentingTokenizerBase(AttributeFactory, TextReader, BreakIterator)
Construct a new SegmenterBase, also supplying the Lucene.Net.Util.AttributeSource.AttributeFactory
Declaration
protected SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | |
TextReader | reader | |
BreakIterator | iterator |
SegmentingTokenizerBase(TextReader, BreakIterator)
Construct a new SegmenterBase, using the provided ICU4N.Text.BreakIterator for sentence segmentation.
Note that you should never share ICU4N.Text.BreakIterators across different Lucene.Net.Analysis.TokenStreams, instead a newly created or cloned one should always be provided to this constructor.
Declaration
protected SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
Parameters
Type | Name | Description |
---|---|---|
TextReader | reader | |
BreakIterator | iterator |
Fields
BUFFERMAX
Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
protected const int BUFFERMAX = 1024
Field Value
Type | Description |
---|---|
int |
m_buffer
Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.
This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.
Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
protected readonly char[] m_buffer
Field Value
Type | Description |
---|---|
char[] |
m_offset
accumulated offset of previous buffers for this reader, for offsetAtt
Declaration
protected int m_offset
Field Value
Type | Description |
---|---|
int |
Methods
End()
This method is called by the consumer after the last token has been
consumed, after Lucene.Net.Analysis.TokenStream.IncrementToken() returned false
(using the new Lucene.Net.Analysis.TokenStream API). Streams implementing the old API
should upgrade to use this feature.
base.End();
.
Declaration
public override sealed void End()
Overrides
Exceptions
Type | Condition |
---|---|
IOException | If an I/O error occurs |
IncrementToken()
Consumers (i.e., Lucene.Net.Index.IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Lucene.Net.Util.IAttributes with the attributes of the next token.
The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use Lucene.Net.Util.AttributeSource.CaptureState() to create a copy of the current attribute state. this method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to Lucene.Net.Util.AttributeSource.AddAttribute<T>() and Lucene.Net.Util.AttributeSource.GetAttribute<T>(), references to all Lucene.Net.Util.IAttributes that this stream uses should be retrieved during instantiation. To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in Lucene.Net.Analysis.TokenStream.IncrementToken().Declaration
public override sealed bool IncrementToken()
Returns
Type | Description |
---|---|
bool | false for end of stream; true otherwise |
Overrides
IncrementWord()
Returns true if another word is available
Declaration
protected abstract bool IncrementWord()
Returns
Type | Description |
---|---|
bool |
IsSafeEnd(char)
For sentence tokenization, these are the unambiguous break positions.
Declaration
protected virtual bool IsSafeEnd(char ch)
Parameters
Type | Name | Description |
---|---|---|
char | ch |
Returns
Type | Description |
---|---|
bool |
Reset()
This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always callbase.Reset()
, otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on further usage).
Declaration
public override void Reset()
Overrides
SetNextSentence(int, int)
Provides the next input sentence for analysis
Declaration
protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
Parameters
Type | Name | Description |
---|---|---|
int | sentenceStart | |
int | sentenceEnd |