Class SegmentingTokenizerBase

Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.

This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

Note

This API is experimental and might change in incompatible ways in the next release.

Inheritance

System.Object

Lucene.Net.Util.AttributeSource

Lucene.Net.Analysis.TokenStream

Lucene.Net.Analysis.Tokenizer

SegmentingTokenizerBase

ThaiTokenizer

Implements

System.IDisposable

Inherited Members

Lucene.Net.Analysis.Tokenizer.m_input

Tokenizer.Dispose(Boolean)

Tokenizer.CorrectOffset(Int32)

Tokenizer.SetReader(TextReader)

Lucene.Net.Analysis.TokenStream.Dispose()

Lucene.Net.Util.AttributeSource.GetAttributeFactory()

Lucene.Net.Util.AttributeSource.GetAttributeClassesEnumerator()

Lucene.Net.Util.AttributeSource.GetAttributeImplsEnumerator()

Lucene.Net.Util.AttributeSource.AddAttributeImpl(Lucene.Net.Util.Attribute)

Lucene.Net.Util.AttributeSource.AddAttribute<T>()

Lucene.Net.Util.AttributeSource.HasAttributes

Lucene.Net.Util.AttributeSource.HasAttribute<T>()

Lucene.Net.Util.AttributeSource.GetAttribute<T>()

Lucene.Net.Util.AttributeSource.ClearAttributes()

Lucene.Net.Util.AttributeSource.CaptureState()

Lucene.Net.Util.AttributeSource.RestoreState(Lucene.Net.Util.AttributeSource.State)

Lucene.Net.Util.AttributeSource.GetHashCode()

AttributeSource.Equals(Object)

AttributeSource.ReflectAsString(Boolean)

Lucene.Net.Util.AttributeSource.ReflectWith(Lucene.Net.Util.IAttributeReflector)

Lucene.Net.Util.AttributeSource.CloneAttributes()

Lucene.Net.Util.AttributeSource.CopyTo(Lucene.Net.Util.AttributeSource)

Lucene.Net.Util.AttributeSource.ToString()

System.Object.Equals(System.Object, System.Object)

System.Object.GetType()

System.Object.MemberwiseClone()

System.Object.ReferenceEquals(System.Object, System.Object)

Namespace: Lucene.Net.Analysis.Util

Assembly: Lucene.Net.ICU.dll

Syntax

public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable

Constructors

| Improve this Doc View Source

SegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)

Construct a new SegmenterBase, also supplying the Lucene.Net.Util.AttributeSource.AttributeFactory

Declaration

protected SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)

Parameters

Type	Name	Description
Lucene.Net.Util.AttributeSource.AttributeFactory	factory
System.IO.TextReader	reader
ICU4N.Text.BreakIterator	iterator

| Improve this Doc View Source

SegmentingTokenizerBase(TextReader, BreakIterator)

Construct a new SegmenterBase, using the provided ICU4N.Text.BreakIterator for sentence segmentation.

Note that you should never share ICU4N.Text.BreakIterators across different Lucene.Net.Analysis.TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

Declaration

protected SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)

Parameters

Type	Name	Description
System.IO.TextReader	reader
ICU4N.Text.BreakIterator	iterator

Fields

| Improve this Doc View Source

BUFFERMAX

Declaration

protected const int BUFFERMAX = 1024

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

m_buffer

Declaration

protected readonly char[] m_buffer

Field Value

Type	Description
System.Char[]

| Improve this Doc View Source

m_offset

accumulated offset of previous buffers for this reader, for offsetAtt

Declaration

protected int m_offset

Field Value

Type	Description
System.Int32

Methods

| Improve this Doc View Source

End()

Declaration

public sealed override void End()

Overrides

Lucene.Net.Analysis.TokenStream.End()

| Improve this Doc View Source

IncrementToken()

Declaration

public sealed override bool IncrementToken()

Returns

Type	Description
System.Boolean

Overrides

Lucene.Net.Analysis.TokenStream.IncrementToken()

| Improve this Doc View Source

IncrementWord()

Returns true if another word is available

Declaration

protected abstract bool IncrementWord()

Returns

Type	Description
System.Boolean

| Improve this Doc View Source

IsSafeEnd(Char)

For sentence tokenization, these are the unambiguous break positions.

Declaration

protected virtual bool IsSafeEnd(char ch)

Parameters

Type	Name	Description
System.Char	ch

Returns

Type	Description
System.Boolean

| Improve this Doc View Source

Reset()

Declaration

public override void Reset()

Overrides

Lucene.Net.Analysis.Tokenizer.Reset()

| Improve this Doc View Source

SetNextSentence(Int32, Int32)

Provides the next input sentence for analysis

Declaration

protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)

Parameters

Type	Name	Description
System.Int32	sentenceStart
System.Int32	sentenceEnd

Implements

System.IDisposable