Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class SegmentingTokenizerBase

    Breaks text into sentences with a ICU4N.Text.BreakIterator and allows subclasses to decompose these sentences into words.

    This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

    Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    System.Object
    Lucene.Net.Util.AttributeSource
    Lucene.Net.Analysis.TokenStream
    Lucene.Net.Analysis.Tokenizer
    SegmentingTokenizerBase
    ThaiTokenizer
    Implements
    System.IDisposable
    Inherited Members
    Lucene.Net.Analysis.Tokenizer.m_input
    Tokenizer.Dispose(Boolean)
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    Lucene.Net.Analysis.TokenStream.Dispose()
    Lucene.Net.Util.AttributeSource.GetAttributeFactory()
    Lucene.Net.Util.AttributeSource.GetAttributeClassesEnumerator()
    Lucene.Net.Util.AttributeSource.GetAttributeImplsEnumerator()
    Lucene.Net.Util.AttributeSource.AddAttributeImpl(Lucene.Net.Util.Attribute)
    Lucene.Net.Util.AttributeSource.AddAttribute<T>()
    Lucene.Net.Util.AttributeSource.HasAttributes
    Lucene.Net.Util.AttributeSource.HasAttribute<T>()
    Lucene.Net.Util.AttributeSource.GetAttribute<T>()
    Lucene.Net.Util.AttributeSource.ClearAttributes()
    Lucene.Net.Util.AttributeSource.CaptureState()
    Lucene.Net.Util.AttributeSource.RestoreState(Lucene.Net.Util.AttributeSource.State)
    Lucene.Net.Util.AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    Lucene.Net.Util.AttributeSource.ReflectWith(Lucene.Net.Util.IAttributeReflector)
    Lucene.Net.Util.AttributeSource.CloneAttributes()
    Lucene.Net.Util.AttributeSource.CopyTo(Lucene.Net.Util.AttributeSource)
    Lucene.Net.Util.AttributeSource.ToString()
    System.Object.Equals(System.Object, System.Object)
    System.Object.GetType()
    System.Object.MemberwiseClone()
    System.Object.ReferenceEquals(System.Object, System.Object)
    Namespace: Lucene.Net.Analysis.Util
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    SegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)

    Construct a new SegmenterBase, also supplying the Lucene.Net.Util.AttributeSource.AttributeFactory

    Declaration
    protected SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
    Parameters
    Type Name Description
    Lucene.Net.Util.AttributeSource.AttributeFactory factory
    System.IO.TextReader reader
    ICU4N.Text.BreakIterator iterator
    | Improve this Doc View Source

    SegmentingTokenizerBase(TextReader, BreakIterator)

    Construct a new SegmenterBase, using the provided ICU4N.Text.BreakIterator for sentence segmentation.

    Note that you should never share ICU4N.Text.BreakIterators across different Lucene.Net.Analysis.TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

    Declaration
    protected SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
    Parameters
    Type Name Description
    System.IO.TextReader reader
    ICU4N.Text.BreakIterator iterator

    Fields

    | Improve this Doc View Source

    BUFFERMAX

    Declaration
    protected const int BUFFERMAX = 1024
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_buffer

    Declaration
    protected readonly char[] m_buffer
    Field Value
    Type Description
    System.Char[]
    | Improve this Doc View Source

    m_offset

    accumulated offset of previous buffers for this reader, for offsetAtt

    Declaration
    protected int m_offset
    Field Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    End()

    Declaration
    public sealed override void End()
    Overrides
    Lucene.Net.Analysis.TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Declaration
    public sealed override bool IncrementToken()
    Returns
    Type Description
    System.Boolean
    Overrides
    Lucene.Net.Analysis.TokenStream.IncrementToken()
    | Improve this Doc View Source

    IncrementWord()

    Returns true if another word is available

    Declaration
    protected abstract bool IncrementWord()
    Returns
    Type Description
    System.Boolean
    | Improve this Doc View Source

    IsSafeEnd(Char)

    For sentence tokenization, these are the unambiguous break positions.

    Declaration
    protected virtual bool IsSafeEnd(char ch)
    Parameters
    Type Name Description
    System.Char ch
    Returns
    Type Description
    System.Boolean
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Lucene.Net.Analysis.Tokenizer.Reset()
    | Improve this Doc View Source

    SetNextSentence(Int32, Int32)

    Provides the next input sentence for analysis

    Declaration
    protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
    Parameters
    Type Name Description
    System.Int32 sentenceStart
    System.Int32 sentenceEnd

    Implements

    System.IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2022 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.