Show / Hide Table of Contents

    Class SegmentingTokenizerBase

    Breaks text into sentences with a and allows subclasses to decompose these sentences into words.

    This can be used by subclasses that need sentence context for tokenization purposes, such as CJK segmenters.

    Additionally it can be used by subclasses that want to mark sentence boundaries (with a custom attribute, extra token, position increment, etc) for downstream processing.

    @lucene.experimental

    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    SegmentingTokenizerBase
    HMMChineseTokenizer
    OpenNLPTokenizer
    ThaiTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.m_input
    Tokenizer.Dispose(Boolean)
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Util
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public abstract class SegmentingTokenizerBase : Tokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    SegmentingTokenizerBase(AttributeSource.AttributeFactory, TextReader, BreakIterator)

    Construct a new SegmenterBase, also supplying the AttributeSource.AttributeFactory

    Declaration
    public SegmentingTokenizerBase(AttributeSource.AttributeFactory factory, TextReader reader, BreakIterator iterator)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory
    TextReader reader
    BreakIterator iterator
    | Improve this Doc View Source

    SegmentingTokenizerBase(TextReader, BreakIterator)

    Construct a new SegmenterBase, using the provided for sentence segmentation.

    Note that you should never share s across different TokenStreams, instead a newly created or cloned one should always be provided to this constructor.

    Declaration
    public SegmentingTokenizerBase(TextReader reader, BreakIterator iterator)
    Parameters
    Type Name Description
    TextReader reader
    BreakIterator iterator

    Fields

    | Improve this Doc View Source

    BUFFERMAX

    Declaration
    protected const int BUFFERMAX = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_buffer

    Declaration
    protected readonly char[] m_buffer
    Field Value
    Type Description
    System.Char[]
    | Improve this Doc View Source

    m_offset

    accumulated offset of previous buffers for this reader, for offsetAtt

    Declaration
    protected int m_offset
    Field Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    End()

    Declaration
    public override sealed void End()
    Overrides
    TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Declaration
    public override sealed bool IncrementToken()
    Returns
    Type Description
    System.Boolean
    Overrides
    TokenStream.IncrementToken()
    | Improve this Doc View Source

    IncrementWord()

    Returns true if another word is available

    Declaration
    protected abstract bool IncrementWord()
    Returns
    Type Description
    System.Boolean
    | Improve this Doc View Source

    IsSafeEnd(Char)

    For sentence tokenization, these are the unambiguous break positions.

    Declaration
    protected virtual bool IsSafeEnd(char ch)
    Parameters
    Type Name Description
    System.Char ch
    Returns
    Type Description
    System.Boolean
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Tokenizer.Reset()
    | Improve this Doc View Source

    SetNextSentence(Int32, Int32)

    Provides the next input sentence for analysis

    Declaration
    protected abstract void SetNextSentence(int sentenceStart, int sentenceEnd)
    Parameters
    Type Name Description
    System.Int32 sentenceStart
    System.Int32 sentenceEnd

    Implements

    IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)