Show / Hide Table of Contents

    Class ICUTokenizer

    Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

    Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk
    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    ICUTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.m_input
    Tokenizer.Dispose(Boolean)
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Icu.Segmentation
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public sealed class ICUTokenizer : Tokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    ICUTokenizer(AttributeSource.AttributeFactory, TextReader, ICUTokenizerConfig)

    Construct a new ICUTokenizer that breaks text into words from the given , using a tailored configuration.

    Declaration
    public ICUTokenizer(AttributeSource.AttributeFactory factory, TextReader input, ICUTokenizerConfig config)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory

    AttributeSource.AttributeFactory to use.

    TextReader input

    containing text to tokenize.

    ICUTokenizerConfig config

    Tailored configuration.

    | Improve this Doc View Source

    ICUTokenizer(TextReader)

    Construct a new ICUTokenizer that breaks text into words from the given .

    Declaration
    public ICUTokenizer(TextReader input)
    Parameters
    Type Name Description
    TextReader input

    containing text to tokenize.

    Remarks

    The default script-specific handling is used.

    The default attribute factory is used.

    See Also
    DefaultICUTokenizerConfig
    | Improve this Doc View Source

    ICUTokenizer(TextReader, ICUTokenizerConfig)

    Construct a new ICUTokenizer that breaks text into words from the given , using a tailored configuration.

    Declaration
    public ICUTokenizer(TextReader input, ICUTokenizerConfig config)
    Parameters
    Type Name Description
    TextReader input

    containing text to tokenize.

    ICUTokenizerConfig config

    Tailored configuration.

    Remarks

    The default attribute factory is used.

    Methods

    | Improve this Doc View Source

    End()

    Declaration
    public override void End()
    Overrides
    TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Declaration
    public override bool IncrementToken()
    Returns
    Type Description
    System.Boolean
    Overrides
    TokenStream.IncrementToken()
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Tokenizer.Reset()

    Implements

    IDisposable

    See Also

    ICUTokenizerConfig
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)