Fork me on GitHub
  • Lucene.Net API
  • Lucene.Net CLI
  • Lucene.Net Website
  • ASF
    • Foundation
    • Events
    • License
    • Thanks
    • Security
    • Sponsorship
    • Privacy Policy
  • API

    Show / Hide Table of Contents

    Class ThaiTokenizer

    Tokenizer that use ICU4N.Text.BreakIterator to tokenize Thai text.

    Inheritance
    object
    AttributeSource
    TokenStream
    Tokenizer
    SegmentingTokenizerBase
    ThaiTokenizer
    Implements
    IDisposable
    Inherited Members
    SegmentingTokenizerBase.BUFFERMAX
    SegmentingTokenizerBase.m_buffer
    SegmentingTokenizerBase.m_offset
    SegmentingTokenizerBase.IncrementToken()
    SegmentingTokenizerBase.End()
    SegmentingTokenizerBase.IsSafeEnd(char)
    Tokenizer.m_input
    Tokenizer.Dispose(bool)
    Tokenizer.CorrectOffset(int)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(object)
    AttributeSource.ReflectAsString(bool)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    object.Equals(object, object)
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    Namespace: Lucene.Net.Analysis.Th
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public class ThaiTokenizer : SegmentingTokenizerBase, IDisposable
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    Constructors

    ThaiTokenizer(AttributeFactory, TextReader)

    Creates a new ThaiTokenizer, supplying the Lucene.Net.Util.AttributeSource.AttributeFactory

    Declaration
    public ThaiTokenizer(AttributeSource.AttributeFactory factory, TextReader reader)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory
    TextReader reader
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    ThaiTokenizer(TextReader)

    Creates a new ThaiTokenizer

    Declaration
    public ThaiTokenizer(TextReader reader)
    Parameters
    Type Name Description
    TextReader reader
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    Methods

    CaptureState()

    Captures the state of all Lucene.Net.Util.Attributes. The return value can be passed to Lucene.Net.Util.AttributeSource.RestoreState(Lucene.Net.Util.AttributeSource.State) to restore the state of this or another Lucene.Net.Util.AttributeSource.

    Declaration
    public override AttributeSource.State CaptureState()
    Returns
    Type Description
    AttributeSource.State
    Overrides
    Lucene.Net.Util.AttributeSource.CaptureState()
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    IncrementWord()

    Returns true if another word is available

    Declaration
    protected override bool IncrementWord()
    Returns
    Type Description
    bool
    Overrides
    SegmentingTokenizerBase.IncrementWord()
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    Reset()

    This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().

    Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

    If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will throw InvalidOperationException on further usage).
    Declaration
    public override void Reset()
    Overrides
    SegmentingTokenizerBase.Reset()
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    SetNextSentence(int, int)

    Provides the next input sentence for analysis

    Declaration
    protected override void SetNextSentence(int sentenceStart, int sentenceEnd)
    Parameters
    Type Name Description
    int sentenceStart
    int sentenceEnd
    Overrides
    SegmentingTokenizerBase.SetNextSentence(int, int)
    Remarks

    This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

    Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

    This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

    Implements

    IDisposable
    Back to top Copyright © 2026 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.