Fork me on GitHub

Show / Hide Table of Contents

Class ThaiTokenizer

Tokenizer that use ICU4N.Text.BreakIterator to tokenize Thai text.

Inheritance

AttributeSource

TokenStream

Tokenizer

SegmentingTokenizerBase

ThaiTokenizer

Implements

Inherited Members

SegmentingTokenizerBase.BUFFERMAX

SegmentingTokenizerBase.m_buffer

SegmentingTokenizerBase.m_offset

SegmentingTokenizerBase.IncrementToken()

SegmentingTokenizerBase.End()

SegmentingTokenizerBase.IsSafeEnd(char)

Tokenizer.m_input

Tokenizer.Dispose(bool)

Tokenizer.CorrectOffset(int)

Tokenizer.SetReader(TextReader)

TokenStream.Dispose()

AttributeSource.GetAttributeFactory()

AttributeSource.GetAttributeClassesEnumerator()

AttributeSource.GetAttributeImplsEnumerator()

AttributeSource.AddAttributeImpl(Attribute)

AttributeSource.AddAttribute<T>()

AttributeSource.HasAttributes

AttributeSource.HasAttribute<T>()

AttributeSource.GetAttribute<T>()

AttributeSource.ClearAttributes()

AttributeSource.RestoreState(AttributeSource.State)

AttributeSource.GetHashCode()

AttributeSource.Equals(object)

AttributeSource.ReflectAsString(bool)

AttributeSource.ReflectWith(IAttributeReflector)

AttributeSource.CloneAttributes()

AttributeSource.CopyTo(AttributeSource)

AttributeSource.ToString()

object.Equals(object, object)

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

Namespace: Lucene.Net.Analysis.Th

Assembly: Lucene.Net.ICU.dll

Syntax

public class ThaiTokenizer : SegmentingTokenizerBase, IDisposable

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

Constructors

ThaiTokenizer(AttributeFactory, TextReader)

Creates a new ThaiTokenizer, supplying the Lucene.Net.Util.AttributeSource.AttributeFactory

Declaration

public ThaiTokenizer(AttributeSource.AttributeFactory factory, TextReader reader)

Parameters

Type	Name	Description
AttributeSource.AttributeFactory	factory
TextReader	reader

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

ThaiTokenizer(TextReader)

Creates a new ThaiTokenizer

Declaration

public ThaiTokenizer(TextReader reader)

Parameters

Type	Name	Description
TextReader	reader

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

Methods

CaptureState()

Captures the state of all Lucene.Net.Util.Attributes. The return value can be passed to Lucene.Net.Util.AttributeSource.RestoreState(Lucene.Net.Util.AttributeSource.State) to restore the state of this or another Lucene.Net.Util.AttributeSource.

Declaration

public override AttributeSource.State CaptureState()

Returns

Type	Description
AttributeSource.State

Overrides

Lucene.Net.Util.AttributeSource.CaptureState()

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

IncrementWord()

Returns true if another word is available

Declaration

protected override bool IncrementWord()

Returns

Type	Description
bool

Overrides

SegmentingTokenizerBase.IncrementWord()

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

Reset()

This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().

Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.

If you override this method, always call base.Reset(), otherwise some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will throw InvalidOperationException on further usage).

Declaration

public override void Reset()

Overrides

SegmentingTokenizerBase.Reset()

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

SetNextSentence(int, int)

Provides the next input sentence for analysis

Declaration

protected override void SetNextSentence(int sentenceStart, int sentenceEnd)

Parameters

Type	Name	Description
int	sentenceStart
int	sentenceEnd

Overrides

SegmentingTokenizerBase.SetNextSentence(int, int)

Remarks

This is an attempt to mimic the behavior of the JDK's java.Text.BreakIterator approach to tokenizing Thai text. While it passes the Lucene tests, there may be innumerable differences between this implementation and the one in the JDK.

Unlike the JDK, this implementation is guaranteed to be stable across all supported target frameworks. While it does use ICU4N's ICU4N.Text.RuleBasedBreakIterator, this implementation doesn't follow the UAX #29 specification (http://unicode.org/reports/tr29) and is not guaranteed to behave the same as either the one in the JDK or in ICU4J.

This implementation is provided primarily for API compatibility with Lucene. If strict Unicode compliance is desired, it is highly recommended to use the ICUTokenizer instead.

Implements