Class StandardTokenizerImpl

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Inheritance

object

StandardTokenizerImpl

Implements

IStandardTokenizerInterface

Inherited Members

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Lucene.Net.Analysis.Standard

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

public sealed class StandardTokenizerImpl : IStandardTokenizerInterface

Constructors

StandardTokenizerImpl(TextReader)

Creates a new scanner

Declaration

public StandardTokenizerImpl(TextReader @in)

Parameters

Type	Name	Description
TextReader	in	the TextReader to read input from.

Fields

HANGUL_TYPE

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Declaration

public static readonly int HANGUL_TYPE

Field Value

Type	Description
int

HIRAGANA_TYPE

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Declaration

public static readonly int HIRAGANA_TYPE

Field Value

Type	Description
int

IDEOGRAPHIC_TYPE

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Declaration

public static readonly int IDEOGRAPHIC_TYPE

Field Value

Type	Description
int

KATAKANA_TYPE

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Declaration

public static readonly int KATAKANA_TYPE

Field Value

Type	Description
int

NUMERIC_TYPE

Numbers

Declaration

public static readonly int NUMERIC_TYPE

Field Value

Type	Description
int

SOUTH_EAST_ASIAN_TYPE

Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

Declaration

public static readonly int SOUTH_EAST_ASIAN_TYPE

Field Value

Type	Description
int

WORD_TYPE

Alphanumeric sequences

Declaration

public static readonly int WORD_TYPE

Field Value

Type	Description
int

YYEOF

This character denotes the end of file

Declaration

public static readonly int YYEOF

Field Value

Type	Description
int

YYINITIAL

lexical states

Declaration

public const int YYINITIAL = 0

Field Value

Type	Description
int

Properties

YyChar

Returns the current position.

Declaration

public int YyChar { get; }

Property Value

Type	Description
int

YyLength

Returns the length of the matched text region.

Declaration

public int YyLength { get; }

Property Value

Type	Description
int

YyState

Returns the current lexical state.

Declaration

public int YyState { get; }

Property Value

Type	Description
int

YyText

Returns the text matched by the current regular expression.

Declaration

public string YyText { get; }

Property Value

Type	Description
string

Methods

GetNextToken()

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

Declaration

public int GetNextToken()

Returns

Type	Description
int	the next token

Exceptions

Type	Condition
IOException	if any I/O-Error occurs

GetText(ICharTermAttribute)

Fills Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute with the current token text.

Declaration

public void GetText(ICharTermAttribute t)

Parameters

Type	Name	Description
ICharTermAttribute	t

YyBegin(int)

Enters a new lexical state

Declaration

public void YyBegin(int newState)

Parameters

Type	Name	Description
int	newState	the new lexical state

YyCharAt(int)

Returns the character at position pos from the matched text.

It is equivalent to YyText[pos], but faster

Declaration

public char YyCharAt(int pos)

Parameters

Type	Name	Description
int	pos	the position of the character to fetch. A value from 0 to YyLength-1.

Returns

Type	Description
char	the character at position pos

YyClose()

Disposes the input stream.

Declaration

public void YyClose()

YyPushBack(int)

Pushes the specified amount of characters back into the input stream.

They will be read again by then next call of the scanning method

Declaration

public void YyPushBack(int number)

Parameters

Type	Name	Description
int	number	the number of characters to be read again. This number must not be greater than YyLength!

YyReset(TextReader)

Resets the scanner to read from a new input stream. Does not close the old reader.

All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to YYINITIAL.

Internal scan buffer is resized down to its initial length, if it has grown.

Declaration

public void YyReset(TextReader reader)

Parameters

Type	Name	Description
TextReader	reader	the new input stream

Implements

IStandardTokenizerInterface