Class UAX29URLEmailTokenizerImpl

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

<ALPHANUM>: A sequence of alphabetic and numeric characters
<NUM>: A number
<URL>: A URL
<EMAIL>: An email address
<SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
<IDEOGRAPHIC>: A single CJKV ideographic character
<HIRAGANA>: A single hiragana character
<KATAKANA>: A sequence of katakana characters
<HANGUL>: A sequence of Hangul characters

Inheritance

System.Object

UAX29URLEmailTokenizerImpl

Implements

IStandardTokenizerInterface

Inherited Members

System.Object.Equals(System.Object)

System.Object.Equals(System.Object, System.Object)

System.Object.GetHashCode()

System.Object.GetType()

System.Object.MemberwiseClone()

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.ToString()

Namespace: Lucene.Net.Analysis.Standard

Assembly: Lucene.Net.Analysis.Common.dll

Syntax

public sealed class UAX29URLEmailTokenizerImpl : IStandardTokenizerInterface

Constructors

| Improve this Doc View Source

UAX29URLEmailTokenizerImpl(TextReader)

Creates a new scanner

Declaration

public UAX29URLEmailTokenizerImpl(TextReader in)

Parameters

Type	Name	Description
System.IO.TextReader	in	the TextReader to read input from.

Fields

| Improve this Doc View Source

AVOID_BAD_URL

Declaration

public const int AVOID_BAD_URL = 2

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

EMAIL_TYPE

Declaration

public static readonly int EMAIL_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

HANGUL_TYPE

Declaration

public static readonly int HANGUL_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

HIRAGANA_TYPE

Declaration

public static readonly int HIRAGANA_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

IDEOGRAPHIC_TYPE

Declaration

public static readonly int IDEOGRAPHIC_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

KATAKANA_TYPE

Declaration

public static readonly int KATAKANA_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

NUMERIC_TYPE

Numbers

Declaration

public static readonly int NUMERIC_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

SOUTH_EAST_ASIAN_TYPE

Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

Declaration

public static readonly int SOUTH_EAST_ASIAN_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

URL_TYPE

Declaration

public static readonly int URL_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

WORD_TYPE

Alphanumeric sequences

Declaration

public static readonly int WORD_TYPE

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

YYEOF

This character denotes the end of file

Declaration

public static readonly int YYEOF

Field Value

Type	Description
System.Int32

| Improve this Doc View Source

YYINITIAL

lexical states

Declaration

public const int YYINITIAL = 0

Field Value

Type	Description
System.Int32

Properties

| Improve this Doc View Source

YyChar

Declaration

public int YyChar { get; }

Property Value

Type	Description
System.Int32

| Improve this Doc View Source

YyLength

Returns the length of the matched text region.

Declaration

public int YyLength { get; }

Property Value

Type	Description
System.Int32

| Improve this Doc View Source

YyState

Returns the current lexical state.

Declaration

public int YyState { get; }

Property Value

Type	Description
System.Int32

| Improve this Doc View Source

YyText

Returns the text matched by the current regular expression.

Declaration

public string YyText { get; }

Property Value

Type	Description
System.String

Methods

| Improve this Doc View Source

GetNextToken()

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

Declaration

public int GetNextToken()

Returns

Type	Description
System.Int32	the next token

Exceptions

Type	Condition
System.IO.IOException	if any I/O-Error occurs

| Improve this Doc View Source

GetText(ICharTermAttribute)

Fills ICharTermAttribute with the current token text.

Declaration

public void GetText(ICharTermAttribute t)

Parameters

Type	Name	Description
Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute	t

| Improve this Doc View Source

YyBegin(Int32)

Enters a new lexical state

Declaration

public void YyBegin(int newState)

Parameters

Type	Name	Description
System.Int32	newState	the new lexical state

| Improve this Doc View Source

YyCharAt(Int32)

Returns the character at position pos from the matched text.

It is equivalent to YyText[pos], but faster

Declaration

public char YyCharAt(int pos)

Parameters

Type	Name	Description
System.Int32	pos	the position of the character to fetch. A value from 0 to YyLength-1.

Returns

Type	Description
System.Char	the character at position pos

| Improve this Doc View Source

YyClose()

Disposes the input stream.

Declaration

public void YyClose()

| Improve this Doc View Source

YyPushBack(Int32)

Pushes the specified amount of characters back into the input stream.

They will be read again by then next call of the scanning method

Declaration

public void YyPushBack(int number)

Parameters

Type	Name	Description
System.Int32	number	the number of characters to be read again. This number must not be greater than YyLength!

| Improve this Doc View Source

YyReset(TextReader)

Resets the scanner to read from a new input stream. Does not close the old reader.

All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to YYINITIAL.

Internal scan buffer is resized down to its initial length, if it has grown.

Declaration

public void YyReset(TextReader reader)

Parameters

Type	Name	Description
System.IO.TextReader	reader	the new input stream

Implements

IStandardTokenizerInterface