Class UAX29URLEmailTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
Inheritance
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Standard
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class UAX29URLEmailTokenizerImpl : IStandardTokenizerInterfaceConstructors
| Improve this Doc View SourceUAX29URLEmailTokenizerImpl(TextReader)
Creates a new scanner
Declaration
public UAX29URLEmailTokenizerImpl(TextReader in)Parameters
| Type | Name | Description | 
|---|---|---|
| System.IO.TextReader | in | the TextReader to read input from. | 
Fields
| Improve this Doc View SourceAVOID_BAD_URL
Declaration
public const int AVOID_BAD_URL = 2Field Value
| Type | Description | 
|---|---|
| System.Int32 | 
EMAIL_TYPE
Declaration
public static readonly int EMAIL_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
HANGUL_TYPE
Declaration
public static readonly int HANGUL_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
HIRAGANA_TYPE
Declaration
public static readonly int HIRAGANA_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
IDEOGRAPHIC_TYPE
Declaration
public static readonly int IDEOGRAPHIC_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
KATAKANA_TYPE
Declaration
public static readonly int KATAKANA_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
NUMERIC_TYPE
Numbers
Declaration
public static readonly int NUMERIC_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
Declaration
public static readonly int SOUTH_EAST_ASIAN_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
URL_TYPE
Declaration
public static readonly int URL_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
WORD_TYPE
Alphanumeric sequences
Declaration
public static readonly int WORD_TYPEField Value
| Type | Description | 
|---|---|
| System.Int32 | 
YYEOF
This character denotes the end of file
Declaration
public static readonly int YYEOFField Value
| Type | Description | 
|---|---|
| System.Int32 | 
YYINITIAL
lexical states
Declaration
public const int YYINITIAL = 0Field Value
| Type | Description | 
|---|---|
| System.Int32 | 
Properties
| Improve this Doc View SourceYyChar
Declaration
public int YyChar { get; }Property Value
| Type | Description | 
|---|---|
| System.Int32 | 
YyLength
Returns the length of the matched text region.
Declaration
public int YyLength { get; }Property Value
| Type | Description | 
|---|---|
| System.Int32 | 
YyState
Returns the current lexical state.
Declaration
public int YyState { get; }Property Value
| Type | Description | 
|---|---|
| System.Int32 | 
YyText
Returns the text matched by the current regular expression.
Declaration
public string YyText { get; }Property Value
| Type | Description | 
|---|---|
| System.String | 
Methods
| Improve this Doc View SourceGetNextToken()
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
Declaration
public int GetNextToken()Returns
| Type | Description | 
|---|---|
| System.Int32 | the next token | 
Exceptions
| Type | Condition | 
|---|---|
| System.IO.IOException | if any I/O-Error occurs | 
GetText(ICharTermAttribute)
Fills ICharTermAttribute with the current token text.
Declaration
public void GetText(ICharTermAttribute t)Parameters
| Type | Name | Description | 
|---|---|---|
| Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute | t | 
YyBegin(Int32)
Enters a new lexical state
Declaration
public void YyBegin(int newState)Parameters
| Type | Name | Description | 
|---|---|---|
| System.Int32 | newState | the new lexical state | 
YyCharAt(Int32)
Returns the character at position pos from the 
matched text.
It is equivalent to YyText[pos], but faster
Declaration
public char YyCharAt(int pos)Parameters
| Type | Name | Description | 
|---|---|---|
| System.Int32 | pos | the position of the character to fetch. A value from 0 to YyLength-1. | 
Returns
| Type | Description | 
|---|---|
| System.Char | the character at position pos | 
YyClose()
Disposes the input stream.
Declaration
public void YyClose()YyPushBack(Int32)
Pushes the specified amount of characters back into the input stream.
They will be read again by then next call of the scanning method
Declaration
public void YyPushBack(int number)Parameters
| Type | Name | Description | 
|---|---|---|
| System.Int32 | number | the number of characters to be read again. This number must not be greater than YyLength! | 
YyReset(TextReader)
Resets the scanner to read from a new input stream. Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to YYINITIAL.
Internal scan buffer is resized down to its initial length, if it has grown.
Declaration
public void YyReset(TextReader reader)Parameters
| Type | Name | Description | 
|---|---|---|
| System.IO.TextReader | reader | the new input stream |