Class UAX29URLEmailTokenizer
This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in `
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
You must specify the required LuceneVersion
compatibility when creating UAX29URLEmailTokenizer:
- As of 3.4, Hiragana and Han characters are no longer wrongly split
from their combining characters. If you use a previous version number,
you get the exact broken behavior for backwards compatibility.
Inheritance
System.Object
UAX29URLEmailTokenizer
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class UAX29URLEmailTokenizer : Tokenizer, IDisposable
Constructors
|
Improve this Doc
View Source
UAX29URLEmailTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)
Declaration
public UAX29URLEmailTokenizer(LuceneVersion matchVersion, AttributeSource.AttributeFactory factory, TextReader input)
Parameters
|
Improve this Doc
View Source
UAX29URLEmailTokenizer(LuceneVersion, TextReader)
Declaration
public UAX29URLEmailTokenizer(LuceneVersion matchVersion, TextReader input)
Parameters
Type |
Name |
Description |
LuceneVersion |
matchVersion |
Lucene compatibility version
|
TextReader |
input |
The input reader
|
Fields
|
Improve this Doc
View Source
ALPHANUM
Declaration
public const int ALPHANUM = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
EMAIL
Declaration
public const int EMAIL = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
HANGUL
Declaration
public const int HANGUL = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
HIRAGANA
Declaration
public const int HIRAGANA = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
IDEOGRAPHIC
Declaration
public const int IDEOGRAPHIC = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
KATAKANA
Declaration
public const int KATAKANA = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
NUM
Declaration
public const int NUM = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
SOUTHEAST_ASIAN
Declaration
public const int SOUTHEAST_ASIAN = null
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
TOKEN_TYPES
String token types that correspond to token type int constants
Declaration
public static readonly string[] TOKEN_TYPES
Field Value
Type |
Description |
System.String[] |
|
|
Improve this Doc
View Source
URL
Declaration
public const int URL = null
Field Value
Type |
Description |
System.Int32 |
|
Properties
|
Improve this Doc
View Source
MaxTokenLength
Set the max allowed token length. Any token longer
than this is skipped.
Declaration
public int MaxTokenLength { get; set; }
Property Value
Type |
Description |
System.Int32 |
|
Methods
|
Improve this Doc
View Source
Dispose(Boolean)
Declaration
protected override void Dispose(bool disposing)
Parameters
Type |
Name |
Description |
System.Boolean |
disposing |
|
|
Improve this Doc
View Source
End()
Declaration
public override sealed void End()
Overrides
|
Improve this Doc
View Source
IncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
Type |
Description |
System.Boolean |
|
Overrides
|
Improve this Doc
View Source
Reset()
Declaration
public override void Reset()
Overrides
Implements
IDisposable