Class ClassicTokenizer
A grammar-based tokenizer constructed with JFlex (and then ported to .NET)
This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a
dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case
the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1.
As of 3.1, StandardTokenizer implements Unicode text segmentation,
as specified by UAX#29.
Inheritance
System.Object
ClassicTokenizer
Implements
System.IDisposable
Inherited Members
System.Object.Equals(System.Object, System.Object)
System.Object.GetType()
System.Object.MemberwiseClone()
System.Object.ReferenceEquals(System.Object, System.Object)
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class ClassicTokenizer : Tokenizer, IDisposable
Constructors
|
Improve this Doc
View Source
ClassicTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)
Declaration
public ClassicTokenizer(LuceneVersion matchVersion, AttributeSource.AttributeFactory factory, TextReader input)
Parameters
|
Improve this Doc
View Source
ClassicTokenizer(LuceneVersion, TextReader)
Creates a new instance of the ClassicTokenizer. Attaches
the input
to the newly created JFlex scanner.
Declaration
public ClassicTokenizer(LuceneVersion matchVersion, TextReader input)
Parameters
Fields
|
Improve this Doc
View Source
ACRONYM
Declaration
public const int ACRONYM = 2
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
ACRONYM_DEP
Declaration
public const int ACRONYM_DEP = 8
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
ALPHANUM
Declaration
public const int ALPHANUM = 0
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
APOSTROPHE
Declaration
public const int APOSTROPHE = 1
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
CJ
Declaration
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
COMPANY
Declaration
public const int COMPANY = 3
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
EMAIL
Declaration
public const int EMAIL = 4
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
HOST
Declaration
public const int HOST = 5
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
NUM
Declaration
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
TOKEN_TYPES
String token types that correspond to token type int constants
Declaration
public static readonly string[] TOKEN_TYPES
Field Value
Type |
Description |
System.String[] |
|
Properties
|
Improve this Doc
View Source
MaxTokenLength
Set the max allowed token length. Any token longer
than this is skipped.
Declaration
public int MaxTokenLength { get; set; }
Property Value
Type |
Description |
System.Int32 |
|
Methods
|
Improve this Doc
View Source
Dispose(Boolean)
Declaration
protected override void Dispose(bool disposing)
Parameters
Type |
Name |
Description |
System.Boolean |
disposing |
|
Overrides
|
Improve this Doc
View Source
End()
Declaration
public override sealed void End()
Overrides
|
Improve this Doc
View Source
IncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
Type |
Description |
System.Boolean |
|
Overrides
|
Improve this Doc
View Source
Reset()
Declaration
public override void Reset()
Overrides
Implements
System.IDisposable