Show / Hide Table of Contents

    Class ClassicTokenizer

    A grammar-based tokenizer constructed with JFlex (and then ported to .NET)

    This should be a good tokenizer for most European-language documents:

    • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
    • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
    • Recognizes email addresses and internet hostnames as one token.

    Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer. ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, StandardTokenizer implements Unicode text segmentation, as specified by UAX#29.

    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    ClassicTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.m_input
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Standard
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public sealed class ClassicTokenizer : Tokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    ClassicTokenizer(LuceneVersion, AttributeSource.AttributeFactory, System.IO.TextReader)

    Creates a new ClassicTokenizer with a given AttributeSource.AttributeFactory

    Declaration
    public ClassicTokenizer(LuceneVersion matchVersion, AttributeSource.AttributeFactory factory, System.IO.TextReader input)
    Parameters
    Type Name Description
    LuceneVersion matchVersion
    AttributeSource.AttributeFactory factory
    System.IO.TextReader input
    | Improve this Doc View Source

    ClassicTokenizer(LuceneVersion, System.IO.TextReader)

    Creates a new instance of the ClassicTokenizer. Attaches the input to the newly created JFlex scanner.

    Declaration
    public ClassicTokenizer(LuceneVersion matchVersion, System.IO.TextReader input)
    Parameters
    Type Name Description
    LuceneVersion matchVersion

    lucene compatibility version

    System.IO.TextReader input

    The input reader

    See http://issues.apache.org/jira/browse/LUCENE-1068

    Fields

    | Improve this Doc View Source

    ACRONYM

    Declaration
    public const int ACRONYM = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    ACRONYM_DEP

    Declaration
    public const int ACRONYM_DEP = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    ALPHANUM

    Declaration
    public const int ALPHANUM = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    APOSTROPHE

    Declaration
    public const int APOSTROPHE = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    CJ

    Declaration
    public const int CJ = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    COMPANY

    Declaration
    public const int COMPANY = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    EMAIL

    Declaration
    public const int EMAIL = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    HOST

    Declaration
    public const int HOST = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    NUM

    Declaration
    public const int NUM = null
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    TOKEN_TYPES

    String token types that correspond to token type int constants

    Declaration
    public static readonly string[] TOKEN_TYPES
    Field Value
    Type Description
    System.String[]

    Properties

    | Improve this Doc View Source

    MaxTokenLength

    Set the max allowed token length. Any token longer than this is skipped.

    Declaration
    public int MaxTokenLength { get; set; }
    Property Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    Dispose(Boolean)

    Declaration
    protected override void Dispose(bool disposing)
    Parameters
    Type Name Description
    System.Boolean disposing
    | Improve this Doc View Source

    End()

    Declaration
    public override sealed void End()
    Overrides
    TokenStream.End()
    | Improve this Doc View Source

    IncrementToken()

    Declaration
    public override sealed bool IncrementToken()
    Returns
    Type Description
    System.Boolean
    Overrides
    TokenStream.IncrementToken()
    | Improve this Doc View Source

    Reset()

    Declaration
    public override void Reset()
    Overrides
    Tokenizer.Reset()

    Implements

    IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)