Show / Hide Table of Contents

    Class ArabicLetterTokenizer

    Tokenizer that breaks text into runs of letters and diacritics.

    The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.

    You must specify the required LuceneVersion compatibility when creating ArabicLetterTokenizer:

    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

    Inheritance
    System.Object
    AttributeSource
    TokenStream
    Tokenizer
    CharTokenizer
    LetterTokenizer
    ArabicLetterTokenizer
    Implements
    IDisposable
    Inherited Members
    CharTokenizer.Normalize(Int32)
    CharTokenizer.IncrementToken()
    CharTokenizer.End()
    CharTokenizer.Reset()
    Tokenizer.m_input
    Tokenizer.Dispose(Boolean)
    Tokenizer.CorrectOffset(Int32)
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(Object)
    AttributeSource.ReflectAsString(Boolean)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    Namespace: Lucene.Net.Analysis.Ar
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public class ArabicLetterTokenizer : LetterTokenizer, IDisposable

    Constructors

    | Improve this Doc View Source

    ArabicLetterTokenizer(LuceneVersion, AttributeSource.AttributeFactory, TextReader)

    Construct a new ArabicLetterTokenizer using a given AttributeSource.AttributeFactory.

    Declaration
    public ArabicLetterTokenizer(LuceneVersion matchVersion, AttributeSource.AttributeFactory factory, TextReader in)
    Parameters
    Type Name Description
    LuceneVersion matchVersion

    Lucene version to match - See LuceneVersion.

    AttributeSource.AttributeFactory factory

    the attribute factory to use for this Tokenizer

    TextReader in

    the input to split up into tokens

    | Improve this Doc View Source

    ArabicLetterTokenizer(LuceneVersion, TextReader)

    Construct a new ArabicLetterTokenizer.

    Declaration
    public ArabicLetterTokenizer(LuceneVersion matchVersion, TextReader in)
    Parameters
    Type Name Description
    LuceneVersion matchVersion

    LuceneVersion to match

    TextReader in

    the input to split up into tokens

    Methods

    | Improve this Doc View Source

    IsTokenChar(Int32)

    Allows for Letter category or NonspacingMark category

    Declaration
    protected override bool IsTokenChar(int c)
    Parameters
    Type Name Description
    System.Int32 c
    Returns
    Type Description
    System.Boolean
    Overrides
    LetterTokenizer.IsTokenChar(Int32)

    Implements

    IDisposable
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)