Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Ar

    Analyzer for Arabic.

    Classes

    ArabicAnalyzer

    Analyzer for Arabic.

    This analyzer implements light-stemming as specified by: Light Stemming for Arabic Information Retrieval
    http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf

    The analysis package contains three primary components:

    • ArabicNormalizationFilter: Arabic orthographic normalization.
    • ArabicStemFilter: Arabic light stemming
    • Arabic stop words file: a set of default Arabic stop words.

    ArabicLetterTokenizer

    Tokenizer that breaks text into runs of letters and diacritics.

    The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.

    You must specify the required LuceneVersion compatibility when creating ArabicLetterTokenizer:

    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

    ArabicLetterTokenizerFactory

    Factory for ArabicLetterTokenizer

    ArabicNormalizationFilter

    A TokenFilter that applies ArabicNormalizer to normalize the orthography.

    ArabicNormalizationFilterFactory

    Factory for ArabicNormalizationFilter.

    <fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ArabicNormalizationFilterFactory"/>
      </analyzer>
    </fieldType>

    ArabicNormalizer

    Normalizer for Arabic.

    Normalization is done in-place for efficiency, operating on a termbuffer.

    Normalization is defined as:

    • Normalization of hamza with alef seat to a bare alef.
    • Normalization of teh marbuta to heh
    • Normalization of dotless yeh (alef maksura) to yeh.
    • Removal of Arabic diacritics (the harakat)
    • Removal of tatweel (stretching character).

    ArabicStemFilter

    A TokenFilter that applies ArabicStemmer to stem Arabic words..

    To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom TokenFilter that sets the KeywordAttribute before this TokenStream.

    ArabicStemFilterFactory

    Factory for ArabicStemFilter.

    <fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ArabicNormalizationFilterFactory"/>
        <filter class="solr.ArabicStemFilterFactory"/>
      </analyzer>
    </fieldType>

    ArabicStemmer

    Stemmer for Arabic.

    Stemming is done in-place for efficiency, operating on a termbuffer.

    Stemming is defined as:

    • Removal of attached definite article, conjunction, and prepositions.
    • Stemming of common suffixes.

    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)