Namespace Lucene.Net.Analysis.Ar

Analyzer for Arabic.

Classes

ArabicAnalyzer

Analyzer for Arabic.

This analyzer implements light-stemming as specified by: Light Stemming for Arabic Information Retrieval
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf

The analysis package contains three primary components:

ArabicNormalizationFilter: Arabic orthographic normalization.
ArabicStemFilter: Arabic light stemming
Arabic stop words file: a set of default Arabic stop words.

ArabicLetterTokenizer

Tokenizer that breaks text into runs of letters and diacritics.

The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.

You must specify the required LuceneVersion compatibility when creating ArabicLetterTokenizer:

As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.

ArabicLetterTokenizerFactory

Factory for ArabicLetterTokenizer

ArabicNormalizationFilter

A TokenFilter that applies ArabicNormalizer to normalize the orthography.

ArabicNormalizationFilterFactory

Factory for ArabicNormalizationFilter.

<fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ArabicNormalizationFilterFactory"/>
  </analyzer>
</fieldType>

ArabicNormalizer

Normalizer for Arabic.

Normalization is done in-place for efficiency, operating on a termbuffer.

Normalization is defined as:

Normalization of hamza with alef seat to a bare alef.
Normalization of teh marbuta to heh
Normalization of dotless yeh (alef maksura) to yeh.
Removal of Arabic diacritics (the harakat)
Removal of tatweel (stretching character).

ArabicStemFilter

A TokenFilter that applies ArabicStemmer to stem Arabic words..

To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom TokenFilter that sets the KeywordAttribute before this TokenStream.

ArabicStemFilterFactory

Factory for ArabicStemFilter.

<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ArabicNormalizationFilterFactory"/>
    <filter class="solr.ArabicStemFilterFactory"/>
  </analyzer>
</fieldType>

ArabicStemmer

Stemmer for Arabic.

Stemming is done in-place for efficiency, operating on a termbuffer.

Stemming is defined as:

Removal of attached definite article, conjunction, and prepositions.
Stemming of common suffixes.