Namespace Lucene.Net.Analysis.Ar

Analyzer for Arabic.

Classes

ArabicAnalyzer

Lucene.Net.Analysis.Analyzer for Arabic.

This analyzer implements light-stemming as specified by:


Light Stemming for Arabic Information Retrieval

http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf

The analysis package contains three primary components:

ArabicNormalizationFilter: Arabic orthographic normalization.
ArabicStemFilter: Arabic light stemming
Arabic stop words file: a set of default Arabic stop words.

ArabicLetterTokenizer

Tokenizer that breaks text into runs of letters and diacritics.

The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.

You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating ArabicLetterTokenizer:

As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(int) and Normalize(int) for details.

ArabicNormalizationFilter

A Lucene.Net.Analysis.TokenFilter that applies ArabicNormalizer to normalize the orthography.

ArabicNormalizationFilterFactory

Factory for ArabicNormalizationFilter.

<fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ArabicNormalizationFilterFactory"/>
  </analyzer>
</fieldType>

ArabicNormalizer

Normalizer for Arabic.

Normalization is done in-place for efficiency, operating on a termbuffer.

Normalization is defined as:

Normalization of hamza with alef seat to a bare alef.
Normalization of teh marbuta to heh
Normalization of dotless yeh (alef maksura) to yeh.
Removal of Arabic diacritics (the harakat)
Removal of tatweel (stretching character).

ArabicStemFilter

A Lucene.Net.Analysis.TokenFilter that applies ArabicStemmer to stem Arabic words..

To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom Lucene.Net.Analysis.TokenFilter that sets the Lucene.Net.Analysis.TokenAttributes.IKeywordAttribute before this Lucene.Net.Analysis.TokenStream.

ArabicStemFilterFactory

Factory for ArabicStemFilter.

<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ArabicNormalizationFilterFactory"/>
    <filter class="solr.ArabicStemFilterFactory"/>
  </analyzer>
</fieldType>

ArabicStemmer

Stemmer for Arabic.

Stemming is done in-place for efficiency, operating on a termbuffer.

Stemming is defined as:

Removal of attached definite article, conjunction, and prepositions.
Stemming of common suffixes.