Namespace Lucene.Net.Analysis.Ar
Analyzer for Arabic.
Classes
ArabicAnalyzer
Lucene.
This analyzer implements light-stemming as specified by:
Light Stemming for Arabic Information Retrieval
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
The analysis package contains three primary components:
- Arabic
Normalization : Arabic orthographic normalization.Filter - Arabic
Stem : Arabic light stemmingFilter - Arabic stop words file: a set of default Arabic stop words.
ArabicLetterTokenizer
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required Lucene.
- As of 3.1, Char
Tokenizer uses an int based API to normalize and detect token characters. See IsToken and Normalize(Int32) for details.Char(Int32)
ArabicLetterTokenizerFactory
Factory for Arabic
ArabicNormalizationFilter
A Lucene.
ArabicNormalizationFilterFactory
Factory for Arabic
<fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
</analyzer>
</fieldType>
ArabicNormalizer
Normalizer for Arabic.
Normalization is done in-place for efficiency, operating on a termbuffer.
Normalization is defined as:
- Normalization of hamza with alef seat to a bare alef.
- Normalization of teh marbuta to heh
- Normalization of dotless yeh (alef maksura) to yeh.
- Removal of Arabic diacritics (the harakat)
- Removal of tatweel (stretching character).
ArabicStemFilter
A Lucene.
To prevent terms from being stemmed use an instance of
Set
ArabicStemFilterFactory
Factory for Arabic
<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
ArabicStemmer
Stemmer for Arabic.
Stemming is done in-place for efficiency, operating on a termbuffer.
Stemming is defined as:
- Removal of attached definite article, conjunction, and prepositions.
- Stemming of common suffixes.