Namespace Lucene.Net.Analysis.Ar
Analyzer for Arabic.
Classes
ArabicAnalyzer
Lucene.Net.Analysis.Analyzer for Arabic.
This analyzer implements light-stemming as specified by:
Light Stemming for Arabic Information Retrieval
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
The analysis package contains three primary components:
- ArabicNormalizationFilter: Arabic orthographic normalization.
- ArabicStemFilter: Arabic light stemming
- Arabic stop words file: a set of default Arabic stop words.
ArabicLetterTokenizer
Tokenizer that breaks text into runs of letters and diacritics.
The problem with the standard Letter tokenizer is that it fails on diacritics. Handling similar to this is necessary for Indic Scripts, Hebrew, Thaana, etc.
You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating ArabicLetterTokenizer:
- As of 3.1, CharTokenizer uses an int based API to normalize and detect token characters. See IsTokenChar(Int32) and Normalize(Int32) for details.
ArabicLetterTokenizerFactory
Factory for ArabicLetterTokenizer
ArabicNormalizationFilter
A Lucene.Net.Analysis.TokenFilter that applies ArabicNormalizer to normalize the orthography.
ArabicNormalizationFilterFactory
Factory for ArabicNormalizationFilter.
<fieldType name="text_arnormal" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
</analyzer>
</fieldType>
ArabicNormalizer
Normalizer for Arabic.
Normalization is done in-place for efficiency, operating on a termbuffer.
Normalization is defined as:
- Normalization of hamza with alef seat to a bare alef.
- Normalization of teh marbuta to heh
- Normalization of dotless yeh (alef maksura) to yeh.
- Removal of Arabic diacritics (the harakat)
- Removal of tatweel (stretching character).
ArabicStemFilter
A Lucene.Net.Analysis.TokenFilter that applies ArabicStemmer to stem Arabic words..
To prevent terms from being stemmed use an instance of SetKeywordMarkerFilter or a custom Lucene.Net.Analysis.TokenFilter that sets the Lucene.Net.Analysis.TokenAttributes.KeywordAttribute before this Lucene.Net.Analysis.TokenStream.
ArabicStemFilterFactory
Factory for ArabicStemFilter.
<fieldType name="text_arstem" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
ArabicStemmer
Stemmer for Arabic.
Stemming is done in-place for efficiency, operating on a termbuffer.
Stemming is defined as:
- Removal of attached definite article, conjunction, and prepositions.
- Stemming of common suffixes.