Namespace Lucene.Net.Analysis.Morfologik

This package provides dictionary-driven lemmatization ("accurate stemming") filter and analyzer for the Polish Language, driven by the Morfologik library developed by Dawid Weiss and Marcin Miłkowski.

For an introduction to Lucene's analysis API, see the Lucene.Net.Analysis namespace documentation.

The MorfologikFilter yields one or more terms for each token. Each of those terms is given the same position in the index.

Classes

MorfologikAnalyzer

Lucene.Net.Analysis.Analyzer using Morfologik library.

See: Morfologik project page

MorfologikFilter

Lucene.Net.Analysis.TokenFilter using Morfologik library to transform input tokens into lemma and morphosyntactic (POS) tokens. Applies to Polish only.

MorfologikFilter contains a MorphosyntacticTagsAttribute, which provides morphosyntactic annotations for produced lemmas. See the Morfologik documentation for details.

MorfologikFilterFactory

Filter factory for MorfologikFilter.

An explicit resource name of the dictionary (".dict") can be provided via the

dictionary

attribute, as the example below demonstrates:

<fieldType name="text_mylang" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.MorfologikFilterFactory" dictionary="mylang.dict" />
  </analyzer>
</fieldType>

If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default.

See: Morfologik web site