Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Compound

    Classes

    CompoundWordTokenFilterBase

    Base class for decomposition token filters.

    You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
    • As of 4.4, CompoundWordTokenFilterBase doesn't update offsets.

    CompoundWordTokenFilterBase.CompoundToken

    Helper class to hold decompounded token information

    DictionaryCompoundWordTokenFilter

    A TokenFilter that decomposes compound words found in many Germanic languages.

    "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

    You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    DictionaryCompoundWordTokenFilterFactory

    Factory for DictionaryCompoundWordTokenFilter.

    <fieldType name="text_dictcomp" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt"
            minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/>
      </analyzer>
    </fieldType>

    HyphenationCompoundWordTokenFilter

    A TokenFilter that decomposes compound words found in many Germanic languages.

    "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.

    You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    HyphenationCompoundWordTokenFilterFactory

    Factory for HyphenationCompoundWordTokenFilter.

    This factory accepts the following parameters:

    • hyphenator
      (mandatory): path to the FOP xml hyphenation pattern. See http://offo.sourceforge.net/hyphenation/.
    • encoding
      (optional): encoding of the xml hyphenation file. defaults to UTF-8.
    • dictionary
      (optional): dictionary of words. defaults to no dictionary.
    • minWordSize
      (optional): minimal word length that gets decomposed. defaults to 5.
    • minSubwordSize
      (optional): minimum length of subwords. defaults to 2.
    • maxSubwordSize
      (optional): maximum length of subwords. defaults to 15.
    • onlyLongestMatch
      (optional): if true, adds only the longest matching subword to the stream. defaults to false.

    <fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="hyphenator.xml" encoding="UTF-8"
            dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="false"/>
      </analyzer>
    </fieldType>

    Back to top Copyright © 2019 Licensed to the Apache Software Foundation (ASF)