Namespace Lucene.Net.Analysis.Compound

Classes

CompoundWordTokenFilterBase

Base class for decomposition token filters.

You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.
As of 4.4, CompoundWordTokenFilterBase doesn't update offsets.

CompoundWordTokenFilterBase.CompoundToken

Helper class to hold decompounded token information

DictionaryCompoundWordTokenFilter

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

DictionaryCompoundWordTokenFilterFactory

Factory for DictionaryCompoundWordTokenFilter.

<fieldType name="text_dictcomp" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt"
        minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/>
  </analyzer>
</fieldType>

HyphenationCompoundWordTokenFilter

A TokenFilter that decomposes compound words found in many Germanic languages.

"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.

You must specify the required LuceneVersion compatibility when creating CompoundWordTokenFilterBase:

As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

HyphenationCompoundWordTokenFilterFactory

Factory for HyphenationCompoundWordTokenFilter.

This factory accepts the following parameters:

```
hyphenator
```
(mandatory): path to the FOP xml hyphenation pattern. See http://offo.sourceforge.net/hyphenation/.
```
encoding
```
(optional): encoding of the xml hyphenation file. defaults to UTF-8.
```
dictionary
```
(optional): dictionary of words. defaults to no dictionary.
```
minWordSize
```
(optional): minimal word length that gets decomposed. defaults to 5.
```
minSubwordSize
```
(optional): minimum length of subwords. defaults to 2.
```
maxSubwordSize
```
(optional): maximum length of subwords. defaults to 15.
```
onlyLongestMatch
```
(optional): if true, adds only the longest matching subword to the stream. defaults to false.

<fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="hyphenator.xml" encoding="UTF-8"
        dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="false"/>
  </analyzer>
</fieldType>