Namespace Lucene.Net.Analysis.CharFilters

Normalization of text before the tokenizer.

CharFilters are chainable filters that normalize text before tokenization and provide mappings between normalized text offsets and the corresponding offset in the original text.

CharFilter offset mappings

CharFilters modify an input stream via a series of substring replacements (including deletions and insertions) to produce an output stream. There are three possible replacement cases: the replacement string has the same length as the original substring; the replacement is shorter; and the replacement is longer. In the latter two cases (when the replacement has a different length than the original), one or more offset correction mappings are required.

When the replacement is shorter than the original (e.g. when the replacement is the empty string), a single offset correction mapping should be added at the replacement's end offset in the output stream. The cumulativeDiff parameter to the AddOffCorrectMap() method will be the sum of all previous replacement offset adjustments, with the addition of the difference between the lengths of the original substring and the replacement string (a positive value).

When the replacement is longer than the original (e.g. when the original is the empty string), you should add as many offset correction mappings as the difference between the lengths of the replacement string and the original substring, starting at the end offset the original substring would have had in the output stream. The cumulativeDiff parameter to the AddOffCorrectMap() method will be the sum of all previous replacement offset adjustments, with the addition of the difference between the lengths of the original substring and the replacement string so far (a negative value).

Classes

BaseCharFilter

Base utility class for implementing a Lucene.Net.Analysis.CharFilter. You subclass this, and then record mappings by calling AddOffCorrectMap(int, int), and then invoke the correct method to correct an offset.

HTMLStripCharFilter

A Lucene.Net.Analysis.CharFilter that wraps another TextReader and attempts to strip out HTML constructs.

HTMLStripCharFilterFactory

Factory for HTMLStripCharFilter.

<fieldType name="text_html" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" />
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

MappingCharFilter

Simplistic Lucene.Net.Analysis.CharFilter that applies the mappings contained in a NormalizeCharMap to the character stream, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.

MappingCharFilterFactory

Factory for MappingCharFilter.

<fieldType name="text_map" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

@since Solr 1.4

NormalizeCharMap

Holds a map of string input to string output, to be used with NormalizeCharMap.Builder. Use the MappingCharFilter to create this.

NormalizeCharMap.Builder

Builds an NormalizeCharMap.

Call add() until you have added all the mappings, then call build() to get a NormalizeCharMap

Note

This API is experimental and might change in incompatible ways in the next release.