Namespace Lucene.Net.Search.Spell

Suggest alternate spellings for words. Also see the spell checker Wiki page.

Classes

CombineSuggestion

A suggestion generated by combining one or more original query terms

Candidates are presented directly from the term dictionary, based on Levenshtein distance. This is an alternative to SpellChecker if you are using an edit-distance-like metric such as Levenshtein or JaroWinklerDistance.

A practical benefit of this spellchecker is that it requires no additional datastructures (neither in RAM nor on disk) to do its work.

DirectSpellChecker.ScoreTerm

Holds a spelling correction for internal usage inside DirectSpellChecker.

HighFrequencyDictionary

HighFrequencyDictionary: terms taken from the given field of a Lucene index, which appear in a number of documents above a given threshold.

Threshold is a value in [0..1] representing the minimum number of documents (of the total) where a term should appear.

Based on LuceneDictionary.

JaroWinklerDistance

Similarity measure for short strings such as person names. See http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

LevensteinDistance

Levenstein edit distance class.

LuceneDictionary

Lucene Dictionary: terms taken from the given field of a Lucene index.

LuceneLevenshteinDistance

Damerau-Levenshtein (optimal string alignment) implemented in a consistent way as Lucene's FuzzyTermsEnum with the transpositions option enabled.

Notes:

This metric treats full unicode codepoints as characters
This metric scales raw edit distances into a floating point score based upon the shortest of the two terms
Transpositions of two adjacent codepoints are treated as primitive edits.
Edits are applied in parallel: for example, "ab" and "bca" have distance 3.

NOTE: this class is not particularly efficient. It is only intended for merging results from multiple DirectSpellCheckers.

NGramDistance

N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126, Buenos Aires, Argentina, November 2005. http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

This implementation uses the position-based optimization to compute partial matches of n-gram sub-strings and adds a null-character prefix of size n-1 so that the first character is contained in the same number of n-grams as a middle character. Null-character prefix matches are discounted so that strings with no matching characters will return a distance of 0.

PlainTextDictionary

Dictionary represented by a text file.

Format allowed: 1 word per line:

word1

word2

word3

SpellChecker

Spell Checker class (Main class)
(initially inspired by the David Spencer code).

Example Usage (C#):

 SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
 // To index a field of a user index:
 spellchecker.IndexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
 // To index a file containing words:
 spellchecker.IndexDictionary(new PlainTextDictionary(new FileInfo("myfile.txt")));
 string[] suggestions = spellchecker.SuggestSimilar("misspelt", 5);

Enums

SuggestMode

Set of strategies for suggesting related terms

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

WordBreakSpellChecker.BreakSuggestionSortMethod

Determines the order to list word break suggestions

Namespace Lucene.Net.Search.Spell

Classes

CombineSuggestion

DirectSpellChecker

DirectSpellChecker.ScoreTerm

HighFrequencyDictionary

JaroWinklerDistance

LevensteinDistance

LuceneDictionary

LuceneLevenshteinDistance

NGramDistance

PlainTextDictionary

SpellChecker

SuggestWord

SuggestWordFrequencyComparer

SuggestWordQueue

SuggestWordScoreComparer

TermFreqIteratorWrapper

WordBreakSpellChecker

Interfaces

IDictionary

IStringDistance

ITermFreqIterator

Enums

SuggestMode

WordBreakSpellChecker.BreakSuggestionSortMethod