Namespace Lucene.Net.Misc
Misc Tools
The misc package has various tools for splitting/merging indices, changing norms, finding high freq terms, and others.
Classes
GetTermInfo
Utility to get document frequency and total number of occurrences (sum of the tf for each doc) of a term.
LUCENENET specific: In the Java implementation, this class' Main method was intended to be called from the command line. However, in .NET a method within a DLL can't be directly called from the command line so we provide a .NET tool, lucene-cli, with a command that maps to that method: index list-term-infoHighFreqTerms
HighFreqTerms class extracts the top n most frequent terms (by document frequency) from an existing Lucene index and reports their document frequency.
LUCENENET specific: In the Java implementation, this class' Main method was intended to be called from the command line. However, in .NET a method within a DLL can't be directly called from the command line so we provide a .NET tool, lucene-cli, with a command that maps to that method: index list-high-freq-termsHighFreqTerms.DocFreqComparer
Compares terms by DocFreq
HighFreqTerms.TotalTermFreqComparer
Compares terms by TotalTermFreq
IndexMergeTool
Merges indices specified on the command line into the index specified as the first command line argument.
LUCENENET specific: In the Java implementation, this class' Main method was intended to be called from the command line. However, in .NET a method within a DLL can't be directly called from the command line so we provide a .NET tool, lucene-cli, with a command that maps to that method: index mergeSweetSpotSimilarity
A similarity with a lengthNorm that provides for a "plateau" of equally good lengths, and tf helper functions.
For lengthNorm, A min/max can be specified to define the plateau of lengths that should all have a norm of 1.0. Below the min, and above the max the lengthNorm drops off in a sqrt function.
For tf, baselineTf and hyperbolicTf functions are provided, which subclasses can choose between.
TermStats
Holder for a term along with its statistics (DocFreq and TotalTermFreq).