Miscellaneous index tools.
Utility to get document frequency and total number of occurrences (sum of the tf for each doc) of a term.
HighFreqTerms class extracts the top n most frequent terms (by document frequency) from an existing Lucene index and reports their document frequency.
If the -t flag is given, both document frequency and total tf (total number of occurrences) are reported, ordered by descending total tf.
Compares terms by DocFreq
Compares terms by TotalTermFreq
Merges indices specified on the command line into the index specified as the first command line argument.
A similarity with a lengthNorm that provides for a "plateau" of equally good lengths, and tf helper functions.
For lengthNorm, A min/max can be specified to define the plateau of lengths that should all have a norm of 1.0. Below the min, and above the max the lengthNorm drops off in a sqrt function.
For tf, baselineTf and hyperbolicTf functions are provided, which subclasses can choose between.