Namespace Lucene.Net.Misc
Misc Tools
The misc package has various tools for splitting/merging indices, changing norms, finding high freq terms, and others.
NativeUnixDirectory
NOTE: This uses C++ sources (accessible via JNI), which you'll have to compile on your platform.
<xref:Lucene.Net.Store.NativeUnixDirectory> is a Directory implementation that bypasses the OS's buffer cache (using direct IO) for any IndexInput and IndexOutput used during merging of segments larger than a specified size (default 10 MB). This avoids evicting hot pages that are still in-use for searching, keeping search more responsive while large merges run.
See this blog post for details.
Steps to build:
- cd lucene/misc/
To compile NativePosixUtil.cpp -> libNativePosixUtil.so, run ant build-native-unix.
libNativePosixUtil.so will be located in the lucene/build/native/ folder
Make sure libNativePosixUtil.so is on your LD_LIBRARY_PATH so java can find it (something like export LD_LIBRARY_PATH=/path/to/dir:$LD_LIBRARY_PATH, where /path/to/dir contains libNativePosixUtil.so)
ant jar to compile the java source and put that JAR on your CLASSPATH
NativePosixUtil.cpp/java also expose access to the posix_madvise, madvise, posix_fadvise functions, which are somewhat more cross platform than O_DIRECT, however, in testing (see above link), these APIs did not seem to help prevent buffer cache eviction.
Classes
GetTermInfo
Utility to get document frequency and total number of occurrences (sum of the tf for each doc) of a term.
HighFreqTerms
HighFreqTerms class extracts the top n most frequent terms (by document frequency) from an existing Lucene index and reports their document frequency.
If the -t flag is given, both document frequency and total tf (total number of occurrences) are reported, ordered by descending total tf.
HighFreqTerms.DocFreqComparer
Compares terms by DocFreq
HighFreqTerms.TotalTermFreqComparer
Compares terms by TotalTermFreq
IndexMergeTool
Merges indices specified on the command line into the index specified as the first command line argument.
SweetSpotSimilarity
A similarity with a lengthNorm that provides for a "plateau" of equally good lengths, and tf helper functions.
For lengthNorm, A min/max can be specified to define the plateau of lengths that should all have a norm of 1.0. Below the min, and above the max the lengthNorm drops off in a sqrt function.
For tf, baselineTf and hyperbolicTf functions are provided, which subclasses can choose between.
TermStats
Holder for a term along with its statistics (DocFreq and TotalTermFreq).