Namespace Lucene.Net.Codecs.Bloom
Codec PostingsFormat for fast access to low-frequency terms such as primary key fields.
Classes
BloomFilterFactory
Class used to create index-time FuzzySet appropriately configured for each field. Also called to right-size bitsets for serialization.
BloomFilteringPostingsFormat
A Lucene.Net.Codecs.PostingsFormat useful for low doc-frequency fields such as primary keys. Bloom filters are maintained in a ".blm" file which offers "fast-fail" for reads in segments known to have no record of the key. A choice of delegate Lucene.Net.Codecs.PostingsFormat is used to record all other Postings data.
A choice of BloomFilterFactory can be passed to tailor Bloom Filter settings on a per-field basis. The default configuration is DefaultBloomFilterFactory which allocates a ~8mb bitset and hashes values using MurmurHash2. This should be suitable for most purposes.
The format of the blm file is as follows:
- BloomFilter (.blm) --> Header, DelegatePostingsFormatName, NumFilteredFields, FilterNumFilteredFields, Footer
- Filter --> FieldNumber, FuzzySet
- FuzzySet -->See Serialize(DataOutput)
- Header --> CodecHeader (WriteHeader(DataOutput, String, Int32))
- DelegatePostingsFormatName --> String (WriteString(String)) The name of a ServiceProvider registered Lucene.Net.Codecs.PostingsFormat
- NumFilteredFields --> Uint32 (WriteInt32(Int32))
- FieldNumber --> Uint32 (WriteInt32(Int32)) The number of the field in this segment
- Footer --> CodecFooter (WriteFooter(IndexOutput))
DefaultBloomFilterFactory
Default policy is to allocate a bitset with 10% saturation given a unique term per document. Bits are set via MurmurHash2 hashing function.
FuzzySet
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.
Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.
HashFunction
Base class for hashing functions that can be referred to by name. Subclasses are expected to provide threadsafe implementations of the hash function on the range of bytes referenced in the provided Lucene.Net.Util.BytesRef.
MurmurHash2
This is a very fast, non-cryptographic hash suitable for general hash-based lookup. See http://murmurhash.googlepages.com/ for more details.
The C version of MurmurHash 2.0 found at that site was ported to Java by Andrzej Bialecki (ab at getopt org).
The code from getopt.org was adapted by Mark Harwood in the form here as one of a pluggable choice of
hashing functions as the core function had to be adapted to work with Lucene.Net.Util.BytesRefs with offsets and lengths
rather than raw byte arrays.