Namespace Lucene.Net.Codecs.Bloom

Codec PostingsFormat for fast access to low-frequency terms such as primary key fields.

Classes

BloomFilterFactory

Class used to create index-time FuzzySet appropriately configured for each field. Also called to right-size bitsets for serialization.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

A PostingsFormat useful for low doc-frequency fields such as primary keys. Bloom filters are maintained in a ".blm" file which offers "fast-fail" for reads in segments known to have no record of the key. A choice of delegate PostingsFormat is used to record all other Postings data.

A choice of BloomFilterFactory can be passed to tailor Bloom Filter settings on a per-field basis. The default configuration is DefaultBloomFilterFactory which allocates a ~8mb bitset and hashes values using MurmurHash2. This should be suitable for most purposes.

The format of the blm file is as follows:

BloomFilter (.blm) --> Header, DelegatePostingsFormatName, NumFilteredFields, Filter^{NumFilteredFields}, Footer
Filter --> FieldNumber, FuzzySet
FuzzySet -->See Serialize(DataOutput)
Header --> CodecHeader ()
DelegatePostingsFormatName --> String () The name of a ServiceProvider registered PostingsFormat
NumFilteredFields --> Uint32 ()
FieldNumber --> Uint32 () The number of the field in this segment
Footer --> CodecFooter (WriteFooter(IndexOutput))

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DefaultBloomFilterFactory

Default policy is to allocate a bitset with 10% saturation given a unique term per document. Bits are set via MurmurHash2 hashing function.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FuzzySet

A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

This class is NOT threadsafe.

Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

HashFunction

Base class for hashing functions that can be referred to by name. Subclasses are expected to provide threadsafe implementations of the hash function on the range of bytes referenced in the provided BytesRef.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

MurmurHash2

This is a very fast, non-cryptographic hash suitable for general hash-based lookup. See http://murmurhash.googlepages.com/ for more details.

The C version of MurmurHash 2.0 found at that site was ported to Java by Andrzej Bialecki (ab at getopt org).

The code from getopt.org was adapted by Mark Harwood in the form here as one of a pluggable choice of hashing functions as the core function had to be adapted to work with BytesRefs with offsets and lengths rather than raw byte arrays.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Namespace Lucene.Net.Codecs.Bloom

Classes

BloomFilterFactory

BloomFilteringPostingsFormat

DefaultBloomFilterFactory

FuzzySet

HashFunction

MurmurHash2

Enums

FuzzySet.ContainsResult