Show / Hide Table of Contents

    Namespace Lucene.Net.Codecs.Bloom

    Codec PostingsFormat for fast access to low-frequency terms such as primary key fields.

    Classes

    BloomFilterFactory

    Class used to create index-time FuzzySet appropriately configured for each field. Also called to right-size bitsets for serialization.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    BloomFilteringPostingsFormat

    A PostingsFormat useful for low doc-frequency fields such as primary keys. Bloom filters are maintained in a ".blm" file which offers "fast-fail" for reads in segments known to have no record of the key. A choice of delegate PostingsFormat is used to record all other Postings data.

    A choice of BloomFilterFactory can be passed to tailor Bloom Filter settings on a per-field basis. The default configuration is DefaultBloomFilterFactory which allocates a ~8mb bitset and hashes values using MurmurHash2. This should be suitable for most purposes.

    The format of the blm file is as follows:

    • BloomFilter (.blm) --> Header, DelegatePostingsFormatName, NumFilteredFields, FilterNumFilteredFields, Footer
    • Filter --> FieldNumber, FuzzySet
    • FuzzySet -->See Serialize(DataOutput)
    • Header --> CodecHeader ()
    • DelegatePostingsFormatName --> String () The name of a ServiceProvider registered PostingsFormat
    • NumFilteredFields --> Uint32 ()
    • FieldNumber --> Uint32 () The number of the field in this segment
    • Footer --> CodecFooter (WriteFooter(IndexOutput))

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    DefaultBloomFilterFactory

    Default policy is to allocate a bitset with 10% saturation given a unique term per document. Bits are set via MurmurHash2 hashing function.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    FuzzySet

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    HashFunction

    Base class for hashing functions that can be referred to by name. Subclasses are expected to provide threadsafe implementations of the hash function on the range of bytes referenced in the provided BytesRef.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    MurmurHash2

    This is a very fast, non-cryptographic hash suitable for general hash-based lookup. See http://murmurhash.googlepages.com/ for more details.

    The C version of MurmurHash 2.0 found at that site was ported to Java by Andrzej Bialecki (ab at getopt org).

    The code from getopt.org was adapted by Mark Harwood in the form here as one of a pluggable choice of hashing functions as the core function had to be adapted to work with BytesRefs with offsets and lengths rather than raw byte arrays.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    Enums

    FuzzySet.ContainsResult

    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)