Show / Hide Table of Contents

    Class FuzzySet

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(Single) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk
    Inheritance
    System.Object
    FuzzySet
    Namespace: Lucene.Net.Codecs.Bloom
    Assembly: Lucene.Net.Codecs.dll
    Syntax
    public class FuzzySet : object

    Fields

    | Improve this Doc View Source

    VERSION_CURRENT

    Declaration
    public static readonly int VERSION_CURRENT
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_SPI

    Declaration
    public static readonly int VERSION_SPI
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_START

    Declaration
    public static readonly int VERSION_START
    Field Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    AddValue(BytesRef)

    Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.

    Declaration
    public virtual void AddValue(BytesRef value)
    Parameters
    Type Name Description
    BytesRef value

    The Key value to be hashed.

    | Improve this Doc View Source

    Contains(BytesRef)

    The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.

    Declaration
    public virtual FuzzySet.ContainsResult Contains(BytesRef value)
    Parameters
    Type Name Description
    BytesRef value
    Returns
    Type Description
    FuzzySet.ContainsResult NO or MAYBE
    | Improve this Doc View Source

    CreateSetBasedOnMaxMemory(Int32)

    Declaration
    public static FuzzySet CreateSetBasedOnMaxMemory(int maxNumBytes)
    Parameters
    Type Name Description
    System.Int32 maxNumBytes
    Returns
    Type Description
    FuzzySet
    | Improve this Doc View Source

    CreateSetBasedOnQuality(Int32, Single)

    Declaration
    public static FuzzySet CreateSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
    Parameters
    Type Name Description
    System.Int32 maxNumUniqueValues
    System.Single desiredMaxSaturation
    Returns
    Type Description
    FuzzySet
    | Improve this Doc View Source

    Deserialize(DataInput)

    Declaration
    public static FuzzySet Deserialize(DataInput input)
    Parameters
    Type Name Description
    DataInput input
    Returns
    Type Description
    FuzzySet
    | Improve this Doc View Source

    Downsize(Single)

    Declaration
    public virtual FuzzySet Downsize(float targetMaxSaturation)
    Parameters
    Type Name Description
    System.Single targetMaxSaturation

    A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.

    Returns
    Type Description
    FuzzySet
    | Improve this Doc View Source

    GetEstimatedNumberUniqueValuesAllowingForCollisions(Int32, Int32)

    Given a setSize and a the number of set bits, produces an estimate of the number of unique values recorded.

    Declaration
    public static int GetEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
    Parameters
    Type Name Description
    System.Int32 setSize
    System.Int32 numRecordedBits
    Returns
    Type Description
    System.Int32
    | Improve this Doc View Source

    GetEstimatedUniqueValues()

    Declaration
    public virtual int GetEstimatedUniqueValues()
    Returns
    Type Description
    System.Int32
    | Improve this Doc View Source

    GetNearestSetSize(Int32)

    Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.
    Use this method where controlling memory use is paramount.

    Declaration
    public static int GetNearestSetSize(int maxNumberOfBits)
    Parameters
    Type Name Description
    System.Int32 maxNumberOfBits
    Returns
    Type Description
    System.Int32
    | Improve this Doc View Source

    GetNearestSetSize(Int32, Single)

    Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.

    Declaration
    public static int GetNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
    Parameters
    Type Name Description
    System.Int32 maxNumberOfValuesExpected
    System.Single desiredSaturation

    A number between 0 and 1 expressing the % of bits set once all values have been recorded.

    Returns
    Type Description
    System.Int32

    The size of the set nearest to the required size.

    | Improve this Doc View Source

    GetSaturation()

    Declaration
    public virtual float GetSaturation()
    Returns
    Type Description
    System.Single
    | Improve this Doc View Source

    HashFunctionForVersion(Int32)

    Declaration
    public static HashFunction HashFunctionForVersion(int version)
    Parameters
    Type Name Description
    System.Int32 version
    Returns
    Type Description
    HashFunction
    | Improve this Doc View Source

    RamBytesUsed()

    Declaration
    public virtual long RamBytesUsed()
    Returns
    Type Description
    System.Int64
    | Improve this Doc View Source

    Serialize(DataOutput)

    Serializes the data set to file using the following format:

    • FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
    • HashFunctionName --> String () The name of a ServiceProvider registered HashFunction
    • FuzzySetVersion --> Uint32 () The version number of the FuzzySet class
    • BloomSize --> Uint32 () The modulo value used to project hashes into the field's Bitset
    • NumBitSetWords --> Uint32 () The number of longs (as returned from GetBits())
    • BitSetWord --> Long () A long from the array returned by GetBits()

    Declaration
    public virtual void Serialize(DataOutput output)
    Parameters
    Type Name Description
    DataOutput output

    Data output stream.

    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)