Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class FuzzySet

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    object
    FuzzySet
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Codecs.Bloom
    Assembly: Lucene.Net.Codecs.dll
    Syntax
    public class FuzzySet

    Fields

    VERSION_CURRENT

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static readonly int VERSION_CURRENT
    Field Value
    Type Description
    int

    VERSION_SPI

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static readonly int VERSION_SPI
    Field Value
    Type Description
    int

    VERSION_START

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static readonly int VERSION_START
    Field Value
    Type Description
    int

    Methods

    AddValue(BytesRef)

    Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.

    Declaration
    public virtual void AddValue(BytesRef value)
    Parameters
    Type Name Description
    BytesRef value

    The Key value to be hashed.

    Exceptions
    Type Condition
    IOException

    If there is a low-level I/O error.

    Contains(BytesRef)

    The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.

    Declaration
    public virtual FuzzySet.ContainsResult Contains(BytesRef value)
    Parameters
    Type Name Description
    BytesRef value
    Returns
    Type Description
    FuzzySet.ContainsResult

    NO or MAYBE

    CreateSetBasedOnMaxMemory(int)

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static FuzzySet CreateSetBasedOnMaxMemory(int maxNumBytes)
    Parameters
    Type Name Description
    int maxNumBytes
    Returns
    Type Description
    FuzzySet

    CreateSetBasedOnQuality(int, float)

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static FuzzySet CreateSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
    Parameters
    Type Name Description
    int maxNumUniqueValues
    float desiredMaxSaturation
    Returns
    Type Description
    FuzzySet

    Deserialize(DataInput)

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static FuzzySet Deserialize(DataInput input)
    Parameters
    Type Name Description
    DataInput input
    Returns
    Type Description
    FuzzySet

    Downsize(float)

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual FuzzySet Downsize(float targetMaxSaturation)
    Parameters
    Type Name Description
    float targetMaxSaturation

    A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.

    Returns
    Type Description
    FuzzySet

    GetEstimatedNumberUniqueValuesAllowingForCollisions(int, int)

    Given a setSize and a the number of set bits, produces an estimate of the number of unique values recorded.

    Declaration
    public static int GetEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
    Parameters
    Type Name Description
    int setSize
    int numRecordedBits
    Returns
    Type Description
    int

    GetEstimatedUniqueValues()

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual int GetEstimatedUniqueValues()
    Returns
    Type Description
    int

    GetNearestSetSize(int)

    Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.
    Use this method where controlling memory use is paramount.

    Declaration
    public static int GetNearestSetSize(int maxNumberOfBits)
    Parameters
    Type Name Description
    int maxNumberOfBits
    Returns
    Type Description
    int

    GetNearestSetSize(int, float)

    Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.

    Declaration
    public static int GetNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
    Parameters
    Type Name Description
    int maxNumberOfValuesExpected
    float desiredSaturation

    A number between 0 and 1 expressing the % of bits set once all values have been recorded.

    Returns
    Type Description
    int

    The size of the set nearest to the required size.

    GetSaturation()

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual float GetSaturation()
    Returns
    Type Description
    float

    HashFunctionForVersion(int)

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public static HashFunction HashFunctionForVersion(int version)
    Parameters
    Type Name Description
    int version
    Returns
    Type Description
    HashFunction

    RamBytesUsed()

    A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

    The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

    Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

    This class is NOT threadsafe.

    Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual long RamBytesUsed()
    Returns
    Type Description
    long

    Serialize(DataOutput)

    Serializes the data set to file using the following format:

    • FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
    • HashFunctionName --> String (WriteString(string)) The name of a ServiceProvider registered HashFunction
    • FuzzySetVersion --> Uint32 (WriteInt32(int)) The version number of the FuzzySet class
    • BloomSize --> Uint32 (WriteInt32(int)) The modulo value used to project hashes into the field's Bitset
    • NumBitSetWords --> Uint32 (WriteInt32(int)) The number of longs (as returned from Lucene.Net.Util.FixedBitSet.GetBits())
    • BitSetWord --> Long (WriteInt64(long)) A long from the array returned by Lucene.Net.Util.FixedBitSet.GetBits()
    Declaration
    public virtual void Serialize(DataOutput output)
    Parameters
    Type Name Description
    DataOutput output

    Data output stream.

    Exceptions
    Type Condition
    IOException

    If there is a low-level I/O error.

    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.