Class FuzzySet
A class used to represent a set of many, potentially large, values (e.g. many
long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain
a value but it can definitively say if a value is not in
the set. It can therefore be used as a Bloom Filter.
Another application of the set is that it can be used to perform fuzzy counting because
it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording
a stream of values the Downsize(Single) method can be used to create a suitably smaller set that
is sized appropriately for the number of values recorded and desired saturation levels.
This is a Lucene.NET EXPERIMENTAL API, use at your own risk
Inheritance
System.Object
FuzzySet
Inherited Members
System.Object.Equals(System.Object)
System.Object.Equals(System.Object, System.Object)
System.Object.GetHashCode()
System.Object.GetType()
System.Object.MemberwiseClone()
System.Object.ReferenceEquals(System.Object, System.Object)
System.Object.ToString()
Assembly: Lucene.Net.Codecs.dll
Syntax
Fields
|
Improve this Doc
View Source
VERSION_CURRENT
Declaration
public static readonly int VERSION_CURRENT
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
VERSION_SPI
Declaration
public static readonly int VERSION_SPI
Field Value
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
VERSION_START
Declaration
public static readonly int VERSION_START
Field Value
Type |
Description |
System.Int32 |
|
Methods
|
Improve this Doc
View Source
AddValue(BytesRef)
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the
chosen size of the internal bitset.
Declaration
public virtual void AddValue(BytesRef value)
Parameters
Type |
Name |
Description |
Lucene.Net.Util.BytesRef |
value |
The Key value to be hashed.
|
Exceptions
Type |
Condition |
System.IO.IOException |
If there is a low-level I/O error.
|
|
Improve this Doc
View Source
Contains(BytesRef)
The main method required for a Bloom filter which, given a value determines set membership.
Unlike a conventional set, the fuzzy set returns NO or
MAYBE rather than true
or false
.
Declaration
public virtual FuzzySet.ContainsResult Contains(BytesRef value)
Parameters
Type |
Name |
Description |
Lucene.Net.Util.BytesRef |
value |
|
Returns
|
Improve this Doc
View Source
CreateSetBasedOnMaxMemory(Int32)
Declaration
public static FuzzySet CreateSetBasedOnMaxMemory(int maxNumBytes)
Parameters
Type |
Name |
Description |
System.Int32 |
maxNumBytes |
|
Returns
|
Improve this Doc
View Source
CreateSetBasedOnQuality(Int32, Single)
Declaration
public static FuzzySet CreateSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
Parameters
Type |
Name |
Description |
System.Int32 |
maxNumUniqueValues |
|
System.Single |
desiredMaxSaturation |
|
Returns
|
Improve this Doc
View Source
Declaration
public static FuzzySet Deserialize(DataInput input)
Parameters
Type |
Name |
Description |
Lucene.Net.Store.DataInput |
input |
|
Returns
|
Improve this Doc
View Source
Downsize(Single)
Declaration
public virtual FuzzySet Downsize(float targetMaxSaturation)
Parameters
Type |
Name |
Description |
System.Single |
targetMaxSaturation |
A number between 0 and 1 describing the % of bits that would ideally be set in the result.
Lower values have better accuracy but require more space.
|
Returns
|
Improve this Doc
View Source
GetEstimatedNumberUniqueValuesAllowingForCollisions(Int32, Int32)
Given a setSize
and a the number of set bits, produces an estimate of the number of unique values recorded.
Declaration
public static int GetEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
Parameters
Type |
Name |
Description |
System.Int32 |
setSize |
|
System.Int32 |
numRecordedBits |
|
Returns
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
GetEstimatedUniqueValues()
Declaration
public virtual int GetEstimatedUniqueValues()
Returns
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
GetNearestSetSize(Int32)
Rounds down required maxNumberOfBits
to the nearest number that is made up
of all ones as a binary number.
Use this method where controlling memory use is paramount.
Declaration
public static int GetNearestSetSize(int maxNumberOfBits)
Parameters
Type |
Name |
Description |
System.Int32 |
maxNumberOfBits |
|
Returns
Type |
Description |
System.Int32 |
|
|
Improve this Doc
View Source
GetNearestSetSize(Int32, Single)
Use this method to choose a set size where accuracy (low content saturation) is more important
than deciding how much memory to throw at the problem.
Declaration
public static int GetNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
Parameters
Type |
Name |
Description |
System.Int32 |
maxNumberOfValuesExpected |
|
System.Single |
desiredSaturation |
A number between 0 and 1 expressing the % of bits set once all values have been recorded.
|
Returns
Type |
Description |
System.Int32 |
The size of the set nearest to the required size.
|
|
Improve this Doc
View Source
GetSaturation()
Declaration
public virtual float GetSaturation()
Returns
Type |
Description |
System.Single |
|
|
Improve this Doc
View Source
HashFunctionForVersion(Int32)
Declaration
public static HashFunction HashFunctionForVersion(int version)
Parameters
Type |
Name |
Description |
System.Int32 |
version |
|
Returns
|
Improve this Doc
View Source
RamBytesUsed()
Declaration
public virtual long RamBytesUsed()
Returns
Type |
Description |
System.Int64 |
|
|
Improve this Doc
View Source
Serialize(DataOutput)
Serializes the data set to file using the following format:
Declaration
public virtual void Serialize(DataOutput output)
Parameters
Type |
Name |
Description |
Lucene.Net.Store.DataOutput |
output |
Data output stream.
|
Exceptions
Type |
Condition |
System.IO.IOException |
If there is a low-level I/O error.
|