Class FuzzySet
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Inherited Members
Namespace: Lucene.Net.Codecs.Bloom
Assembly: Lucene.Net.Codecs.dll
Syntax
public class FuzzySet
  Fields
VERSION_CURRENT
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static readonly int VERSION_CURRENT
  Field Value
| Type | Description | 
|---|---|
| int | 
VERSION_SPI
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static readonly int VERSION_SPI
  Field Value
| Type | Description | 
|---|---|
| int | 
VERSION_START
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static readonly int VERSION_START
  Field Value
| Type | Description | 
|---|---|
| int | 
Methods
AddValue(BytesRef)
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.
Declaration
public virtual void AddValue(BytesRef value)
  Parameters
| Type | Name | Description | 
|---|---|---|
| BytesRef | value | The Key value to be hashed.  | 
      
Exceptions
| Type | Condition | 
|---|---|
| IOException | If there is a low-level I/O error.  | 
      
Contains(BytesRef)
The main method required for a Bloom filter which, given a value determines set membership.
Unlike a conventional set, the fuzzy set returns NO or
MAYBE rather than true or false.
Declaration
public virtual FuzzySet.ContainsResult Contains(BytesRef value)
  Parameters
| Type | Name | Description | 
|---|---|---|
| BytesRef | value | 
Returns
| Type | Description | 
|---|---|
| FuzzySet.ContainsResult | 
CreateSetBasedOnMaxMemory(int)
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static FuzzySet CreateSetBasedOnMaxMemory(int maxNumBytes)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | maxNumBytes | 
Returns
| Type | Description | 
|---|---|
| FuzzySet | 
CreateSetBasedOnQuality(int, float)
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static FuzzySet CreateSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | maxNumUniqueValues | |
| float | desiredMaxSaturation | 
Returns
| Type | Description | 
|---|---|
| FuzzySet | 
Deserialize(DataInput)
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static FuzzySet Deserialize(DataInput input)
  Parameters
| Type | Name | Description | 
|---|---|---|
| DataInput | input | 
Returns
| Type | Description | 
|---|---|
| FuzzySet | 
Downsize(float)
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public virtual FuzzySet Downsize(float targetMaxSaturation)
  Parameters
| Type | Name | Description | 
|---|---|---|
| float | targetMaxSaturation | A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.  | 
      
Returns
| Type | Description | 
|---|---|
| FuzzySet | 
GetEstimatedNumberUniqueValuesAllowingForCollisions(int, int)
Given a setSize and a the number of set bits, produces an estimate of the number of unique values recorded.
Declaration
public static int GetEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | setSize | |
| int | numRecordedBits | 
Returns
| Type | Description | 
|---|---|
| int | 
GetEstimatedUniqueValues()
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public virtual int GetEstimatedUniqueValues()
  Returns
| Type | Description | 
|---|---|
| int | 
GetNearestSetSize(int)
Rounds down required maxNumberOfBits to the nearest number that is made up
of all ones as a binary number.
Use this method where controlling memory use is paramount.
Declaration
public static int GetNearestSetSize(int maxNumberOfBits)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | maxNumberOfBits | 
Returns
| Type | Description | 
|---|---|
| int | 
GetNearestSetSize(int, float)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.
Declaration
public static int GetNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | maxNumberOfValuesExpected | |
| float | desiredSaturation | A number between 0 and 1 expressing the % of bits set once all values have been recorded.  | 
      
Returns
| Type | Description | 
|---|---|
| int | The size of the set nearest to the required size.  | 
      
GetSaturation()
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public virtual float GetSaturation()
  Returns
| Type | Description | 
|---|---|
| float | 
HashFunctionForVersion(int)
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public static HashFunction HashFunctionForVersion(int version)
  Parameters
| Type | Name | Description | 
|---|---|---|
| int | version | 
Returns
| Type | Description | 
|---|---|
| HashFunction | 
RamBytesUsed()
A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set. This class is NOT threadsafe. Internally a Bitset is used to record values and once a client has finished recording a stream of values the Downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public virtual long RamBytesUsed()
  Returns
| Type | Description | 
|---|---|
| long | 
Serialize(DataOutput)
Serializes the data set to file using the following format:
- FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
 - HashFunctionName --> String (WriteString(string)) The name of a ServiceProvider registered HashFunction
 - FuzzySetVersion --> Uint32 (WriteInt32(int)) The version number of the FuzzySet class
 - BloomSize --> Uint32 (WriteInt32(int)) The modulo value used to project hashes into the field's Bitset
 - NumBitSetWords --> Uint32 (WriteInt32(int)) The number of longs (as returned from Lucene.Net.Util.FixedBitSet.GetBits())
 - BitSetWord --> Long (WriteInt64(long)) A long from the array returned by Lucene.Net.Util.FixedBitSet.GetBits()
 
Declaration
public virtual void Serialize(DataOutput output)
  Parameters
| Type | Name | Description | 
|---|---|---|
| DataOutput | output | Data output stream.  | 
      
Exceptions
| Type | Condition | 
|---|---|
| IOException | If there is a low-level I/O error.  |