Class WAH8DocIdSet

DocIdSet implementation based on word-aligned hybrid encoding on words of 8 bits.

This implementation doesn't support random-access but has a fast DocIdSetIterator which can advance in logarithmic time thanks to an index.

The compression scheme is simplistic and should work well with sparse and very dense doc id sets while being only slightly larger than a FixedBitSet for incompressible sets (overhead<2% in the worst case) in spite of the index.

Format: The format is byte-aligned. An 8-bits word is either clean, meaning composed only of zeros or ones, or dirty, meaning that it contains between 1 and 7 bits set. The idea is to encode sequences of clean words using run-length encoding and to leave sequences of dirty words as-is.

TokenClean length+Dirty length+Dirty words
1 byte0-n bytes0-n bytes0-n bytes

Token encodes whether clean means full of zeros or ones in the first bit, the number of clean words minus 2 on the next 3 bits and the number of dirty words on the last 4 bits. The higher-order bit is a continuation bit, meaning that the number is incomplete and needs additional bytes to be read.
Clean length+: If clean length has its higher-order bit set, you need to read a vint (ReadVInt32()), shift it by 3 bits on the left side and add it to the 3 bits which have been read in the token.
Dirty length+ works the same way as Clean length+ but on 4 bits and for the length of dirty words.
Dirty wordsare the dirty words, there are Dirty length of them.

This format cannot encode sequences of less than 2 clean words and 0 dirty word. The reason is that if you find a single clean word, you should rather encode it as a dirty word. This takes the same space as starting a new sequence (since you need one byte for the token) but will be lighter to decode. There is however an exception for the first sequence. Since the first sequence may start directly with a dirty word, the clean length is encoded directly, without subtracting 2.

There is an additional restriction on the format: the sequence of dirty words is not allowed to contain two consecutive clean words. This restriction exists to make sure no space is wasted and to make sure iterators can read the next doc ID by reading at most 2 dirty words.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Inheritance

System.Object

DocIdSet

WAH8DocIdSet

Inherited Members

DocIdSet.Bits

DocIdSet.NewAnonymous(Func<DocIdSetIterator>)

DocIdSet.NewAnonymous(Func<DocIdSetIterator>, Func<IBits>)

DocIdSet.NewAnonymous(Func<DocIdSetIterator>, Func<Boolean>)

DocIdSet.NewAnonymous(Func<DocIdSetIterator>, Func<IBits>, Func<Boolean>)

System.Object.Equals(System.Object)

System.Object.Equals(System.Object, System.Object)

System.Object.GetHashCode()

System.Object.GetType()

System.Object.MemberwiseClone()

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.ToString()

Namespace: Lucene.Net.Util

Assembly: Lucene.Net.dll

Syntax

public sealed class WAH8DocIdSet : DocIdSet

Fields

| Improve this Doc View Source

DEFAULT_INDEX_INTERVAL

Default index interval.

Declaration

public const int DEFAULT_INDEX_INTERVAL = 24

Field Value

Type	Description
System.Int32

Properties

| Improve this Doc View Source

IsCacheable

Declaration

public override bool IsCacheable { get; }

Property Value

Type	Description
System.Boolean

Overrides

DocIdSet.IsCacheable

Methods

| Improve this Doc View Source

Cardinality()

Return the number of documents in this DocIdSet in constant time.

Declaration

public int Cardinality()

Returns

Type	Description
System.Int32

| Improve this Doc View Source

GetIterator()

Declaration

public override DocIdSetIterator GetIterator()

Returns

Type	Description
DocIdSetIterator

Overrides

DocIdSet.GetIterator()

| Improve this Doc View Source

Intersect(ICollection<WAH8DocIdSet>)

Same as Intersect(ICollection<WAH8DocIdSet>, Int32) with the default index interval.

Declaration

public static WAH8DocIdSet Intersect(ICollection<WAH8DocIdSet> docIdSets)

Parameters

Type	Name	Description
System.Collections.Generic.ICollection<WAH8DocIdSet>	docIdSets

Returns

Type	Description
WAH8DocIdSet

| Improve this Doc View Source

Intersect(ICollection<WAH8DocIdSet>, Int32)

Compute the intersection of the provided sets. This method is much faster than computing the intersection manually since it operates directly at the byte level.

Declaration

public static WAH8DocIdSet Intersect(ICollection<WAH8DocIdSet> docIdSets, int indexInterval)

Parameters

Type	Name	Description
System.Collections.Generic.ICollection<WAH8DocIdSet>	docIdSets
System.Int32	indexInterval

Returns

Type	Description
WAH8DocIdSet

| Improve this Doc View Source

RamBytesUsed()

Return the memory usage of this class in bytes.

Declaration

public long RamBytesUsed()

Returns

Type	Description
System.Int64

| Improve this Doc View Source

Union(ICollection<WAH8DocIdSet>)

Same as Union(ICollection<WAH8DocIdSet>, Int32) with the default index interval.

Declaration

public static WAH8DocIdSet Union(ICollection<WAH8DocIdSet> docIdSets)

Parameters

Type	Name	Description
System.Collections.Generic.ICollection<WAH8DocIdSet>	docIdSets

Returns

Type	Description
WAH8DocIdSet

| Improve this Doc View Source

Union(ICollection<WAH8DocIdSet>, Int32)

Compute the union of the provided sets. This method is much faster than computing the union manually since it operates directly at the byte level.

Declaration

public static WAH8DocIdSet Union(ICollection<WAH8DocIdSet> docIdSets, int indexInterval)

Parameters

Type	Name	Description
System.Collections.Generic.ICollection<WAH8DocIdSet>	docIdSets
System.Int32	indexInterval

Returns

Type	Description
WAH8DocIdSet