Namespace Lucene.Net.Util.Packed
Packed integer arrays and streams.
The packed package provides * sequential and random access capable arrays of positive longs, * routines for efficient serialization and deserialization of streams of packed integers. The implementations provide different trade-offs between memory usage and access speed. The standard usage scenario is replacing large int or long arrays in order to reduce the memory footprint.
The main access point is the <xref:Lucene.Net.Util.Packed.PackedInts> factory.
In-memory structures
<xref:Lucene.Net.Util.Packed.PackedInts.Mutable>
Only supports positive longs.
Requires the number of bits per value to be known in advance.
Random-access for both writing and reading.
-
Same as PackedInts.Mutable but grows the number of bits per values when needed.
Useful to build a PackedInts.Mutable from a read-once stream of longs.
-
Slices data into fixed-size blocks stored in GrowableWriters.
Supports more than 2B values.
You should use Appending(Delta)PackedLongBuffer instead if you don't need random write access.
<xref:Lucene.Net.Util.Packed.AppendingDeltaPackedLongBuffer>
Can store any sequence of longs.
Compression is good when values are close to each other.
Supports random reads, but only sequential writes.
Can address up to 2^42 values.
<xref:Lucene.Net.Util.Packed.AppendingPackedLongBuffer>
- Same as AppendingDeltaPackedLongBuffer but assumes values are 0-based.
<xref:Lucene.Net.Util.Packed.MonotonicAppendingLongBuffer>
- Same as AppendingDeltaPackedLongBuffer except that compression is good when the stream is a succession of affine functions.
Disk-based structures
<xref:Lucene.Net.Util.Packed.PackedInts.Writer>, <xref:Lucene.Net.Util.Packed.PackedInts.Reader>, <xref:Lucene.Net.Util.Packed.PackedInts.ReaderIterator>
Only supports positive longs.
Requires the number of bits per value to be known in advance.
Supports both fast sequential access with low memory footprint with ReaderIterator and random-access by either loading values in memory or leaving them on disk with Reader.
BlockPackedWriter, BlockPackedReader, BlockPackedReaderIterator
Splits the stream into fixed-size blocks.
Compression is good when values are close to each other.
Can address up to 2B * blockSize values.
MonotonicBlockPackedWriter, MonotonicBlockPackedReader
Same as the non-monotonic variants except that compression is good when the stream is a succession of affine functions.
The reason why there is no sequential access is that if you need sequential access, you should rather delta-encode and use BlockPackedWriter.
PackedDataOutput, PackedDataInput
- Writes sequences of longs where each long can use any number of bits.
Classes
AbstractAppendingInt64Buffer
Common functionality shared by AppendingDeltaPackedInt64Buffer and MonotonicAppendingInt64Buffer.
NOTE: This was AbstractAppendingLongBuffer in Lucene
AbstractAppendingInt64Buffer.Iterator
AbstractBlockPackedWriter
AbstractPagedMutable<T>
Base implementation for PagedMutable and PagedGrowableWriter.
AppendingDeltaPackedInt64Buffer
Utility class to buffer a list of signed longs in memory. This class only supports appending and is optimized for the case where values are close to each other.
NOTE: This was AppendingDeltaPackedLongBuffer in Lucene
AppendingPackedInt64Buffer
Utility class to buffer a list of signed longs in memory. This class only supports appending and is optimized for non-negative numbers with a uniform distribution over a fixed (limited) range.
NOTE: This was AppendingPackedLongBuffer in Lucene
BlockPackedReader
Provides random access to a stream written with BlockPackedWriter.
BlockPackedReaderIterator
Reader for sequences of System.Int64s written with BlockPackedWriter.
BlockPackedWriter
A writer for large sequences of longs.
The sequence is divided into fixed-size blocks and for each block, the difference between each value and the minimum value of the block is encoded using as few bits as possible. Memory usage of this class is proportional to the block size. Each block has an overhead between 1 and 10 bytes to store the minimum value and the number of bits per value of the block.
Format:
- <BLock>BlockCount
- BlockCount: ⌈ ValueCount / BlockSize ⌉
- Block: <Header, (Ints)>
- Header: <Token, (MinValue)>
- Token: a byte (WriteByte(Byte)), first 7 bits are the
number of bits per value (
bitsPerValue
). If the 8th bit is 1, then MinValue (see next) is0
, otherwise MinValue and needs to be decoded - MinValue: a zigzag-encoded variable-length System.Int64 (WriteVInt64(Int64)) whose value should be added to every int from the block to restore the original values
- Ints: If the number of bits per value is
0
, then there is nothing to decode and all ints are equal to MinValue. Otherwise: BlockSize packed ints (PackedInt32s) encoded on exactlybitsPerValue
bits per value. They are the subtraction of the original values and MinValue
EliasFanoDecoder
A decoder for an EliasFanoEncoder.
EliasFanoDocIdSet
A DocIdSet in Elias-Fano encoding.
EliasFanoEncoder
Encode a non decreasing sequence of non negative whole numbers in the Elias-Fano encoding that was introduced in the 1970's by Peter Elias and Robert Fano.
The Elias-Fano encoding is a high bits / low bits representation of
a monotonically increasing sequence of numValues > 0
natural numbers x[i]
0 <= x[0] <= x[1] <= ... <= x[numValues-2] <= x[numValues-1] <= upperBound
where upperBound > 0
is an upper bound on the last value.
The Elias-Fano encoding uses less than half a bit per encoded number more than the smallest representation that can encode any monotone sequence with the same bounds.
The lower L
bits of each x[i]
are stored explicitly and contiguously
in the lower-bits array, with L
chosen as (Log()
base 2):
L = max(0, floor(log(upperBound/numValues)))
The upper bits are stored in the upper-bits array as a sequence of unary-coded gaps (x[-1] = 0
):
(x[i]/2L) - (x[i-1]/2L)
The unary code encodes a natural number n
by n
0 bits followed by a 1 bit:
0...01
.
In the upper bits the total the number of 1 bits is numValues
and the total number of 0 bits is:
floor(x[numValues-1]/2L) <= upperBound/(2max(0, floor(log(upperBound/numValues)))) <= 2numValues
The Elias-Fano encoding uses at most
2 + Ceil(Log(upperBound/numValues))
bits per encoded number. With upperBound
in these bounds (p
is an integer):
2p < x[numValues-1] <= upperBound <= 2*(p+1)
the number of bits per encoded number is minimized.
In this implementation the values in the sequence can be given as long
,
numValues = 0
and upperBound = 0
are allowed,
and each of the upper and lower bit arrays should fit in a long[]
.
An index of positions of zero's in the upper bits is also built.
this implementation is based on this article:
Sebastiano Vigna, "Quasi Succinct Indices", June 19, 2012, sections 3, 4 and 9. Retrieved from http://arxiv.org/pdf/1206.4300 .
The articles originally describing the Elias-Fano representation are:
Peter Elias, "Efficient storage and retrieval by content and address of static files", J. Assoc. Comput. Mach., 21(2):246â€"260, 1974.
Robert M. Fano, "On the number of bits required to implement an associative memory", Memorandum 61, Computer Structures Group, Project MAC, MIT, Cambridge, Mass., 1971.
GrowableWriter
Implements PackedInt32s.Mutable, but grows the bit count of the underlying packed ints on-demand.
Beware that this class will accept to set negative values but in order to do this, it will grow the number of bits per value to 64.
MonotonicAppendingInt64Buffer
Utility class to buffer signed longs in memory, which is optimized for the case where the sequence is monotonic, although it can encode any sequence of arbitrary longs. It only supports appending.
NOTE: This was MonotonicAppendingLongBuffer in Lucene.
MonotonicBlockPackedReader
Provides random access to a stream written with MonotonicBlockPackedWriter.
MonotonicBlockPackedWriter
A writer for large monotonically increasing sequences of positive System.Int64s.
The sequence is divided into fixed-size blocks and for each block, values are modeled after a linear function f: x → A × x + B. The block encodes deltas from the expected values computed from this function using as few bits as possible. Each block has an overhead between 6 and 14 bytes.
Format:
- <BLock>BlockCount
- BlockCount: ⌈ ValueCount / BlockSize ⌉
- Block: <Header, (Ints)>
- Header: <B, A, BitsPerValue>
- B: the B from f: x → A × x + B using a variable-length System.Int64 (WriteVInt64(Int64))
- A: the A from f: x → A × x + B encoded using J2N.BitConversion.SingleToInt32Bits(System.Single) on 4 bytes (WriteVInt32(Int32))
- BitsPerValue: a variable-length System.Int32 (WriteVInt32(Int32))
- Ints: if BitsPerValue is
0
, then there is nothing to read and all values perfectly match the result of the function. Otherwise, these are the zigzag-encoded packed (PackedInt32s) deltas from the expected value (computed from the function) using exaclty BitsPerValue bits per value
Packed64
Space optimized random access capable array of values with a fixed number of bits/value. Values are packed contiguously.
The implementation strives to perform af fast as possible under the constraint of contiguous bits, by avoiding expensive operations. This comes at the cost of code clarity.
Technical details: this implementation is a refinement of a non-branching version. The non-branching get and set methods meant that 2 or 4 atomics in the underlying array were always accessed, even for the cases where only 1 or 2 were needed. Even with caching, this had a detrimental effect on performance. Related to this issue, the old implementation used lookup tables for shifts and masks, which also proved to be a bit slower than calculating the shifts and masks on the fly. See https://issues.apache.org/jira/browse/LUCENE-4062 for details.
PackedDataInput
A DataInput wrapper to read unaligned, variable-length packed integers. This API is much slower than the PackedInt32s fixed-length API but can be convenient to save space.
PackedDataOutput
A DataOutput wrapper to write unaligned, variable-length packed integers.
PackedInt32s
Simplistic compression for array of unsigned long values. Each value is >= 0 and <= a specified maximum value. The values are stored as packed ints, with each value consuming a fixed number of bits.
NOTE: This was PackedInts in Lucene.
PackedInt32s.Format
A format to write packed System.Int32s.
PackedInt32s.FormatAndBits
Simple class that holds a format and a number of bits per value.
PackedInt32s.Header
Header identifying the structure of a packed integer array.
PackedInt32s.Mutable
A packed integer array that can be modified.
PackedInt32s.MutableImpl
PackedInt32s.NullReader
A PackedInt32s.Reader which has all its values equal to 0 (bitsPerValue = 0).
PackedInt32s.Reader
A read-only random access array of positive integers.
PackedInt32s.Writer
A write-once Writer.
PagedGrowableWriter
A PagedGrowableWriter. This class slices data into fixed-size blocks which have independent numbers of bits per value and grow on-demand.
You should use this class instead of the AbstractAppendingInt64Buffer related ones only when you need random write-access. Otherwise this class will likely be slower and less memory-efficient.
PagedMutable
A PagedMutable. This class slices data into fixed-size blocks which have the same number of bits per value. It can be a useful replacement for PackedInt32s.Mutable to store more than 2B values.
Interfaces
PackedInt32s.IDecoder
A decoder for packed integers.
PackedInt32s.IEncoder
An encoder for packed integers.
PackedInt32s.IReaderIterator
Run-once iterator interface, to decode previously saved PackedInt32s.