Class Lucene45DocValuesFormat
Lucene 4.5 DocValues format.
Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with these strategies:
- Delta-compressed: per-document integers written in blocks of 16k. For each block the minimum value in that block is encoded, and each entry is a delta from that minimum value. Each block of deltas is compressed with bitpacking. For more information, see BlockPackedWriter.
- Table-compressed: when the number of unique values is very small (< 256), and when there are unused "gaps" in the range of values used (such as SmallSingle), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (PackedInt32s).
- GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
- Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length.
Each document's value can be addressed directly with multiplication (
docID * length
). - Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
- Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. Chunk addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each chunk the deviation from the delta (actual - expected) is written.
- Sorted: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, along with the per-document ordinals written using one of the numeric strategies above.
- SortedSet: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.
Files:
.dvd
: DocValues data.dvm
: DocValues metadata
-
The DocValues metadata or .dvm file.
For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)
DocValues metadata (.dvm) --> Header,<Entry>NumFields,Footer
- Entry --> NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
- NumericEntry --> GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
- GCDNumericEntry --> NumericHeader,MinValue,GCD
- TableNumericEntry --> NumericHeader,TableSize,Int64 (WriteInt64(Int64)) TableSize
- DeltaNumericEntry --> NumericHeader
- NumericHeader --> FieldNumber,EntryType,NumericType,MissingOffset,PackedVersion,DataOffset,Count,BlockSize
- BinaryEntry --> FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
- FixedBinaryEntry --> BinaryHeader
- VariableBinaryEntry --> BinaryHeader,AddressOffset,PackedVersion,BlockSize
- PrefixBinaryEntry --> BinaryHeader,AddressInterval,AddressOffset,PackedVersion,BlockSize
- BinaryHeader --> FieldNumber,EntryType,BinaryType,MissingOffset,MinLength,MaxLength,DataOffset
- SortedEntry --> FieldNumber,EntryType,BinaryEntry,NumericEntry
- SortedSetEntry --> EntryType,BinaryEntry,NumericEntry,NumericEntry
- FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt (WriteVInt32(Int32)
- EntryType,CompressionType --> Byte (WriteByte(Byte)
- Header --> CodecHeader (WriteHeader(DataOutput, String, Int32))
- MinValue,GCD,MissingOffset,AddressOffset,DataOffset --> Int64 (WriteInt64(Int64))
- TableSize --> vInt (WriteVInt32(Int32))
- Footer --> CodecFooter (WriteFooter(IndexOutput))
Sorted fields have two entries: a Lucene45DocValuesProducer.BinaryEntry with the value metadata, and an ordinary Lucene45DocValuesProducer.NumericEntry for the document-to-ord metadata.
SortedSet fields have three entries: a Lucene45DocValuesProducer.BinaryEntry with the value metadata, and two Lucene45DocValuesProducer.NumericEntrys for the document-to-ord-index and ordinal list metadata.
FieldNumber of -1 indicates the end of metadata.
EntryType is a 0 (Lucene45DocValuesProducer.NumericEntry) or 1 (Lucene45DocValuesProducer.BinaryEntry)
DataOffset is the pointer to the start of the data in the DocValues data (.dvd)
NumericType indicates how Numeric values will be compressed:
- 0 --> delta-compressed. For each block of 16k integers, every integer is delta-encoded from the minimum value within the block.
- 1 --> gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.
- 2 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
BinaryType indicates how Binary values will be stored:
- 0 --> fixed-width. All values have the same length, addressing by multiplication.
- 1 --> variable-width. An address for each value is stored.
- 2 --> prefix-compressed. An address to the start of every interval'th value is stored.
MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.
MissingOffset points to a byte[] containing a bitset of all documents that had a value for the field. If its -1, then there are no missing values.
Checksum contains the CRC32 checksum of all bytes in the .dvm file up until the checksum. this is used to verify integrity of the file on opening the index.
-
The DocValues data or .dvd file.
For DocValues field, this stores the actual per-document data (the heavy-lifting)
DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields,Footer
- NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | GCDCompressedNumerics
- BinaryData --> Byte (WriteByte(Byte)) DataLength,Addresses
- SortedData --> FST<Int64> (FST<T>)
- DeltaCompressedNumerics --> BlockPackedInts(blockSize=16k) (BlockPackedWriter)
- TableCompressedNumerics --> PackedInts (PackedInt32s)
- GCDCompressedNumerics --> BlockPackedInts(blockSize=16k) (BlockPackedWriter)
- Addresses --> MonotonicBlockPackedInts(blockSize=16k) (MonotonicBlockPackedWriter)
- Footer --> CodecFooter (WriteFooter(IndexOutput))
SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs (WriteVInt64(Int64)), delta-encoded.
Note
This API is experimental and might change in incompatible ways in the next release.
Inherited Members
Namespace: Lucene.Net.Codecs.Lucene45
Assembly: Lucene.Net.dll
Syntax
[DocValuesFormatName("Lucene45")]
public sealed class Lucene45DocValuesFormat : DocValuesFormat
Constructors
| Improve this Doc View SourceLucene45DocValuesFormat()
Sole Constructor
Declaration
public Lucene45DocValuesFormat()
Methods
| Improve this Doc View SourceFieldsConsumer(SegmentWriteState)
Declaration
public override DocValuesConsumer FieldsConsumer(SegmentWriteState state)
Parameters
Type | Name | Description |
---|---|---|
SegmentWriteState | state |
Returns
Type | Description |
---|---|
DocValuesConsumer |
Overrides
| Improve this Doc View SourceFieldsProducer(SegmentReadState)
Declaration
public override DocValuesProducer FieldsProducer(SegmentReadState state)
Parameters
Type | Name | Description |
---|---|---|
SegmentReadState | state |
Returns
Type | Description |
---|---|
DocValuesProducer |