Class Lucene45DocValuesFormat
Lucene 4.5 DocValues format.
Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with these strategies:
- Delta-compressed: per-document integers written in blocks of 16k. For each block
the minimum value in that block is encoded, and each entry is a delta from that
minimum value. Each block of deltas is compressed with bitpacking. For more
information, see Block
Packed .Writer - Table-compressed: when the number of unique values is very small (< 256), and
when there are unused "gaps" in the range of values used (such as Small
Single ), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (PackedInt32s ). - GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
- Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length.
Each document's value can be addressed directly with multiplication (
docID * length
). - Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
- Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. Chunk addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each chunk the deviation from the delta (actual - expected) is written.
- Sorted: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, along with the per-document ordinals written using one of the numeric strategies above.
- SortedSet: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.
Files:
.dvd
: DocValues data.dvm
: DocValues metadata
-
The DocValues metadata or .dvm file.
For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)
DocValues metadata (.dvm) --> Header,<Entry>NumFields,Footer
- Entry --> NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
- NumericEntry --> GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
- GCDNumericEntry --> NumericHeader,MinValue,GCD
- TableNumericEntry --> NumericHeader,TableSize,Int64 (Write
Int64(Int64) ) TableSize - DeltaNumericEntry --> NumericHeader
- NumericHeader --> FieldNumber,EntryType,NumericType,MissingOffset,PackedVersion,DataOffset,Count,BlockSize
- BinaryEntry --> FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
- FixedBinaryEntry --> BinaryHeader
- VariableBinaryEntry --> BinaryHeader,AddressOffset,PackedVersion,BlockSize
- PrefixBinaryEntry --> BinaryHeader,AddressInterval,AddressOffset,PackedVersion,BlockSize
- BinaryHeader --> FieldNumber,EntryType,BinaryType,MissingOffset,MinLength,MaxLength,DataOffset
- SortedEntry --> FieldNumber,EntryType,BinaryEntry,NumericEntry
- SortedSetEntry --> EntryType,BinaryEntry,NumericEntry,NumericEntry
- FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt (Write
VInt32(Int32) - EntryType,CompressionType --> Byte (Write
Byte(Byte) - Header --> CodecHeader (Write
Header(Data )Output, String, Int32) - MinValue,GCD,MissingOffset,AddressOffset,DataOffset --> Int64 (Write
Int64(Int64) ) - TableSize --> vInt (Write
VInt32(Int32) ) - Footer --> CodecFooter (Write
Footer(Index )Output)
Sorted fields have two entries: a Lucene45Doc
Values with the value metadata, and an ordinary Lucene45DocProducer. Binary Entry Values for the document-to-ord metadata.Producer. Numeric Entry SortedSet fields have three entries: a Lucene45Doc
Values with the value metadata, and two Lucene45DocProducer. Binary Entry Values s for the document-to-ord-index and ordinal list metadata.Producer. Numeric Entry FieldNumber of -1 indicates the end of metadata.
EntryType is a 0 (Lucene45Doc
Values ) or 1 (Lucene45DocProducer. Numeric Entry Values )Producer. Binary Entry DataOffset is the pointer to the start of the data in the DocValues data (.dvd)
NumericType indicates how Numeric values will be compressed:
- 0 --> delta-compressed. For each block of 16k integers, every integer is delta-encoded from the minimum value within the block.
- 1 --> gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.
- 2 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
BinaryType indicates how Binary values will be stored:
- 0 --> fixed-width. All values have the same length, addressing by multiplication.
- 1 --> variable-width. An address for each value is stored.
- 2 --> prefix-compressed. An address to the start of every interval'th value is stored.
MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.
MissingOffset points to a byte[] containing a bitset of all documents that had a value for the field. If its -1, then there are no missing values.
Checksum contains the CRC32 checksum of all bytes in the .dvm file up until the checksum. this is used to verify integrity of the file on opening the index.
-
The DocValues data or .dvd file.
For DocValues field, this stores the actual per-document data (the heavy-lifting)
DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields,Footer
- NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | GCDCompressedNumerics
- BinaryData --> Byte (Write
Byte(Byte) ) DataLength,Addresses - SortedData --> FST<Int64> (FST<T>)
- DeltaCompressedNumerics --> BlockPackedInts(blockSize=16k) (Block
Packed )Writer - TableCompressedNumerics --> PackedInts (Packed
Int32s ) - GCDCompressedNumerics --> BlockPackedInts(blockSize=16k) (Block
Packed )Writer - Addresses --> MonotonicBlockPackedInts(blockSize=16k) (Monotonic
Block )Packed Writer - Footer --> CodecFooter (Write
Footer(Index )Output)
SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs (Write
VInt64(Int64) ), delta-encoded.
Inherited Members
Namespace: Lucene.Net.Codecs.Lucene45
Assembly: Lucene.Net.dll
Syntax
public sealed class Lucene45DocValuesFormat : DocValuesFormat
Constructors
| Improve this Doc View SourceLucene45DocValuesFormat()
Sole Constructor
Declaration
public Lucene45DocValuesFormat()
Methods
| Improve this Doc View SourceFieldsConsumer(SegmentWriteState)
Declaration
public override DocValuesConsumer FieldsConsumer(SegmentWriteState state)
Parameters
Type | Name | Description |
---|---|---|
Segment |
state |
Returns
Type | Description |
---|---|
Doc |
Overrides
| Improve this Doc View SourceFieldsProducer(SegmentReadState)
Declaration
public override DocValuesProducer FieldsProducer(SegmentReadState state)
Parameters
Type | Name | Description |
---|---|---|
Segment |
state |
Returns
Type | Description |
---|---|
Doc |