Class Lucene45DocValuesFormat
Lucene 4.5 DocValues format.
Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with these strategies: NUMERIC:- Delta-compressed: per-document integers written in blocks of 16k. For each block the minimum value in that block is encoded, and each entry is a delta from that minimum value. Each block of deltas is compressed with bitpacking. For more information, see BlockPackedWriter.
- Table-compressed: when the number of unique values is very small (< 256), and when there are unused "gaps" in the range of values used (such as SmallSingle), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (PackedInt32s).
- GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
- Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length.
Each document's value can be addressed directly with multiplication (
docID * length
). - Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
- Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. Chunk addresses are written in blocks of 16k, with the current absolute start for the block, and the average (expected) delta per entry. For each chunk the deviation from the delta (actual - expected) is written.
- Sorted: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, along with the per-document ordinals written using one of the numeric strategies above.
- SortedSet: a mapping of ordinals to deduplicated terms is written as Prefix-Compressed Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.
.dvd
: DocValues data.dvm
: DocValues metadata
The DocValues metadata or .dvm file.
For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)
DocValues metadata (.dvm) --> Header,<Entry>NumFields,Footer
- Entry --> NumericEntry | BinaryEntry | SortedEntry | SortedSetEntry
- NumericEntry --> GCDNumericEntry | TableNumericEntry | DeltaNumericEntry
- GCDNumericEntry --> NumericHeader,MinValue,GCD
- TableNumericEntry --> NumericHeader,TableSize,Int64 (WriteInt64(long)) TableSize
- DeltaNumericEntry --> NumericHeader
- NumericHeader --> FieldNumber,EntryType,NumericType,MissingOffset,PackedVersion,DataOffset,Count,BlockSize
- BinaryEntry --> FixedBinaryEntry | VariableBinaryEntry | PrefixBinaryEntry
- FixedBinaryEntry --> BinaryHeader
- VariableBinaryEntry --> BinaryHeader,AddressOffset,PackedVersion,BlockSize
- PrefixBinaryEntry --> BinaryHeader,AddressInterval,AddressOffset,PackedVersion,BlockSize
- BinaryHeader --> FieldNumber,EntryType,BinaryType,MissingOffset,MinLength,MaxLength,DataOffset
- SortedEntry --> FieldNumber,EntryType,BinaryEntry,NumericEntry
- SortedSetEntry --> EntryType,BinaryEntry,NumericEntry,NumericEntry
- FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt (WriteVInt32(int)
- EntryType,CompressionType --> Byte (WriteByte(byte)
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- MinValue,GCD,MissingOffset,AddressOffset,DataOffset --> Int64 (WriteInt64(long))
- TableSize --> vInt (WriteVInt32(int))
- Footer --> CodecFooter (WriteFooter(IndexOutput))
Sorted fields have two entries: a Lucene45DocValuesProducer.BinaryEntry with the value metadata, and an ordinary Lucene45DocValuesProducer.NumericEntry for the document-to-ord metadata.
SortedSet fields have three entries: a Lucene45DocValuesProducer.BinaryEntry with the value metadata, and two Lucene45DocValuesProducer.NumericEntrys for the document-to-ord-index and ordinal list metadata.
FieldNumber of -1 indicates the end of metadata.
EntryType is a 0 (Lucene45DocValuesProducer.NumericEntry) or 1 (Lucene45DocValuesProducer.BinaryEntry)
DataOffset is the pointer to the start of the data in the DocValues data (.dvd)
NumericType indicates how Numeric values will be compressed:- 0 --> delta-compressed. For each block of 16k integers, every integer is delta-encoded from the minimum value within the block.
- 1 --> gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.
- 2 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
- 0 --> fixed-width. All values have the same length, addressing by multiplication.
- 1 --> variable-width. An address for each value is stored.
- 2 --> prefix-compressed. An address to the start of every interval'th value is stored.
The DocValues data or .dvd file.
For DocValues field, this stores the actual per-document data (the heavy-lifting)
DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields,Footer
- NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | GCDCompressedNumerics
- BinaryData --> Byte (WriteByte(byte)) DataLength,Addresses
- SortedData --> FST<Int64> (FST<T>)
- DeltaCompressedNumerics --> BlockPackedInts(blockSize=16k) (BlockPackedWriter)
- TableCompressedNumerics --> PackedInts (PackedInt32s)
- GCDCompressedNumerics --> BlockPackedInts(blockSize=16k) (BlockPackedWriter)
- Addresses --> MonotonicBlockPackedInts(blockSize=16k) (MonotonicBlockPackedWriter)
- Footer --> CodecFooter (WriteFooter(IndexOutput))
SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs (WriteVInt64(long)), delta-encoded.
Note
This API is experimental and might change in incompatible ways in the next release.
Inherited Members
Namespace: Lucene.Net.Codecs.Lucene45
Assembly: Lucene.Net.dll
Syntax
[DocValuesFormatName("Lucene45")]
public sealed class Lucene45DocValuesFormat : DocValuesFormat
Constructors
Lucene45DocValuesFormat()
Sole Constructor
Declaration
public Lucene45DocValuesFormat()
Methods
FieldsConsumer(SegmentWriteState)
Returns a DocValuesConsumer to write docvalues to the index.
Declaration
public override DocValuesConsumer FieldsConsumer(SegmentWriteState state)
Parameters
Type | Name | Description |
---|---|---|
SegmentWriteState | state |
Returns
Type | Description |
---|---|
DocValuesConsumer |
Overrides
FieldsProducer(SegmentReadState)
Returns a DocValuesProducer to read docvalues from the index.
NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments.Declaration
public override DocValuesProducer FieldsProducer(SegmentReadState state)
Parameters
Type | Name | Description |
---|---|---|
SegmentReadState | state |
Returns
Type | Description |
---|---|
DocValuesProducer |