Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class Lucene42DocValuesFormat

    Lucene 4.2 DocValues format.

    Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with seven basic strategies.

    • Delta-compressed Numerics: per-document integers written in blocks of 4096. For each block the minimum value is encoded, and each entry is a delta from that minimum value.
    • Table-compressed Numerics: when the number of unique values is very small, a lookup table is written instead. Each per-document entry is instead the ordinal to this table.
    • Uncompressed Numerics: when all values would fit into a single byte, and the acceptableOverheadRatio would pack values into 8 bits per value anyway, they are written as absolute values (with no indirection or packing) for performance.
    • GCD-compressed Numerics: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
    • Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length. Each document's value can be addressed by maxDoc*length.
    • Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 4096, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
    • Sorted: an FST mapping deduplicated terms to ordinals is written, along with the per-document ordinals written using one of the numeric strategies above.
    • SortedSet: an FST mapping deduplicated terms to ordinals is written, along with the per-document ordinal list written using one of the binary strategies above.

    Files:
    1. .dvd: DocValues data
    2. .dvm: DocValues metadata
    1. The DocValues metadata or .dvm file.

      For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)

      DocValues metadata (.dvm) --> Header,<FieldNumber,EntryType,Entry>NumFields,Footer

      • Entry --> NumericEntry | BinaryEntry | SortedEntry
      • NumericEntry --> DataOffset,CompressionType,PackedVersion
      • BinaryEntry --> DataOffset,DataLength,MinLength,MaxLength,PackedVersion?,BlockSize?
      • SortedEntry --> DataOffset,ValueCount
      • FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt (WriteVInt32(int))
      • DataOffset,DataLength --> Int64 (WriteInt64(long))
      • EntryType,CompressionType --> Byte (WriteByte(byte))
      • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
      • Footer --> CodecFooter (WriteFooter(IndexOutput))

      Sorted fields have two entries: a SortedEntry with the FST metadata, and an ordinary NumericEntry for the document-to-ord metadata.

      SortedSet fields have two entries: a SortedEntry with the FST metadata, and an ordinary BinaryEntry for the document-to-ord-list metadata.

      FieldNumber of -1 indicates the end of metadata.

      EntryType is a 0 (NumericEntry), 1 (BinaryEntry, or 2 (SortedEntry)

      DataOffset is the pointer to the start of the data in the DocValues data (.dvd)

      CompressionType indicates how Numeric values will be compressed:
      • 0 --> delta-compressed. For each block of 4096 integers, every integer is delta-encoded from the minimum value within the block.
      • 1 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
      • 2 --> uncompressed. When the acceptableOverheadRatio parameter would upgrade the number of bits required to 8, and all values fit in a byte, these are written as absolute binary values for performance.
      • 3 --> gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.

      MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.
    2. The DocValues data or .dvd file.

      For DocValues field, this stores the actual per-document data (the heavy-lifting)

      DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields,Footer

      • NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | UncompressedNumerics | GCDCompressedNumerics
      • BinaryData --> Byte (WriteByte(byte)) DataLength,Addresses
      • SortedData --> FST<Int64> (FST<T>)
      • DeltaCompressedNumerics --> BlockPackedInts(blockSize=4096) (BlockPackedWriter)
      • TableCompressedNumerics --> TableSize, Int64 (WriteInt64(long)) TableSize, PackedInts (PackedInt32s)
      • UncompressedNumerics --> Byte (WriteByte(byte)) maxdoc
      • Addresses --> MonotonicBlockPackedInts(blockSize=4096) (MonotonicBlockPackedWriter)
      • Footer --> CodecFooter (WriteFooter(IndexOutput)

      SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs (WriteVInt64(long)), delta-encoded.

    Limitations:
    • Binary doc values can be at most MAX_BINARY_FIELD_LENGTH in length.
    Inheritance
    object
    DocValuesFormat
    Lucene42DocValuesFormat
    Inherited Members
    DocValuesFormat.SetDocValuesFormatFactory(IDocValuesFormatFactory)
    DocValuesFormat.GetDocValuesFormatFactory()
    DocValuesFormat.Name
    DocValuesFormat.ToString()
    DocValuesFormat.ForName(string)
    DocValuesFormat.AvailableDocValuesFormats
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    Namespace: Lucene.Net.Codecs.Lucene42
    Assembly: Lucene.Net.dll
    Syntax
    [Obsolete("Only for reading old 4.2 segments")]
    [DocValuesFormatName("Lucene42")]
    public class Lucene42DocValuesFormat : DocValuesFormat

    Constructors

    Lucene42DocValuesFormat()

    Calls Lucene42DocValuesFormat(PackedInts.DEFAULT) (Lucene42DocValuesFormat(float).

    Declaration
    public Lucene42DocValuesFormat()

    Lucene42DocValuesFormat(float)

    Creates a new Lucene42DocValuesFormat with the specified acceptableOverheadRatio for NumericDocValues.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public Lucene42DocValuesFormat(float acceptableOverheadRatio)
    Parameters
    Type Name Description
    float acceptableOverheadRatio

    Compression parameter for numerics. Currently this is only used when the number of unique values is small.

    Fields

    MAX_BINARY_FIELD_LENGTH

    Maximum length for each binary doc values field.

    Declaration
    public static readonly int MAX_BINARY_FIELD_LENGTH
    Field Value
    Type Description
    int

    m_acceptableOverheadRatio

    Lucene 4.2 DocValues format.

    Encodes the four per-document value types (Numeric,Binary,Sorted,SortedSet) with seven basic strategies.

    • Delta-compressed Numerics: per-document integers written in blocks of 4096. For each block the minimum value is encoded, and each entry is a delta from that minimum value.
    • Table-compressed Numerics: when the number of unique values is very small, a lookup table is written instead. Each per-document entry is instead the ordinal to this table.
    • Uncompressed Numerics: when all values would fit into a single byte, and the acceptableOverheadRatio would pack values into 8 bits per value anyway, they are written as absolute values (with no indirection or packing) for performance.
    • GCD-compressed Numerics: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
    • Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length. Each document's value can be addressed by maxDoc*length.
    • Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written in blocks of 4096, with the current absolute start for the block, and the average (expected) delta per entry. For each document the deviation from the delta (actual - expected) is written.
    • Sorted: an FST mapping deduplicated terms to ordinals is written, along with the per-document ordinals written using one of the numeric strategies above.
    • SortedSet: an FST mapping deduplicated terms to ordinals is written, along with the per-document ordinal list written using one of the binary strategies above.

    Files:
    1. .dvd: DocValues data
    2. .dvm: DocValues metadata
    1. The DocValues metadata or .dvm file.

      For DocValues field, this stores metadata, such as the offset into the DocValues data (.dvd)

      DocValues metadata (.dvm) --> Header,<FieldNumber,EntryType,Entry>NumFields,Footer

      • Entry --> NumericEntry | BinaryEntry | SortedEntry
      • NumericEntry --> DataOffset,CompressionType,PackedVersion
      • BinaryEntry --> DataOffset,DataLength,MinLength,MaxLength,PackedVersion?,BlockSize?
      • SortedEntry --> DataOffset,ValueCount
      • FieldNumber,PackedVersion,MinLength,MaxLength,BlockSize,ValueCount --> VInt (WriteVInt32(int))
      • DataOffset,DataLength --> Int64 (WriteInt64(long))
      • EntryType,CompressionType --> Byte (WriteByte(byte))
      • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
      • Footer --> CodecFooter (WriteFooter(IndexOutput))

      Sorted fields have two entries: a SortedEntry with the FST metadata, and an ordinary NumericEntry for the document-to-ord metadata.

      SortedSet fields have two entries: a SortedEntry with the FST metadata, and an ordinary BinaryEntry for the document-to-ord-list metadata.

      FieldNumber of -1 indicates the end of metadata.

      EntryType is a 0 (NumericEntry), 1 (BinaryEntry, or 2 (SortedEntry)

      DataOffset is the pointer to the start of the data in the DocValues data (.dvd)

      CompressionType indicates how Numeric values will be compressed:
      • 0 --> delta-compressed. For each block of 4096 integers, every integer is delta-encoded from the minimum value within the block.
      • 1 --> table-compressed. When the number of unique numeric values is small and it would save space, a lookup table of unique values is written, followed by the ordinal for each document.
      • 2 --> uncompressed. When the acceptableOverheadRatio parameter would upgrade the number of bits required to 8, and all values fit in a byte, these are written as absolute binary values for performance.
      • 3 --> gcd-compressed. When all integers share a common divisor, only quotients are stored using blocks of delta-encoded ints.

      MinLength and MaxLength represent the min and max byte[] value lengths for Binary values. If they are equal, then all values are of a fixed size, and can be addressed as DataOffset + (docID * length). Otherwise, the binary values are of variable size, and packed integer metadata (PackedVersion,BlockSize) is written for the addresses.
    2. The DocValues data or .dvd file.

      For DocValues field, this stores the actual per-document data (the heavy-lifting)

      DocValues data (.dvd) --> Header,<NumericData | BinaryData | SortedData>NumFields,Footer

      • NumericData --> DeltaCompressedNumerics | TableCompressedNumerics | UncompressedNumerics | GCDCompressedNumerics
      • BinaryData --> Byte (WriteByte(byte)) DataLength,Addresses
      • SortedData --> FST<Int64> (FST<T>)
      • DeltaCompressedNumerics --> BlockPackedInts(blockSize=4096) (BlockPackedWriter)
      • TableCompressedNumerics --> TableSize, Int64 (WriteInt64(long)) TableSize, PackedInts (PackedInt32s)
      • UncompressedNumerics --> Byte (WriteByte(byte)) maxdoc
      • Addresses --> MonotonicBlockPackedInts(blockSize=4096) (MonotonicBlockPackedWriter)
      • Footer --> CodecFooter (WriteFooter(IndexOutput)

      SortedSet entries store the list of ordinals in their BinaryData as a sequences of increasing vLongs (WriteVInt64(long)), delta-encoded.

    Limitations:
    • Binary doc values can be at most MAX_BINARY_FIELD_LENGTH in length.
    Declaration
    protected readonly float m_acceptableOverheadRatio
    Field Value
    Type Description
    float

    Methods

    FieldsConsumer(SegmentWriteState)

    Returns a DocValuesConsumer to write docvalues to the index.

    Declaration
    public override DocValuesConsumer FieldsConsumer(SegmentWriteState state)
    Parameters
    Type Name Description
    SegmentWriteState state
    Returns
    Type Description
    DocValuesConsumer
    Overrides
    DocValuesFormat.FieldsConsumer(SegmentWriteState)

    FieldsProducer(SegmentReadState)

    Returns a DocValuesProducer to read docvalues from the index.

    NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments.
    Declaration
    public override DocValuesProducer FieldsProducer(SegmentReadState state)
    Parameters
    Type Name Description
    SegmentReadState state
    Returns
    Type Description
    DocValuesProducer
    Overrides
    DocValuesFormat.FieldsProducer(SegmentReadState)
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.