Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class Lucene40DocValuesFormat

    Lucene 4.0 DocValues format.

    Files:
    • .dv.cfs: compound container (CompoundFileDirectory)
    • .dv.cfe: compound entries (CompoundFileDirectory)
    Entries within the compound file:
    • <segment>_<fieldNumber>.dat: data values
    • <segment>_<fieldNumber>.idx: index into the .dat for DEREF types

    There are several many types of DocValues with different encodings. From the perspective of filenames, all types store their values in .dat entries within the compound file. In the case of dereferenced/sorted types, the .dat actually contains only the unique values, and an additional .idx file contains pointers to these unique values.

    Formats:
    • VAR_INTS .dat --> Header, PackedType, MinValue, DefaultValue, PackedStream
    • FIXED_INTS_8 .dat --> Header, ValueSize, Byte (WriteByte(byte)) maxdoc
    • FIXED_INTS_16 .dat --> Header, ValueSize, Short (WriteInt16(short)) maxdoc
    • FIXED_INTS_32 .dat --> Header, ValueSize, Int32 (WriteInt32(int)) maxdoc
    • FIXED_INTS_64 .dat --> Header, ValueSize, Int64 (WriteInt64(long)) maxdoc
    • FLOAT_32 .dat --> Header, ValueSize, Float32maxdoc
    • FLOAT_64 .dat --> Header, ValueSize, Float64maxdoc
    • BYTES_FIXED_STRAIGHT .dat --> Header, ValueSize, (Byte (WriteByte(byte)) * ValueSize)maxdoc
    • BYTES_VAR_STRAIGHT .idx --> Header, TotalBytes, Addresses
    • BYTES_VAR_STRAIGHT .dat --> Header, (Byte (WriteByte(byte)) * variable ValueSize)maxdoc
    • BYTES_FIXED_DEREF .idx --> Header, NumValues, Addresses
    • BYTES_FIXED_DEREF .dat --> Header, ValueSize, (Byte (WriteByte(byte)) * ValueSize)NumValues
    • BYTES_VAR_DEREF .idx --> Header, TotalVarBytes, Addresses
    • BYTES_VAR_DEREF .dat --> Header, (LengthPrefix + Byte (WriteByte(byte)) * variable ValueSize)NumValues
    • BYTES_FIXED_SORTED .idx --> Header, NumValues, Ordinals
    • BYTES_FIXED_SORTED .dat --> Header, ValueSize, (Byte (WriteByte(byte)) * ValueSize)NumValues
    • BYTES_VAR_SORTED .idx --> Header, TotalVarBytes, Addresses, Ordinals
    • BYTES_VAR_SORTED .dat --> Header, (Byte (WriteByte(byte)) * variable ValueSize)NumValues
    Data Types:
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • PackedType --> Byte (WriteByte(byte))
    • MaxAddress, MinValue, DefaultValue --> Int64 (WriteInt64(long))
    • PackedStream, Addresses, Ordinals --> PackedInt32s
    • ValueSize, NumValues --> Int32 (WriteInt32(int))
    • Float32 --> 32-bit float encoded with SingleToRawInt32Bits(float) then written as Int32 (WriteInt32(int))
    • Float64 --> 64-bit float encoded with DoubleToRawInt64Bits(double) then written as Int64 (WriteInt64(long))
    • TotalBytes --> VLong (WriteVInt64(long))
    • TotalVarBytes --> Int64 (WriteInt64(long))
    • LengthPrefix --> Length of the data value as VInt (WriteVInt32(int)) (maximum of 2 bytes)
    Notes:
    • PackedType is a 0 when compressed, 1 when the stream is written as 64-bit integers.
    • Addresses stores pointers to the actual byte location (indexed by docid). In the VAR_STRAIGHT case, each entry can have a different length, so to determine the length, docid+1 is retrieved. A sentinel address is written at the end for the VAR_STRAIGHT case, so the Addresses stream contains maxdoc+1 indices. For the deduplicated VAR_DEREF case, each length is encoded as a prefix to the data itself as a VInt (WriteVInt32(int)) (maximum of 2 bytes).
    • Ordinals stores the term ID in sorted order (indexed by docid). In the FIXED_SORTED case, the address into the .dat can be computed from the ordinal as Header+ValueSize+(ordinal*ValueSize) because the byte length is fixed. In the VAR_SORTED case, there is double indirection (docid -> ordinal -> address), but an additional sentinel ordinal+address is always written (so there are NumValues+1 ordinals). To determine the length, ord+1's address is looked up as well.
    • BYTES_VAR_STRAIGHT in contrast to other straight variants uses a .idx file to improve lookup perfromance. In contrast to BYTES_VAR_DEREF it doesn't apply deduplication of the document values.

    Limitations:
    • Binary doc values can be at most MAX_BINARY_FIELD_LENGTH in length.
    Inheritance
    object
    DocValuesFormat
    Lucene40DocValuesFormat
    Inherited Members
    DocValuesFormat.SetDocValuesFormatFactory(IDocValuesFormatFactory)
    DocValuesFormat.GetDocValuesFormatFactory()
    DocValuesFormat.Name
    DocValuesFormat.ToString()
    DocValuesFormat.ForName(string)
    DocValuesFormat.AvailableDocValuesFormats
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    Namespace: Lucene.Net.Codecs.Lucene40
    Assembly: Lucene.Net.dll
    Syntax
    [Obsolete("Only for reading old 4.0 and 4.1 segments")]
    [DocValuesFormatName("Lucene40")]
    public class Lucene40DocValuesFormat : DocValuesFormat

    Constructors

    Lucene40DocValuesFormat()

    Sole constructor.

    Declaration
    public Lucene40DocValuesFormat()

    Fields

    MAX_BINARY_FIELD_LENGTH

    Maximum length for each binary doc values field.

    Declaration
    public static readonly int MAX_BINARY_FIELD_LENGTH
    Field Value
    Type Description
    int

    Methods

    FieldsConsumer(SegmentWriteState)

    Returns a DocValuesConsumer to write docvalues to the index.

    Declaration
    public override DocValuesConsumer FieldsConsumer(SegmentWriteState state)
    Parameters
    Type Name Description
    SegmentWriteState state
    Returns
    Type Description
    DocValuesConsumer
    Overrides
    DocValuesFormat.FieldsConsumer(SegmentWriteState)

    FieldsProducer(SegmentReadState)

    Returns a DocValuesProducer to read docvalues from the index.

    NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments.
    Declaration
    public override DocValuesProducer FieldsProducer(SegmentReadState state)
    Parameters
    Type Name Description
    SegmentReadState state
    Returns
    Type Description
    DocValuesProducer
    Overrides
    DocValuesFormat.FieldsProducer(SegmentReadState)
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.