Class Lucene41StoredFieldsFormat
Lucene 4.1 stored fields format.
Principle
This Stored
File formats
Stored fields are represented by two files:
-
A fields data file (extension
.fdt
). this file stores a compact representation of documents in compressed blocks of 16KB or more. When writing a segment, documents are appended to an in-memorybyte[]
buffer. When its size reaches 16KB or more, some metadata about the documents is flushed to disk, immediately followed by a compressed representation of the buffer using the LZ4 compression format.Here is a more detailed description of the field data file format:
- FieldData (.fdt) --> <Header>, PackedIntsVersion, <Chunk>ChunkCount
- Header --> CodecHeader (Write
Header(Data )Output, String, Int32) - PackedIntsVersion --> VERSION_CURRENT as a VInt (Write
VInt32(Int32) ) - ChunkCount is not known in advance and is the number of chunks necessary to store all document of the segment
- Chunk --> DocBase, ChunkDocs, DocFieldCounts, DocLengths, <CompressedDocs>
- DocBase --> the ID of the first document of the chunk as a VInt (Write
VInt32(Int32) ) - ChunkDocs --> the number of documents in the chunk as a VInt (Write
VInt32(Int32) ) - DocFieldCounts --> the number of stored fields of every document in the chunk, encoded as followed:
- if chunkDocs=1, the unique value is encoded as a VInt (Write
VInt32(Int32) ) - else read a VInt (Write
VInt32(Int32) ) (let's call itbitsRequired
)- if
bitsRequired
is0
then all values are equal, and the common value is the following VInt (WriteVInt32(Int32) ) - else
bitsRequired
is the number of bits required to store any value, and values are stored in a packed (PackedInt32s ) array where every value is stored on exactlybitsRequired
bits
- if
- if chunkDocs=1, the unique value is encoded as a VInt (Write
- DocLengths --> the lengths of all documents in the chunk, encoded with the same method as DocFieldCounts
- CompressedDocs --> a compressed representation of <Docs> using the LZ4 compression format
- Docs --> <Doc>ChunkDocs
- Doc --> <FieldNumAndType, Value>DocFieldCount
- FieldNumAndType --> a VLong (Write
VInt64(Int64) ), whose 3 last bits are Type and other bits are FieldNum - Type -->
- 0: Value is String
- 1: Value is BinaryValue
- 2: Value is Int
- 3: Value is Float
- 4: Value is Long
- 5: Value is Double
- 6, 7: unused
- FieldNum --> an ID of the field
- Value --> String (Write
String(String) ) | BinaryValue | Int | Float | Long | Double depending on Type - BinaryValue --> ValueLength <Byte>ValueLength
Notes
- If documents are larger than 16KB then chunks will likely contain only one document. However, documents can never spread across several chunks (all fields of a single document are in the same chunk).
- When at least one document in a chunk is large enough so that the chunk
is larger than 32KB, the chunk will actually be compressed in several LZ4
blocks of 16KB. this allows Stored
Field s which are only interested in the first fields of a document to not have to decompress 10MB of data if the document is 10MB, but only 16KB.Visitor - Given that the original lengths are written in the metadata of the chunk, the decompressor can leverage this information to stop decoding as soon as enough data has been decompressed.
- In case documents are incompressible, CompressedDocs will be less than 0.5% larger than Docs.
-
A fields index file (extension
.fdx
).- FieldsIndex (.fdx) --> <Header>, <ChunkIndex>
- Header --> CodecHeader (Write
Header(Data )Output, String, Int32) - ChunkIndex: See Compressing
Stored Fields Index Writer
Known limitations
This Stored231 - 214
) bytes. In case this
is a problem, you should use another format, such as
Lucene40Stored
Inheritance
Inherited Members
Namespace: Lucene.Net.Codecs.Lucene41
Assembly: Lucene.Net.dll
Syntax
public sealed class Lucene41StoredFieldsFormat : CompressingStoredFieldsFormat
Constructors
| Improve this Doc View SourceLucene41StoredFieldsFormat()
Sole constructor.
Declaration
public Lucene41StoredFieldsFormat()