Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class BlockTreeTermsWriter

    Block-based terms index and dictionary writer.

    Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

    Files:

    • .tim:Term Dictionary
    • .tip:Term Index

    Term Dictionary

    The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

    The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

    NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.

    • TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, FieldSummary, DirOffset, Footer
    • NodeBlock --> (OuterNode | InnerNode)
    • OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
    • InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
    • TermStats --> DocFreq, TotalTermFreq
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, SumTotalTermFreq?, SumDocFreq, DocCount>NumFields
    • Header --> CodecHeader (WriteHeader(DataOutput, String, Int32)
    • DirOffset --> Uint64 (WriteInt64(Int64))
    • EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt (WriteVInt32(Int32)_
    • TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong (WriteVInt64(Int64))
    • Footer --> CodecFooter (WriteFooter(IndexOutput))

    Notes:

    • Header is a CodecHeader (WriteHeader(DataOutput, String, Int32)) storing the version information for the BlockTree implementation.
    • DirOffset is a pointer to the FieldSummary section.
    • DocFreq is the count of documents which contain the term.
    • TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
    • FieldNumber is the fields number from Lucene.Net.Codecs.BlockTreeTermsWriter.fieldInfos. (.fnm)
    • NumTerms is the number of unique terms for the field.
    • RootCode points to the root block for the field.
    • SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
    • DocCount is the number of documents that have at least one posting for this field.
    • PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
    • For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetadata are omitted

    Term Index

    The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.

    • TermsIndex (.tip) --> Header, FSTIndexNumFields <IndexStartFP>NumFields, DirOffset, Footer
    • Header --> CodecHeader (WriteHeader(DataOutput, String, Int32))
    • DirOffset --> Uint64 (WriteInt64(Int64)
    • IndexStartFP --> VLong (WriteVInt64(Int64)
    • FSTIndex --> FST{byte[]}
    • Footer --> CodecFooter (WriteFooter(IndexOutput)

    Notes:

    • The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
    • DirOffset is a pointer to the start of the IndexStartFPs for all fields
    • It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    System.Object
    FieldsConsumer
    BlockTreeTermsWriter
    Implements
    System.IDisposable
    Inherited Members
    FieldsConsumer.Dispose()
    FieldsConsumer.Merge(MergeState, Fields)
    System.Object.Equals(System.Object)
    System.Object.Equals(System.Object, System.Object)
    System.Object.GetHashCode()
    System.Object.GetType()
    System.Object.MemberwiseClone()
    System.Object.ReferenceEquals(System.Object, System.Object)
    System.Object.ToString()
    Namespace: Lucene.Net.Codecs
    Assembly: Lucene.Net.dll
    Syntax
    public class BlockTreeTermsWriter : FieldsConsumer, IDisposable

    Constructors

    | Improve this Doc View Source

    BlockTreeTermsWriter(SegmentWriteState, PostingsWriterBase, Int32, Int32)

    Create a new writer. The number of items (terms or sub-blocks) per block will aim to be between minItemsInBlock and maxItemsInBlock, though in some cases the blocks may be smaller than the min.

    Declaration
    public BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock)
    Parameters
    Type Name Description
    SegmentWriteState state
    PostingsWriterBase postingsWriter
    System.Int32 minItemsInBlock
    System.Int32 maxItemsInBlock

    Fields

    | Improve this Doc View Source

    DEFAULT_MAX_BLOCK_SIZE

    Suggested default value for the maxItemsInBlock parameter to BlockTreeTermsWriter(SegmentWriteState, PostingsWriterBase, Int32, Int32).

    Declaration
    public const int DEFAULT_MAX_BLOCK_SIZE = 48
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    DEFAULT_MIN_BLOCK_SIZE

    Suggested default value for the minItemsInBlock parameter to BlockTreeTermsWriter(SegmentWriteState, PostingsWriterBase, Int32, Int32).

    Declaration
    public const int DEFAULT_MIN_BLOCK_SIZE = 25
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_APPEND_ONLY

    Append-only

    Declaration
    public const int VERSION_APPEND_ONLY = 1
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_CHECKSUM

    Checksums.

    Declaration
    public const int VERSION_CHECKSUM = 3
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_CURRENT

    Current terms format.

    Declaration
    public const int VERSION_CURRENT = 3
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_META_ARRAY

    Meta data as array.

    Declaration
    public const int VERSION_META_ARRAY = 2
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    VERSION_START

    Initial terms format.

    Declaration
    public const int VERSION_START = 0
    Field Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    AddField(FieldInfo)

    Declaration
    public override TermsConsumer AddField(FieldInfo field)
    Parameters
    Type Name Description
    FieldInfo field
    Returns
    Type Description
    TermsConsumer
    Overrides
    FieldsConsumer.AddField(FieldInfo)
    | Improve this Doc View Source

    Dispose(Boolean)

    Disposes all resources used by this object.

    Declaration
    protected override void Dispose(bool disposing)
    Parameters
    Type Name Description
    System.Boolean disposing
    Overrides
    FieldsConsumer.Dispose(Boolean)
    | Improve this Doc View Source

    WriteHeader(IndexOutput)

    Writes the terms file header.

    Declaration
    protected virtual void WriteHeader(IndexOutput out)
    Parameters
    Type Name Description
    IndexOutput out
    | Improve this Doc View Source

    WriteIndexHeader(IndexOutput)

    Writes the index file header.

    Declaration
    protected virtual void WriteIndexHeader(IndexOutput out)
    Parameters
    Type Name Description
    IndexOutput out
    | Improve this Doc View Source

    WriteIndexTrailer(IndexOutput, Int64)

    Writes the index file trailer.

    Declaration
    protected virtual void WriteIndexTrailer(IndexOutput indexOut, long dirStart)
    Parameters
    Type Name Description
    IndexOutput indexOut
    System.Int64 dirStart
    | Improve this Doc View Source

    WriteTrailer(IndexOutput, Int64)

    Writes the terms file trailer.

    Declaration
    protected virtual void WriteTrailer(IndexOutput out, long dirStart)
    Parameters
    Type Name Description
    IndexOutput out
    System.Int64 dirStart

    Implements

    System.IDisposable

    See Also

    BlockTreeTermsReader
    • Improve this Doc
    • View Source
    Back to top Copyright © 2022 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.