Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class FSTTermsWriter

    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts:
    1. term statistics: docFreq, totalTermFreq;
    2. monotonic long[], e.g. the pointer to the postings list for that term;
    3. generic byte[], e.g. other information need by postings reader.

    File:

    • .tst: Term Dictionary

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.

    File format:

    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST TermData
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long))
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    object
    FieldsConsumer
    FSTTermsWriter
    Implements
    IDisposable
    Inherited Members
    FieldsConsumer.Dispose()
    FieldsConsumer.Merge(MergeState, Fields)
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Codecs.Memory
    Assembly: Lucene.Net.Codecs.dll
    Syntax
    public class FSTTermsWriter : FieldsConsumer, IDisposable

    Constructors

    FSTTermsWriter(SegmentWriteState, PostingsWriterBase)

    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts:
    1. term statistics: docFreq, totalTermFreq;
    2. monotonic long[], e.g. the pointer to the postings list for that term;
    3. generic byte[], e.g. other information need by postings reader.

    File:

    • .tst: Term Dictionary

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.

    File format:

    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST TermData
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long))
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter)
    Parameters
    Type Name Description
    SegmentWriteState state
    PostingsWriterBase postingsWriter

    Fields

    TERMS_VERSION_CHECKSUM

    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts:
    1. term statistics: docFreq, totalTermFreq;
    2. monotonic long[], e.g. the pointer to the postings list for that term;
    3. generic byte[], e.g. other information need by postings reader.

    File:

    • .tst: Term Dictionary

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.

    File format:

    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST TermData
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long))
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int TERMS_VERSION_CHECKSUM = 1
    Field Value
    Type Description
    int

    TERMS_VERSION_CURRENT

    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts:
    1. term statistics: docFreq, totalTermFreq;
    2. monotonic long[], e.g. the pointer to the postings list for that term;
    3. generic byte[], e.g. other information need by postings reader.

    File:

    • .tst: Term Dictionary

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.

    File format:

    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST TermData
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long))
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int TERMS_VERSION_CURRENT = 1
    Field Value
    Type Description
    int

    TERMS_VERSION_START

    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts:
    1. term statistics: docFreq, totalTermFreq;
    2. monotonic long[], e.g. the pointer to the postings list for that term;
    3. generic byte[], e.g. other information need by postings reader.

    File:

    • .tst: Term Dictionary

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.

    File format:

    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST TermData
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long))
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int TERMS_VERSION_START = 0
    Field Value
    Type Description
    int

    Methods

    AddField(FieldInfo)

    Add a new field.

    Declaration
    public override TermsConsumer AddField(FieldInfo field)
    Parameters
    Type Name Description
    FieldInfo field
    Returns
    Type Description
    TermsConsumer
    Overrides
    Lucene.Net.Codecs.FieldsConsumer.AddField(Lucene.Net.Index.FieldInfo)

    Dispose(bool)

    Implementations must override and should dispose all resources used by this instance.

    Declaration
    protected override void Dispose(bool disposing)
    Parameters
    Type Name Description
    bool disposing
    Overrides
    FieldsConsumer.Dispose(bool)

    Implements

    IDisposable
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.