Class FSTTermsWriter
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts:- term statistics: docFreq, totalTermFreq;
- monotonic long[], e.g. the pointer to the postings list for that term;
- generic byte[], e.g. other information need by postings reader.
File:
.tst
: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST TermData
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- DirOffset --> Uint64 (WriteInt64(long))
- DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
Note
This API is experimental and might change in incompatible ways in the next release.
Implements
Inherited Members
Namespace: Lucene.Net.Codecs.Memory
Assembly: Lucene.Net.Codecs.dll
Syntax
public class FSTTermsWriter : FieldsConsumer, IDisposable
Constructors
FSTTermsWriter(SegmentWriteState, PostingsWriterBase)
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts:- term statistics: docFreq, totalTermFreq;
- monotonic long[], e.g. the pointer to the postings list for that term;
- generic byte[], e.g. other information need by postings reader.
File:
.tst
: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST TermData
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- DirOffset --> Uint64 (WriteInt64(long))
- DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public FSTTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter)
Parameters
Type | Name | Description |
---|---|---|
SegmentWriteState | state | |
PostingsWriterBase | postingsWriter |
Fields
TERMS_VERSION_CHECKSUM
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts:- term statistics: docFreq, totalTermFreq;
- monotonic long[], e.g. the pointer to the postings list for that term;
- generic byte[], e.g. other information need by postings reader.
File:
.tst
: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST TermData
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- DirOffset --> Uint64 (WriteInt64(long))
- DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int TERMS_VERSION_CHECKSUM = 1
Field Value
Type | Description |
---|---|
int |
TERMS_VERSION_CURRENT
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts:- term statistics: docFreq, totalTermFreq;
- monotonic long[], e.g. the pointer to the postings list for that term;
- generic byte[], e.g. other information need by postings reader.
File:
.tst
: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST TermData
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- DirOffset --> Uint64 (WriteInt64(long))
- DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int TERMS_VERSION_CURRENT = 1
Field Value
Type | Description |
---|---|
int |
TERMS_VERSION_START
FST-based term dict, using metadata as FST output.
The FST directly holds the mapping between <term, metadata>. Term metadata consists of three parts:- term statistics: docFreq, totalTermFreq;
- monotonic long[], e.g. the pointer to the postings list for that term;
- generic byte[], e.g. other information need by postings reader.
File:
.tst
: Term Dictionary
Term Dictionary
The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).
Typically the metadata is separated into two parts:
- Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
- Generic byte array: Used to store non-monotonic metadata.
File format:
- TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
- FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
- TermFST TermData
- TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
- Header --> CodecHeader (WriteHeader(DataOutput, string, int))
- DirOffset --> Uint64 (WriteInt64(long))
- DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt (WriteVInt32(int))
- TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong (WriteVInt64(long))
Notes:
- The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
- The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
- The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
- Since LongsSize is per-field fixed, it is only written once in field summary.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int TERMS_VERSION_START = 0
Field Value
Type | Description |
---|---|
int |
Methods
AddField(FieldInfo)
Add a new field.
Declaration
public override TermsConsumer AddField(FieldInfo field)
Parameters
Type | Name | Description |
---|---|---|
FieldInfo | field |
Returns
Type | Description |
---|---|
TermsConsumer |
Overrides
Dispose(bool)
Implementations must override and should dispose all resources used by this instance.
Declaration
protected override void Dispose(bool disposing)
Parameters
Type | Name | Description |
---|---|---|
bool | disposing |