Class BlockTreeTermsWriter<TSubclassState>

Block-based terms index and dictionary writer.

Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

Files:

.tim:Term Dictionary
.tip:Term Index

Term Dictionary

The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.

TermsDict (.tim) --> Header, PostingsHeader, NodeBlock^NumBlocks, FieldSummary, DirOffset, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, Byte^SuffixLength, StatsLength, < TermStats >^EntryCount, MetaLength, <TermMetadata>^EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], Byte^SuffixLength, StatsLength, < TermStats ? >^EntryCount, MetaLength, <TermMetadata ? >^EntryCount
TermStats --> DocFreq, TotalTermFreq
FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumTotalTermFreq?, SumDocFreq, DocCount>^NumFields
Header --> CodecHeader (WriteHeader(DataOutput, string, int)
DirOffset --> Uint64 (WriteInt64(long))
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt (WriteVInt32(int)_
TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong (WriteVInt64(long))
Footer --> CodecFooter (WriteFooter(IndexOutput))

Notes:

Header is a CodecHeader (WriteHeader(DataOutput, string, int)) storing the version information for the BlockTree implementation.
DirOffset is a pointer to the FieldSummary section.
DocFreq is the count of documents which contain the term.
TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
FieldNumber is the fields number from fieldInfos. (.fnm)
NumTerms is the number of unique terms for the field.
RootCode points to the root block for the field.
SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
DocCount is the number of documents that have at least one posting for this field.
PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetadata are omitted

Term Index

The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.

TermsIndex (.tip) --> Header, FSTIndex^NumFields <IndexStartFP>^NumFields, DirOffset, Footer
Header --> CodecHeader (WriteHeader(DataOutput, string, int))
DirOffset --> Uint64 (WriteInt64(long)
IndexStartFP --> VLong (WriteVInt64(long)
FSTIndex --> FST{byte[]}
Footer --> CodecFooter (WriteFooter(IndexOutput)

Notes:

The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
DirOffset is a pointer to the start of the IndexStartFPs for all fields
It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

Note

This API is experimental and might change in incompatible ways in the next release.

Inheritance

object

FieldsConsumer

BlockTreeTermsWriter<TSubclassState>

Implements

IDisposable

Inherited Members

FieldsConsumer.Dispose()

FieldsConsumer.Merge(MergeState, Fields)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Namespace: Lucene.Net.Codecs

Assembly: Lucene.Net.dll

Syntax

public class BlockTreeTermsWriter<TSubclassState> : FieldsConsumer, IDisposable

Type Parameters

Name	Description
TSubclassState

Constructors

BlockTreeTermsWriter(SegmentWriteState, PostingsWriterBase, int, int, TSubclassState)

Create a new writer. The number of items (terms or sub-blocks) per block will aim to be between

Declaration

public BlockTreeTermsWriter(SegmentWriteState state, PostingsWriterBase postingsWriter, int minItemsInBlock, int maxItemsInBlock, TSubclassState subclassState)

Parameters

Type	Name	Description
SegmentWriteState	state
PostingsWriterBase	postingsWriter
int	minItemsInBlock
int	maxItemsInBlock
TSubclassState	subclassState	LUCENENET specific parameter which allows a subclass to set state. It is optional and can be used when overriding the WriteHeader(), WriteIndexHeader(). It only matters in the case where the state is required inside of any of those methods that is passed in to the subclass constructor. `When passed to the constructor, it is set to the protected field m_subclassState before any of the above methods are called where it is available for reading when overriding the above methods. If your subclass needs to pass more than one piece of data, you can create a class or struct to do so. All other virtual members of BlockTreeTermsWriter are not called in the constructor, so the overrides of those methods won't specifically need to use this field (although they could for consistency).`

Fields

m_subclassState

Block-based terms index and dictionary writer.

Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

Files:

.tim:Term Dictionary
.tip:Term Index

Term Dictionary

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

TermsDict (.tim) --> Header, PostingsHeader, NodeBlock^NumBlocks, FieldSummary, DirOffset, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, Byte^SuffixLength, StatsLength, < TermStats >^EntryCount, MetaLength, <TermMetadata>^EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], Byte^SuffixLength, StatsLength, < TermStats ? >^EntryCount, MetaLength, <TermMetadata ? >^EntryCount
TermStats --> DocFreq, TotalTermFreq
FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumTotalTermFreq?, SumDocFreq, DocCount>^NumFields
Header --> CodecHeader (WriteHeader(DataOutput, string, int)
DirOffset --> Uint64 (WriteInt64(long))
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt (WriteVInt32(int)_
TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong (WriteVInt64(long))
Footer --> CodecFooter (WriteFooter(IndexOutput))

Notes:

Header is a CodecHeader (WriteHeader(DataOutput, string, int)) storing the version information for the BlockTree implementation.
DirOffset is a pointer to the FieldSummary section.
DocFreq is the count of documents which contain the term.
TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
FieldNumber is the fields number from fieldInfos. (.fnm)
NumTerms is the number of unique terms for the field.
RootCode points to the root block for the field.
SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
DocCount is the number of documents that have at least one posting for this field.
PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetadata are omitted

Term Index

TermsIndex (.tip) --> Header, FSTIndex^NumFields <IndexStartFP>^NumFields, DirOffset, Footer
Header --> CodecHeader (WriteHeader(DataOutput, string, int))
DirOffset --> Uint64 (WriteInt64(long)
IndexStartFP --> VLong (WriteVInt64(long)
FSTIndex --> FST{byte[]}
Footer --> CodecFooter (WriteFooter(IndexOutput)

Notes:

The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
DirOffset is a pointer to the start of the IndexStartFPs for all fields
It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

Note

This API is experimental and might change in incompatible ways in the next release.

Declaration

protected object m_subclassState

Field Value

Type	Description
object

Methods

AddField(FieldInfo)

Add a new field.

Declaration

public override TermsConsumer AddField(FieldInfo field)

Parameters

Type	Name	Description
FieldInfo	field

Returns

Type	Description
TermsConsumer

Overrides

FieldsConsumer.AddField(FieldInfo)

Dispose(bool)

Disposes all resources used by this object.

Declaration

protected override void Dispose(bool disposing)

Parameters

Type	Name	Description
bool	disposing

Overrides

FieldsConsumer.Dispose(bool)

WriteHeader(IndexOutput)

Writes the terms file header.

Declaration

protected virtual void WriteHeader(IndexOutput @out)

Parameters

Type	Name	Description
IndexOutput	out

WriteIndexHeader(IndexOutput)

Writes the index file header.

Declaration

protected virtual void WriteIndexHeader(IndexOutput @out)

Parameters

Type	Name	Description
IndexOutput	out

WriteIndexTrailer(IndexOutput, long)

Writes the index file trailer.

Declaration

protected virtual void WriteIndexTrailer(IndexOutput indexOut, long dirStart)

Parameters

Type	Name	Description
IndexOutput	indexOut
long	dirStart

WriteTrailer(IndexOutput, long)

Writes the terms file trailer.

Declaration

protected virtual void WriteTrailer(IndexOutput @out, long dirStart)

Parameters

Type	Name	Description
IndexOutput	out
long	dirStart

Implements

IDisposable

Class BlockTreeTermsWriter<TSubclassState>

Term Dictionary

Term Index

Note

Inheritance

Implements

Inherited Members

Namespace: Lucene.Net.Codecs

Assembly: Lucene.Net.dll

Syntax

Type Parameters

Constructors

BlockTreeTermsWriter(SegmentWriteState, PostingsWriterBase, int, int, TSubclassState)

Declaration

Parameters

See Also

Fields

m_subclassState

Term Dictionary

Term Index

Note

Declaration

Field Value

See Also

Methods

AddField(FieldInfo)

Declaration

Parameters

Returns

Overrides

See Also

Dispose(bool)

Declaration

Parameters

Overrides

See Also

WriteHeader(IndexOutput)

Declaration

Parameters

See Also

WriteIndexHeader(IndexOutput)

Declaration

Parameters

See Also

WriteIndexTrailer(IndexOutput, long)

Declaration

Parameters

See Also

WriteTrailer(IndexOutput, long)

Declaration

Parameters

See Also

Implements

See Also