Namespace Lucene.Net.Codecs

Codecs API: API for customization of the encoding and structure of the index.

The Codec API allows you to customise the way the following pieces of index information are stored: * Postings lists - see PostingsFormat * DocValues - see DocValuesFormat * Stored fields - see StoredFieldsFormat * Term vectors - see TermVectorsFormat * FieldInfos - see FieldInfosFormat * SegmentInfo - see SegmentInfoFormat * Norms - see NormsFormat * Live documents - see LiveDocsFormat

For some concrete implementations beyond Lucene's official index format, see the Codecs module.

Codecs are identified by name through the Java Service Provider Interface. To create your own codec, extend Codec and pass the new codec's name to the super() constructor: public class MyCodec extends Codec { public MyCodec() { super("MyCodecName"); } ... } You will need to register the Codec class so that the {@link java.util.ServiceLoader ServiceLoader} can find it, by including a META-INF/services/org.apache.lucene.codecs.Codec file on your classpath that contains the package-qualified name of your codec.

If you just want to customise the PostingsFormat, or use different postings formats for different fields, then you can register your custom postings format in the same way (in META-INF/services/org.apache.lucene.codecs.PostingsFormat), and then extend the default Lucene46Codec and override #getPostingsFormatForField(String) to return your custom postings format.

Similarly, if you just want to customise the DocValuesFormat per-field, have a look at #getDocValuesFormatForField(String).

Classes

BlockTermState

Holds all state required for PostingsReaderBase to produce a DocsEnum without re-seeking the terms dict.

BlockTreeTermsReader

A block-based terms index and dictionary that assigns terms to variable length blocks according to how they share prefixes. The terms index is a prefix trie whose leaves are term blocks. The advantage of this approach is that SeekExact() is often able to determine a term cannot exist without doing any IO, and intersection with Automata is very fast. Note that this terms dictionary has it's own fixed terms index (ie, it does not support a pluggable terms index implementation).

NOTE: this terms dictionary does not support index divisor when opening an IndexReader. Instead, you can change the min/maxItemsPerBlock during indexing.

The data structure used by this implementation is very similar to a burst trie (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), but with added logic to break up too-large blocks of all terms sharing a given prefix into smaller ones.

Use CheckIndex with the -verbose option to see summary statistics on the blocks in the dictionary.

See BlockTreeTermsWriter.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BlockTreeTermsReader.FieldReader

BlockTree's implementation of GetTerms(String).

BlockTreeTermsReader.Stats

BlockTree statistics for a single field returned by ComputeStats().

BlockTreeTermsWriter

Block-based terms index and dictionary writer.

Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

Files:

.tim:Term Dictionary
.tip:Term Index

Term Dictionary

The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.

TermsDict (.tim) --> Header, PostingsHeader, NodeBlock^NumBlocks, FieldSummary, DirOffset, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, Byte^SuffixLength, StatsLength, < TermStats >^EntryCount, MetaLength, <TermMetadata>^EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], Byte^SuffixLength, StatsLength, < TermStats ? >^EntryCount, MetaLength, <TermMetadata ? >^EntryCount
TermStats --> DocFreq, TotalTermFreq
FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, Byte^{RootCodeLength}, SumTotalTermFreq?, SumDocFreq, DocCount>^NumFields
Header --> CodecHeader (WriteHeader(DataOutput, String, Int32)
DirOffset --> Uint64 (WriteInt64(Int64))
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt (WriteVInt32(Int32)_
TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong (WriteVInt64(Int64))
Footer --> CodecFooter (WriteFooter(IndexOutput))

Notes:

Header is a CodecHeader (WriteHeader(DataOutput, String, Int32)) storing the version information for the BlockTree implementation.
DirOffset is a pointer to the FieldSummary section.
DocFreq is the count of documents which contain the term.
TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
FieldNumber is the fields number from Lucene.Net.Codecs.BlockTreeTermsWriter.fieldInfos. (.fnm)
NumTerms is the number of unique terms for the field.
RootCode points to the root block for the field.
SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
DocCount is the number of documents that have at least one posting for this field.
PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetadata are omitted

Term Index

The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.

TermsIndex (.tip) --> Header, FSTIndex^NumFields <IndexStartFP>^NumFields, DirOffset, Footer
Header --> CodecHeader (WriteHeader(DataOutput, String, Int32))
DirOffset --> Uint64 (WriteInt64(Int64)
IndexStartFP --> VLong (WriteVInt64(Int64)
FSTIndex --> FST{byte[]}
Footer --> CodecFooter (WriteFooter(IndexOutput)

Notes:

The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
DirOffset is a pointer to the start of the IndexStartFPs for all fields
It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Codec

Encodes/decodes an inverted index segment.

Note, when extending this class, the name (Name) is written into the index. In order for the segment to be read, the name must resolve to your implementation via ForName(String). This method uses GetCodec(String) to resolve codec names.

To implement your own codec:

Subclass this class.
Subclass DefaultCodecFactory, override the Initialize() method, and add the line base.ScanForCodecs(typeof(YourCodec).GetTypeInfo().Assembly). If you have any codec classes in your assembly that are not meant for reading, you can add the ExcludeCodecFromScanAttribute to them so they are ignored by the scan.
set the new ICodecFactory by calling SetCodecFactory(ICodecFactory) at application startup.

If your codec has dependencies, you may also override GetCodec(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

Codec Names

Unlike the Java version, codec names are by default convention-based on the class name. If you name your custom codec class "MyCustomCodec", the codec name will the same name without the "Codec" suffix: "MyCustom".

You can override this default behavior by using the CodecNameAttribute to name the codec differently than this convention. Codec names must be all ASCII alphanumeric, and less than 128 characters in length.

CodecNameAttribute

Represents an attribute that is used to name a Codec, if a name other than the default Codec naming convention is desired.

CodecUtil

Utility class for reading and writing versioned headers.

Writing codec headers is useful to ensure that a file is in the format you think it is.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DefaultCodecFactory

LUCENENET specific class that implements the default functionality for the ICodecFactory.

The most common use cases are:

Initialize DefaultCodecFactory with a set of CustomCodecTypes.
Subclass DefaultCodecFactory and override GetCodec(Type) so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetCodec(String).
Subclass DefaultCodecFactory and override GetCodecType(String) so a type new type can be supplied that is not in the Lucene.Net.Codecs.DefaultCodecFactory.codecNameToTypeMap.
Subclass DefaultCodecFactory to add new or override the default Codec types by overriding Initialize() and calling PutCodecType(Type).
Subclass DefaultCodecFactory to scan additional assemblies for Codec subclasses in by overriding Initialize() and calling ScanForCodecs(Assembly). For performance reasons, the default behavior only loads Lucene.Net codecs.

To set the ICodecFactory, call SetCodecFactory(ICodecFactory).

DefaultDocValuesFormatFactory

LUCENENET specific class that implements the default functionality for the IDocValuesFormatFactory.

The most common use cases are:

Initialize DefaultDocValuesFormatFactory with a set of CustomDocValuesFormatTypes.
Subclass DefaultDocValuesFormatFactory and override GetDocValuesFormat(Type) so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetDocValuesFormat(String).
Subclass DefaultDocValuesFormatFactory and override GetDocValuesFormatType(String) so a type new type can be supplied that is not in the Lucene.Net.Codecs.DefaultDocValuesFormatFactory.docValuesFormatNameToTypeMap.
Subclass DefaultDocValuesFormatFactory to add new or override the default DocValuesFormat types by overriding Initialize() and calling PutDocValuesFormatType(Type).
Subclass DefaultDocValuesFormatFactory to scan additional assemblies for DocValuesFormat subclasses in by overriding Initialize() and calling ScanForDocValuesFormats(Assembly). For performance reasons, the default behavior only loads Lucene.Net codecs.

To set the IDocValuesFormatFactory, call SetDocValuesFormatFactory(IDocValuesFormatFactory).

DefaultPostingsFormatFactory

LUCENENET specific class that implements the default functionality for the IPostingsFormatFactory.

The most common use cases are:

Initialize DefaultPostingsFormatFactory with a set of CustomPostingsFormatTypes.
Subclass DefaultPostingsFormatFactory and override GetPostingsFormat(Type) so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetPostingsFormat(String).
Subclass DefaultPostingsFormatFactory and override GetPostingsFormatType(String) so a type new type can be supplied that is not in the Lucene.Net.Codecs.DefaultPostingsFormatFactory.postingsFormatNameToTypeMap.
Subclass DefaultPostingsFormatFactory to add new or override the default PostingsFormat types by overriding Initialize() and calling PutPostingsFormatType(Type).
Subclass DefaultPostingsFormatFactory to scan additional assemblies for PostingsFormat subclasses in by overriding Initialize() and calling ScanForPostingsFormats(Assembly). For performance reasons, the default behavior only loads Lucene.Net codecs.

To set the IPostingsFormatFactory, call SetPostingsFormatFactory(IPostingsFormatFactory).

DocValuesConsumer

Abstract API that consumes numeric, binary and sorted docvalues. Concrete implementations of this actually do "something" with the docvalues (write it into the index in a specific format).

The lifecycle is:

DocValuesConsumer is created by FieldsConsumer(SegmentWriteState) or NormsConsumer(SegmentWriteState).
AddNumericField(FieldInfo, IEnumerable<Nullable<Int64>>), AddBinaryField(FieldInfo, IEnumerable<BytesRef>), or AddSortedField(FieldInfo, IEnumerable<BytesRef>, IEnumerable<Nullable<Int64>>) are called for each Numeric, Binary, or Sorted docvalues field. The API is a "pull" rather than "push", and the implementation is free to iterate over the values multiple times (System.Collections.Generic.IEnumerable<T>.GetEnumerator()).
After all fields are added, the consumer is Dispose()d.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DocValuesFormat

Encodes/decodes per-document values.

Note, when extending this class, the name (Name) may written into the index in certain configurations. In order for the segment to be read, the name must resolve to your implementation via ForName(String). This method uses GetDocValuesFormat(String) to resolve format names.

To implement your own format:

Subclass this class.
Subclass DefaultDocValuesFormatFactory, override the Initialize() method, and add the line base.ScanForDocValuesFormats(typeof(YourDocValuesFormat).GetTypeInfo().Assembly). If you have any format classes in your assembly that are not meant for reading, you can add the ExcludeDocValuesFormatFromScanAttribute to them so they are ignored by the scan.
Set the new IDocValuesFormatFactory by calling SetDocValuesFormatFactory(IDocValuesFormatFactory) at application startup.

If your format has dependencies, you may also override GetDocValuesFormat(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

DocValuesFormat Names

Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomDocValuesFormat", the format name will the same name without the "DocValuesFormat" suffix: "MyCustom".

You can override this default behavior by using the DocValuesFormatNameAttribute to name the format differently than this convention. Format names must be all ASCII alphanumeric, and less than 128 characters in length.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DocValuesFormatNameAttribute

Represents an attribute that is used to name a DocValuesFormat, if a name other than the default DocValuesFormat naming convention is desired.

DocValuesProducer

Abstract API that produces numeric, binary and sorted docvalues.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

ExcludeCodecFromScanAttribute

When placed on a class that subclasses Codec, adding this attribute will exclude the type from consideration in the ScanForCodecs(Assembly) method.

However, the Codec type can still be added manually using PutCodecType(Type).

ExcludeDocValuesFormatFromScanAttribute

When placed on a class that subclasses DocValuesFormat, adding this attribute will exclude the type from consideration in the ScanForDocValuesFormats(Assembly) method.

However, the DocValuesFormat type can still be added manually using PutDocValuesFormatType(Type).

ExcludePostingsFormatFromScanAttribute

When placed on a class that subclasses PostingsFormat, adding this attribute will exclude the type from consideration in the ScanForPostingsFormats(Assembly) method.

However, the PostingsFormat type can still be added manually using PutPostingsFormatType(Type).

FieldInfosFormat

Encodes/decodes FieldInfos.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FieldInfosReader

Codec API for reading FieldInfos.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FieldInfosWriter

Codec API for writing FieldInfos.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FieldsConsumer

Abstract API that consumes terms, doc, freq, prox, offset and payloads postings. Concrete implementations of this actually do "something" with the postings (write it into the index in a specific format).

The lifecycle is:

FieldsConsumer is created by FieldsConsumer(SegmentWriteState).
For each field, AddField(FieldInfo) is called, returning a TermsConsumer for the field.
After all fields are added, the consumer is Dispose()d.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FieldsProducer

Abstract API that produces terms, doc, freq, prox, offset and payloads postings.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

FilterCodec

A codec that forwards all its method calls to another codec.

Extend this class when you need to reuse the functionality of an existing codec. For example, if you want to build a codec that redefines Lucene46's LiveDocsFormat:

    public sealed class CustomCodec : FilterCodec 
    {
        public CustomCodec()
            : base("CustomCodec", new Lucene46Codec())
        {
        }

        public override LiveDocsFormat LiveDocsFormat 
        {
            get { return new CustomLiveDocsFormat(); }
        }
    }

Please note: Don't call ForName(String) from the no-arg constructor of your own codec. When the DefaultCodecFactory loads your own Codec, the DefaultCodecFactory has not yet fully initialized! If you want to extend another Codec, instantiate it directly by calling its constructor.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LiveDocsFormat

Format for live/deleted documents.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

MappingMultiDocsAndPositionsEnum

Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

MappingMultiDocsEnum

Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

MultiLevelSkipListReader

This abstract class reads skip lists with multiple levels.

See MultiLevelSkipListWriter for the information about the encoding of the multi level skip lists.

Subclasses must implement the abstract method ReadSkipData(Int32, IndexInput) which defines the actual format of the skip data.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

MultiLevelSkipListWriter

This abstract class writes skip lists with multiple levels.

Example for skipInterval = 3:
                                                    c            (skip level 2)
                c                 c                 c            (skip level 1)
    x     x     x     x     x     x     x     x     x     x      (skip level 0)
d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
    3     6     9     12    15    18    21    24    27    30     (df)

d - document
x - skip data
c - skip data with child pointer

Skip level i contains every skipInterval-th entry from skip level i-1.
Therefore the number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).

Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in list i-1.
this guarantees a logarithmic amount of skips to find the target document.

While this class takes care of writing the different skip levels,
subclasses must define the actual format of the skip data.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NormsFormat

Encodes/decodes per-document score normalization values.

PostingsBaseFormat

Provides a PostingsReaderBase and PostingsWriterBase.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

PostingsConsumer

Abstract API that consumes postings for an individual term.

The lifecycle is:

PostingsConsumer is returned for each term by StartTerm(BytesRef).
StartDoc(Int32, Int32) is called for each document where the term occurs, specifying id and term frequency for that document.
If positions are enabled for the field, then AddPosition(Int32, BytesRef, Int32, Int32) will be called for each occurrence in the document.
FinishDoc() is called when the producer is done adding positions to the document.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

PostingsFormat

Encodes/decodes terms, postings, and proximity data.

If you implement your own format:

Subclass this class.
Subclass DefaultPostingsFormatFactory, override Initialize(), and add the line base.ScanForPostingsFormats(typeof(YourPostingsFormat).GetTypeInfo().Assembly). If you have any format classes in your assembly that are not meant for reading, you can add the ExcludePostingsFormatFromScanAttribute to them so they are ignored by the scan.
Set the new IPostingsFormatFactory by calling SetPostingsFormatFactory(IPostingsFormatFactory) at application startup.

If your format has dependencies, you may also override GetPostingsFormat(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

PostingsFormat Names

Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomPostingsFormat", the codec name will the same name without the "PostingsFormat" suffix: "MyCustom".

You can override this default behavior by using the PostingsFormatNameAttribute to name the format differently than this convention. Format names must be all ASCII alphanumeric, and less than 128 characters in length.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

PostingsFormatNameAttribute

Represents an attribute that is used to name a PostingsFormat, if a name other than the default PostingsFormat naming convention is desired.

PostingsReaderBase

The core terms dictionaries (BlockTermsReader, BlockTreeTermsReader) interact with a single instance of this class to manage creation of DocsEnum and DocsAndPositionsEnum instances. It provides an IndexInput (termsIn) where this class may read any previously stored data that it had written in its corresponding PostingsWriterBase at indexing time.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

PostingsWriterBase

Extension of PostingsConsumer to support pluggable term dictionaries.

This class contains additional hooks to interact with the provided term dictionaries such as BlockTreeTermsWriter. If you want to re-use an existing implementation and are only interested in customizing the format of the postings list, extend this class instead.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

SegmentInfoFormat

Expert: Controls the format of the SegmentInfo (segment metadata file).

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

SegmentInfoReader

Specifies an API for classes that can read SegmentInfo information.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

SegmentInfoWriter

Specifies an API for classes that can write out SegmentInfo data.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

StoredFieldsFormat

Controls the format of stored fields.

StoredFieldsReader

Codec API for reading stored fields.

You need to implement VisitDocument(Int32, StoredFieldVisitor) to read the stored fields for a document, implement Clone() (creating clones of any IndexInputs used, etc), and Dispose(Boolean) to cleanup any allocated resources.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

StoredFieldsWriter

Codec API for writing stored fields:

For every document, StartDocument(Int32) is called, informing the Codec how many fields will be written.
WriteField(FieldInfo, IIndexableField) is called for each field in the document.
After all documents have been written, Finish(FieldInfos, Int32) is called for verification/sanity-checks.
Finally the writer is disposed (Dispose(Boolean))

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

TermsConsumer

Abstract API that consumes terms for an individual field.

The lifecycle is:

TermsConsumer is returned for each field by AddField(FieldInfo).
TermsConsumer returns a PostingsConsumer for each term in StartTerm(BytesRef).
When the producer (e.g. IndexWriter) is done adding documents for the term, it calls FinishTerm(BytesRef, TermStats), passing in the accumulated term statistics.
Producer calls Finish(Int64, Int64, Int32) with the accumulated collection statistics when it is finished adding terms to the field.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

TermStats

Holder for per-term statistics.

TermVectorsFormat

Controls the format of term vectors.

TermVectorsReader

Codec API for reading term vectors:

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

TermVectorsWriter

Codec API for writing term vectors:

For every document, StartDocument(Int32) is called, informing the Codec how many fields will be written.
StartField(FieldInfo, Int32, Boolean, Boolean, Boolean) is called for each field in the document, informing the codec how many terms will be written for that field, and whether or not positions, offsets, or payloads are enabled.
Within each field, StartTerm(BytesRef, Int32) is called for each term.
If offsets and/or positions are enabled, then AddPosition(Int32, Int32, Int32, BytesRef) will be called for each term occurrence.
After all documents have been written, Finish(FieldInfos, Int32) is called for verification/sanity-checks.
Finally the writer is disposed (Dispose(Boolean))

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Namespace Lucene.Net.Codecs

Classes

Term Dictionary

Term Index

Interfaces