Namespace Lucene.Net.Codecs
Codecs API: API for customization of the encoding and structure of the index.
The Codec API allows you to customise the way the following pieces of index information are stored: * Postings lists - see Postings
For some concrete implementations beyond Lucene's official index format, see the Codecs module.
Codecs are identified by name through the Java Service Provider Interface. To create your own codec, extend Codec and pass the new codec's name to the super() constructor: public class MyCodec extends Codec { public MyCodec() { super("MyCodecName"); } ... } You will need to register the Codec class so that the {@link java.util.ServiceLoader ServiceLoader} can find it, by including a META-INF/services/org.apache.lucene.codecs.Codec file on your classpath that contains the package-qualified name of your codec.
If you just want to customise the Postings
Similarly, if you just want to customise the Doc
Classes
BlockTermState
Holds all state required for Postings
BlockTreeTermsReader
A block-based terms index and dictionary that assigns terms to variable length blocks according to how they share prefixes. The terms index is a prefix trie whose leaves are term blocks. The advantage of this approach is that SeekExact() is often able to determine a term cannot exist without doing any IO, and intersection with Automata is very fast. Note that this terms dictionary has it's own fixed terms index (ie, it does not support a pluggable terms index implementation).
NOTE: this terms dictionary does not support index divisor when opening an IndexReader. Instead, you can change the min/maxItemsPerBlock during indexing.
The data structure used by this implementation is very similar to a burst trie (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), but with added logic to break up too-large blocks of all terms sharing a given prefix into smaller ones.
Use Check-verbose
option to see summary statistics on the blocks in the
dictionary.
See Block
BlockTreeTermsReader.FieldReader
BlockTree's implementation of Get
BlockTreeTermsReader.Stats
BlockTree statistics for a single field
returned by Compute
BlockTreeTermsWriter
Block-based terms index and dictionary writer.
Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.
Files:
- .tim:Term Dictionary
- .tip:Term Index
Term Dictionary
The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).
The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.
NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.
- TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, FieldSummary, DirOffset, Footer
- NodeBlock --> (OuterNode | InnerNode)
- OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
- InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
- TermStats --> DocFreq, TotalTermFreq
- FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, SumTotalTermFreq?, SumDocFreq, DocCount>NumFields
- Header --> CodecHeader (Write
Header(Data Output, String, Int32) - DirOffset --> Uint64 (Write
Int64(Int64) ) - EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
FieldNumber,RootCodeLength,DocCount --> VInt (Write
VInt32(Int32) _ - TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq -->
VLong (Write
VInt64(Int64) ) - Footer --> CodecFooter (Write
Footer(Index )Output)
Notes:
- Header is a CodecHeader (Write
Header(Data ) storing the version information for the BlockTree implementation.Output, String, Int32) - DirOffset is a pointer to the FieldSummary section.
- DocFreq is the count of documents which contain the term.
- TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
- FieldNumber is the fields number from Lucene.
Net. . (.fnm)Codecs. Block Tree Terms Writer. field Infos - NumTerms is the number of unique terms for the field.
- RootCode points to the root block for the field.
- SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
- DocCount is the number of documents that have at least one posting for this field.
- PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
- For inner nodes of the tree, every entry will steal one bit to mark whether it points
to child nodes(sub-block). If so, the corresponding Term
Stats and TermMetadata are omitted
Term Index
The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.
- TermsIndex (.tip) --> Header, FSTIndexNumFields <IndexStartFP>NumFields, DirOffset, Footer
- Header --> CodecHeader (Write
Header(Data )Output, String, Int32) - DirOffset --> Uint64 (Write
Int64(Int64) - IndexStartFP --> VLong (Write
VInt64(Int64) - FSTIndex --> FST{byte[]}
- Footer --> CodecFooter (Write
Footer(Index Output)
Notes:
- The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
- DirOffset is a pointer to the start of the IndexStartFPs for all fields
- It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.
Codec
Encodes/decodes an inverted index segment.
Note, when extending this class, the name (Name) is
written into the index. In order for the segment to be read, the
name must resolve to your implementation via For
To implement your own codec:
- Subclass this class.
- Subclass Default
Codec , override the Initialize() method, and add the lineFactory base.ScanForCodecs(typeof(YourCodec).GetTypeInfo().Assembly)
. If you have any codec classes in your assembly that are not meant for reading, you can add the ExcludeCodec to them so they are ignored by the scan.From Scan Attribute - set the new ICodec
Factory by calling SetCodec at application startup.Factory(ICodec Factory)
Codec Names
Unlike the Java version, codec names are by default convention-based on the class name. If you name your custom codec class "MyCustomCodec", the codec name will the same name without the "Codec" suffix: "MyCustom".
You can override this default behavior by using the Codec
CodecNameAttribute
Represents an attribute that is used to name a Codec, if a name other than the default Codec naming convention is desired.
CodecUtil
Utility class for reading and writing versioned headers.
Writing codec headers is useful to ensure that a file is in the format you think it is.
DefaultCodecFactory
LUCENENET specific class that implements the default functionality for the
ICodec
The most common use cases are:
- Initialize Default
Codec with a set of CustomFactory Codec .Types - Subclass Default
Codec and override GetFactory Codec(Type) so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetCodec(String) . - Subclass Default
Codec and override GetFactory Codec so a type new type can be supplied that is not in the Lucene.Type(String) Net. .Codecs. Default Codec Factory. codec Name To Type Map - Subclass Default
Codec to add new or override the default Codec types by overriding Initialize() and calling PutFactory Codec .Type(Type) - Subclass Default
Codec to scan additional assemblies for Codec subclasses in by overriding Initialize() and calling ScanFactory For . For performance reasons, the default behavior only loads Lucene.Net codecs.Codecs(Assembly)
To set the ICodec
DefaultDocValuesFormatFactory
LUCENENET specific class that implements the default functionality for the
IDoc
The most common use cases are:
- Initialize Default
Doc with a set of CustomValues Format Factory Doc .Values Format Types - Subclass Default
Doc and override GetValues Format Factory Doc so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetValues Format(Type) Doc .Values Format(String) - Subclass Default
Doc and override GetValues Format Factory Doc so a type new type can be supplied that is not in the Lucene.Values Format Type(String) Net. .Codecs. Default Doc Values Format Factory. doc Values Format Name To Type Map - Subclass Default
Doc to add new or override the default DocValues Format Factory Values types by overriding Initialize() and calling PutFormat Doc .Values Format Type(Type) - Subclass Default
Doc to scan additional assemblies for DocValues Format Factory Values subclasses in by overriding Initialize() and calling ScanFormat For . For performance reasons, the default behavior only loads Lucene.Net codecs.Doc Values Formats(Assembly)
To set the IDoc
DefaultPostingsFormatFactory
LUCENENET specific class that implements the default functionality for the
IPostings
The most common use cases are:
- Initialize Default
Postings with a set of CustomFormat Factory Postings .Format Types - Subclass Default
Postings and override GetFormat Factory Postings so an external dependency injection container can be used to supply the instances (lifetime should be singleton). Note that you could alternately use the "named type" feature that many DI containers have to supply the type based on name by overriding GetFormat(Type) Postings .Format(String) - Subclass Default
Postings and override GetFormat Factory Postings so a type new type can be supplied that is not in the Lucene.Format Type(String) Net. .Codecs. Default Postings Format Factory. postings Format Name To Type Map - Subclass Default
Postings to add new or override the default PostingsFormat Factory Format types by overriding Initialize() and calling PutPostings .Format Type(Type) - Subclass Default
Postings to scan additional assemblies for PostingsFormat Factory Format subclasses in by overriding Initialize() and calling ScanFor . For performance reasons, the default behavior only loads Lucene.Net codecs.Postings Formats(Assembly)
To set the IPostings
DocValuesConsumer
Abstract API that consumes numeric, binary and sorted docvalues. Concrete implementations of this actually do "something" with the docvalues (write it into the index in a specific format).
The lifecycle is:
- DocValuesConsumer is created by
Fields
Consumer(Segment or NormsWrite State) Consumer(Segment .Write State) - AddNumericField(FieldInfo, IEnumerable<Nullable<Int64>>), AddBinaryField(FieldInfo, IEnumerable<BytesRef>), or AddSortedField(FieldInfo, IEnumerable<BytesRef>, IEnumerable<Nullable<Int64>>) are called for each Numeric, Binary, or Sorted docvalues field. The API is a "pull" rather than "push", and the implementation is free to iterate over the values multiple times (System.Collections.Generic.IEnumerable<T>.GetEnumerator()).
- After all fields are added, the consumer is Dispose()d.
DocValuesFormat
Encodes/decodes per-document values.
Note, when extending this class, the name (Name) may
written into the index in certain configurations. In order for the segment
to be read, the name must resolve to your implementation via For
To implement your own format:
- Subclass this class.
- Subclass Default
Doc , override the Initialize() method, and add the lineValues Format Factory base.ScanForDocValuesFormats(typeof(YourDocValuesFormat).GetTypeInfo().Assembly)
. If you have any format classes in your assembly that are not meant for reading, you can add the ExcludeDoc to them so they are ignored by the scan.Values Format From Scan Attribute - Set the new IDoc
Values by calling SetFormat Factory Doc at application startup.Values Format Factory(IDoc Values Format Factory)
DocValuesFormat Names
Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomDocValuesFormat", the format name will the same name without the "DocValuesFormat" suffix: "MyCustom".
You can override this default behavior by using the Doc
DocValuesFormatNameAttribute
Represents an attribute that is used to name a Doc
DocValuesProducer
Abstract API that produces numeric, binary and sorted docvalues.
ExcludeCodecFromScanAttribute
When placed on a class that subclasses Codec, adding this
attribute will exclude the type from consideration in the
Scan
However, the Codec type can still be added manually using
Put
ExcludeDocValuesFormatFromScanAttribute
When placed on a class that subclasses Doc
However, the Doc
ExcludePostingsFormatFromScanAttribute
When placed on a class that subclasses Postings
However, the Postings
FieldInfosFormat
Encodes/decodes Field
FieldInfosReader
Codec API for reading Field
FieldInfosWriter
Codec API for writing Field
FieldsConsumer
Abstract API that consumes terms, doc, freq, prox, offset and payloads postings. Concrete implementations of this actually do "something" with the postings (write it into the index in a specific format).
The lifecycle is:
- FieldsConsumer is created by
Fields
Consumer(Segment .Write State) - For each field, Add
Field(Field is called, returning a TermsInfo) Consumer for the field. - After all fields are added, the consumer is Dispose()d.
FieldsProducer
Abstract API that produces terms, doc, freq, prox, offset and payloads postings.
FilterCodec
A codec that forwards all its method calls to another codec.
Extend this class when you need to reuse the functionality of an existing
codec. For example, if you want to build a codec that redefines Lucene46's
Live
public sealed class CustomCodec : FilterCodec
{
public CustomCodec()
: base("CustomCodec", new Lucene46Codec())
{
}
public override LiveDocsFormat LiveDocsFormat
{
get { return new CustomLiveDocsFormat(); }
}
}
Please note: Don't call For
LiveDocsFormat
Format for live/deleted documents.
MappingMultiDocsAndPositionsEnum
Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).
MappingMultiDocsEnum
Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).
MultiLevelSkipListReader
This abstract class reads skip lists with multiple levels.
See Multi
Subclasses must implement the abstract method Read
MultiLevelSkipListWriter
This abstract class writes skip lists with multiple levels.
Example for skipInterval = 3:
c (skip level 2)
c c c (skip level 1)
x x x x x x x x x x (skip level 0)
d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d (posting list)
3 6 9 12 15 18 21 24 27 30 (df)
d - document
x - skip data
c - skip data with child pointer
Skip level i contains every skipInterval-th entry from skip level i-1.
Therefore the number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in list i-1.
this guarantees a logarithmic amount of skips to find the target document.
While this class takes care of writing the different skip levels,
subclasses must define the actual format of the skip data.
NormsFormat
Encodes/decodes per-document score normalization values.
PostingsBaseFormat
Provides a Postings
PostingsConsumer
Abstract API that consumes postings for an individual term.
The lifecycle is:
- PostingsConsumer is returned for each term by
Start
Term(Bytes .Ref) - Start
Doc(Int32, Int32) is called for each document where the term occurs, specifying id and term frequency for that document. - If positions are enabled for the field, then
Add
Position(Int32, Bytes will be called for each occurrence in the document.Ref, Int32, Int32) - Finish
Doc() is called when the producer is done adding positions to the document.
PostingsFormat
Encodes/decodes terms, postings, and proximity data.
Note, when extending this class, the name (Name) may
written into the index in certain configurations. In order for the segment
to be read, the name must resolve to your implementation via For
If you implement your own format:
- Subclass this class.
- Subclass Default
Postings , override Initialize(), and add the lineFormat Factory base.ScanForPostingsFormats(typeof(YourPostingsFormat).GetTypeInfo().Assembly)
. If you have any format classes in your assembly that are not meant for reading, you can add the ExcludePostings to them so they are ignored by the scan.Format From Scan Attribute - Set the new IPostings
Format by calling SetFactory Postings at application startup.Format Factory(IPostings Format Factory)
PostingsFormat Names
Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomPostingsFormat", the codec name will the same name without the "PostingsFormat" suffix: "MyCustom".
You can override this default behavior by using the Postings
PostingsFormatNameAttribute
Represents an attribute that is used to name a Postings
PostingsReaderBase
The core terms dictionaries (BlockTermsReader,
Block
PostingsWriterBase
Extension of Postings
This class contains additional hooks to interact with the provided
term dictionaries such as Block
SegmentInfoFormat
Expert: Controls the format of the
Segment
SegmentInfoReader
Specifies an API for classes that can read Segment
SegmentInfoWriter
Specifies an API for classes that can write out Segment
StoredFieldsFormat
Controls the format of stored fields.
StoredFieldsReader
Codec API for reading stored fields.
You need to implement Visit
StoredFieldsWriter
Codec API for writing stored fields:
- For every document, Start
Document(Int32) is called, informing the Codec how many fields will be written. - Write
Field(Field is called for each field in the document.Info, IIndexable Field) - After all documents have been written, Finish(Field
Infos, Int32) is called for verification/sanity-checks. - Finally the writer is disposed (Dispose(Boolean))
TermsConsumer
Abstract API that consumes terms for an individual field.
The lifecycle is:
- TermsConsumer is returned for each field
by Add
Field(Field .Info) - TermsConsumer returns a Postings
Consumer for each term in StartTerm(Bytes .Ref) - When the producer (e.g. IndexWriter)
is done adding documents for the term, it calls
Finish
Term(Bytes , passing in the accumulated term statistics.Ref, Term Stats) - Producer calls Finish(Int64, Int64, Int32) with the accumulated collection statistics when it is finished adding terms to the field.
TermStats
Holder for per-term statistics.
TermVectorsFormat
Controls the format of term vectors.
TermVectorsReader
Codec API for reading term vectors:
TermVectorsWriter
Codec API for writing term vectors:
- For every document, Start
Document(Int32) is called, informing the Codec how many fields will be written. - Start
Field(Field is called for each field in the document, informing the codec how many terms will be written for that field, and whether or not positions, offsets, or payloads are enabled.Info, Int32, Boolean, Boolean, Boolean) - Within each field, Start
Term(Bytes is called for each term.Ref, Int32) - If offsets and/or positions are enabled, then
Add
Position(Int32, Int32, Int32, Bytes will be called for each term occurrence.Ref) - After all documents have been written, Finish(Field
Infos, Int32) is called for verification/sanity-checks. - Finally the writer is disposed (Dispose(Boolean))
Interfaces
ICodecFactory
LUCENENET specific contract for extending the functionality of Codec implementations so they can be injected with dependencies.
To set the ICodec
IDocValuesFormatFactory
LUCENENET specific contract for extending the functionality of Doc
To set the IDoc
IPostingsFormatFactory
LUCENENET specific contract for extending the functionality of Postings
To set the IPostings