Namespace Lucene.Net.Index

Code to maintain and access indices.

Postings APIs
- Fields
- Terms
- Documents
- Positions
Index Statistics

Postings APIs

Fields

Fields is the initial entry point into the postings APIs, this can be obtained in several ways:

// access indexed fields for an index segment
Fields fields = reader.Fields; // access term vector fields for a specified document
Fields fields = reader.GetTermVectors(docid);

Fields implements .NET's IEnumerable<T> interface, so its easy to enumerate the list of fields:

// enumerate list of fields
foreach (string field in fields) // access the terms for this field
{
    Terms terms = fields.GetTerms(field);
}

Terms

Terms represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.

// metadata about the field
Console.WriteLine("positions? " + terms.HasPositions);
Console.WriteLine("offsets? " + terms.HasOffsets);
Console.WriteLine("payloads? " + terms.HasPayloads);
// iterate through terms
TermsEnum termsEnum = terms.GetEnumerator();
while (termsEnum.MoveNext())
{
    DoSomethingWith(termsEnum.Term); // Term is a BytesRef
}

TermsEnum provides an enumerator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.

// seek to a specific term
bool found = termsEnum.SeekExact(new BytesRef("foobar"));
if (found)
{
    // get the document frequency
    Console.WriteLine(termsEnum.DocFreq);
    // enumerate through documents
    DocsEnum docs = termsEnum.Docs(null, null);
    // enumerate through documents and positions
    DocsAndPositionsEnum docsAndPositions = termsEnum.DocsAndPositions(null, null);
}

Documents

DocsEnum is an extension of DocIdSetIterator that iterates over the list of documents for a term, along with the term frequency within that document.

int docid;
while ((docid = docsEnum.NextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
    Console.WriteLine(docid);
    Console.WriteLine(docsEnum.Freq);
}

Positions

DocsAndPositionsEnum is an extension of DocsEnum that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)

int docid;
while ((docid = docsAndPositionsEnum.NextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
    Console.WriteLine(docid);
    int freq = docsAndPositionsEnum.Freq;
    for (int i = 0; i < freq; i++)
    {
        Console.WriteLine(docsAndPositionsEnum.NextPosition());
        Console.WriteLine(docsAndPositionsEnum.StartOffset);
        Console.WriteLine(docsAndPositionsEnum.EndOffset);
        Console.WriteLine(docsAndPositionsEnum.GetPayload());
    }
}

Index Statistics

Term statistics

DocFreq: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.
TotalTermFreq: Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns -1) if term frequencies were omitted from the index (DOCS_ONLY) for the field. Like DocFreq, it will also count occurrences that appear in deleted documents.

Field statistics

Count: Returns the number of unique terms in the field. This statistic may be unavailable (returns -1) for some Terms implementations such as MultiTerms, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.
DocCount: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level DocFreq. Like DocFreq it will also count deleted documents.
SumDocFreq: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum of TermsEnum.DocFreq across all terms in the field, and like DocFreq it will also count postings that appear in deleted documents.
SumTotalTermFreq: Returns the number of tokens for the field. This can be thought of as the sum of TermsEnum.TotalTermFreq across all terms in the field, and like TotalTermFreq it will also count occurrences that appear in deleted documents, and will be unavailable (returns -1) if term frequencies were omitted from the index (DOCS_ONLY) for the field.

Segment statistics

MaxDoc: Returns the number of documents (including deleted documents) in the index.
NumDocs: Returns the number of live documents (excluding deleted documents) in the index.
NumDeletedDocs: Returns the number of deleted documents in the index.
Count: Returns the number of indexed fields.
UniqueTermCount: Returns the number of indexed terms, the sum of Count across all fields.

Document statistics

Document statistics are available during the indexing process for an indexed field: typically a Similarity implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its Similarity.ComputeNorm(FieldInvertState) method.

Length: Returns the number of tokens for this field in the document. Note that this is just the number of times that IncrementToken() returned true, and is unrelated to the values in PositionIncrementAttribute.
NumOverlap: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.
Position: Returns the accumulated position value for this field in the document: computed from the values of PositionIncrementAttribute and including GetPositionIncrementGap(String)s across multivalued fields.
Offset: Returns the total character offset value for this field in the document: computed from the values of OffsetAttribute returned by End(), and including GetOffsetGap(String)s across multivalued fields.
UniqueTermCount: Returns the number of unique terms encountered for this field in the document.
MaxTermFrequency: Returns the maximum frequency across all unique terms encountered for this field in the document.

Additional user-supplied statistics can be added to the document as DocValues fields and accessed via GetNumericDocValues(String).

Classes

AtomicReader

AtomicReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, so that any subclass which implements it is searchable. IndexReaders implemented by this subclass do not consist of several sub-readers, they are atomic. They support retrieval of stored fields, doc values, terms, and postings.

For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.

NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.

AtomicReaderContext

IndexReaderContext for AtomicReader instances.

BaseCompositeReader<R>

Base class for implementing CompositeReaders based on an array of sub-readers. The implementing class has to add code for correctly refcounting and closing the sub-readers.

User code will most likely use MultiReader to build a composite reader on a set of sub-readers (like several DirectoryReaders).

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

BinaryDocValues

A per-document byte[]

BufferedUpdates

Holds buffered deletes and updates, by docID, term or query for a single segment. this is used to hold buffered pending deletes and updates against the to-be-flushed segment. Once the deletes and updates are pushed (on flush in Lucene.Net.Index.DocumentsWriter), they are converted to a FrozenDeletes instance.

NOTE: instances of this class are accessed either via a private instance on Lucene.Net.Index.DocumentsWriterPerThread, or via sync'd code by Lucene.Net.Index.DocumentsWriterDeleteQueue

ByteSliceReader

IndexInput that knows how to read the byte slices written by Posting and PostingVector. We read the bytes in each slice until we hit the end of that slice at which point we read the forwarding address of the next slice and then jump to it.

CheckAbort

Class for recording units of work when merging segments.

CheckIndex

Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.

As this tool checks every byte in the index, on a large index it can take quite a long time to run.

Please make a complete backup of your index before using this to fix your index!

Note

This API is experimental and might change in incompatible ways in the next release.

CheckIndex.Status

Returned from DoCheckIndex() detailing the health and status of the index.

Note

This API is experimental and might change in incompatible ways in the next release.

CheckIndex.Status.DocValuesStatus

Status from testing DocValues

CheckIndex.Status.FieldNormStatus

Status from testing field norms.

CheckIndex.Status.SegmentInfoStatus

Holds the status of each segment in the index. See SegmentInfos.

Note

This API is experimental and might change in incompatible ways in the next release.

CheckIndex.Status.StoredFieldStatus

Status from testing stored fields.

CheckIndex.Status.TermIndexStatus

Status from testing term index.

CheckIndex.Status.TermVectorStatus

Status from testing stored fields.

CompositeReader

Instances of this reader type can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings. To do that, get the AtomicReaderContext for all sub-readers via Leaves. Alternatively, you can mimic an AtomicReader (with a serious slowdown), by wrapping composite readers with SlowCompositeReaderWrapper.

IndexReader instances for indexes on disk are usually constructed with a call to one of the static DirectoryReader.Open() methods, e.g. Open(Directory). DirectoryReader implements the CompositeReader interface, it is not possible to directly get postings.

Concrete subclasses of IndexReader are usually constructed with a call to one of the static Open() methods, e.g. Open(Directory).

CompositeReaderContext

IndexReaderContext for CompositeReader instance.

CompositeReaderContext.Builder

ConcurrentMergeScheduler

A MergeScheduler that runs each merge using a separate thread.

Specify the max number of threads that may run at once, and the maximum number of simultaneous merges with SetMaxMergesAndThreads(Int32, Int32).

If the number of merges exceeds the max number of threads then the largest merges are paused until one of the smaller merges completes.

If more than MaxMergeCount merges are requested then this class will forcefully throttle the incoming threads by pausing until one more more merges complete.

ConcurrentMergeScheduler.MergeThread

Runs a merge thread, which may run one or more merges in sequence.

CorruptIndexException

This exception is thrown when Lucene detects an inconsistency in the index.

DirectoryReader

DirectoryReader is an implementation of CompositeReader that can read indexes in a Directory.

DirectoryReader instances are usually constructed with a call to one of the static Open() methods, e.g. Open(Directory).

DocsAndPositionsEnum

Also iterates through positions.

DocsEnum

Iterates through the documents and term freqs. NOTE: you must first call NextDoc() before using any of the per-doc methods.

DocTermOrds

This class enables fast access to multiple term ords for a specified field across all docIDs.

Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.

While normally term ords are type System.Int64, in this API they are System.Int32 as the internal representation here cannot address more than Lucene.Net.Index.BufferedUpdates.MAX_INT32 unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an System.InvalidOperationException.

Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.

The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).

This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.

The RAM consumption of this class can be high!

Note

This API is experimental and might change in incompatible ways in the next release.

DocValues

This class contains utility methods and constants for DocValues

FieldInfo

Access to the Field Info file that describes document fields and whether or not they are indexed. Each segment has a separate Field Info file. Objects of this class are thread-safe for multiple readers, but only one thread can be adding documents at a time, with no other reader or writer threads accessing this object.

FieldInfos

Collection of FieldInfos (accessible by number or by name).

Note

This API is experimental and might change in incompatible ways in the next release.

FieldInvertState

This class tracks the number and position / offset parameters of terms being added to the index. The information collected in this class is also used to calculate the normalization factor for a field.

Note

This API is experimental and might change in incompatible ways in the next release.

Fields

Flex API for access to fields and terms

Note

This API is experimental and might change in incompatible ways in the next release.

FilterAtomicReader

A FilterAtomicReader contains another AtomicReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality. The class FilterAtomicReader itself simply implements all abstract methods of IndexReader with versions that pass all requests to the contained index reader. Subclasses of FilterAtomicReader may further override some of these methods and may also provide additional methods and fields.

NOTE: If you override LiveDocs, you will likely need to override NumDocs as well and vice-versa.

NOTE: If this FilterAtomicReader does not change the content the contained reader, you could consider overriding CoreCacheKey so that IFieldCache and CachingWrapperFilter share the same entries for this atomic reader and the wrapped one. CombinedCoreAndDeletesKey could be overridden as well if the LiveDocs are not changed either.

FilterAtomicReader.FilterDocsAndPositionsEnum

Base class for filtering DocsAndPositionsEnum implementations.

FilterAtomicReader.FilterDocsEnum

Base class for filtering DocsEnum implementations.

FilterAtomicReader.FilterFields

Base class for filtering Fields implementations.

FilterAtomicReader.FilterTerms

Base class for filtering Terms implementations.

NOTE: If the order of terms and documents is not changed, and if these terms are going to be intersected with automata, you could consider overriding Intersect(CompiledAutomaton, BytesRef) for better performance.

FilterAtomicReader.FilterTermsEnum

Base class for filtering TermsEnum implementations.

FilterDirectoryReader

A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.

Subclasses should implement DoWrapDirectoryReader(DirectoryReader) to return an instance of the subclass.

If the subclass wants to wrap the DirectoryReader's subreaders, it should also implement a FilterDirectoryReader.SubReaderWrapper subclass, and pass an instance to its base constructor.

FilterDirectoryReader.StandardReaderWrapper

A no-op FilterDirectoryReader.SubReaderWrapper that simply returns the parent DirectoryReader's original subreaders.

FilterDirectoryReader.SubReaderWrapper

Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders. You can use this to, e.g., wrap the subreaders with specialized FilterAtomicReader implementations.

FilteredTermsEnum

Abstract class for enumerating a subset of all terms.

Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than all that precede it.

Please note: Consumers of this enumeration cannot call Seek(), it is forward only; it throws System.NotSupportedException when a seeking method is called.

IndexCommit

Expert: represents a single commit into an index as seen by the IndexDeletionPolicy or IndexReader.

Changes to the content of an index are made visible only after the writer who made that change commits by writing a new segments file (segments_N). This point in time, when the action of writing of a new segments file to the directory is completed, is an index commit.

Each index commit point has a unique segments file associated with it. The segments file associated with a later index commit point would have a larger N.

Note

This API is experimental and might change in incompatible ways in the next release.

IndexDeletionPolicy

Expert: policy for deletion of stale IndexCommits.

Implement this interface, and pass it to one of the IndexWriter or IndexReader constructors, to customize when older point-in-time commits (IndexCommit) are deleted from the index directory. The default deletion policy is KeepOnlyLastCommitDeletionPolicy, which always removes old commits as soon as a new commit is done (this matches the behavior before 2.2).

One expected use case for this (and the reason why it was first created) is to work around problems with an index directory accessed via filesystems like NFS because NFS does not provide the "delete on last close" semantics that Lucene's "point in time" search normally relies on. By implementing a custom deletion policy, such as "a commit is only removed once it has been stale for more than X minutes", you can give your readers time to refresh to the new commit before IndexWriter removes the old commits. Note that doing so will increase the storage requirements of the index. See LUCENE-710 for details.

Implementers of sub-classes should make sure that Clone() returns an independent instance able to work with any other IndexWriter or Directory instance.

IndexFileNames

This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (MatchesExtension(String, String)), as well as generating file names from a segment name, generation and extension (FileNameFromGeneration(String, String, Int64), SegmentFileName(String, String, String)).

NOTE: extensions used by codecs are not listed here. You must interact with the Codec directly.

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

IndexFormatTooNewException

This exception is thrown when Lucene detects an index that is newer than this Lucene version.

IndexFormatTooOldException

This exception is thrown when Lucene detects an index that is too old for this Lucene version

IndexNotFoundException

Signals that no index was found in the System.IO.Directory. Possibly because the directory is empty, however can also indicate an index corruption.

IndexOptionsComparer

Represents an IndexOptions comparison operation that uses System.Int32 comparison rules.

Since in .NET the standard comparers will do boxing when comparing enum types, this class was created as a more performant alternative than calling CompareTo() on IndexOptions.

IndexReader

IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, so that any subclass which implements it is searchable.

There are two different types of IndexReaders:

AtomicReader: These indexes do not consist of several sub-readers, they are atomic. They support retrieval of stored fields, doc values, terms, and postings.
CompositeReader: Instances (like DirectoryReader) of this reader can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings. To do that, get the sub-readers via GetSequentialSubReaders(). Alternatively, you can mimic an AtomicReader (with a serious slowdown), by wrapping composite readers with SlowCompositeReaderWrapper.

IndexReader instances for indexes on disk are usually constructed with a call to one of the static DirectoryReader.Open() methods, e.g. Open(Directory). DirectoryReader inherits the CompositeReader abstract class, it is not possible to directly get postings.

IndexReaderContext

A struct like class that represents a hierarchical relationship between IndexReader instances.

IndexUpgrader

This is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format. It can be used from command line:

 java -cp lucene-core.jar Lucene.Net.Index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir

Alternatively this class can be instantiated and Upgrade() invoked. It uses UpgradeIndexMergePolicy and triggers the upgrade via an ForceMerge(Int32) request to IndexWriter.

This tool keeps only the last commit in an index; for this reason, if the incoming index has more than one commit, the tool refuses to run by default. Specify -delete-prior-commits to override this, allowing the tool to delete all but the last commit. From .NET code this can be enabled by passing true to IndexUpgrader(Directory, LuceneVersion, TextWriter, Boolean).

Warning: this tool may reorder documents if the index was partially upgraded before execution (e.g., documents were added). If your application relies on "monotonicity" of doc IDs (which means that the order in which the documents were added to the index is preserved), do a full ForceMerge instead. The MergePolicy set by IndexWriterConfig may also reorder documents.

IndexWriter

An IndexWriter creates and maintains an index.

IndexWriter.IndexReaderWarmer

If Open(IndexWriter, Boolean) has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits. This is not required for near real-time search, but will reduce search latency on opening a new near real-time reader after a merge completes.

Note

This API is experimental and might change in incompatible ways in the next release.

NOTE: Warm(AtomicReader) is called before any deletes have been carried over to the merged segment.

IndexWriterConfig

Holds all the configuration that is used to create an IndexWriter. Once IndexWriter has been created with this object, changes to this object will not affect the IndexWriter instance. For that, use LiveIndexWriterConfig that is returned from Config.

LUCENENET NOTE: Unlike Lucene, we use property setters instead of setter methods. In C#, this allows you to initialize the IndexWriterConfig using the language features of C#, for example:

    IndexWriterConfig conf = new IndexWriterConfig(analyzer)
    {
        Codec = Lucene46Codec(),
        OpenMode = OpenMode.CREATE
    };

However, if you prefer to match the syntax of Lucene using chained setter methods, there are extension methods in the Lucene.Net.Index.Extensions namespace. Example usage:

    using Lucene.Net.Index.Extensions;

    ..

    IndexWriterConfig conf = new IndexWriterConfig(analyzer)
        .SetCodec(new Lucene46Codec())
        .SetOpenMode(OpenMode.CREATE);

@since 3.1

KeepOnlyLastCommitDeletionPolicy

This IndexDeletionPolicy implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done. This is the default deletion policy.

LiveIndexWriterConfig

Holds all the configuration used by IndexWriter with few setters for settings that can be changed on an IndexWriter instance "live".

@since 4.0

LogByteSizeMergePolicy

This is a LogMergePolicy that measures size of a segment as the total byte size of the segment's files.

LogDocMergePolicy

This is a LogMergePolicy that measures size of a segment as the number of documents (not taking deletions into account).

LogMergePolicy

This class implements a MergePolicy that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor. Whenever extra segments (beyond the merge factor upper bound) are encountered, all segments within the level are merged. You can get or set the merge factor using MergeFactor.

This class is abstract and requires a subclass to define the Size(SegmentCommitInfo) method which specifies how a segment's size is determined. LogDocMergePolicy is one subclass that measures size by document count in the segment. LogByteSizeMergePolicy is another subclass that measures size as the total byte size of the file(s) for the segment.

MergePolicy

Expert: a MergePolicy determines the sequence of primitive merge operations.

Whenever the segments in an index have been altered by IndexWriter, either the addition of a newly flushed segment, addition of many segments from AddIndexes* calls, or a previous merge that may now need to cascade, IndexWriter invokes FindMerges(MergeTrigger, SegmentInfos) to give the MergePolicy a chance to pick merges that are now required. This method returns a MergePolicy.MergeSpecification instance describing the set of merges that should be done, or null if no merges are necessary. When ForceMerge(Int32) is called, it calls FindForcedMerges(SegmentInfos, Int32, IDictionary<SegmentCommitInfo, Boolean>) and the MergePolicy should then return the necessary merges.

Note that the policy can return more than one merge at a time. In this case, if the writer is using SerialMergeScheduler, the merges will be run sequentially but if it is using ConcurrentMergeScheduler they will be run concurrently.

The default MergePolicy is TieredMergePolicy.

Note

This API is experimental and might change in incompatible ways in the next release.

MergePolicy.DocMap

A map of doc IDs.

MergePolicy.MergeAbortedException

Thrown when a merge was explicity aborted because Dispose(Boolean) was called with false. Normally this exception is privately caught and suppresed by IndexWriter.

MergePolicy.MergeException

Exception thrown if there are any problems while executing a merge.

MergePolicy.MergeSpecification

A MergePolicy.MergeSpecification instance provides the information necessary to perform multiple merges. It simply contains a list of MergePolicy.OneMerge instances.

MergePolicy.OneMerge

OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment. The merge spec includes the subset of segments to be merged as well as whether the new segment should use the compound file format.

MergeScheduler

Expert: IndexWriter uses an instance implementing this interface to execute the merges selected by a MergePolicy. The default MergeScheduler is ConcurrentMergeScheduler.

Implementers of sub-classes should make sure that Clone() returns an independent instance able to work with any IndexWriter instance.

Note

This API is experimental and might change in incompatible ways in the next release.

MergeState

Holds common state used during segment merging.

Note

This API is experimental and might change in incompatible ways in the next release.

MergeState.DocMap

Remaps docids around deletes during merge

MultiDocsAndPositionsEnum

Exposes flex API, merged from flex API of sub-segments.

Note

This API is experimental and might change in incompatible ways in the next release.

MultiDocsAndPositionsEnum.EnumWithSlice

Holds a DocsAndPositionsEnum along with the corresponding ReaderSlice.

MultiDocsEnum

Exposes DocsEnum, merged from DocsEnum API of sub-segments.

Note

This API is experimental and might change in incompatible ways in the next release.

MultiDocsEnum.EnumWithSlice

Holds a DocsEnum along with the corresponding ReaderSlice.

MultiDocValues

A wrapper for CompositeReader providing access to DocValues.

NOTE: for multi readers, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.

NOTE: this is very costly.

Note

This API is experimental and might change in incompatible ways in the next release.

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

MultiDocValues.MultiSortedDocValues

Implements SortedDocValues over n subs, using an MultiDocValues.OrdinalMap

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

MultiDocValues.MultiSortedSetDocValues

Implements MultiDocValues.MultiSortedSetDocValues over n subs, using an MultiDocValues.OrdinalMap

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

MultiDocValues.OrdinalMap

maps per-segment ordinals to/from global ordinal space

MultiFields

Exposes flex API, merged from flex API of sub-segments. This is useful when you're interacting with an IndexReader implementation that consists of sequential sub-readers (eg DirectoryReader or MultiReader).

NOTE: for composite readers, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.

Note

This API is experimental and might change in incompatible ways in the next release.

MultiReader

A CompositeReader which reads multiple indexes, appending their content. It can be used to create a view on several sub-readers (like DirectoryReader) and execute searches on it.

MultiTerms

Exposes flex API, merged from flex API of sub-segments.

Note

This API is experimental and might change in incompatible ways in the next release.

MultiTermsEnum

Exposes TermsEnum API, merged from TermsEnum API of sub-segments. This does a merge sort, by term text, of the sub-readers.

Note

This API is experimental and might change in incompatible ways in the next release.

MultiTermsEnum.TermsEnumIndex

MultiTermsEnum.TermsEnumWithSlice

NoDeletionPolicy

An IndexDeletionPolicy which keeps all index commits around, never deleting them. This class is a singleton and can be accessed by referencing INSTANCE.

NoMergePolicy

A MergePolicy which never returns merges to execute (hence it's name). It is also a singleton and can be accessed through NO_COMPOUND_FILES if you want to indicate the index does not use compound files, or through COMPOUND_FILES otherwise. Use it if you want to prevent an IndexWriter from ever executing merges, without going through the hassle of tweaking a merge policy's settings to achieve that, such as changing its merge factor.

NoMergeScheduler

A MergeScheduler which never executes any merges. It is also a singleton and can be accessed through INSTANCE. Use it if you want to prevent an IndexWriter from ever executing merges, regardless of the MergePolicy used. Note that you can achieve the same thing by using NoMergePolicy, however with NoMergeScheduler you also ensure that no unnecessary code of any MergeScheduler implementation is ever executed. Hence it is recommended to use both if you want to disable merges from ever happening.

NumericDocValues

A per-document numeric value.

OrdTermState

An ordinal based TermState

Note

This API is experimental and might change in incompatible ways in the next release.

ParallelAtomicReader

An AtomicReader which reads multiple, parallel indexes. Each index added must have the same number of documents, but typically each contains different fields. Deletions are taken from the first reader. Each document contains the union of the fields of all documents with the same document number. When searching, matches for a query term are from the first index added that has the field.

This is useful, e.g., with collections that have large fields which change rarely and small fields that change more frequently. The smaller fields may be re-indexed in a new index and both indexes may be searched together.

ParallelCompositeReader

A CompositeReader which reads multiple, parallel indexes. Each index added must have the same number of documents, and exactly the same hierarchical subreader structure, but typically each contains different fields. Deletions are taken from the first reader. Each document contains the union of the fields of all documents with the same document number. When searching, matches for a query term are from the first index added that has the field.

Warning: It is up to you to make sure all indexes are created and modified the same way. For example, if you add documents to one index, you need to add the same documents in the same order to the other indexes. Failure to do so will result in undefined behavior. A good strategy to create suitable indexes with IndexWriter is to use LogDocMergePolicy, as this one does not reorder documents during merging (like TieredMergePolicy) and triggers merges by number of documents per segment. If you use different MergePolicys it might happen that the segment structure of your index is no longer predictable.

PersistentSnapshotDeletionPolicy

A SnapshotDeletionPolicy which adds a persistence layer so that snapshots can be maintained across the life of an application. The snapshots are persisted in a Directory and are committed as soon as Snapshot() or Release(IndexCommit) is called.

NOTE: Sharing PersistentSnapshotDeletionPolicys that write to the same directory across IndexWriters will corrupt snapshots. You should make sure every IndexWriter has its own PersistentSnapshotDeletionPolicy and that they all write to a different Directory. It is OK to use the same Directory that holds the index.

This class adds a Release(Int64) method to release commits from a previous snapshot's Generation.

Note

This API is experimental and might change in incompatible ways in the next release.

RandomAccessOrds

Extension of SortedSetDocValues that supports random access to the ordinals of a document.

Operations via this API are independent of the iterator api (NextOrd()) and do not impact its state.

Codecs can optionally extend this API if they support constant-time access to ordinals for the document.

RandomAccessOrdsExtensions

ReaderManager

Utility class to safely share DirectoryReader instances across multiple threads, while periodically reopening. This class ensures each reader is disposed only once all threads have finished using it.

Note

This API is experimental and might change in incompatible ways in the next release.

ReaderSlice

Subreader slice from a parent composite reader.

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

ReaderUtil

Common util methods for dealing with IndexReaders and IndexReaderContexts.

Note

This API is for internal purposes only and might change in incompatible ways in the next release.

SegmentCommitInfo

Embeds a [read-only] SegmentInfo and adds per-commit fields.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentInfo

Information about a segment such as it's name, directory, and files related to the segment.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentInfos

A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.

The active segments in the index are stored in the segment info file, segments_N. There may be one or more segments_N files in the index; however, the one with the largest generation is the active one (when older segments_N files are present it's because they temporarily cannot be deleted, or, a writer is in the process of committing, or a custom IndexDeletionPolicy is in use). This file lists each segment by name and has details about the codec and generation of deletes.

There is also a file segments.gen. this file contains the current generation (the _N in segments_N) of the index. This is used only as a fallback in case the current generation cannot be accurately determined by directory listing alone (as is the case for some NFS clients with time-based directory cache expiration). This file simply contains an WriteInt32(Int32) version header (FORMAT_SEGMENTS_GEN_CURRENT), followed by the generation recorded as WriteInt64(Int64), written twice.

Files:

segments.gen: GenHeader, Generation, Generation, Footer
segments_N: Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>^SegCount, CommitUserData, Footer

Data types:

Header --> WriteHeader(DataOutput, String, Int32)
GenHeader, NameCounter, SegCount, DeletionCount --> WriteInt32(Int32)
Generation, Version, DelGen, Checksum, FieldInfosGen --> WriteInt64(Int64)
SegName, SegCodec --> WriteString(String)
CommitUserData --> WriteStringStringMap(IDictionary<String, String>)
UpdatesFiles --> WriteStringSet(ISet<String>)
Footer --> WriteFooter(IndexOutput)

Field Descriptions:

Version counts how often the index has been changed by adding or deleting documents.
NameCounter is used to generate names for new segment files.
SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.
DelGen is the generation count of the deletes file. If this is -1, there are no deletes. Anything above zero means there are deletes stored by LiveDocsFormat.
DeletionCount records the number of deleted documents in this segment.
SegCodec is the Name of the Codec that encoded this segment.
CommitUserData stores an optional user-supplied opaque that was passed to SetCommitData(IDictionary<String, String>).
FieldInfosGen is the generation count of the fieldInfos file. If this is -1, there are no updates to the fieldInfos in that segment. Anything above zero means there are updates to fieldInfos stored by FieldInfosFormat.
UpdatesFiles stores the list of files that were updated in that segment.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentInfos.FindSegmentsFile

Utility class for executing code that needs to do something with the current segments file. This is necessary with lock-less commits because from the time you locate the current segments file name, until you actually open it, read its contents, or check modified time, etc., it could have been deleted due to a writer commit finishing.

SegmentReader

IndexReader implementation over a single segment.

Instances pointing to the same segment (but with different deletes, etc) may share the same core data.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentReadState

Holder class for common parameters used during read.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentWriteState

Holder class for common parameters used during write.

Note

This API is experimental and might change in incompatible ways in the next release.

SerialMergeScheduler

A MergeScheduler that simply does each merge sequentially, using the current thread.

SimpleMergedSegmentWarmer

A very simple merged segment warmer that just ensures data structures are initialized.

SingleTermsEnum

Subclass of FilteredTermsEnum for enumerating a single term.

For example, this can be used by MultiTermQuerys that need only visit one term, but want to preserve MultiTermQuery semantics such as MultiTermRewriteMethod.

SlowCompositeReaderWrapper

This class forces a composite reader (eg a MultiReader or DirectoryReader) to emulate an atomic reader. This requires implementing the postings APIs on-the-fly, using the static methods in MultiFields, MultiDocValues, by stepping through the sub-readers to merge fields/terms, appending docs, etc.

NOTE: This class almost always results in a performance hit. If this is important to your use case, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.

SnapshotDeletionPolicy

An IndexDeletionPolicy that wraps any other IndexDeletionPolicy and adds the ability to hold and later release snapshots of an index. While a snapshot is held, the IndexWriter will not remove any files associated with it even if the index is otherwise being actively, arbitrarily changed. Because we wrap another arbitrary IndexDeletionPolicy, this gives you the freedom to continue using whatever IndexDeletionPolicy you would normally want to use with your index.

This class maintains all snapshots in-memory, and so the information is not persisted and not protected against system failures. If persistence is important, you can use PersistentSnapshotDeletionPolicy.

Note

This API is experimental and might change in incompatible ways in the next release.

SortedDocValues

A per-document byte[] with presorted values.

Per-Document values in a SortedDocValues are deduplicated, dereferenced, and sorted into a dictionary of unique values. A pointer to the dictionary value (ordinal) can be retrieved for each document. Ordinals are dense and in increasing sorted order.

SortedSetDocValues

A per-document set of presorted byte[] values.

StoredFieldVisitor

Expert: Provides a low-level means of accessing the stored field values in an index. See Document(Int32, StoredFieldVisitor).

NOTE: a StoredFieldVisitor implementation should not try to load or visit other stored documents in the same reader because the implementation of stored fields for most codecs is not reeentrant and you will see strange exceptions as a result.

See DocumentStoredFieldVisitor, which is a StoredFieldVisitor that builds the Document containing all stored fields. This is used by Document(Int32).

Note

This API is experimental and might change in incompatible ways in the next release.

Term

A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.

Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

TermContext

Maintains a IndexReader TermState view over IndexReader instances containing a single term. The TermContext doesn't track if the given TermState objects are valid, neither if the TermState instances refer to the same terms in the associated readers.

Note

This API is experimental and might change in incompatible ways in the next release.

TermExtensions

Terms

Access to the terms in a specific field. See Fields.

Note

This API is experimental and might change in incompatible ways in the next release.

TermsEnum

Enumerator to seek (SeekCeil(BytesRef), SeekExact(BytesRef)) or step through (MoveNext() terms to obtain Term, frequency information (DocFreq), DocsEnum or DocsAndPositionsEnum for the current term (Docs(IBits, DocsEnum)).

Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than the one before it.

The TermsEnum is unpositioned when you first obtain it and you must first successfully call MoveNext() or one of the Seek methods.

Note

This API is experimental and might change in incompatible ways in the next release.

TermState

Encapsulates all required internal state to position the associated TermsEnum without re-seeking.

Note

This API is experimental and might change in incompatible ways in the next release.

TieredMergePolicy

Merges segments of approximately equal size, subject to an allowed number of segments per tier. This is similar to LogByteSizeMergePolicy, except this merge policy is able to merge non-adjacent segment, and separates how many segments are merged at once (MaxMergeAtOnce) from how many segments are allowed per tier (SegmentsPerTier). This merge policy also does not over-merge (i.e. cascade merges).

For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.

If a merge will produce a segment that's larger than MaxMergedSegmentMB, then the policy will merge fewer segments (down to 1 at once, if that one has deletions) to keep the segment size under budget.

NOTE: This policy freely merges non-adjacent segments; if this is a problem, use LogMergePolicy.

NOTE: This policy always merges by byte size of the segments, always pro-rates by percent deletes, and does not apply any maximum segment size during forceMerge (unlike LogByteSizeMergePolicy).

Note

This API is experimental and might change in incompatible ways in the next release.

TieredMergePolicy.MergeScore

Holds score and explanation for a single candidate merge.

TrackingIndexWriter

Class that tracks changes to a delegated IndexWriter, used by ControlledRealTimeReopenThread<T> to ensure specific changes are visible. Create this class (passing your IndexWriter), and then pass this class to ControlledRealTimeReopenThread<T>. Be sure to make all changes via the TrackingIndexWriter, otherwise ControlledRealTimeReopenThread<T> won't know about the changes.

Note

This API is experimental and might change in incompatible ways in the next release.

TwoPhaseCommitTool

A utility for executing 2-phase commit on several objects.

Note

This API is experimental and might change in incompatible ways in the next release.

TwoPhaseCommitTool.CommitFailException

Thrown by Execute(ITwoPhaseCommit[]) when an object fails to Commit().

TwoPhaseCommitTool.PrepareCommitFailException

Thrown by Execute(ITwoPhaseCommit[]) when an object fails to PrepareCommit().

UpgradeIndexMergePolicy

This MergePolicy is used for upgrading all existing segments of an index when calling ForceMerge(Int32). All other methods delegate to the base MergePolicy given to the constructor. This allows for an as-cheap-as possible upgrade of an older index by only upgrading segments that are created by previous Lucene versions. ForceMerge does no longer really merge; it is just used to "ForceMerge" older segment versions away.

In general one would use IndexUpgrader, but for a fully customizeable upgrade, you can use this like any other MergePolicy and call ForceMerge(Int32):

    IndexWriterConfig iwc = new IndexWriterConfig(LuceneVersion.LUCENE_XX, new KeywordAnalyzer());
    iwc.MergePolicy = new UpgradeIndexMergePolicy(iwc.MergePolicy);
    using (IndexWriter w = new IndexWriter(dir, iwc))
    {
        w.ForceMerge(1);
    }

Warning: this merge policy may reorder documents if the index was partially upgraded before calling ForceMerge(Int32) (e.g., documents were added). If your application relies on "monotonicity" of doc IDs (which means that the order in which the documents were added to the index is preserved), do a ForceMerge(1) instead. Please note, the delegate MergePolicy may also reorder documents.

Note

This API is experimental and might change in incompatible ways in the next release.

Interfaces

IConcurrentMergeScheduler

IIndexableField

Represents a single field for indexing. IndexWriter consumes IEnumerable<IndexableField> as a document.

Note

This API is experimental and might change in incompatible ways in the next release.

IIndexableFieldType

Describes the properties of a field.

Note

This API is experimental and might change in incompatible ways in the next release.

IMergeScheduler

IndexReader.IReaderClosedListener

A custom listener that's invoked when the IndexReader is closed.

Note

This API is experimental and might change in incompatible ways in the next release.

IndexWriter.IEvent

Interface for internal atomic events. See Lucene.Net.Index.DocumentsWriter for details. Events are executed concurrently and no order is guaranteed. Each event should only rely on the serializeability within it's process method. All actions that must happen before or after a certain action must be encoded inside the Process(IndexWriter, Boolean, Boolean) method.

ITwoPhaseCommit

An interface for implementations that support 2-phase commit. You can use TwoPhaseCommitTool to execute a 2-phase commit algorithm over several ITwoPhaseCommits.

Note

This API is experimental and might change in incompatible ways in the next release.

SegmentReader.ICoreDisposedListener

Called when the shared core for this SegmentReader is disposed.

This listener is called only once all SegmentReaders sharing the same core are disposed. At this point it is safe for apps to evict this reader from any caches keyed on CoreCacheKey. This is the same interface that IFieldCache uses, internally, to evict entries.

NOTE: This was CoreClosedListener in Lucene.

Note

This API is experimental and might change in incompatible ways in the next release.

Note

This API is experimental and might change in incompatible ways in the next release.

Namespace Lucene.Net.Index

Table Of Contents

Postings APIs

Fields

Terms

Documents

Positions

Index Statistics

Term statistics

Field statistics

Segment statistics

Document statistics

Classes

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note

Note