Namespace Lucene.Net.Index
Code to maintain and access indices.
Table Of Contents
Postings APIs
Fields
Fields is the initial entry point into the postings APIs, this can be obtained in several ways:
// access indexed fields for an index segment
Fields fields = reader.Fields; // access term vector fields for a specified document
Fields fields = reader.GetTermVectors(docid);
Fields implements .NET's IEnumerable<T>
interface, so its easy to enumerate the list of fields:
// enumerate list of fields
foreach (string field in fields) // access the terms for this field
{
Terms terms = fields.GetTerms(field);
}
Terms
Terms represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.
// metadata about the field
Console.WriteLine("positions? " + terms.HasPositions);
Console.WriteLine("offsets? " + terms.HasOffsets);
Console.WriteLine("payloads? " + terms.HasPayloads);
// iterate through terms
TermsEnum termsEnum = terms.GetEnumerator();
while (termsEnum.MoveNext())
{
DoSomethingWith(termsEnum.Term); // Term is a BytesRef
}
TermsEnum provides an enumerator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.
// seek to a specific term
bool found = termsEnum.SeekExact(new BytesRef("foobar"));
if (found)
{
// get the document frequency
Console.WriteLine(termsEnum.DocFreq);
// enumerate through documents
DocsEnum docs = termsEnum.Docs(null, null);
// enumerate through documents and positions
DocsAndPositionsEnum docsAndPositions = termsEnum.DocsAndPositions(null, null);
}
Documents
DocsEnum is an extension of DocIdSetIterator that iterates over the list of documents for a term, along with the term frequency within that document.
int docid;
while ((docid = docsEnum.NextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
Console.WriteLine(docid);
Console.WriteLine(docsEnum.Freq);
}
Positions
DocsAndPositionsEnum is an extension of DocsEnum that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)
int docid;
while ((docid = docsAndPositionsEnum.NextDoc()) != DocIdSetIterator.NO_MORE_DOCS)
{
Console.WriteLine(docid);
int freq = docsAndPositionsEnum.Freq;
for (int i = 0; i < freq; i++)
{
Console.WriteLine(docsAndPositionsEnum.NextPosition());
Console.WriteLine(docsAndPositionsEnum.StartOffset);
Console.WriteLine(docsAndPositionsEnum.EndOffset);
Console.WriteLine(docsAndPositionsEnum.GetPayload());
}
}
Index Statistics
Term statistics
- DocFreq: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.
- TotalTermFreq: Returns the number of occurrences of this term across all documents. Note that this statistic is unavailable (returns
-1
) if term frequencies were omitted from the index (DOCS_ONLY) for the field. LikeDocFreq
, it will also count occurrences that appear in deleted documents.
Field statistics
- Count: Returns the number of unique terms in the field. This statistic may be unavailable (returns
-1
) for some Terms implementations such as MultiTerms, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated. - DocCount: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level
DocFreq
. LikeDocFreq
it will also count deleted documents. - SumDocFreq: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum of TermsEnum.DocFreq across all terms in the field, and like
DocFreq
it will also count postings that appear in deleted documents. - SumTotalTermFreq: Returns the number of tokens for the field. This can be thought of as the sum of TermsEnum.TotalTermFreq across all terms in the field, and like
TotalTermFreq
it will also count occurrences that appear in deleted documents, and will be unavailable (returns-1
) if term frequencies were omitted from the index (DOCS_ONLY) for the field.
Segment statistics
- MaxDoc: Returns the number of documents (including deleted documents) in the index.
- NumDocs: Returns the number of live documents (excluding deleted documents) in the index.
- NumDeletedDocs: Returns the number of deleted documents in the index.
- Count: Returns the number of indexed fields.
- UniqueTermCount: Returns the number of indexed terms, the sum of Count across all fields.
Document statistics
Document statistics are available during the indexing process for an indexed field: typically a Similarity implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its Similarity.ComputeNorm(FieldInvertState) method.
- Length: Returns the number of tokens for this field in the document. Note that this is just the number of times that IncrementToken() returned
true
, and is unrelated to the values in PositionIncrementAttribute. - NumOverlap: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.
- Position: Returns the accumulated position value for this field in the document: computed from the values of PositionIncrementAttribute and including GetPositionIncrementGap(String)s across multivalued fields.
- Offset: Returns the total character offset value for this field in the document: computed from the values of OffsetAttribute returned by End(), and including GetOffsetGap(String)s across multivalued fields.
- UniqueTermCount: Returns the number of unique terms encountered for this field in the document.
MaxTermFrequency: Returns the maximum frequency across all unique terms encountered for this field in the document.
Additional user-supplied statistics can be added to the document as DocValues fields and accessed via GetNumericDocValues(String).
Classes
AtomicReader
AtomicReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, so that any subclass which implements it is searchable. IndexReaders implemented by this subclass do not consist of several sub-readers, they are atomic. They support retrieval of stored fields, doc values, terms, and postings.
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
AtomicReaderContext
IndexReaderContext for AtomicReader instances.
BaseCompositeReader<R>
Base class for implementing CompositeReaders based on an array of sub-readers. The implementing class has to add code for correctly refcounting and closing the sub-readers.
User code will most likely use MultiReader to build a composite reader on a set of sub-readers (like several DirectoryReaders).
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
BinaryDocValues
A per-document byte[]
BufferedUpdates
Holds buffered deletes and updates, by docID, term or query for a single segment. this is used to hold buffered pending deletes and updates against the to-be-flushed segment. Once the deletes and updates are pushed (on flush in Lucene.Net.Index.DocumentsWriter), they are converted to a FrozenDeletes instance.
NOTE: instances of this class are accessed either via a private instance on Lucene.Net.Index.DocumentsWriterPerThread, or via sync'd code by Lucene.Net.Index.DocumentsWriterDeleteQueue
ByteSliceReader
IndexInput that knows how to read the byte slices written by Posting and PostingVector. We read the bytes in each slice until we hit the end of that slice at which point we read the forwarding address of the next slice and then jump to it.
CheckAbort
Class for recording units of work when merging segments.
CheckIndex
Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.
As this tool checks every byte in the index, on a large index it can take quite a long time to run.
Please make a complete backup of your index before using this to fix your index!
Note
This API is experimental and might change in incompatible ways in the next release.
CheckIndex.Status
Returned from DoCheckIndex() detailing the health and status of the index.
Note
This API is experimental and might change in incompatible ways in the next release.
CheckIndex.Status.DocValuesStatus
Status from testing DocValues
CheckIndex.Status.FieldNormStatus
Status from testing field norms.
CheckIndex.Status.SegmentInfoStatus
Holds the status of each segment in the index. See SegmentInfos.
Note
This API is experimental and might change in incompatible ways in the next release.
CheckIndex.Status.StoredFieldStatus
Status from testing stored fields.
CheckIndex.Status.TermIndexStatus
Status from testing term index.
CheckIndex.Status.TermVectorStatus
Status from testing stored fields.
CompositeReader
Instances of this reader type can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings. To do that, get the AtomicReaderContext for all sub-readers via Leaves. Alternatively, you can mimic an AtomicReader (with a serious slowdown), by wrapping composite readers with SlowCompositeReaderWrapper.
IndexReader instances for indexes on disk are usually constructed
with a call to one of the static DirectoryReader.Open()
methods,
e.g. Open(Directory). DirectoryReader implements
the CompositeReader interface, it is not possible to directly get postings.
Concrete subclasses of IndexReader are usually constructed with a call to
one of the static Open()
methods, e.g. Open(Directory).
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
CompositeReaderContext
IndexReaderContext for CompositeReader instance.
CompositeReaderContext.Builder
ConcurrentMergeScheduler
A MergeScheduler that runs each merge using a separate thread.
Specify the max number of threads that may run at once, and the maximum number of simultaneous merges with SetMaxMergesAndThreads(Int32, Int32).
If the number of merges exceeds the max number of threads then the largest merges are paused until one of the smaller merges completes.
If more than MaxMergeCount merges are requested then this class will forcefully throttle the incoming threads by pausing until one more more merges complete.
ConcurrentMergeScheduler.MergeThread
Runs a merge thread, which may run one or more merges in sequence.
CorruptIndexException
This exception is thrown when Lucene detects an inconsistency in the index.
DirectoryReader
DirectoryReader is an implementation of CompositeReader that can read indexes in a Directory.
DirectoryReader instances are usually constructed with a call to
one of the static Open()
methods, e.g. Open(Directory).
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
DocsAndPositionsEnum
Also iterates through positions.
DocsEnum
Iterates through the documents and term freqs. NOTE: you must first call NextDoc() before using any of the per-doc methods.
DocTermOrds
This class enables fast access to multiple term ords for a specified field across all docIDs.
Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.
While normally term ords are type System.Int64, in this API they are System.Int32 as the internal representation here cannot address more than Lucene.Net.Index.BufferedUpdates.MAX_INT32 unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an System.InvalidOperationException.
Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.
The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).
This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.
The RAM consumption of this class can be high!
Note
This API is experimental and might change in incompatible ways in the next release.
DocValues
This class contains utility methods and constants for DocValues
FieldInfo
Access to the Field Info file that describes document fields and whether or not they are indexed. Each segment has a separate Field Info file. Objects of this class are thread-safe for multiple readers, but only one thread can be adding documents at a time, with no other reader or writer threads accessing this object.
FieldInfos
Collection of FieldInfos (accessible by number or by name).
Note
This API is experimental and might change in incompatible ways in the next release.
FieldInvertState
This class tracks the number and position / offset parameters of terms being added to the index. The information collected in this class is also used to calculate the normalization factor for a field.
Note
This API is experimental and might change in incompatible ways in the next release.
Fields
Flex API for access to fields and terms
Note
This API is experimental and might change in incompatible ways in the next release.
FilterAtomicReader
A FilterAtomicReader contains another AtomicReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality. The class FilterAtomicReader itself simply implements all abstract methods of IndexReader with versions that pass all requests to the contained index reader. Subclasses of FilterAtomicReader may further override some of these methods and may also provide additional methods and fields.
NOTE: If you override LiveDocs, you will likely need to override NumDocs as well and vice-versa.
NOTE: If this FilterAtomicReader does not change the content the contained reader, you could consider overriding CoreCacheKey so that IFieldCache and CachingWrapperFilter share the same entries for this atomic reader and the wrapped one. CombinedCoreAndDeletesKey could be overridden as well if the LiveDocs are not changed either.
FilterAtomicReader.FilterDocsAndPositionsEnum
Base class for filtering DocsAndPositionsEnum implementations.
FilterAtomicReader.FilterDocsEnum
Base class for filtering DocsEnum implementations.
FilterAtomicReader.FilterFields
Base class for filtering Fields implementations.
FilterAtomicReader.FilterTerms
Base class for filtering Terms implementations.
NOTE: If the order of terms and documents is not changed, and if these terms are going to be intersected with automata, you could consider overriding Intersect(CompiledAutomaton, BytesRef) for better performance.
FilterAtomicReader.FilterTermsEnum
Base class for filtering TermsEnum implementations.
FilterDirectoryReader
A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.
Subclasses should implement DoWrapDirectoryReader(DirectoryReader) to return an instance of the subclass.
If the subclass wants to wrap the DirectoryReader's subreaders, it should also implement a FilterDirectoryReader.SubReaderWrapper subclass, and pass an instance to its base constructor.
FilterDirectoryReader.StandardReaderWrapper
A no-op FilterDirectoryReader.SubReaderWrapper that simply returns the parent DirectoryReader's original subreaders.
FilterDirectoryReader.SubReaderWrapper
Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders. You can use this to, e.g., wrap the subreaders with specialized FilterAtomicReader implementations.
FilteredTermsEnum
Abstract class for enumerating a subset of all terms.
Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than all that precede it.
Please note:
Consumers of this enumeration cannot
call Seek()
, it is forward only; it throws
System.NotSupportedException when a seeking method
is called.
IndexCommit
Expert: represents a single commit into an index as seen by the IndexDeletionPolicy or IndexReader.
Changes to the content of an index are made visible
only after the writer who made that change commits by
writing a new segments file
(segments_N
). This point in time, when the
action of writing of a new segments file to the directory
is completed, is an index commit.
Each index commit point has a unique segments file associated with it. The segments file associated with a later index commit point would have a larger N.
Note
This API is experimental and might change in incompatible ways in the next release.
IndexDeletionPolicy
Expert: policy for deletion of stale IndexCommits.
Implement this interface, and pass it to one of the IndexWriter or IndexReader constructors, to customize when older point-in-time commits (IndexCommit) are deleted from the index directory. The default deletion policy is KeepOnlyLastCommitDeletionPolicy, which always removes old commits as soon as a new commit is done (this matches the behavior before 2.2).
One expected use case for this (and the reason why it was first created) is to work around problems with an index directory accessed via filesystems like NFS because NFS does not provide the "delete on last close" semantics that Lucene's "point in time" search normally relies on. By implementing a custom deletion policy, such as "a commit is only removed once it has been stale for more than X minutes", you can give your readers time to refresh to the new commit before IndexWriter removes the old commits. Note that doing so will increase the storage requirements of the index. See LUCENE-710 for details.
Implementers of sub-classes should make sure that Clone() returns an independent instance able to work with any other IndexWriter or Directory instance.
IndexFileNames
This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (MatchesExtension(String, String)), as well as generating file names from a segment name, generation and extension (FileNameFromGeneration(String, String, Int64), SegmentFileName(String, String, String)).
NOTE: extensions used by codecs are not listed here. You must interact with the Codec directly.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
IndexFormatTooNewException
This exception is thrown when Lucene detects an index that is newer than this Lucene version.
IndexFormatTooOldException
This exception is thrown when Lucene detects an index that is too old for this Lucene version
IndexNotFoundException
Signals that no index was found in the System.IO.Directory. Possibly because the directory is empty, however can also indicate an index corruption.
IndexOptionsComparer
Represents an IndexOptions comparison operation that uses System.Int32 comparison rules.
Since in .NET the standard comparers will do boxing when comparing enum types,
this class was created as a more performant alternative than calling CompareTo()
on IndexOptions.
IndexReader
IndexReader is an abstract class, providing an interface for accessing an index. Search of an index is done entirely through this abstract interface, so that any subclass which implements it is searchable.
There are two different types of IndexReaders:
- AtomicReader: These indexes do not consist of several sub-readers, they are atomic. They support retrieval of stored fields, doc values, terms, and postings.
- CompositeReader: Instances (like DirectoryReader) of this reader can only be used to get stored fields from the underlying AtomicReaders, but it is not possible to directly retrieve postings. To do that, get the sub-readers via GetSequentialSubReaders(). Alternatively, you can mimic an AtomicReader (with a serious slowdown), by wrapping composite readers with SlowCompositeReaderWrapper.
IndexReader instances for indexes on disk are usually constructed
with a call to one of the static DirectoryReader.Open()
methods,
e.g. Open(Directory). DirectoryReader inherits
the CompositeReader abstract class, it is not possible to directly get postings.
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
IndexReaderContext
A struct like class that represents a hierarchical relationship between IndexReader instances.
IndexUpgrader
This is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format. It can be used from command line:
java -cp lucene-core.jar Lucene.Net.Index.IndexUpgrader [-delete-prior-commits] [-verbose] indexDir
Alternatively this class can be instantiated and Upgrade() invoked. It uses UpgradeIndexMergePolicy and triggers the upgrade via an ForceMerge(Int32) request to IndexWriter.
This tool keeps only the last commit in an index; for this
reason, if the incoming index has more than one commit, the tool
refuses to run by default. Specify -delete-prior-commits
to override this, allowing the tool to delete all but the last commit.
From .NET code this can be enabled by passing true
to
IndexUpgrader(Directory, LuceneVersion, TextWriter, Boolean).
Warning: this tool may reorder documents if the index was partially upgraded before execution (e.g., documents were added). If your application relies on "monotonicity" of doc IDs (which means that the order in which the documents were added to the index is preserved), do a full ForceMerge instead. The MergePolicy set by IndexWriterConfig may also reorder documents.
IndexWriter
An IndexWriter creates and maintains an index.
IndexWriter.IndexReaderWarmer
If Open(IndexWriter, Boolean) has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits. This is not required for near real-time search, but will reduce search latency on opening a new near real-time reader after a merge completes.
Note
This API is experimental and might change in incompatible ways in the next release.
NOTE: Warm(AtomicReader) is called before any deletes have been carried over to the merged segment.
IndexWriterConfig
Holds all the configuration that is used to create an IndexWriter. Once IndexWriter has been created with this object, changes to this object will not affect the IndexWriter instance. For that, use LiveIndexWriterConfig that is returned from Config.
LUCENENET NOTE: Unlike Lucene, we use property setters instead of setter methods. In C#, this allows you to initialize the IndexWriterConfig using the language features of C#, for example:
IndexWriterConfig conf = new IndexWriterConfig(analyzer)
{
Codec = Lucene46Codec(),
OpenMode = OpenMode.CREATE
};
However, if you prefer to match the syntax of Lucene using chained setter methods, there are extension methods in the Lucene.Net.Index.Extensions namespace. Example usage:
using Lucene.Net.Index.Extensions;
..
IndexWriterConfig conf = new IndexWriterConfig(analyzer)
.SetCodec(new Lucene46Codec())
.SetOpenMode(OpenMode.CREATE);
@since 3.1
KeepOnlyLastCommitDeletionPolicy
This IndexDeletionPolicy implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done. This is the default deletion policy.
LiveIndexWriterConfig
Holds all the configuration used by IndexWriter with few setters for settings that can be changed on an IndexWriter instance "live".
@since 4.0
LogByteSizeMergePolicy
This is a LogMergePolicy that measures size of a segment as the total byte size of the segment's files.
LogDocMergePolicy
This is a LogMergePolicy that measures size of a segment as the number of documents (not taking deletions into account).
LogMergePolicy
This class implements a MergePolicy that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor. Whenever extra segments (beyond the merge factor upper bound) are encountered, all segments within the level are merged. You can get or set the merge factor using MergeFactor.
This class is abstract and requires a subclass to define the Size(SegmentCommitInfo) method which specifies how a segment's size is determined. LogDocMergePolicy is one subclass that measures size by document count in the segment. LogByteSizeMergePolicy is another subclass that measures size as the total byte size of the file(s) for the segment.
MergePolicy
Expert: a MergePolicy determines the sequence of primitive merge operations.
Whenever the segments in an index have been altered by IndexWriter, either the addition of a newly flushed segment, addition of many segments from AddIndexes* calls, or a previous merge that may now need to cascade, IndexWriter invokes FindMerges(MergeTrigger, SegmentInfos) to give the MergePolicy a chance to pick merges that are now required. This method returns a MergePolicy.MergeSpecification instance describing the set of merges that should be done, or null if no merges are necessary. When ForceMerge(Int32) is called, it calls FindForcedMerges(SegmentInfos, Int32, IDictionary<SegmentCommitInfo, Boolean>) and the MergePolicy should then return the necessary merges.
Note that the policy can return more than one merge at a time. In this case, if the writer is using SerialMergeScheduler, the merges will be run sequentially but if it is using ConcurrentMergeScheduler they will be run concurrently.
The default MergePolicy is TieredMergePolicy.
Note
This API is experimental and might change in incompatible ways in the next release.
MergePolicy.DocMap
A map of doc IDs.
MergePolicy.MergeAbortedException
Thrown when a merge was explicity aborted because
Dispose(Boolean) was called with
false
. Normally this exception is
privately caught and suppresed by IndexWriter.
MergePolicy.MergeException
Exception thrown if there are any problems while executing a merge.
MergePolicy.MergeSpecification
A MergePolicy.MergeSpecification instance provides the information necessary to perform multiple merges. It simply contains a list of MergePolicy.OneMerge instances.
MergePolicy.OneMerge
OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment. The merge spec includes the subset of segments to be merged as well as whether the new segment should use the compound file format.
MergeScheduler
Expert: IndexWriter uses an instance implementing this interface to execute the merges selected by a MergePolicy. The default MergeScheduler is ConcurrentMergeScheduler.
Implementers of sub-classes should make sure that Clone() returns an independent instance able to work with any IndexWriter instance.
Note
This API is experimental and might change in incompatible ways in the next release.
MergeState
Holds common state used during segment merging.
Note
This API is experimental and might change in incompatible ways in the next release.
MergeState.DocMap
Remaps docids around deletes during merge
MultiDocsAndPositionsEnum
Exposes flex API, merged from flex API of sub-segments.
Note
This API is experimental and might change in incompatible ways in the next release.
MultiDocsAndPositionsEnum.EnumWithSlice
Holds a DocsAndPositionsEnum along with the corresponding ReaderSlice.
MultiDocsEnum
Exposes DocsEnum, merged from DocsEnum API of sub-segments.
Note
This API is experimental and might change in incompatible ways in the next release.
MultiDocsEnum.EnumWithSlice
Holds a DocsEnum along with the corresponding ReaderSlice.
MultiDocValues
A wrapper for CompositeReader providing access to DocValues.
NOTE: for multi readers, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.
NOTE: this is very costly.
Note
This API is experimental and might change in incompatible ways in the next release.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
MultiDocValues.MultiSortedDocValues
Implements SortedDocValues over n subs, using an MultiDocValues.OrdinalMap
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
MultiDocValues.MultiSortedSetDocValues
Implements MultiDocValues.MultiSortedSetDocValues over n subs, using an MultiDocValues.OrdinalMap
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
MultiDocValues.OrdinalMap
maps per-segment ordinals to/from global ordinal space
MultiFields
Exposes flex API, merged from flex API of sub-segments. This is useful when you're interacting with an IndexReader implementation that consists of sequential sub-readers (eg DirectoryReader or MultiReader).
NOTE: for composite readers, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.
Note
This API is experimental and might change in incompatible ways in the next release.
MultiReader
A CompositeReader which reads multiple indexes, appending their content. It can be used to create a view on several sub-readers (like DirectoryReader) and execute searches on it.
For efficiency, in this API documents are often referred to via document numbers, non-negative integers which each name a unique document in the index. These document numbers are ephemeral -- they may change as documents are added to and deleted from an index. Clients should thus not rely on a given document having the same number between sessions.
NOTE: IndexReader instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexReader instance; use your own (non-Lucene) objects instead.
MultiTerms
Exposes flex API, merged from flex API of sub-segments.
Note
This API is experimental and might change in incompatible ways in the next release.
MultiTermsEnum
Exposes TermsEnum API, merged from TermsEnum API of sub-segments. This does a merge sort, by term text, of the sub-readers.
Note
This API is experimental and might change in incompatible ways in the next release.
MultiTermsEnum.TermsEnumIndex
MultiTermsEnum.TermsEnumWithSlice
NoDeletionPolicy
An IndexDeletionPolicy which keeps all index commits around, never deleting them. This class is a singleton and can be accessed by referencing INSTANCE.
NoMergePolicy
A MergePolicy which never returns merges to execute (hence it's name). It is also a singleton and can be accessed through NO_COMPOUND_FILES if you want to indicate the index does not use compound files, or through COMPOUND_FILES otherwise. Use it if you want to prevent an IndexWriter from ever executing merges, without going through the hassle of tweaking a merge policy's settings to achieve that, such as changing its merge factor.
NoMergeScheduler
A MergeScheduler which never executes any merges. It is also a singleton and can be accessed through INSTANCE. Use it if you want to prevent an IndexWriter from ever executing merges, regardless of the MergePolicy used. Note that you can achieve the same thing by using NoMergePolicy, however with NoMergeScheduler you also ensure that no unnecessary code of any MergeScheduler implementation is ever executed. Hence it is recommended to use both if you want to disable merges from ever happening.
NumericDocValues
A per-document numeric value.
OrdTermState
An ordinal based TermState
Note
This API is experimental and might change in incompatible ways in the next release.
ParallelAtomicReader
An AtomicReader which reads multiple, parallel indexes. Each index added must have the same number of documents, but typically each contains different fields. Deletions are taken from the first reader. Each document contains the union of the fields of all documents with the same document number. When searching, matches for a query term are from the first index added that has the field.
This is useful, e.g., with collections that have large fields which change rarely and small fields that change more frequently. The smaller fields may be re-indexed in a new index and both indexes may be searched together.
Warning: It is up to you to make sure all indexes are created and modified the same way. For example, if you add documents to one index, you need to add the same documents in the same order to the other indexes. Failure to do so will result in undefined behavior.
ParallelCompositeReader
A CompositeReader which reads multiple, parallel indexes. Each index added must have the same number of documents, and exactly the same hierarchical subreader structure, but typically each contains different fields. Deletions are taken from the first reader. Each document contains the union of the fields of all documents with the same document number. When searching, matches for a query term are from the first index added that has the field.
This is useful, e.g., with collections that have large fields which change rarely and small fields that change more frequently. The smaller fields may be re-indexed in a new index and both indexes may be searched together.
Warning: It is up to you to make sure all indexes are created and modified the same way. For example, if you add documents to one index, you need to add the same documents in the same order to the other indexes. Failure to do so will result in undefined behavior. A good strategy to create suitable indexes with IndexWriter is to use LogDocMergePolicy, as this one does not reorder documents during merging (like TieredMergePolicy) and triggers merges by number of documents per segment. If you use different MergePolicys it might happen that the segment structure of your index is no longer predictable.
PersistentSnapshotDeletionPolicy
A SnapshotDeletionPolicy which adds a persistence layer so that snapshots can be maintained across the life of an application. The snapshots are persisted in a Directory and are committed as soon as Snapshot() or Release(IndexCommit) is called.
NOTE: Sharing PersistentSnapshotDeletionPolicys that write to the same directory across IndexWriters will corrupt snapshots. You should make sure every IndexWriter has its own PersistentSnapshotDeletionPolicy and that they all write to a different Directory. It is OK to use the same Directory that holds the index.
This class adds a Release(Int64) method to release commits from a previous snapshot's Generation.
Note
This API is experimental and might change in incompatible ways in the next release.
RandomAccessOrds
Extension of SortedSetDocValues that supports random access to the ordinals of a document.
Operations via this API are independent of the iterator api (NextOrd()) and do not impact its state.
Codecs can optionally extend this API if they support constant-time access to ordinals for the document.
RandomAccessOrdsExtensions
ReaderManager
Utility class to safely share DirectoryReader instances across multiple threads, while periodically reopening. This class ensures each reader is disposed only once all threads have finished using it.
Note
This API is experimental and might change in incompatible ways in the next release.
ReaderSlice
Subreader slice from a parent composite reader.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
ReaderUtil
Common util methods for dealing with IndexReaders and IndexReaderContexts.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
SegmentCommitInfo
Embeds a [read-only] SegmentInfo and adds per-commit fields.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentInfo
Information about a segment such as it's name, directory, and files related to the segment.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentInfos
A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.
The active segments in the index are stored in the segment info file,
segments_N
. There may be one or more segments_N
files in the
index; however, the one with the largest generation is the active one (when
older segments_N files are present it's because they temporarily cannot be
deleted, or, a writer is in the process of committing, or a custom
IndexDeletionPolicy
is in use). This file lists each segment by name and has details about the
codec and generation of deletes.
There is also a file segments.gen
. this file contains
the current generation (the _N
in segments_N
) of the index.
This is used only as a fallback in case the current generation cannot be
accurately determined by directory listing alone (as is the case for some NFS
clients with time-based directory cache expiration). This file simply contains
an WriteInt32(Int32) version header
(FORMAT_SEGMENTS_GEN_CURRENT), followed by the
generation recorded as WriteInt64(Int64), written twice.
Files:
segments.gen
: GenHeader, Generation, Generation, Footersegments_N
: Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount, CommitUserData, Footer
- Header --> WriteHeader(DataOutput, String, Int32)
- GenHeader, NameCounter, SegCount, DeletionCount --> WriteInt32(Int32)
- Generation, Version, DelGen, Checksum, FieldInfosGen --> WriteInt64(Int64)
- SegName, SegCodec --> WriteString(String)
- CommitUserData --> WriteStringStringMap(IDictionary<String, String>)
- UpdatesFiles --> WriteStringSet(ISet<String>)
- Footer --> WriteFooter(IndexOutput)
- Version counts how often the index has been changed by adding or deleting documents.
- NameCounter is used to generate names for new segment files.
- SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index.
- DelGen is the generation count of the deletes file. If this is -1, there are no deletes. Anything above zero means there are deletes stored by LiveDocsFormat.
- DeletionCount records the number of deleted documents in this segment.
- SegCodec is the Name of the Codec that encoded this segment.
- CommitUserData stores an optional user-supplied opaque
that was passed to SetCommitData(IDictionary<String, String>). - FieldInfosGen is the generation count of the fieldInfos file. If this is -1, there are no updates to the fieldInfos in that segment. Anything above zero means there are updates to fieldInfos stored by FieldInfosFormat.
- UpdatesFiles stores the list of files that were updated in that segment.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentInfos.FindSegmentsFile
Utility class for executing code that needs to do something with the current segments file. This is necessary with lock-less commits because from the time you locate the current segments file name, until you actually open it, read its contents, or check modified time, etc., it could have been deleted due to a writer commit finishing.
SegmentReader
IndexReader implementation over a single segment.
Instances pointing to the same segment (but with different deletes, etc) may share the same core data.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentReadState
Holder class for common parameters used during read.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentWriteState
Holder class for common parameters used during write.
Note
This API is experimental and might change in incompatible ways in the next release.
SerialMergeScheduler
A MergeScheduler that simply does each merge sequentially, using the current thread.
SimpleMergedSegmentWarmer
A very simple merged segment warmer that just ensures data structures are initialized.
SingleTermsEnum
Subclass of FilteredTermsEnum for enumerating a single term.
For example, this can be used by MultiTermQuerys that need only visit one term, but want to preserve MultiTermQuery semantics such as MultiTermRewriteMethod.
SlowCompositeReaderWrapper
This class forces a composite reader (eg a MultiReader or DirectoryReader) to emulate an atomic reader. This requires implementing the postings APIs on-the-fly, using the static methods in MultiFields, MultiDocValues, by stepping through the sub-readers to merge fields/terms, appending docs, etc.
NOTE: This class almost always results in a performance hit. If this is important to your use case, you'll get better performance by gathering the sub readers using Context to get the atomic leaves and then operate per-AtomicReader, instead of using this class.
SnapshotDeletionPolicy
An IndexDeletionPolicy that wraps any other IndexDeletionPolicy and adds the ability to hold and later release snapshots of an index. While a snapshot is held, the IndexWriter will not remove any files associated with it even if the index is otherwise being actively, arbitrarily changed. Because we wrap another arbitrary IndexDeletionPolicy, this gives you the freedom to continue using whatever IndexDeletionPolicy you would normally want to use with your index.
This class maintains all snapshots in-memory, and so the information is not persisted and not protected against system failures. If persistence is important, you can use PersistentSnapshotDeletionPolicy.
Note
This API is experimental and might change in incompatible ways in the next release.
SortedDocValues
A per-document byte[] with presorted values.
Per-Document values in a SortedDocValues are deduplicated, dereferenced, and sorted into a dictionary of unique values. A pointer to the dictionary value (ordinal) can be retrieved for each document. Ordinals are dense and in increasing sorted order.
SortedSetDocValues
A per-document set of presorted byte[] values.
Per-Document values in a SortedDocValues are deduplicated, dereferenced, and sorted into a dictionary of unique values. A pointer to the dictionary value (ordinal) can be retrieved for each document. Ordinals are dense and in increasing sorted order.
StoredFieldVisitor
Expert: Provides a low-level means of accessing the stored field values in an index. See Document(Int32, StoredFieldVisitor).
NOTE: a StoredFieldVisitor implementation should not try to load or visit other stored documents in the same reader because the implementation of stored fields for most codecs is not reeentrant and you will see strange exceptions as a result.
See DocumentStoredFieldVisitor, which is a StoredFieldVisitor that builds the Document containing all stored fields. This is used by Document(Int32).
Note
This API is experimental and might change in incompatible ways in the next release.
Term
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.
Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.
TermContext
Maintains a IndexReader TermState view over IndexReader instances containing a single term. The TermContext doesn't track if the given TermState objects are valid, neither if the TermState instances refer to the same terms in the associated readers.
Note
This API is experimental and might change in incompatible ways in the next release.
TermExtensions
Terms
Access to the terms in a specific field. See Fields.
Note
This API is experimental and might change in incompatible ways in the next release.
TermsEnum
Enumerator to seek (SeekCeil(BytesRef), SeekExact(BytesRef)) or step through (MoveNext() terms to obtain Term, frequency information (DocFreq), DocsEnum or DocsAndPositionsEnum for the current term (Docs(IBits, DocsEnum)).
Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than the one before it.
The TermsEnum is unpositioned when you first obtain it
and you must first successfully call MoveNext() or one
of the Seek
methods.
Note
This API is experimental and might change in incompatible ways in the next release.
TermState
Encapsulates all required internal state to position the associated TermsEnum without re-seeking.
Note
This API is experimental and might change in incompatible ways in the next release.
TieredMergePolicy
Merges segments of approximately equal size, subject to an allowed number of segments per tier. This is similar to LogByteSizeMergePolicy, except this merge policy is able to merge non-adjacent segment, and separates how many segments are merged at once (MaxMergeAtOnce) from how many segments are allowed per tier (SegmentsPerTier). This merge policy also does not over-merge (i.e. cascade merges).
For normal merging, this policy first computes a "budget" of how many segments are allowed to be in the index. If the index is over-budget, then the policy sorts segments by decreasing size (pro-rating by percent deletes), and then finds the least-cost merge. Merge cost is measured by a combination of the "skew" of the merge (size of largest segment divided by smallest segment), total merge size and percent deletes reclaimed, so that merges with lower skew, smaller size and those reclaiming more deletes, are favored.
If a merge will produce a segment that's larger than MaxMergedSegmentMB, then the policy will merge fewer segments (down to 1 at once, if that one has deletions) to keep the segment size under budget.
NOTE: This policy freely merges non-adjacent segments; if this is a problem, use LogMergePolicy.
NOTE: This policy always merges by byte size of the segments, always pro-rates by percent deletes, and does not apply any maximum segment size during forceMerge (unlike LogByteSizeMergePolicy).
Note
This API is experimental and might change in incompatible ways in the next release.
TieredMergePolicy.MergeScore
Holds score and explanation for a single candidate merge.
TrackingIndexWriter
Class that tracks changes to a delegated IndexWriter, used by ControlledRealTimeReopenThread<T> to ensure specific changes are visible. Create this class (passing your IndexWriter), and then pass this class to ControlledRealTimeReopenThread<T>. Be sure to make all changes via the TrackingIndexWriter, otherwise ControlledRealTimeReopenThread<T> won't know about the changes.
Note
This API is experimental and might change in incompatible ways in the next release.
TwoPhaseCommitTool
A utility for executing 2-phase commit on several objects.
Note
This API is experimental and might change in incompatible ways in the next release.
TwoPhaseCommitTool.CommitFailException
Thrown by Execute(ITwoPhaseCommit[]) when an object fails to Commit().
TwoPhaseCommitTool.PrepareCommitFailException
Thrown by Execute(ITwoPhaseCommit[]) when an object fails to PrepareCommit().
UpgradeIndexMergePolicy
This MergePolicy is used for upgrading all existing segments of an index when calling ForceMerge(Int32). All other methods delegate to the base MergePolicy given to the constructor. This allows for an as-cheap-as possible upgrade of an older index by only upgrading segments that are created by previous Lucene versions. ForceMerge does no longer really merge; it is just used to "ForceMerge" older segment versions away.
In general one would use IndexUpgrader, but for a fully customizeable upgrade, you can use this like any other MergePolicy and call ForceMerge(Int32):
IndexWriterConfig iwc = new IndexWriterConfig(LuceneVersion.LUCENE_XX, new KeywordAnalyzer());
iwc.MergePolicy = new UpgradeIndexMergePolicy(iwc.MergePolicy);
using (IndexWriter w = new IndexWriter(dir, iwc))
{
w.ForceMerge(1);
}
Warning: this merge policy may reorder documents if the index was partially
upgraded before calling ForceMerge(Int32) (e.g., documents were added). If your application relies
on "monotonicity" of doc IDs (which means that the order in which the documents
were added to the index is preserved), do a ForceMerge(1)
instead. Please note, the
delegate MergePolicy may also reorder documents.
Note
This API is experimental and might change in incompatible ways in the next release.
Interfaces
IConcurrentMergeScheduler
IIndexableField
Represents a single field for indexing. IndexWriter consumes IEnumerable<IndexableField> as a document.
Note
This API is experimental and might change in incompatible ways in the next release.
IIndexableFieldType
Describes the properties of a field.
Note
This API is experimental and might change in incompatible ways in the next release.
IMergeScheduler
IndexReader.IReaderClosedListener
A custom listener that's invoked when the IndexReader is closed.
Note
This API is experimental and might change in incompatible ways in the next release.
IndexWriter.IEvent
Interface for internal atomic events. See Lucene.Net.Index.DocumentsWriter for details. Events are executed concurrently and no order is guaranteed. Each event should only rely on the serializeability within it's process method. All actions that must happen before or after a certain action must be encoded inside the Process(IndexWriter, Boolean, Boolean) method.
ITwoPhaseCommit
An interface for implementations that support 2-phase commit. You can use TwoPhaseCommitTool to execute a 2-phase commit algorithm over several ITwoPhaseCommits.
Note
This API is experimental and might change in incompatible ways in the next release.
SegmentReader.ICoreDisposedListener
Called when the shared core for this SegmentReader is disposed.
This listener is called only once all SegmentReaders sharing the same core are disposed. At this point it is safe for apps to evict this reader from any caches keyed on CoreCacheKey. This is the same interface that IFieldCache uses, internally, to evict entries.
NOTE: This was CoreClosedListener in Lucene.
Note
This API is experimental and might change in incompatible ways in the next release.
Enums
DocsAndPositionsFlags
DocsFlags
DocValuesType
DocValues types. Note that DocValues is strongly typed, so a field cannot have different types across different documents.
FilteredTermsEnum.AcceptStatus
Return value, if term should be accepted or the iteration should
END. The *_SEEK
values denote, that after handling the current term
the enum should call NextSeekTerm(BytesRef) and step forward.
IndexOptions
Controls how much information is stored in the postings lists.
Note
This API is experimental and might change in incompatible ways in the next release.
MergeTrigger
MergeTrigger is passed to FindMerges(MergeTrigger, SegmentInfos) to indicate the event that triggered the merge.
OpenMode
Specifies the open mode for IndexWriter.
StoredFieldVisitor.Status
Enumeration of possible return values for NeedsField(FieldInfo).
TermsEnum.SeekStatus
Represents returned result from SeekCeil(BytesRef).