Class DocTermOrds
This class enables fast access to multiple term ords for a specified field across all docIDs.
Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.
While normally term ords are type System.Int64, in this API they are System.Int32 as the internal representation here cannot address more than MAX_INT32 unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an System.InvalidOperationException.
Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.
The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).
This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.
The RAM consumption of this class can be high!
Note
This API is experimental and might change in incompatible ways in the next release.
Inheritance
Inherited Members
Namespace: Lucene.Net.Index
Assembly: Lucene.Net.dll
Syntax
public class DocTermOrds
Remarks
Final form of the un-inverted field:
- Each document points to a list of term numbers that are contained in that document.
- Term numbers are in sorted order, and are encoded as variable-length deltas from the previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A term number of 0 signals the end of the termNumber list.
- There is a single int[maxDoc()] which either contains a pointer into a byte[] for the termNumber lists, or directly contains the termNumber list if it fits in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 bytes are a pointer into a byte[] where the termNumber list starts.
- There are actually 256 byte arrays, to compensate for the fact that the pointers into the byte arrays are only 3 bytes long. The correct byte array for a document is a function of it's id.
- To save space and speed up faceting, any term that matches enough documents will not be un-inverted... it will be skipped while building the un-inverted field structure, and will use a set intersection method during faceting.
- To further save memory, the terms (the actual string values) are not all stored in memory, but a TermIndex is used to convert term numbers to term values only for the terms needed after faceting has completed. Only every 128th term value is stored, along with it's corresponding term number, and this is used as an index to find the closest term and iterate until the desired number is hit (very much like Lucene's own internal term index).
Constructors
| Improve this Doc View SourceDocTermOrds(AtomicReader, IBits, String)
Inverts all terms
Declaration
public DocTermOrds(AtomicReader reader, IBits liveDocs, string field)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader | |
IBits | liveDocs | |
System.String | field |
DocTermOrds(AtomicReader, IBits, String, BytesRef)
Inverts only terms starting w/ prefix
Declaration
public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader | |
IBits | liveDocs | |
System.String | field | |
BytesRef | termPrefix |
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32)
Inverts only terms starting w/ prefix, and only terms
whose docFreq (not taking deletions into account) is
<= maxTermDocFreq
Declaration
public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix, int maxTermDocFreq)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader | |
IBits | liveDocs | |
System.String | field | |
BytesRef | termPrefix | |
System.Int32 | maxTermDocFreq |
DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32, Int32)
Inverts only terms starting w/ prefix, and only terms
whose docFreq (not taking deletions into account) is
<= maxTermDocFreq
, with a custom indexing interval
(default is every 128nd term).
Declaration
public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader | |
IBits | liveDocs | |
System.String | field | |
BytesRef | termPrefix | |
System.Int32 | maxTermDocFreq | |
System.Int32 | indexIntervalBits |
DocTermOrds(String, Int32, Int32)
Subclass inits w/ this, but be sure you then call uninvert, only once
Declaration
protected DocTermOrds(string field, int maxTermDocFreq, int indexIntervalBits)
Parameters
Type | Name | Description |
---|---|---|
System.String | field | |
System.Int32 | maxTermDocFreq | |
System.Int32 | indexIntervalBits |
Fields
| Improve this Doc View SourceDEFAULT_INDEX_INTERVAL_BITS
Every 128th term is indexed, by default.
Declaration
public const int DEFAULT_INDEX_INTERVAL_BITS = 7
Field Value
Type | Description |
---|---|
System.Int32 |
m_docsEnum
Used while uninverting.
Declaration
protected DocsEnum m_docsEnum
Field Value
Type | Description |
---|---|
DocsEnum |
m_field
Field we are uninverting.
Declaration
protected readonly string m_field
Field Value
Type | Description |
---|---|
System.String |
m_index
Holds the per-document ords or a pointer to the ords.
Declaration
protected int[] m_index
Field Value
Type | Description |
---|---|
System.Int32[] |
m_indexedTermsArray
Holds the indexed (by default every 128th) terms.
Declaration
protected BytesRef[] m_indexedTermsArray
Field Value
Type | Description |
---|---|
BytesRef[] |
m_maxTermDocFreq
Don't uninvert terms that exceed this count.
Declaration
protected readonly int m_maxTermDocFreq
Field Value
Type | Description |
---|---|
System.Int32 |
m_numTermsInField
Number of terms in the field.
Declaration
protected int m_numTermsInField
Field Value
Type | Description |
---|---|
System.Int32 |
m_ordBase
Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement Ord.
Declaration
protected int m_ordBase
Field Value
Type | Description |
---|---|
System.Int32 |
m_phase1_time
Time for phase1 of the uninvert process.
Declaration
protected int m_phase1_time
Field Value
Type | Description |
---|---|
System.Int32 |
m_prefix
If non-null, only terms matching this prefix were indexed.
Declaration
protected BytesRef m_prefix
Field Value
Type | Description |
---|---|
BytesRef |
m_sizeOfIndexedStrings
Total bytes (sum of term lengths) for all indexed terms.
Declaration
protected long m_sizeOfIndexedStrings
Field Value
Type | Description |
---|---|
System.Int64 |
m_termInstances
Total number of references to term numbers.
Declaration
protected long m_termInstances
Field Value
Type | Description |
---|---|
System.Int64 |
m_tnums
Holds term ords for documents.
Declaration
[CLSCompliant(false)]
protected sbyte[][] m_tnums
Field Value
Type | Description |
---|---|
System.SByte[][] |
m_total_time
Total time to uninvert the field.
Declaration
protected int m_total_time
Field Value
Type | Description |
---|---|
System.Int32 |
Properties
| Improve this Doc View SourceIsEmpty
Returns true
if no terms were indexed.
Declaration
public virtual bool IsEmpty { get; }
Property Value
Type | Description |
---|---|
System.Boolean |
NumTerms
Returns the number of terms in this field
Declaration
public virtual int NumTerms { get; }
Property Value
Type | Description |
---|---|
System.Int32 |
Methods
| Improve this Doc View SourceGetIterator(AtomicReader)
Returns a SortedSetDocValues view of this instance
Declaration
public virtual SortedSetDocValues GetIterator(AtomicReader reader)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader |
Returns
Type | Description |
---|---|
SortedSetDocValues |
GetOrdTermsEnum(AtomicReader)
Returns a TermsEnum that implements Ord. If the
provided reader
supports Ord, we just return its
TermsEnum; if it does not, we build a "private" terms
index internally (WARNING: consumes RAM) and use that
index to implement Ord. This also enables Ord on top
of a composite reader. The returned TermsEnum is
unpositioned. This returns null
if there are no terms.
NOTE: you must pass the same reader that was used when creating this class
Declaration
public virtual TermsEnum GetOrdTermsEnum(AtomicReader reader)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader |
Returns
Type | Description |
---|---|
TermsEnum |
LookupTerm(TermsEnum, Int32)
Returns the term (BytesRef) corresponding to the provided ordinal.
Declaration
public virtual BytesRef LookupTerm(TermsEnum termsEnum, int ord)
Parameters
Type | Name | Description |
---|---|---|
TermsEnum | termsEnum | |
System.Int32 | ord |
Returns
Type | Description |
---|---|
BytesRef |
RamUsedInBytes()
Returns total bytes used.
Declaration
public virtual long RamUsedInBytes()
Returns
Type | Description |
---|---|
System.Int64 |
SetActualDocFreq(Int32, Int32)
Invoked during Uninvert(AtomicReader, IBits, BytesRef) to record the document frequency for each uninverted term.
Declaration
protected virtual void SetActualDocFreq(int termNum, int df)
Parameters
Type | Name | Description |
---|---|---|
System.Int32 | termNum | |
System.Int32 | df |
Uninvert(AtomicReader, IBits, BytesRef)
Call this only once (if you subclass!)
Declaration
protected virtual void Uninvert(AtomicReader reader, IBits liveDocs, BytesRef termPrefix)
Parameters
Type | Name | Description |
---|---|---|
AtomicReader | reader | |
IBits | liveDocs | |
BytesRef | termPrefix |
VisitTerm(TermsEnum, Int32)
Subclass can override this
Declaration
protected virtual void VisitTerm(TermsEnum te, int termNum)
Parameters
Type | Name | Description |
---|---|---|
TermsEnum | te | |
System.Int32 | termNum |