Show / Hide Table of Contents

    Class DocTermOrds

    This class enables fast access to multiple term ords for a specified field across all docIDs.

    Like IFieldCache, it uninverts the index and holds a packed data structure in RAM to enable fast access. Unlike IFieldCache, it can handle multi-valued fields, and, it does not hold the term bytes in RAM. Rather, you must obtain a TermsEnum from the GetOrdTermsEnum(AtomicReader) method, and then seek-by-ord to get the term's bytes.

    While normally term ords are type , in this API they are as the internal representation here cannot address more than MAX_INT32 unique terms. Also, typically this class is used on fields with relatively few unique terms vs the number of documents. In addition, there is an internal limit (16 MB) on how many bytes each chunk of documents may consume. If you trip this limit you'll hit an .

    Deleted documents are skipped during uninversion, and if you look them up you'll get 0 ords.

    The returned per-document ords do not retain their original order in the document. Instead they are returned in sorted (by ord, ie term's BytesRef comparer) order. They are also de-dup'd (ie if doc has same term more than once in this field, you'll only get that ord back once).

    This class tests whether the provided reader is able to retrieve terms by ord (ie, it's single segment, and it uses an ord-capable terms index). If not, this class will create its own term index internally, allowing to create a wrapped TermsEnum that can handle ord. The GetOrdTermsEnum(AtomicReader) method then provides this wrapped enum, if necessary.

    The RAM consumption of this class can be high!

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk
    Inheritance
    System.Object
    DocTermOrds
    Namespace: Lucene.Net.Index
    Assembly: Lucene.Net.dll
    Syntax
    public class DocTermOrds : object
    Remarks

    Final form of the un-inverted field:

    • Each document points to a list of term numbers that are contained in that document.
    • Term numbers are in sorted order, and are encoded as variable-length deltas from the previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A term number of 0 signals the end of the termNumber list.
    • There is a single int[maxDoc()] which either contains a pointer into a byte[] for the termNumber lists, or directly contains the termNumber list if it fits in the 4 bytes of an integer. If the first byte in the integer is 1, the next 3 bytes are a pointer into a byte[] where the termNumber list starts.
    • There are actually 256 byte arrays, to compensate for the fact that the pointers into the byte arrays are only 3 bytes long. The correct byte array for a document is a function of it's id.
    • To save space and speed up faceting, any term that matches enough documents will not be un-inverted... it will be skipped while building the un-inverted field structure, and will use a set intersection method during faceting.
    • To further save memory, the terms (the actual string values) are not all stored in memory, but a TermIndex is used to convert term numbers to term values only for the terms needed after faceting has completed. Only every 128th term value is stored, along with it's corresponding term number, and this is used as an index to find the closest term and iterate until the desired number is hit (very much like Lucene's own internal term index).

    Constructors

    | Improve this Doc View Source

    DocTermOrds(AtomicReader, IBits, String)

    Inverts all terms

    Declaration
    public DocTermOrds(AtomicReader reader, IBits liveDocs, string field)
    Parameters
    Type Name Description
    AtomicReader reader
    IBits liveDocs
    System.String field
    | Improve this Doc View Source

    DocTermOrds(AtomicReader, IBits, String, BytesRef)

    Inverts only terms starting w/ prefix

    Declaration
    public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix)
    Parameters
    Type Name Description
    AtomicReader reader
    IBits liveDocs
    System.String field
    BytesRef termPrefix
    | Improve this Doc View Source

    DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32)

    Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq

    Declaration
    public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix, int maxTermDocFreq)
    Parameters
    Type Name Description
    AtomicReader reader
    IBits liveDocs
    System.String field
    BytesRef termPrefix
    System.Int32 maxTermDocFreq
    | Improve this Doc View Source

    DocTermOrds(AtomicReader, IBits, String, BytesRef, Int32, Int32)

    Inverts only terms starting w/ prefix, and only terms whose docFreq (not taking deletions into account) is <= maxTermDocFreq, with a custom indexing interval (default is every 128nd term).

    Declaration
    public DocTermOrds(AtomicReader reader, IBits liveDocs, string field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits)
    Parameters
    Type Name Description
    AtomicReader reader
    IBits liveDocs
    System.String field
    BytesRef termPrefix
    System.Int32 maxTermDocFreq
    System.Int32 indexIntervalBits
    | Improve this Doc View Source

    DocTermOrds(String, Int32, Int32)

    Subclass inits w/ this, but be sure you then call uninvert, only once

    Declaration
    protected DocTermOrds(string field, int maxTermDocFreq, int indexIntervalBits)
    Parameters
    Type Name Description
    System.String field
    System.Int32 maxTermDocFreq
    System.Int32 indexIntervalBits

    Fields

    | Improve this Doc View Source

    DEFAULT_INDEX_INTERVAL_BITS

    Every 128th term is indexed, by default.

    Declaration
    public static readonly int DEFAULT_INDEX_INTERVAL_BITS
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_docsEnum

    Used while uninverting.

    Declaration
    protected DocsEnum m_docsEnum
    Field Value
    Type Description
    DocsEnum
    | Improve this Doc View Source

    m_field

    Field we are uninverting.

    Declaration
    protected readonly string m_field
    Field Value
    Type Description
    System.String
    | Improve this Doc View Source

    m_index

    Holds the per-document ords or a pointer to the ords.

    Declaration
    protected int[] m_index
    Field Value
    Type Description
    System.Int32[]
    | Improve this Doc View Source

    m_indexedTermsArray

    Holds the indexed (by default every 128th) terms.

    Declaration
    protected BytesRef[] m_indexedTermsArray
    Field Value
    Type Description
    BytesRef[]
    | Improve this Doc View Source

    m_maxTermDocFreq

    Don't uninvert terms that exceed this count.

    Declaration
    protected readonly int m_maxTermDocFreq
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_numTermsInField

    Number of terms in the field.

    Declaration
    protected int m_numTermsInField
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_ordBase

    Ordinal of the first term in the field, or 0 if the PostingsFormat does not implement Ord.

    Declaration
    protected int m_ordBase
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_phase1_time

    Time for phase1 of the uninvert process.

    Declaration
    protected int m_phase1_time
    Field Value
    Type Description
    System.Int32
    | Improve this Doc View Source

    m_prefix

    If non-null, only terms matching this prefix were indexed.

    Declaration
    protected BytesRef m_prefix
    Field Value
    Type Description
    BytesRef
    | Improve this Doc View Source

    m_sizeOfIndexedStrings

    Total bytes (sum of term lengths) for all indexed terms.

    Declaration
    protected long m_sizeOfIndexedStrings
    Field Value
    Type Description
    System.Int64
    | Improve this Doc View Source

    m_termInstances

    Total number of references to term numbers.

    Declaration
    protected long m_termInstances
    Field Value
    Type Description
    System.Int64
    | Improve this Doc View Source

    m_tnums

    Holds term ords for documents.

    Declaration
    protected sbyte[][] m_tnums
    Field Value
    Type Description
    System.SByte[][]
    | Improve this Doc View Source

    m_total_time

    Total time to uninvert the field.

    Declaration
    protected int m_total_time
    Field Value
    Type Description
    System.Int32

    Properties

    | Improve this Doc View Source

    IsEmpty

    Returns true if no terms were indexed.

    Declaration
    public virtual bool IsEmpty { get; }
    Property Value
    Type Description
    System.Boolean
    | Improve this Doc View Source

    NumTerms

    Returns the number of terms in this field

    Declaration
    public virtual int NumTerms { get; }
    Property Value
    Type Description
    System.Int32

    Methods

    | Improve this Doc View Source

    GetIterator(AtomicReader)

    Returns a SortedSetDocValues view of this instance

    Declaration
    public virtual SortedSetDocValues GetIterator(AtomicReader reader)
    Parameters
    Type Name Description
    AtomicReader reader
    Returns
    Type Description
    SortedSetDocValues
    | Improve this Doc View Source

    GetOrdTermsEnum(AtomicReader)

    Returns a TermsEnum that implements Ord. If the provided reader supports Ord, we just return its TermsEnum; if it does not, we build a "private" terms index internally (WARNING: consumes RAM) and use that index to implement Ord. This also enables Ord on top of a composite reader. The returned TermsEnum is unpositioned. This returns null if there are no terms.

    NOTE: you must pass the same reader that was used when creating this class

    Declaration
    public virtual TermsEnum GetOrdTermsEnum(AtomicReader reader)
    Parameters
    Type Name Description
    AtomicReader reader
    Returns
    Type Description
    TermsEnum
    | Improve this Doc View Source

    LookupTerm(TermsEnum, Int32)

    Returns the term (BytesRef) corresponding to the provided ordinal.

    Declaration
    public virtual BytesRef LookupTerm(TermsEnum termsEnum, int ord)
    Parameters
    Type Name Description
    TermsEnum termsEnum
    System.Int32 ord
    Returns
    Type Description
    BytesRef
    | Improve this Doc View Source

    RamUsedInBytes()

    Returns total bytes used.

    Declaration
    public virtual long RamUsedInBytes()
    Returns
    Type Description
    System.Int64
    | Improve this Doc View Source

    SetActualDocFreq(Int32, Int32)

    Invoked during Uninvert(AtomicReader, IBits, BytesRef) to record the document frequency for each uninverted term.

    Declaration
    protected virtual void SetActualDocFreq(int termNum, int df)
    Parameters
    Type Name Description
    System.Int32 termNum
    System.Int32 df
    | Improve this Doc View Source

    Uninvert(AtomicReader, IBits, BytesRef)

    Call this only once (if you subclass!)

    Declaration
    protected virtual void Uninvert(AtomicReader reader, IBits liveDocs, BytesRef termPrefix)
    Parameters
    Type Name Description
    AtomicReader reader
    IBits liveDocs
    BytesRef termPrefix
    | Improve this Doc View Source

    VisitTerm(TermsEnum, Int32)

    Subclass can override this

    Declaration
    protected virtual void VisitTerm(TermsEnum te, int termNum)
    Parameters
    Type Name Description
    TermsEnum te
    System.Int32 termNum
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)