Namespace Lucene.Net.Facet.Taxonomy

Taxonomy of Categories

Facets are defined using a hierarchy of categories, known as a _Taxonomy_.
For example, the taxonomy of a book store application might have the following structure:

Author
- Mark Twain
- J. K. Rowling
Date
- 2010
- March
- April
- 2009
The Taxonomy translates category-paths into interger identifiers (often termed ordinals) and vice versa. The category Author/Mark Twain adds two nodes to the taxonomy: Author and Author/Mark Twain, each is assigned a different ordinal. The taxonomy maintains the invariant that a node always has an ordinal that is < all its children.

Classes

AssociationFacetField

Add an instance of this to your to add a facet label associated with an arbitrary byte[]. This will require a custom Facets implementation at search time; see Int32AssociationFacetField and SingleAssociationFacetField to use existing Facets implementations.

@lucene.experimental

CachedOrdinalsReader

A per-segment cache of documents' facet ordinals. Every CachedOrdinalsReader.CachedOrds holds the ordinals in a raw int[], and therefore consumes as much RAM as the total number of ordinals found in the segment, but saves the CPU cost of decoding ordinals during facet counting.

NOTE: every CachedOrdinalsReader.CachedOrds is limited to 2.1B total ordinals. If that is a limitation for you then consider limiting the segment size to fewer documents, or use an alternative cache which pages through the category ordinals.

NOTE: when using this cache, it is advised to use a that does not cache the data in memory, at least for the category lists fields, or otherwise you'll be doing double-caching.

NOTE: create one instance of this and re-use it for all facet implementations (the cache is per-instance, not static).

CachedOrdinalsReader.CachedOrds

Holds the cached ordinals in two parallel int[] arrays.

CategoryPath

Holds a sequence of string components, specifying the hierarchical name of a category.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DocValuesOrdinalsReader

Decodes ordinals previously indexed into a field

FacetLabel

Holds a sequence of string components, specifying the hierarchical name of a category.

This is a Lucene.NET INTERNAL API, use at your own risk

FastTaxonomyFacetCounts

Computes facets counts, assuming the default encoding into DocValues was used.

@lucene.experimental

Int32AssociationFacetField

Add an instance of this to your to add a facet label associated with an . Use TaxonomyFacetSumInt32Associations to aggregate int values per facet label at search time.

NOTE: This was IntAssociationFacetField in Lucene

@lucene.experimental

Int32TaxonomyFacets

Base class for all taxonomy-based facets that aggregate to a per-ords int[].

NOTE: This was IntTaxonomyFacets in Lucene

LRUHashMap<TKey, TValue>

LRUHashMap<TKey, TValue> is similar to of Java's HashMap, which has a bounded Limit; When it reaches that Limit, each time a new element is added, the least recently used (LRU) entry is removed.

Unlike the Java Lucene implementation, this one is thread safe because it is backed by the . Do note that every time an element is read from LRUHashMap<TKey, TValue>, a write operation also takes place to update the element's last access time. This is because the LRU order needs to be remembered to determine which element to evict when the Limit is exceeded.

@lucene.experimental

OrdinalsReader

Provides per-document ordinals.

OrdinalsReader.OrdinalsSegmentReader

Returns ordinals for documents in one segment.

ParallelTaxonomyArrays

Returns 3 arrays for traversing the taxonomy:

Parents: Parents[i] denotes the parent of category ordinal i.
Children: Children[i] denotes a child of category ordinal i.
Siblings: Siblings[i] denotes the sibling of category ordinal i.

To traverse the taxonomy tree, you typically start with Children[0] (ordinal 0 is reserved for ROOT), and then depends if you want to do DFS or BFS, you call Children[Children[0]] or Siblings[Children[0]] and so forth, respectively.

NOTE: you are not expected to modify the values of the arrays, since the arrays are shared with other threads. @lucene.experimental

PrintTaxonomyStats

Prints how many ords are under each dimension.

SearcherTaxonomyManager

Manages near-real-time reopen of both an and a TaxonomyReader.

NOTE: If you call ReplaceTaxonomy(Store.Directory) then you must open a new SearcherTaxonomyManager afterwards.

SearcherTaxonomyManager.SearcherAndTaxonomy

Holds a matched pair of and TaxonomyReader

SingleAssociationFacetField

Add an instance of this to your to add a facet label associated with a . Use TaxonomyFacetSumSingleAssociations to aggregate values per facet label at search time.

NOTE: This was FloatAssociationFacetField in Lucene

@lucene.experimental

SingleTaxonomyFacets

Base class for all taxonomy-based facets that aggregate to a per-ords float[].

NOTE: This was FloatTaxonomyFacets in Lucene

TaxonomyFacetCounts

Reads from any OrdinalsReader; use FastTaxonomyFacetCounts if you are using the default encoding from .

@lucene.experimental

TaxonomyFacets

Base class for all taxonomy-based facets impls.

TaxonomyFacetSumInt32Associations

Aggregates sum of values previously indexed with Int32AssociationFacetField, assuming the default encoding.

NOTE: This was TaxonomyFacetSumIntAssociations in Lucene

@lucene.experimental

TaxonomyFacetSumSingleAssociations

Aggregates sum of values previously indexed with SingleAssociationFacetField, assuming the default encoding.

NOTE: This was TaxonomyFacetSumFloatAssociations in Lucene

@lucene.experimental

TaxonomyFacetSumValueSource

Aggregates sum of values from DoubleVal(Int32) and DoubleVal(Int32, Double[]), for each facet label.

@lucene.experimental

TaxonomyFacetSumValueSource.ScoreValueSource

ValueSource that returns the score for each hit; use this to aggregate the sum of all hit scores for each facet label.

TaxonomyReader

TaxonomyReader is the read-only interface with which the faceted-search library uses the taxonomy during search time.

A TaxonomyReader holds a list of categories. Each category has a serial number which we call an "ordinal", and a hierarchical "path" name:

The ordinal is an integer that starts at 0 for the first category (which is always the root category), and grows contiguously as more categories are added; Note that once a category is added, it can never be deleted.
The path is a CategoryPath object specifying the category's position in the hierarchy.

Notes about concurrent access to the taxonomy:

An implementation must allow multiple readers to be active concurrently with a single writer. Readers follow so-called "point in time" semantics, i.e., a TaxonomyReader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's is called.

In faceted search, two separate indices are used: the main Lucene index, and the taxonomy. Because the main index refers to the categories listed in the taxonomy, it is important to open the taxonomy after opening the main index, and it is also necessary to Reopen() the taxonomy after Reopen()ing the main index.

This order is important, otherwise it would be possible for the main index to refer to a category which is not yet visible in the old snapshot of the taxonomy. Note that it is indeed fine for the the taxonomy to be opened after the main index - even a long time after. The reason is that once a category is added to the taxonomy, it can never be changed or deleted, so there is no danger that a "too new" taxonomy not being consistent with an older index.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

TaxonomyReader.ChildrenIterator

An iterator over a category's children.

Interfaces

ITaxonomyWriter

ITaxonomyWriter is the interface which the faceted-search library uses to dynamically build the taxonomy at indexing time.

Notes about concurrent access to the taxonomy:

An implementation must allow multiple readers and a single writer to be active concurrently. Readers follow so-called "point in time" semantics, i.e., a reader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's is called.

Faceted search keeps two indices - namely Lucene's main index, and this taxonomy index. When one or more readers are active concurrently with the writer, care must be taken to avoid an inconsistency between the state of these two indices: When writing to the indices, the taxonomy must always be committed to disk before the main index, because the main index refers to categories listed in the taxonomy. Such control can best be achieved by turning off the main index's "autocommit" feature, and explicitly calling for both indices (first for the taxonomy, then for the main index). In old versions of Lucene (2.2 or earlier), when autocommit could not be turned off, a more complicated solution needs to be used. E.g., use some sort of (possibly inter-process) locking to ensure that a reader is being opened only right after both indices have been flushed (and before anything else is written to them).

This is a Lucene.NET EXPERIMENTAL API, use at your own risk