Namespace Lucene.Net.Facet.Taxonomy
Taxonomy of Categories
Facets are defined using a hierarchy of categories, known as a _Taxonomy_.
For example, the taxonomy of a book store application might have the following structure:
Author
Mark Twain
J. K. Rowling
Date
2010
March
April
2009
The Taxonomy translates category-paths into interger identifiers (often termed ordinals) and vice versa. The category
Author/Mark Twain
adds two nodes to the taxonomy:Author
andAuthor/Mark Twain
, each is assigned a different ordinal. The taxonomy maintains the invariant that a node always has an ordinal that is < all its children.
Classes
AssociationFacetField
Add an instance of this to your
@lucene.experimental
CachedOrdinalsReader
A per-segment cache of documents' facet ordinals. Every
Cached
NOTE: every Cached
NOTE: when using this cache, it is advised to use
a
NOTE: create one instance of this and re-use it for all facet implementations (the cache is per-instance, not static).
CachedOrdinalsReader.CachedOrds
Holds the cached ordinals in two parallel int[] arrays.
CategoryPath
Holds a sequence of string components, specifying the hierarchical name of a category.
DocValuesOrdinalsReader
Decodes ordinals previously indexed into a
FacetLabel
Holds a sequence of string components, specifying the hierarchical name of a category.
FastTaxonomyFacetCounts
Computes facets counts, assuming the default encoding into DocValues was used.
@lucene.experimental
Int32AssociationFacetField
Add an instance of this to your
NOTE: This was IntAssociationFacetField in Lucene
@lucene.experimental
Int32TaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords int[].
NOTE: This was IntTaxonomyFacets in Lucene
LRUHashMap<TKey, TValue>
LRUHashMap<TKey, TValue> is similar to of Java's HashMap, which has a bounded Limit; When it reaches that Limit, each time a new element is added, the least recently used (LRU) entry is removed.
Unlike the Java Lucene implementation, this one is thread safe because it is backed by the
@lucene.experimental
OrdinalsReader
Provides per-document ordinals.
OrdinalsReader.OrdinalsSegmentReader
Returns ordinals for documents in one segment.
ParallelTaxonomyArrays
Returns 3 arrays for traversing the taxonomy:
- Parents:
Parents[i]
denotes the parent of category ordinali
. - Children:
Children[i]
denotes a child of category ordinali
. - Siblings:
Siblings[i]
denotes the sibling of category ordinali
.
To traverse the taxonomy tree, you typically start with Children[0]
(ordinal 0 is reserved for ROOT), and then depends if you want to do DFS or
BFS, you call Children[Children[0]]
or Siblings[Children[0]]
and so forth, respectively.
NOTE: you are not expected to modify the values of the arrays, since the arrays are shared with other threads. @lucene.experimental
PrintTaxonomyStats
Prints how many ords are under each dimension.
SearcherTaxonomyManager
Manages near-real-time reopen of both an
NOTE: If you call Replace
SearcherTaxonomyManager.SearcherAndTaxonomy
Holds a matched pair of
SingleAssociationFacetField
Add an instance of this to your
NOTE: This was FloatAssociationFacetField in Lucene
@lucene.experimental
SingleTaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords float[].
NOTE: This was FloatTaxonomyFacets in Lucene
TaxonomyFacetCounts
Reads from any Ordinals
@lucene.experimental
TaxonomyFacets
Base class for all taxonomy-based facets impls.
TaxonomyFacetSumInt32Associations
Aggregates sum of
NOTE: This was TaxonomyFacetSumIntAssociations in Lucene
@lucene.experimental
TaxonomyFacetSumSingleAssociations
Aggregates sum of
NOTE: This was TaxonomyFacetSumFloatAssociations in Lucene
@lucene.experimental
TaxonomyFacetSumValueSource
Aggregates sum of values from Double
@lucene.experimental
TaxonomyFacetSumValueSource.ScoreValueSource
Value
TaxonomyReader
TaxonomyReader is the read-only interface with which the faceted-search library uses the taxonomy during search time.
A TaxonomyReader holds a list of categories. Each category has a serial number which we call an "ordinal", and a hierarchical "path" name:
- The ordinal is an integer that starts at 0 for the first category (which is always the root category), and grows contiguously as more categories are added; Note that once a category is added, it can never be deleted.
- The path is a CategoryPath object specifying the category's position in the hierarchy.
An implementation must allow multiple readers to be active concurrently
with a single writer. Readers follow so-called "point in time" semantics,
i.e., a TaxonomyReader object will only see taxonomy entries which were
available at the time it was created. What the writer writes is only
available to (new) readers after the writer's
In faceted search, two separate indices are used: the main Lucene index, and the taxonomy. Because the main index refers to the categories listed in the taxonomy, it is important to open the taxonomy after opening the main index, and it is also necessary to Reopen() the taxonomy after Reopen()ing the main index.
This order is important, otherwise it would be possible for the main index to refer to a category which is not yet visible in the old snapshot of the taxonomy. Note that it is indeed fine for the the taxonomy to be opened after the main index - even a long time after. The reason is that once a category is added to the taxonomy, it can never be changed or deleted, so there is no danger that a "too new" taxonomy not being consistent with an older index.
TaxonomyReader.ChildrenIterator
An iterator over a category's children.
Interfaces
ITaxonomyWriter
ITaxonomy
Notes about concurrent access to the taxonomy:
An implementation must allow multiple readers and a single writer to be
active concurrently. Readers follow so-called "point in time" semantics,
i.e., a reader object will only see taxonomy entries which were available
at the time it was created. What the writer writes is only available to
(new) readers after the writer's
Faceted search keeps two indices - namely Lucene's main index, and this
taxonomy index. When one or more readers are active concurrently with the
writer, care must be taken to avoid an inconsistency between the state of
these two indices: When writing to the indices, the taxonomy must always
be committed to disk before the main index, because the main index
refers to categories listed in the taxonomy.
Such control can best be achieved by turning off the main index's
"autocommit" feature, and explicitly calling