Namespace Lucene.Net.Facet.Taxonomy
Taxonomy of Categories
Facets are defined using a hierarchy of categories, known as a _Taxonomy_.
For example, the taxonomy of a book store application might have the following structure:
Author
Mark Twain
J. K. Rowling
Date
2010
March
April
2009
The Taxonomy translates category-paths into interger identifiers (often termed ordinals) and vice versa. The category
Author/Mark Twain
adds two nodes to the taxonomy:Author
andAuthor/Mark Twain
, each is assigned a different ordinal. The taxonomy maintains the invariant that a node always has an ordinal that is < all its children.
Classes
AssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with an arbitrary byte[]. This will require a custom Facets implementation at search time; see Int32AssociationFacetField and SingleAssociationFacetField to use existing Facets implementations.
CachedOrdinalsReader
A per-segment cache of documents' facet ordinals. Every CachedOrdinalsReader.CachedOrds holds the ordinals in a raw int[], and therefore consumes as much RAM as the total number of ordinals found in the segment, but saves the CPU cost of decoding ordinals during facet counting.
NOTE: every CachedOrdinalsReader.CachedOrds is limited to 2.1B total ordinals. If that is a limitation for you then consider limiting the segment size to fewer documents, or use an alternative cache which pages through the category ordinals.
NOTE: when using this cache, it is advised to use a DocValuesFormat that does not cache the data in memory, at least for the category lists fields, or otherwise you'll be doing double-caching.
NOTE: create one instance of this and re-use it for all facet implementations (the cache is per-instance, not static).
CachedOrdinalsReader.CachedOrds
Holds the cached ordinals in two parallel int[] arrays.
CategoryPath
Holds a sequence of string components, specifying the hierarchical name of a category.
DocValuesOrdinalsReader
Decodes ordinals previously indexed into a BinaryDocValues field
FacetLabel
Holds a sequence of string components, specifying the hierarchical name of a category.
FastTaxonomyFacetCounts
Computes facets counts, assuming the default encoding into DocValues was used.
Int32AssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with an System.Int32. Use TaxonomyFacetSumInt32Associations to aggregate int values per facet label at search time.
NOTE: This was IntAssociationFacetField in Lucene
Int32TaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords int[].
NOTE: This was IntTaxonomyFacets in Lucene
LruDictionary<TKey, TValue>
LruDictionary<TKey, TValue> is similar to of Java's HashMap, which has a bounded Limit; When it reaches that Limit, each time a new element is added, the least recently used (LRU) entry is removed.
Unlike the Java Lucene implementation, this one is thread safe because it is backed by the J2N.Collections.Concurrent.LurchTable`2. Do note that every time an element is read from LruDictionary<TKey, TValue>, a write operation also takes place to update the element's last access time. This is because the LRU order needs to be remembered to determine which element to evict when the Limit is exceeded.
OrdinalsReader
Provides per-document ordinals.
OrdinalsReader.OrdinalsSegmentReader
Returns ordinals for documents in one segment.
ParallelTaxonomyArrays
Returns 3 arrays for traversing the taxonomy:
- Parents:
Parents[i]
denotes the parent of category ordinali
. - Children:
Children[i]
denotes a child of category ordinali
. - Siblings:
Siblings[i]
denotes the sibling of category ordinali
.
To traverse the taxonomy tree, you typically start with Children[0]
(ordinal 0 is reserved for ROOT), and then depends if you want to do DFS or
BFS, you call Children[Children[0]]
or Siblings[Children[0]]
and so forth, respectively.
NOTE: you are not expected to modify the values of the arrays, since the arrays are shared with other threads. @lucene.experimental
PrintTaxonomyStats
Prints how many ords are under each dimension.
SearcherTaxonomyManager
Manages near-real-time reopen of both an Lucene.Net.Search.IndexSearcher and a TaxonomyReader.
NOTE: If you call ReplaceTaxonomy(Directory) then you must open a new SearcherTaxonomyManager afterwards.
SearcherTaxonomyManager.SearcherAndTaxonomy
Holds a matched pair of Lucene.Net.Search.IndexSearcher and TaxonomyReader
SingleAssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with a System.Single. Use TaxonomyFacetSumSingleAssociations to aggregate System.Single values per facet label at search time.
NOTE: This was FloatAssociationFacetField in Lucene
SingleTaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords float[].
NOTE: This was FloatTaxonomyFacets in Lucene
TaxonomyFacetCounts
Reads from any OrdinalsReader; use FastTaxonomyFacetCounts if you are using the default encoding from BinaryDocValues.
TaxonomyFacets
Base class for all taxonomy-based facets impls.
TaxonomyFacetSumInt32Associations
Aggregates sum of System.Int32 values previously indexed with Int32AssociationFacetField, assuming the default encoding.
NOTE: This was TaxonomyFacetSumIntAssociations in Lucene
TaxonomyFacetSumSingleAssociations
Aggregates sum of System.Single values previously indexed with SingleAssociationFacetField, assuming the default encoding.
NOTE: This was TaxonomyFacetSumFloatAssociations in Lucene
TaxonomyFacetSumValueSource
Aggregates sum of values from Lucene.Net.Queries.Function.FunctionValues.DoubleVal(System.Int32) and Lucene.Net.Queries.Function.FunctionValues.DoubleVal(System.Int32,System.Double[]), for each facet label.
TaxonomyFacetSumValueSource.ScoreValueSource
Lucene.Net.Queries.Function.ValueSource that returns the score for each hit; use this to aggregate the sum of all hit scores for each facet label.
TaxonomyReader
TaxonomyReader is the read-only interface with which the faceted-search library uses the taxonomy during search time.
A TaxonomyReader holds a list of categories. Each category has a serial number which we call an "ordinal", and a hierarchical "path" name:
- The ordinal is an integer that starts at 0 for the first category (which is always the root category), and grows contiguously as more categories are added; Note that once a category is added, it can never be deleted.
- The path is a CategoryPath object specifying the category's position in the hierarchy.
An implementation must allow multiple readers to be active concurrently with a single writer. Readers follow so-called "point in time" semantics, i.e., a TaxonomyReader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's Commit() is called.
In faceted search, two separate indices are used: the main Lucene index, and the taxonomy. Because the main index refers to the categories listed in the taxonomy, it is important to open the taxonomy after opening the main index, and it is also necessary to Reopen() the taxonomy after Reopen()ing the main index.
This order is important, otherwise it would be possible for the main index to refer to a category which is not yet visible in the old snapshot of the taxonomy. Note that it is indeed fine for the the taxonomy to be opened after the main index - even a long time after. The reason is that once a category is added to the taxonomy, it can never be changed or deleted, so there is no danger that a "too new" taxonomy not being consistent with an older index.
TaxonomyReader.ChildrenEnumerator
An iterator over a category's children.
Interfaces
ITaxonomyWriter
ITaxonomyWriter is the interface which the faceted-search library uses to dynamically build the taxonomy at indexing time.
Notes about concurrent access to the taxonomy:
An implementation must allow multiple readers and a single writer to be active concurrently. Readers follow so-called "point in time" semantics, i.e., a reader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's Commit() is called.
Faceted search keeps two indices - namely Lucene's main index, and this taxonomy index. When one or more readers are active concurrently with the writer, care must be taken to avoid an inconsistency between the state of these two indices: When writing to the indices, the taxonomy must always be committed to disk before the main index, because the main index refers to categories listed in the taxonomy. Such control can best be achieved by turning off the main index's "autocommit" feature, and explicitly calling Commit() for both indices (first for the taxonomy, then for the main index). In old versions of Lucene (2.2 or earlier), when autocommit could not be turned off, a more complicated solution needs to be used. E.g., use some sort of (possibly inter-process) locking to ensure that a reader is being opened only right after both indices have been flushed (and before anything else is written to them).