Namespace Lucene.Net.Facet.Taxonomy
Taxonomy of Categories
Facets are defined using a hierarchy of categories, known as a Taxonomy. For example, the taxonomy of a book store application might have the following structure:
Author
Mark Twain
J. K. Rowling
Date
2010
March
April
2009
The Taxonomy translates category-paths into interger identifiers (often termed ordinals) and vice versa.
The category Author/Mark Twain
adds two nodes to the taxonomy: Author
and Author/Mark Twain
, each is assigned a different ordinal. The taxonomy maintains the invariant that a node always has an ordinal that is < all its children.
Classes
AssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with an arbitrary byte[]. This will require a custom Facets implementation at search time; see Int32AssociationFacetField and SingleAssociationFacetField to use existing Facets implementations.
Note
This API is experimental and might change in incompatible ways in the next release.
CachedOrdinalsReader
A per-segment cache of documents' facet ordinals. Every CachedOrdinalsReader.CachedOrds holds the ordinals in a raw int[], and therefore consumes as much RAM as the total number of ordinals found in the segment, but saves the CPU cost of decoding ordinals during facet counting.
NOTE: every CachedOrdinalsReader.CachedOrds is limited to 2.1B total ordinals. If that is a limitation for you then consider limiting the segment size to fewer documents, or use an alternative cache which pages through the category ordinals.
NOTE: when using this cache, it is advised to use a Lucene.Net.Codecs.DocValuesFormat that does not cache the data in memory, at least for the category lists fields, or otherwise you'll be doing double-caching.
NOTE: create one instance of this and re-use it for all facet implementations (the cache is per-instance, not static).
CachedOrdinalsReader.CachedOrds
Holds the cached ordinals in two parallel int[] arrays.
CategoryPath
Holds a sequence of string components, specifying the hierarchical name of a category.
Note
This API is experimental and might change in incompatible ways in the next release.
DocValuesOrdinalsReader
Decodes ordinals previously indexed into a Lucene.Net.Index.BinaryDocValues field
FacetLabel
Holds a sequence of string components, specifying the hierarchical name of a category.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
FastTaxonomyFacetCounts
Computes facets counts, assuming the default encoding into DocValues was used.
Note
This API is experimental and might change in incompatible ways in the next release.
Int32AssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with an int. Use TaxonomyFacetSumInt32Associations to aggregate int values per facet label at search time.
NOTE: This was IntAssociationFacetField in LuceneNote
This API is experimental and might change in incompatible ways in the next release.
Int32TaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords int[].
NOTE: This was IntTaxonomyFacets in LuceneLruDictionary<TKey, TValue>
LruDictionary<TKey, TValue> is similar to of Java's HashMap, which has a bounded Limit; When it reaches that Limit, each time a new element is added, the least recently used (LRU) entry is removed.
Unlike the Java Lucene implementation, this one is thread safe because it is backed by the J2N.Collections.Concurrent.LurchTable<TKey, TValue>. Do note that every time an element is read from LruDictionary<TKey, TValue>, a write operation also takes place to update the element's last access time. This is because the LRU order needs to be remembered to determine which element to evict when the Limit is exceeded.
Note
This API is experimental and might change in incompatible ways in the next release.
OrdinalsReader
Provides per-document ordinals.
OrdinalsReader.OrdinalsSegmentReader
Returns ordinals for documents in one segment.
ParallelTaxonomyArrays
Returns 3 arrays for traversing the taxonomy:
- Parents:
Parents[i]
denotes the parent of category ordinali
. - Children:
Children[i]
denotes a child of category ordinali
. - Siblings:
Siblings[i]
denotes the sibling of category ordinali
.
To traverse the taxonomy tree, you typically start with Children[0]
(ordinal 0 is reserved for ROOT), and then depends if you want to do DFS or
BFS, you call Children[Children[0]]
or Siblings[Children[0]]
and so forth, respectively.
NOTE: you are not expected to modify the values of the arrays, since the arrays are shared with other threads.
Note
This API is experimental and might change in incompatible ways in the next release.
PrintTaxonomyStats
Prints how many ords are under each dimension.
SearcherTaxonomyManager
Manages near-real-time reopen of both an Lucene.Net.Search.IndexSearcher and a TaxonomyReader.
NOTE: If you call ReplaceTaxonomy(Directory) then you must open a new SearcherTaxonomyManager afterwards.
SearcherTaxonomyManager.SearcherAndTaxonomy
Holds a matched pair of Lucene.Net.Search.IndexSearcher and TaxonomyReader
SingleAssociationFacetField
Add an instance of this to your Lucene.Net.Documents.Document to add a facet label associated with a float. Use TaxonomyFacetSumSingleAssociations to aggregate float values per facet label at search time.
NOTE: This was FloatAssociationFacetField in LuceneNote
This API is experimental and might change in incompatible ways in the next release.
SingleTaxonomyFacets
Base class for all taxonomy-based facets that aggregate to a per-ords float[].
NOTE: This was FloatTaxonomyFacets in LuceneTaxonomyFacetCounts
Reads from any OrdinalsReader; use FastTaxonomyFacetCounts if you are using the default encoding from Lucene.Net.Index.BinaryDocValues.
Note
This API is experimental and might change in incompatible ways in the next release.
TaxonomyFacetSumInt32Associations
Aggregates sum of int values previously indexed with Int32AssociationFacetField, assuming the default encoding.
NOTE: This was TaxonomyFacetSumIntAssociations in LuceneNote
This API is experimental and might change in incompatible ways in the next release.
TaxonomyFacetSumSingleAssociations
Aggregates sum of float values previously indexed with SingleAssociationFacetField, assuming the default encoding.
NOTE: This was TaxonomyFacetSumFloatAssociations in LuceneNote
This API is experimental and might change in incompatible ways in the next release.
TaxonomyFacetSumValueSource
Aggregates sum of values from DoubleVal(int) and DoubleVal(int, double[]), for each facet label.
Note
This API is experimental and might change in incompatible ways in the next release.
TaxonomyFacetSumValueSource.ScoreValueSource
Lucene.Net.Queries.Function.ValueSource that returns the score for each hit; use this to aggregate the sum of all hit scores for each facet label.
TaxonomyFacets
Base class for all taxonomy-based facets impls.
TaxonomyReader
TaxonomyReader is the read-only interface with which the faceted-search library uses the taxonomy during search time.
A TaxonomyReader holds a list of categories. Each category has a serial number which we call an "ordinal", and a hierarchical "path" name:
- The ordinal is an integer that starts at 0 for the first category (which is always the root category), and grows contiguously as more categories are added; Note that once a category is added, it can never be deleted.
- The path is a CategoryPath object specifying the category's position in the hierarchy.
An implementation must allow multiple readers to be active concurrently with a single writer. Readers follow so-called "point in time" semantics, i.e., a TaxonomyReader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's Lucene.Net.Index.IndexWriter.Commit() is called.
In faceted search, two separate indices are used: the main Lucene index, and the taxonomy. Because the main index refers to the categories listed in the taxonomy, it is important to open the taxonomy *after* opening the main index, and it is also necessary to Reopen() the taxonomy after Reopen()ing the main index.
This order is important, otherwise it would be possible for the main index to refer to a category which is not yet visible in the old snapshot of the taxonomy. Note that it is indeed fine for the the taxonomy to be opened after the main index - even a long time after. The reason is that once a category is added to the taxonomy, it can never be changed or deleted, so there is no danger that a "too new" taxonomy not being consistent with an older index.
Note
This API is experimental and might change in incompatible ways in the next release.
TaxonomyReader.ChildrenEnumerator
An iterator over a category's children.
Interfaces
ITaxonomyWriter
ITaxonomyWriter is the interface which the faceted-search library uses to dynamically build the taxonomy at indexing time.
Notes about concurrent access to the taxonomy:
An implementation must allow multiple readers and a single writer to be active concurrently. Readers follow so-called "point in time" semantics, i.e., a reader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's Lucene.Net.Index.IndexWriter.Commit() is called.
Faceted search keeps two indices - namely Lucene's main index, and this taxonomy index. When one or more readers are active concurrently with the writer, care must be taken to avoid an inconsistency between the state of these two indices: When writing to the indices, the taxonomy must always be committed to disk *before* the main index, because the main index refers to categories listed in the taxonomy. Such control can best be achieved by turning off the main index's "autocommit" feature, and explicitly calling Lucene.Net.Index.IndexWriter.Commit() for both indices (first for the taxonomy, then for the main index). In old versions of Lucene (2.2 or earlier), when autocommit could not be turned off, a more complicated solution needs to be used. E.g., use some sort of (possibly inter-process) locking to ensure that a reader is being opened only right after both indices have been flushed (and before anything else is written to them).
Note
This API is experimental and might change in incompatible ways in the next release.