Namespace Lucene.Net.Facet.Taxonomy.WriterCache
Improves indexing time by caching a map of CategoryPath to their Ordinal.
Classes
Cl2oTaxonomyWriterCache
ITaxonomyWriterCache using CompactLabelToOrdinal. Although called cache, it maintains in memory all the mappings from category to ordinal, relying on that CompactLabelToOrdinal is an efficient mapping for this purpose.
CollisionMap
HashMap to store colliding labels. See CompactLabelToOrdinal for details.
CompactLabelToOrdinal
This is a very efficient LabelToOrdinal implementation that uses a Lucene.Net.Facet.Taxonomy.WriterCache.CharBlockArray to store all labels and a configurable number of Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays to reference the labels.
Since the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays don't handle collisions, a CollisionMap is used to store the colliding labels.
This data structure grows by adding a new HashArray whenever the number of
collisions in the CollisionMap exceeds Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.loadFactor
GetMaxOrdinal().
Growing also includes reinserting all colliding
labels into the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays to possibly reduce the number of collisions.
For setting the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.loadFactor see CompactLabelToOrdinal(Int32, Single, Int32).
This data structure has a much lower memory footprint (~30%) compared to a Java HashMap<String, Integer>. It also only uses a small fraction of objects a HashMap would use, thus limiting the GC overhead. Ingestion speed was also ~50% faster compared to a HashMap for 3M unique labels.
LabelToOrdinal
Abstract class for storing Label->Ordinal mappings in a taxonomy.
LruTaxonomyWriterCache
LRU ITaxonomyWriterCache - good choice for huge taxonomies.
NameHashInt32CacheLRU
An an LRU cache of mapping from name to int. Used to cache Ordinals of category paths. It uses as key, hash of the path instead of the path. This way the cache takes less RAM, but correctness depends on assuming no collisions.
NOTE: this was NameHashIntCacheLRU in Lucene
NameInt32CacheLRU
An an LRU cache of mapping from name to int. Used to cache Ordinals of category paths.
NOTE: This was NameIntCacheLRU in Lucene
Interfaces
ITaxonomyWriterCache
ITaxonomyWriterCache is a relatively simple interface for a cache of category->ordinal mappings, used in ITaxonomyWriter implementations (such as DirectoryTaxonomyWriter).
It basically has Put(FacetLabel, Int32) methods for adding a mapping, and Get(FacetLabel) for looking a mapping up the cache. The cache does not guarantee to hold everything that has been put into it, and might in fact selectively delete some of the mappings (e.g., the ones least recently used). This means that if Get(FacetLabel) returns a negative response, it does not necessarily mean that the category doesn't exist - just that it is not in the cache. The caller can only infer that the category doesn't exist if it knows the cache to be complete (because all the categories were loaded into the cache, and since then no Put(FacetLabel, Int32) returned true).
However, if it does so, it should clear out large parts of the cache at once, because the user will typically need to work hard to recover from every cache cleanup (see Put(FacetLabel, Int32)'s return value).
NOTE: the cache may be accessed concurrently by multiple threads, therefore cache implementations should take this into consideration.
Enums
LruTaxonomyWriterCache.LRUType
Determines cache type. For guaranteed correctness - not relying on no-collisions in the hash function, LRU_STRING should be used.