Namespace Lucene.Net.Facet.Taxonomy.WriterCache

Improves indexing time by caching a map of CategoryPath to their Ordinal.

Classes

Cl2oTaxonomyWriterCache

ITaxonomyWriterCache using CompactLabelToOrdinal. Although called cache, it maintains in memory all the mappings from category to ordinal, relying on that CompactLabelToOrdinal is an efficient mapping for this purpose.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

CollisionMap

HashMap to store colliding labels. See CompactLabelToOrdinal for details.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

CompactLabelToOrdinal

This is a very efficient LabelToOrdinal implementation that uses a Lucene.Net.Facet.Taxonomy.WriterCache.CharBlockArray to store all labels and a configurable number of Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays to reference the labels.

Since the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays don't handle collisions, a CollisionMap is used to store the colliding labels.

This data structure grows by adding a new HashArray whenever the number of collisions in the CollisionMap exceeds Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.loadFactor GetMaxOrdinal(). Growing also includes reinserting all colliding labels into the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.HashArrays to possibly reduce the number of collisions.

For setting the Lucene.Net.Facet.Taxonomy.WriterCache.CompactLabelToOrdinal.loadFactor see CompactLabelToOrdinal(Int32, Single, Int32).

This data structure has a much lower memory footprint (~30%) compared to a Java HashMap<String, Integer>. It also only uses a small fraction of objects a HashMap would use, thus limiting the GC overhead. Ingestion speed was also ~50% faster compared to a HashMap for 3M unique labels.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LabelToOrdinal

Abstract class for storing Label->Ordinal mappings in a taxonomy.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LruTaxonomyWriterCache

LRU ITaxonomyWriterCache - good choice for huge taxonomies.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NameHashInt32CacheLRU

An an LRU cache of mapping from name to int. Used to cache Ordinals of category paths. It uses as key, hash of the path instead of the path. This way the cache takes less RAM, but correctness depends on assuming no collisions.

NOTE: this was NameHashIntCacheLRU in Lucene

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NameInt32CacheLRU

An an LRU cache of mapping from name to int. Used to cache Ordinals of category paths.

NOTE: This was NameIntCacheLRU in Lucene

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Interfaces

ITaxonomyWriterCache

ITaxonomyWriterCache is a relatively simple interface for a cache of category->ordinal mappings, used in ITaxonomyWriter implementations (such as DirectoryTaxonomyWriter).

It basically has Put(FacetLabel, Int32) methods for adding a mapping, and Get(FacetLabel) for looking a mapping up the cache. The cache does not guarantee to hold everything that has been put into it, and might in fact selectively delete some of the mappings (e.g., the ones least recently used). This means that if Get(FacetLabel) returns a negative response, it does not necessarily mean that the category doesn't exist - just that it is not in the cache. The caller can only infer that the category doesn't exist if it knows the cache to be complete (because all the categories were loaded into the cache, and since then no Put(FacetLabel, Int32) returned true).

However, if it does so, it should clear out large parts of the cache at once, because the user will typically need to work hard to recover from every cache cleanup (see Put(FacetLabel, Int32)'s return value).

NOTE: the cache may be accessed concurrently by multiple threads, therefore cache implementations should take this into consideration.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Enums

LruTaxonomyWriterCache.LRUType

Determines cache type. For guaranteed correctness - not relying on no-collisions in the hash function, LRU_STRING should be used.