Namespace Lucene.Net.Util
Some utility classes.
Classes
AlreadySetException
Thrown when Set(T) is called more than once.
ArrayUtil
Methods for manipulating arrays.
Attribute
Base class for Attributes that can be added to a AttributeSource.
Attributes are used to add data in a dynamic, yet type-safe way to a source of usually streamed objects, e. g. a TokenStream.
AttributeSource
An AttributeSource contains a list of different Attributes, and methods to add and get them. There can only be a single instance of an attribute in the same AttributeSource instance. This is ensured by passing in the actual type of the IAttribute to the AddAttribute<T>(), which then checks if an instance of that type is already present. If yes, it returns the instance, otherwise it creates a new instance and returns it.
AttributeSource.AttributeFactory
An AttributeSource.AttributeFactory creates instances of Attributes.
AttributeSource.State
This class holds the state of an AttributeSource.
Bits
Bits.MatchAllBits
Bits impl of the specified length with all bits set.
Bits.MatchNoBits
Bits impl of the specified length with no bits set.
BitUtil
A variety of high efficiency bit twiddling routines.
BroadWord
Methods and constants inspired by the article "Broadword Implementation of Rank/Select Queries" by Sebastiano Vigna, January 30, 2012:
- algorithm 1: Lucene.Net.Util.BroadWord.BitCount(System.Int64), count of set bits in a System.Int64
- algorithm 2: Select(Int64, Int32), selection of a set bit in a System.Int64,
- bytewise signed smaller <8 operator: SmallerUpTo7_8(Int64, Int64).
- shortwise signed smaller <16 operator: SmallerUpto15_16(Int64, Int64).
- some of the Lk and Hk constants that are used by the above: L8 L8_L, H8 H8_L, L9 L9_L, L16 L16_Land H16 H8_L.
BundleResourceManagerFactory
This implementation of IResourceManagerFactory uses a convention
to retrieve resources. In Java NLS, the convention is to use the same name for the
resource key propeties and for the resource file names. This presents a problem
for .NET because the resource generator already creates an internal class with the
same name as the .resx
file.
To work around this, we use the convention of appending the suffix "Bundle" to
the end of the type the resource key propeties are stored in. For example,
if our constants are stored in a class named ErrorMessages, the type
that will be looked up by this factory will be ErrorMessagesBundle (which is the
name of the .resx
file that should be added to your project).
This implementation can be inherited to use a different convention or can be replaced to get the resources from an external source.
ByteBlockPool
Class that Posting and PostingVector use to write byte streams into shared fixed-size byte[] arrays. The idea is to allocate slices of increasing lengths. For example, the first slice is 5 bytes, the next slice is 14, etc. We start by writing our bytes into the first 5 bytes. When we hit the end of the slice, we allocate the next slice and then write the address of the new slice into the last 4 bytes of the previous slice (the "forwarding address").
Each slice is filled with 0's initially, and we mark the end with a non-zero byte. This way the methods that are writing into the slice don't need to record its length and instead allocate a new slice once they hit a non-zero byte.
ByteBlockPool.Allocator
Abstract class for allocating and freeing byte blocks.
ByteBlockPool.DirectAllocator
A simple ByteBlockPool.Allocator that never recycles.
ByteBlockPool.DirectTrackingAllocator
A simple ByteBlockPool.Allocator that never recycles, but tracks how much total RAM is in use.
BytesRef
Represents byte[], as a slice (offset + length) into an
existing byte[]. The Bytes property should never be null
;
use EMPTY_BYTES if necessary.
Important note: Unless otherwise noted, Lucene uses this class to
represent terms that are encoded as UTF8 bytes in the index. To
convert them to a .NET System.String (which is UTF16), use Utf8ToString().
Using code like new String(bytes, offset, length)
to do this
is wrong, as it does not respect the correct character set
and may return wrong results (depending on the platform's defaults)!
BytesRefArray
A simple append only random-access BytesRef array that stores full copies of the appended bytes in a ByteBlockPool.
Note: this class is not Thread-Safe!
BytesRefHash
BytesRefHash is a special purpose hash-map like data-structure optimized for BytesRef instances. BytesRefHash maintains mappings of byte arrays to ids (Map<BytesRef,int>) storing the hashed bytes efficiently in continuous storage. The mapping to the id is encapsulated inside BytesRefHash and is guaranteed to be increased for each added BytesRef.
Note: The maximum capacity BytesRef instance passed to Add(BytesRef) must not be longer than BYTE_BLOCK_SIZE-2. The internal storage is limited to 2GB total byte storage.
BytesRefHash.BytesStartArray
Manages allocation of the per-term addresses.
BytesRefHash.DirectBytesStartArray
A simple BytesRefHash.BytesStartArray that tracks memory allocation using a private Counter instance.
BytesRefHash.MaxBytesLengthExceededException
Thrown if a BytesRef exceeds the BytesRefHash limit of BYTE_BLOCK_SIZE-2.
BytesRefIterator
LUCENENET specific class to make the syntax of creating an empty IBytesRefIterator the same as it was in Lucene. Example:
var iter = BytesRefIterator.Empty;
CharsRef
Represents char[], as a slice (offset + Length) into an existing char[].
The Chars property should never be null
; use
EMPTY_CHARS if necessary.
CollectionUtil
Methods for manipulating (sorting) collections. Sort methods work directly on the supplied lists and don't copy to/from arrays before/after. For medium size collections as used in the Lucene indexer that is much more efficient.
CommandLineUtil
Class containing some useful methods used by command line tools
Constants
Some useful constants.
Counter
Simple counter class
DisposableThreadLocal<T>
Java's builtin ThreadLocal has a serious flaw: it can take an arbitrarily long amount of time to dereference the things you had stored in it, even once the ThreadLocal instance itself is no longer referenced. This is because there is single, master map stored for each thread, which all ThreadLocals share, and that master map only periodically purges "stale" entries.
While not technically a memory leak, because eventually the memory will be reclaimed, it can take a long time and you can easily hit System.OutOfMemoryException because from the GC's standpoint the stale entries are not reclaimable.
This class works around that, by only enrolling WeakReference values into the ThreadLocal, and separately holding a hard reference to each stored value. When you call Dispose(), these hard references are cleared and then GC is freely able to reclaim space by objects stored in it.
You should not call Dispose() until all threads are done using the instance.
DocIdBitSet
Simple DocIdSet and DocIdSetIterator backed by a BitSet
DoubleBarrelLRUCache
LUCENENET specific class to nest the DoubleBarrelLRUCache.CloneableKey so it can be accessed without referencing the generic closing types of DoubleBarrelLRUCache<TKey, TValue>.
DoubleBarrelLRUCache.CloneableKey
Object providing clone(); the key class must subclass this.
DoubleBarrelLRUCache<TKey, TValue>
Simple concurrent LRU cache, using a "double barrel" approach where two ConcurrentHashMaps record entries.
At any given time, one hash is primary and the other is secondary. Get(TKey) first checks primary, and if that's a miss, checks secondary. If secondary has the entry, it's promoted to primary (NOTE: the key is cloned at this point). Once primary is full, the secondary is cleared and the two are swapped.
This is not as space efficient as other possible concurrent approaches (see LUCENE-2075): to achieve perfect LRU(N) it requires 2*N storage. But, this approach is relatively simple and seems in practice to not grow unbounded in size when under hideously high load.
ExceptionExtensions
Extensions to the System.Exception class to allow for adding and retrieving suppressed exceptions, like you can do in Java.
ExcludeServiceAttribute
Base class for Attribute types that exclude services from Reflection scanning.
FieldCacheSanityChecker
Provides methods for sanity checking that entries in the FieldCache are not wasteful or inconsistent.
Lucene 2.9 Introduced numerous enhancements into how the FieldCache is used by the low levels of Lucene searching (for Sorting and ValueSourceQueries) to improve both the speed for Sorting, as well as reopening of IndexReaders. But these changes have shifted the usage of FieldCache from "top level" IndexReaders (frequently a MultiReader or DirectoryReader) down to the leaf level SegmentReaders. As a result, existing applications that directly access the FieldCache may find RAM usage increase significantly when upgrading to 2.9 or Later. This class provides an API for these applications (or their Unit tests) to check at run time if the FieldCache contains "insane" usages of the FieldCache.
FieldCacheSanityChecker.Insanity
Simple container for a collection of related FieldCache.CacheEntry objects that in conjunction with each other represent some "insane" usage of the IFieldCache.
FieldCacheSanityChecker.InsanityType
An Enumeration of the different types of "insane" behavior that may be detected in a IFieldCache.
FilterIterator<T>
An System.Collections.Generic.IEnumerator<T> implementation that filters elements with a boolean predicate.
FixedBitSet
BitSet of fixed length (numBits), backed by accessible (GetBits()) long[], accessed with an int index, implementing GetBits() and DocIdSet. If you need to manage more than 2.1B bits, use Int64BitSet.
FixedBitSet.FixedBitSetIterator
A DocIdSetIterator which iterates over set bits in a FixedBitSet.
GrowableByteArrayDataOutput
A DataOutput that can be used to build a byte[].
IndexableBinaryStringTools
Provides support for converting byte sequences to System.Strings and back again. The resulting System.Strings preserve the original byte sequences' sort order.
The System.Strings are constructed using a Base 8000h encoding of the original binary data - each char of an encoded System.String represents a 15-bit chunk from the byte sequence. Base 8000h was chosen because it allows for all lower 15 bits of char to be used without restriction; the surrogate range [U+D8000-U+DFFF] does not represent valid chars, and would require complicated handling to avoid them and allow use of char's high bit.
Although unset bits are used as padding in the final char, the original byte sequence could contain trailing bytes with no set bits (null bytes): padding is indistinguishable from valid information. To overcome this problem, a char is appended, indicating the number of encoded bytes in the final content char.
InfoStream
Debugging API for Lucene classes such as IndexWriter and SegmentInfos.
NOTE: Enabling infostreams may cause performance degradation in some components.
InPlaceMergeSorter
Sorter implementation based on the merge-sort algorithm that merges in place (no extra memory will be allocated). Small arrays are sorted with insertion sort.
Int32BlockPool
A pool for System.Int32 blocks similar to ByteBlockPool.
NOTE: This was IntBlockPool in Lucene
Int32BlockPool.Allocator
Abstract class for allocating and freeing System.Int32 blocks.
Int32BlockPool.DirectAllocator
A simple Int32BlockPool.Allocator that never recycles.
Int32BlockPool.SliceReader
A Int32BlockPool.SliceReader that can read System.Int32 slices written by a Int32BlockPool.SliceWriter.
Int32BlockPool.SliceWriter
A Int32BlockPool.SliceWriter that allows to write multiple integer slices into a given Int32BlockPool.
Int32sRef
Represents int[], as a slice (offset + length) into an
existing int[]. The Int32s member should never be null
; use
EMPTY_INT32S if necessary.
NOTE: This was IntsRef in Lucene
Int64BitSet
BitSet of fixed length (Lucene.Net.Util.Int64BitSet.numBits), backed by accessible (GetBits()) long[], accessed with a System.Int64 index. Use it only if you intend to store more than 2.1B bits, otherwise you should use FixedBitSet.
NOTE: This was LongBitSet in Lucene
Int64sRef
Represents long[], as a slice (offset + length) into an
existing long[]. The Int64s member should never be null
; use
EMPTY_INT64S if necessary.
NOTE: This was LongsRef in Lucene
Int64Values
Abstraction over an array of System.Int64s. This class extends NumericDocValues so that we don't need to add another level of abstraction every time we want eg. to use the PackedInt32s utility classes to represent a NumericDocValues instance.
NOTE: This was LongValues in Lucene
IntroSorter
Sorter implementation based on a variant of the quicksort algorithm called introsort: when the recursion level exceeds the log of the length of the array to sort, it falls back to heapsort. This prevents quicksort from running into its worst-case quadratic runtime. Small arrays are sorted with insertion sort.
IOUtils
This class emulates the new Java 7 "Try-With-Resources" statement. Remove once Lucene is on Java 7.
ListExtensions
Extensions to System.Collections.Generic.IList<T>.
LuceneVersionExtensions
Extension methods to the LuceneVersion enumeration to provide version comparison and parsing functionality.
MapOfSets<TKey, TValue>
Helper class for keeping Lists of Objects associated with keys. WARNING: this CLASS IS NOT THREAD SAFE
MathUtil
Math static utility methods.
MergedIterator<T>
Provides a merged sorted view from several sorted iterators.
If built with Lucene.Net.Util.MergedIterator`1.removeDuplicates set to true
and an element
appears in multiple iterators then it is deduplicated, that is this iterator
returns the sorted union of elements.
If built with Lucene.Net.Util.MergedIterator`1.removeDuplicates set to false
then all elements
in all iterators are returned.
Caveats:
- The behavior is undefined if the iterators are not actually sorted.
- Null elements are unsupported.
- If Lucene.Net.Util.MergedIterator`1.removeDuplicates is set to
true
and if a single iterator contains duplicates then they will not be deduplicated. - When elements are deduplicated it is not defined which one is returned.
- If Lucene.Net.Util.MergedIterator`1.removeDuplicates is set to
false
then the order in which duplicates are returned isn't defined.
The caller is responsible for disposing the System.Collections.Generic.IEnumerator<T> instances that are passed into the constructor, MergedIterator<T> doesn't do it automatically.
NamedServiceFactory<TService>
LUCENENET specific abstract class containing common fuctionality for named service factories.
NumberFormat
A LUCENENET specific class that represents a numeric format. This class mimicks the design of Java's NumberFormat class, which unlike the System.Globalization.NumberFormatInfo class in .NET, can be subclassed.
NumericUtils
This is a helper class to generate prefix-encoded representations for numerical values and supplies converters to represent float/double values as sortable integers/longs.
To quickly execute range queries in Apache Lucene, a range is divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly. this reduces the number of terms dramatically.
This class generates terms to achieve this: First the numerical integer values need to
be converted to bytes. For that integer values (32 bit or 64 bit) are made unsigned
and the bits are converted to ASCII chars with each 7 bit. The resulting byte[] is
sortable like the original integer value (even using UTF-8 sort order). Each value is also
prefixed (in the first char) by the shift
value (number of bits removed) used
during encoding.
To also index floating point numbers, this class supplies two methods to convert them to integer values by changing their bit layout: DoubleToSortableInt64(Double), SingleToSortableInt32(Single). You will have no precision loss by converting floating point numbers to integers and back (only that the integer form is not usable). Other data types like dates can easily converted to System.Int64s or System.Int32s (e.g. date to long: System.DateTime.Ticks).
For easy usage, the trie algorithm is implemented for indexing inside NumericTokenStream that can index System.Int32, System.Int64, System.Single, and System.Double. For querying, NumericRangeQuery and NumericRangeFilter implement the query part for the same data types.
This class can also be used, to generate lexicographically sortable (according to UTF8SortedAsUTF16Comparer) representations of numeric data types for other usages (e.g. sorting).
@since 2.9, API changed non backwards-compliant in 4.0
NumericUtils.Int32RangeBuilder
Callback for SplitInt32Range(NumericUtils.Int32RangeBuilder, Int32, Int32, Int32). You need to override only one of the methods.
NOTE: This was IntRangeBuilder in Lucene
@since 2.9, API changed non backwards-compliant in 4.0
NumericUtils.Int64RangeBuilder
Callback for SplitInt64Range(NumericUtils.Int64RangeBuilder, Int32, Int64, Int64). You need to override only one of the methods.
NOTE: This was LongRangeBuilder in Lucene
@since 2.9, API changed non backwards-compliant in 4.0
OfflineSorter
On-disk sorting of byte arrays. Each byte array (entry) is a composed of the following fields:
- (two bytes) length of the following byte array,
- exactly the above count of bytes for the sequence to be sorted.
OfflineSorter.BufferSize
A bit more descriptive unit for constructors.
OfflineSorter.ByteSequencesReader
Utility class to read length-prefixed byte[] entries from an input. Complementary to OfflineSorter.ByteSequencesWriter.
OfflineSorter.ByteSequencesWriter
Utility class to emit length-prefixed byte[] entries to an output stream for sorting. Complementary to OfflineSorter.ByteSequencesReader.
OfflineSorter.SortInfo
Sort info (debugging mostly).
OpenBitSet
An "open" BitSet implementation that allows direct access to the array of words storing the bits.
NOTE: This can be used in .NET any place where a java.util.BitSet
is used in Java.
Unlike java.util.BitSet
, the fact that bits are packed into an array of longs
is part of the interface. This allows efficient implementation of other algorithms
by someone other than the author. It also allows one to efficiently implement
alternate serialization or interchange formats.
OpenBitSet is faster than java.util.BitSet
in most operations
and much faster at calculating cardinality of sets and results of set operations.
It can also handle sets of larger cardinality (up to 64 * 2**32-1)
The goals of OpenBitSet are the fastest implementation possible, and maximum code reuse. Extra safety and encapsulation may always be built on top, but if that's built in, the cost can never be removed (and hence people re-implement their own version in order to get better performance).
Performance Results
Test system: Pentium 4, Sun Java 1.5_06 -server -Xbatch -Xmx64M
BitSet size = 1,000,000
Results are java.util.BitSet time divided by OpenBitSet time.
cardinalityIntersectionCountUnionNextSetBitGetGetIterator | |
---|---|
50% full | 3.363.961.441.461.991.58 |
1% full | 3.313.90 1.04 0.99 |
Test system: AMD Opteron, 64 bit linux, Sun Java 1.5_06 -server -Xbatch -Xmx64M
BitSet size = 1,000,000
Results are java.util.BitSet time divided by OpenBitSet time.
cardinalityIntersectionCountUnionNextSetBitGetGetIterator | |
---|---|
50% full | 2.503.501.001.031.121.25 |
1% full | 2.513.49 1.00 1.02 |
OpenBitSetDISI
OpenBitSet with added methods to bulk-update the bits from a DocIdSetIterator. (DISI stands for DocIdSetIterator).
OpenBitSetIterator
An iterator to iterate over set bits in an OpenBitSet. this is faster than NextSetBit(Int64) for iterating over the complete set of bits, especially when the density of the bits set is high.
PagedBytes
Represents a logical byte[] as a series of pages. You can write-once into the logical byte[] (append only), using copy, and then retrieve slices (BytesRef) into it using fill.
PagedBytes.PagedBytesDataInput
PagedBytes.PagedBytesDataOutput
PagedBytes.Reader
Provides methods to read BytesRefs from a frozen PagedBytes.
PForDeltaDocIdSet
DocIdSet implementation based on pfor-delta encoding.
This implementation is inspired from LinkedIn's Kamikaze (http://data.linkedin.com/opensource/kamikaze) and Daniel Lemire's JavaFastPFOR (https://github.com/lemire/JavaFastPFOR).
On the contrary to the original PFOR paper, exceptions are encoded with FOR instead of Simple16.
PForDeltaDocIdSet.Builder
A builder for PForDeltaDocIdSet.
PrintStreamInfoStream
LUCENENET specific stub to assist with migration to TextWriterInfoStream.
PriorityQueue<T>
A PriorityQueue<T> maintains a partial ordering of its elements such that the element with least priority can always be found in constant time. Put()'s and Pop()'s require log(size) time.
NOTE: this class will pre-allocate a full array of
length maxSize+1
if instantiated via the
PriorityQueue(Int32, Boolean) constructor with
prepopulate
set to true
. That maximum
size can grow as we insert elements over the time.
QueryBuilder
Creates queries from the Analyzer chain.
Example usage:
QueryBuilder builder = new QueryBuilder(analyzer);
Query a = builder.CreateBooleanQuery("body", "just a test");
Query b = builder.CreatePhraseQuery("body", "another test");
Query c = builder.CreateMinShouldMatchQuery("body", "another test", 0.5f);
This can also be used as a subclass for query parsers to make it easier to interact with the analysis chain. Factory methods such as NewTermQuery(Term) are provided so that the generated queries can be customized.
RamUsageEstimator
Estimates the size (memory representation) of .NET objects.
RecyclingByteBlockAllocator
A ByteBlockPool.Allocator implementation that recycles unused byte blocks in a buffer and reuses them in subsequent calls to GetByteBlock().
Note: this class is not thread-safe.
RecyclingInt32BlockAllocator
A Int32BlockPool.Allocator implementation that recycles unused System.Int32 blocks in a buffer and reuses them in subsequent calls to GetInt32Block().
Note: this class is not thread-safe.
NOTE: This was RecyclingIntBlockAllocator in Lucene
RefCount<T>
Manages reference counting for a given object. Extensions can override Release() to do custom logic when reference counting hits 0.
RollingBuffer
LUCENENET specific class to allow referencing static members of RollingBuffer<T> without referencing its generic closing type.
RollingBuffer<T>
Acts like forever growing T[], but internally uses a
circular buffer to reuse instances of
SentinelInt32Set
A native System.Int32 hash-based set where one value is reserved to mean "EMPTY" internally. The space overhead is fairly low as there is only one power-of-two sized int[] to hold the values. The set is re-hashed when adding a value that would make it >= 75% full. Consider extending and over-riding Hash(Int32) if the values might be poor hash keys; Lucene docids should be fine. The internal fields are exposed publicly to enable more efficient use at the expense of better O-O principles.
To iterate over the integers held in this set, simply use code like this:
SentinelIntSet set = ...
foreach (int v in set.keys)
{
if (v == set.EmptyVal)
continue;
//use v...
}
NOTE: This was SentinelIntSet in Lucene
ServiceNameAttribute
LUCENENET specific abstract class for System.Attributes that can be used to override the default convention-based names of services. For example, "Lucene40Codec" will by convention be named "Lucene40". Using the CodecNameAttribute, the name can be overridden with a custom value.
SetOnce<T>
A convenient class which offers a semi-immutable object wrapper implementation which allows one to set the value of an object exactly once, and retrieve it many times. If Set(T) is called more than once, AlreadySetException is thrown and the operation will fail.
SloppyMath
Math functions that trade off accuracy for speed.
SmallSingle
Floating point numbers smaller than 32 bits.
NOTE: This was SmallFloat in Lucene
Sorter
Base class for sorting algorithms implementations.
SPIClassIterator<S>
Helper class for loading SPI classes from classpath (META-INF files).
This is a light impl of java.util.ServiceLoader
but is guaranteed to
be bug-free regarding classpath order and does not instantiate or initialize
the classes found.
StringHelper
Methods for manipulating strings.
SystemConsole
Mimics System.Console, but allows for swapping the System.IO.TextWriter of Out and Error, or the System.IO.TextReader of In with user-defined implementations.
TextWriterInfoStream
InfoStream implementation over a System.IO.TextWriter such as System.Console.Out.
NOTE: This is analogous to PrintStreamInfoStream in Lucene.
TimSorter
Sorter implementation based on the TimSort algorithm.
This implementation is especially good at sorting partially-sorted arrays and sorts small arrays with binary sort.
NOTE:There are a few differences with the original implementation:
- The extra amount of memory to perform merges is
configurable. This allows small merges to be very fast while large merges
will be performed in-place (slightly slower). You can make sure that the
fast merge routine will always be used by having
maxTempSlots
equal to half of the length of the slice of data to sort. - Only the fast merge routine can gallop (the one that doesn't run in-place) and it only gallops on the longest slice.
ToStringUtils
Helper methods to ease implementing System.Object.ToString().
UnicodeUtil
Class to encode .NET's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as System.Text.Encoding.GetBytes(System.String) of System.Text.Encoding.UTF8 does.
VirtualMethod
A utility for keeping backwards compatibility on previously abstract methods (or similar replacements).
Before the replacement method can be made abstract, the old method must kept deprecated. If somebody still overrides the deprecated method in a non-sealed class, you must keep track, of this and maybe delegate to the old method in the subclass. The cost of reflection is minimized by the following usage of this class:
Define static readonly fields in the base class (BaseClass
),
where the old and new method are declared:
internal static readonly VirtualMethod newMethod =
new VirtualMethod(typeof(BaseClass), "newName", parameters...);
internal static readonly VirtualMethod oldMethod =
new VirtualMethod(typeof(BaseClass), "oldName", parameters...);
this enforces the singleton status of these objects, as the maintenance of the cache would be too costly else.
If you try to create a second instance of for the same method/baseClass
combination, an exception is thrown.
To detect if e.g. the old method was overridden by a more far subclass on the inheritance path to the current instance's class, use a non-static field:
bool isDeprecatedMethodOverridden =
oldMethod.GetImplementationDistance(this.GetType()) > newMethod.GetImplementationDistance(this.GetType());
// alternatively (more readable):
bool isDeprecatedMethodOverridden =
VirtualMethod.CompareImplementationDistance(this.GetType(), oldMethod, newMethod) > 0
GetImplementationDistance(Type) returns the distance of the subclass that overrides this method. The one with the larger distance should be used preferable. this way also more complicated method rename scenarios can be handled (think of 2.9 TokenStream deprecations).
WAH8DocIdSet
DocIdSet implementation based on word-aligned hybrid encoding on words of 8 bits.
This implementation doesn't support random-access but has a fast DocIdSetIterator which can advance in logarithmic time thanks to an index.
The compression scheme is simplistic and should work well with sparse and very dense doc id sets while being only slightly larger than a FixedBitSet for incompressible sets (overhead<2% in the worst case) in spite of the index.
Format: The format is byte-aligned. An 8-bits word is either clean, meaning composed only of zeros or ones, or dirty, meaning that it contains between 1 and 7 bits set. The idea is to encode sequences of clean words using run-length encoding and to leave sequences of dirty words as-is.
TokenClean length+Dirty length+Dirty words | |
---|---|
1 byte0-n bytes0-n bytes0-n bytes |
- Token encodes whether clean means full of zeros or ones in the first bit, the number of clean words minus 2 on the next 3 bits and the number of dirty words on the last 4 bits. The higher-order bit is a continuation bit, meaning that the number is incomplete and needs additional bytes to be read.
- Clean length+: If clean length has its higher-order bit set, you need to read a vint (ReadVInt32()), shift it by 3 bits on the left side and add it to the 3 bits which have been read in the token.
- Dirty length+ works the same way as Clean length+ but on 4 bits and for the length of dirty words.
- Dirty wordsare the dirty words, there are Dirty length of them.
This format cannot encode sequences of less than 2 clean words and 0 dirty word. The reason is that if you find a single clean word, you should rather encode it as a dirty word. This takes the same space as starting a new sequence (since you need one byte for the token) but will be lighter to decode. There is however an exception for the first sequence. Since the first sequence may start directly with a dirty word, the clean length is encoded directly, without subtracting 2.
There is an additional restriction on the format: the sequence of dirty words is not allowed to contain two consecutive clean words. This restriction exists to make sure no space is wasted and to make sure iterators can read the next doc ID by reading at most 2 dirty words.
WAH8DocIdSet.Builder
A builder for WAH8DocIdSets.
WAH8DocIdSet.WordBuilder
Word-based builder.
Interfaces
IAccountable
An object whose RAM usage can be computed.
IAttribute
Base interface for attributes.
IAttributeReflector
This interface is used to reflect contents of AttributeSource or Attribute.
IBits
Interface for Bitset-like structures.
IBytesRefIterator
A simple iterator interface for BytesRef iteration.
IMutableBits
Extension of IBits for live documents.
IResourceManagerFactory
LUCENENET specific interface used to inject instances of System.Resources.ResourceManager. This extension point can be used to override the default behavior to, for example, retrieve resources from a persistent data store, rather than getting them from resource files.
IServiceListable
LUCENENET specific contract that provides support for AvailableCodecs, AvailableDocValuesFormats, and AvailablePostingsFormats. Implement this interface in addition to ICodecFactory, IDocValuesFormatFactory, or IPostingsFormatFactory to provide optional support for the above methods when providing a custom implementation. If this interface is not supported by the corresponding factory, a System.NotSupportedException will be thrown from the above methods.
RollingBuffer.IResettable
Implement to reset an instance
Enums
LuceneVersion
Use by certain classes to match version compatibility across releases of Lucene.
WARNING: When changing the version parameter that you supply to components in Lucene, do not simply change the version at search-time, but instead also adjust your indexing code to match, and re-index.