Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Util

    Utility functions for text analysis.

    Classes

    AbstractAnalysisFactory

    Abstract parent class for analysis factories TokenizerFactory, TokenFilterFactory and CharFilterFactory.

    The typical lifecycle for a factory consumer is:

    • Create factory via its constructor (or via XXXFactory.ForName)
    • (Optional) If the factory uses resources such as files, Inform(IResourceLoader) is called to initialize those resources.
    • Consumer calls create() to obtain instances.

    BufferedCharFilter

    LUCENENET specific class to mimic Java's BufferedReader (that is, a reader that is seekable) so it supports Mark() and Reset() (which are part of the Java Reader class), but also provide the Correct() method of BaseCharFilter.

    CharacterUtils

    CharacterUtils provides a unified interface to Character-related operations to implement backwards compatible character operations based on a Lucene.Net.Util.LuceneVersion instance.

    This is a Lucene.NET INTERNAL API, use at your own risk

    CharacterUtils.CharacterBuffer

    A simple IO buffer to use with Fill(CharacterUtils.CharacterBuffer, TextReader).

    CharArrayMap

    CharArrayMap<TValue>

    A simple class that stores key System.Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the map, nor does it resize its hash table to be smaller, etc. It is designed to be quick to retrieve items by char[] keys without the necessity of converting to a System.String first.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating CharArrayMap:

    • As of 3.1, supplementary characters are properly lowercased.
    Before 3.1 supplementary characters could not be lowercased correctly due to the lack of Unicode 4 support in JDK 1.4. To use instances of CharArrayMap with the behavior before Lucene 3.1 pass a Lucene.Net.Util.LuceneVersion < 3.1 to the constructors.

    CharArrayMap<TValue>.EntryIterator

    public iterator class so efficient methods are exposed to users

    CharArrayMap<TValue>.EntrySet_

    public EntrySet_ class so efficient methods are exposed to users

    NOTE: In .NET this was renamed to EntrySet_ because it conflicted with the method EntrySet(). Since there is also an extension method named IDictionary{K,V}.EntrySet() that this class needs to override, changing the name of the method was not possible because the extension method would produce incorrect results if it were inadvertently called, leading to hard-to-diagnose bugs.

    Another difference between this set and the Java counterpart is that it implements System.Collections.Generic.ICollection<T> rather than System.Collections.Generic.ISet<T> so we don't have to implement a bunch of methods that we aren't really interested in. The Keys and Values properties both return System.Collections.Generic.ICollection<T>, and while there is no EntrySet() method or property in .NET, if there were it would certainly return System.Collections.Generic.ICollection<T>.

    CharArrayMapExtensions

    LUCENENET specific extension methods for CharArrayMap

    CharArraySet

    A simple class that stores System.Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a System.String first.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating CharArraySet:

    • As of 3.1, supplementary characters are properly lowercased.
    Before 3.1 supplementary characters could not be lowercased correctly due to the lack of Unicode 4 support in JDK 1.4. To use instances of CharArraySet with the behavior before Lucene 3.1 pass a Lucene.Net.Util.LuceneVersion to the constructors.

    Please note: This class implements System.Collections.Generic.ISet<T> but does not behave like it should in all cases. The generic type is System.String, because you can add any object to it, that has a string representation (which is converted to a string). The add methods will use System.Object.ToString() and store the result using a char[] buffer. The same behavior have the Contains(String) methods. The GetEnumerator() returns an IEnumerator{char[]}

    CharArraySetExtensions

    LUCENENET specific extension methods for CharArraySet

    CharFilterFactory

    Abstract parent class for analysis factories that create Lucene.Net.Analysis.CharFilter instances.

    CharTokenizer

    An abstract base class for simple, character-oriented tokenizers.

    You must specify the required Lucene.Net.Util.LuceneVersion compatibility when creating CharTokenizer:

    • As of 3.1, CharTokenizer uses an int based API to normalize and detect token codepoints. See IsTokenChar(Int32) and Normalize(Int32) for details.

    A new CharTokenizer API has been introduced with Lucene 3.1. This API moved from UTF-16 code units to UTF-32 codepoints to eventually add support for supplementary characters. The old char based API has been deprecated and should be replaced with the int based methods IsTokenChar(Int32) and Normalize(Int32).

    As of Lucene 3.1 each CharTokenizer - constructor expects a Lucene.Net.Util.LuceneVersion argument. Based on the given Lucene.Net.Util.LuceneVersion either the new API or a backwards compatibility layer is used at runtime. For Lucene.Net.Util.LuceneVersion < 3.1 the backwards compatibility layer ensures correct behavior even for indexes build with previous versions of Lucene. If a Lucene.Net.Util.LuceneVersion >= 3.1 is used CharTokenizer requires the new API to be implemented by the instantiated class. Yet, the old char based API is not required anymore even if backwards compatibility must be preserved. CharTokenizer subclasses implementing the new API are fully backwards compatible if instantiated with Lucene.Net.Util.LuceneVersion < 3.1.

    Note: If you use a subclass of CharTokenizer with Lucene.Net.Util.LuceneVersion >= 3.1 on an index build with a version < 3.1, created tokens might not be compatible with the terms in your index.

    ClasspathResourceLoader

    Simple IResourceLoader that uses System.Reflection.Assembly.GetManifestResourceStream(System.String) and System.Reflection.Assembly.GetType(System.String) to open resources and System.Types, respectively.

    ElisionFilter

    Removes elisions from a Lucene.Net.Analysis.TokenStream. For example, "l'avion" (the plane) will be tokenized as "avion" (plane).

    Elision in Wikipedia

    ElisionFilterFactory

    Factory for ElisionFilter.

    <fieldType name="text_elsn" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ElisionFilterFactory" 
          articles="stopwordarticles.txt" ignoreCase="true"/>
      </analyzer>
    </fieldType>

    FilesystemResourceLoader

    Simple IResourceLoader that opens resource files from the local file system, optionally resolving against a base directory.

    This loader wraps a delegate IResourceLoader that is used to resolve all files, the current base directory does not contain. NewInstance<T>(String) is always resolved against the delegate, as an System.Assembly is needed.

    You can chain several FilesystemResourceLoaders to allow lookup of files in more than one base directory.

    FilteringTokenFilter

    Abstract base class for TokenFilters that may remove tokens. You have to implement Accept() and return a boolean if the current token should be preserved. IncrementToken() uses this method to decide if a token should be passed to the caller.

    As of Lucene 4.4, an System.ArgumentException is thrown when trying to disable position increments when filtering terms.

    OpenStringBuilder

    A StringBuilder that allows one to access the array.

    RollingCharBuffer

    Acts like a forever growing char[] as you read characters into it from the provided reader, but internally it uses a circular buffer to only hold the characters that haven't been freed yet. This is like a PushbackReader, except you don't have to specify up-front the max size of the buffer, but you do have to periodically call FreeBefore(Int32).

    StemmerUtil

    Some commonly-used stemming functions

    This is a Lucene.NET INTERNAL API, use at your own risk

    StopwordAnalyzerBase

    Base class for Lucene.Net.Analysis.Analyzers that need to make use of stopword sets.

    TokenFilterFactory

    Abstract parent class for analysis factories that create Lucene.Net.Analysis.TokenFilter instances.

    TokenizerFactory

    Abstract parent class for analysis factories that create Lucene.Net.Analysis.Tokenizer instances.

    WordlistLoader

    Loader for text files that represent a list of stopwords.

    IOUtils to obtain System.IO.TextReader instances.

    This is a Lucene.NET INTERNAL API, use at your own risk

    Interfaces

    IMultiTermAwareComponent

    Add to any analysis factory component to allow returning an analysis component factory for use with partial terms in prefix queries, wildcard queries, range query endpoints, regex queries, etc.

    This is a Lucene.NET EXPERIMENTAL API, use at your own risk

    IResourceLoader

    Abstraction for loading resources (streams, files, and classes).

    IResourceLoaderAware

    Interface for a component that needs to be initialized by an implementation of IResourceLoader.

    • Improve this Doc
    Back to top Copyright © 2020 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.