Lucene.Net  3.0.3
Lucene.Net is a .NET port of the Java Lucene Indexing Library
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Properties
Namespaces | Classes
Package Lucene.Net.Analysis

Namespaces

package  AR
 
package  BR
 
package  CJK
 
package  Cn
 
package  Compound
 
package  Cz
 
package  De
 
package  El
 
package  Ext
 
package  Fa
 
package  Fr
 
package  Hunspell
 
package  Miscellaneous
 
package  NGram
 
package  Nl
 
package  Payloads
 
package  Position
 
package  Query
 
package  Reverse
 
package  Ru
 
package  Shingle
 
package  Sinks
 
package  Snowball
 
package  Standard
 
package  Th
 
package  Tokenattributes
 

Classes

class  ChainedFilter
  More...
 
class  Analyzer
 An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text. Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer. More...
 
class  ASCIIFoldingFilter
 This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. More...
 
class  BaseCharFilter
  More...
 
class  CachingTokenFilter
 This class can be used if the token attributes of a TokenStream are intended to be consumed more than once. It caches all token attribute states locally in a List. More...
 
class  CharArraySet
 A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first. Please note: This class implements System.Collections.Generic.ISet{T} but does not behave like it should in all cases. The generic type is System.Collections.Generic.ICollection{T}, because you can add any object to it, that has a string representation. The add methods will use object.ToString() and store the result using a char buffer. The same behaviour have the Contains(object) methods. The GetEnumerator method returns an string IEnumerable. For type safety also stringIterator() is provided. More...
 
class  CharFilter
 Subclasses of CharFilter can be chained to filter CharStream. They can be used as System.IO.TextReader with additional offset correction. Tokenizers will automatically use CorrectOffset if a CharFilter/CharStream subclass is used. More...
 
class  CharReader
 CharReader is a Reader wrapper. It reads chars from Reader and outputs CharStream, defining an identify function CorrectOffset method that simply returns the provided offset. More...
 
class  CharStream
 CharStream adds CorrectOffset functionality over System.IO.TextReader. All Tokenizers accept a CharStream instead of System.IO.TextReader as input, which enables arbitrary character based filtering before tokenization. The CorrectOffset method fixed offsets to account for removal or insertion of characters, so that the offsets reported in the tokens match the character offsets of the original Reader. More...
 
class  CharTokenizer
 An abstract base class for simple, character-oriented tokenizers. More...
 
class  ISOLatin1AccentFilter
 A filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered. For instance, 'À' will be replaced by 'a'. More...
 
class  KeywordAnalyzer
 "Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names. More...
 
class  KeywordTokenizer
 Emits the entire input as a single token. More...
 
class  LengthFilter
 Removes words that are too long or too short from the stream. More...
 
class  LetterTokenizer
 A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. More...
 
class  LowerCaseFilter
 Normalizes token text to lower case. More...
 
class  LowerCaseTokenizer
 LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces. More...
 
class  MappingCharFilter
 Simplistic CharFilter that applies the mappings contained in a NormalizeCharMap to the character stream, and correcting the resulting changes to the offsets. More...
 
class  NormalizeCharMap
 Holds a map of String input to String output, to be used with MappingCharFilter. More...
 
class  NumericTokenStream
 Expert: This class provides a TokenStream for indexing numeric values that can be used by NumericRangeQuery{T} or NumericRangeFilter{T}. More...
 
class  PerFieldAnalyzerWrapper
 This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use AddAnalyzer to add a non-default analyzer on a field name basis. More...
 
class  PorterStemFilter
 Transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly! To use this filter with other analyzers, you'll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with LowerCaseTokenizer, for example, you'd write an analyzer like this: More...
 
class  PorterStemmer
 Stemmer, implementing the Porter Stemming Algorithm More...
 
class  SimpleAnalyzer
 An Analyzer that filters LetterTokenizer with LowerCaseFilter More...
 
class  StopAnalyzer
 Filters LetterTokenizer with LowerCaseFilter and StopFilter. More...
 
class  StopFilter
 Removes stop words from a token stream. More...
 
class  TeeSinkTokenFilter
 This TokenFilter provides the ability to set aside attribute states that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. More...
 
class  Token
 A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string. The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a <abbr title="KeyWord In Context">KWIC</abbr> display, etc. The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word". A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use TermPositions.PayloadLength and TermPositions.GetPayload(byte[], int) to retrieve the payloads from the index. More...
 
class  TokenFilter
 A TokenFilter is a TokenStream whose input is another TokenStream. This is an abstract class; subclasses must override TokenStream.IncrementToken(). More...
 
class  Tokenizer
 A Tokenizer is a TokenStream whose input is a Reader. This is an abstract class; subclasses must override TokenStream.IncrementToken() NOTE: Subclasses overriding TokenStream.IncrementToken() must call AttributeSource.ClearAttributes() before setting attributes. More...
 
class  TokenStream
 A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text. This is an abstract class. Concrete subclasses are:

A new TokenStream API has been introduced with Lucene 2.9. This API has moved from being Token based to IAttribute based. While Token still exists in 2.9 as a convenience class, the preferred way to store the information of a Token is to use Util.Attributes. TokenStream now extends AttributeSource, which provides access to all of the token IAttributes for the TokenStream. Note that only one instance per Util.Attribute is created and reused for every token. This approach reduces object creation and allows local caching of references to the Util.Attributes. See IncrementToken() for further details. The workflow of the new TokenStream API is as follows:

  • Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
  • The consumer calls TokenStream.Reset().
  • The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access
  • The consumer calls IncrementToken() until it returns false and consumes the attributes after each call.
  • The consumer calls End() so that any end-of-stream operations can be performed.
  • The consumer calls Close() to release any resource when finished using the TokenStream

To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken(). You can find some example code for the new API in the analysis package level Javadoc. Sometimes it is desirable to capture a current state of a TokenStream , e. g. for buffering purposes (see CachingTokenFilter, TeeSinkTokenFilter). For this usecase AttributeSource.CaptureState and AttributeSource.RestoreState can be used. More...

 
class  WhitespaceAnalyzer
 An Analyzer that uses WhitespaceTokenizer. More...
 
class  WhitespaceTokenizer
 A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens. More...
 
class  WordlistLoader
 Loader for text files that represent a list of stopwords. More...