[Missing <summary> documentation for "N:Lucene.Net.Analysis"]

Classes

	Class	Description
	Analyzer	An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text. Typical implementations first build a Tokenizer, which breaks the stream of characters from the Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the Tokenizer.
	ASCIIFoldingFilter	This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted: C1 Controls and Latin-1 Supplement: http://www.unicode.org/charts/PDF/U0080.pdf Latin Extended-A: http://www.unicode.org/charts/PDF/U0100.pdf Latin Extended-B: http://www.unicode.org/charts/PDF/U0180.pdf Latin Extended Additional: http://www.unicode.org/charts/PDF/U1E00.pdf Latin Extended-C: http://www.unicode.org/charts/PDF/U2C60.pdf Latin Extended-D: http://www.unicode.org/charts/PDF/UA720.pdf IPA Extensions: http://www.unicode.org/charts/PDF/U0250.pdf Phonetic Extensions: http://www.unicode.org/charts/PDF/U1D00.pdf Phonetic Extensions Supplement: http://www.unicode.org/charts/PDF/U1D80.pdf General Punctuation: http://www.unicode.org/charts/PDF/U2000.pdf Superscripts and Subscripts: http://www.unicode.org/charts/PDF/U2070.pdf Enclosed Alphanumerics: http://www.unicode.org/charts/PDF/U2460.pdf Dingbats: http://www.unicode.org/charts/PDF/U2700.pdf Supplemental Punctuation: http://www.unicode.org/charts/PDF/U2E00.pdf Alphabetic Presentation Forms: http://www.unicode.org/charts/PDF/UFB00.pdf Halfwidth and Fullwidth Forms: http://www.unicode.org/charts/PDF/UFF00.pdf See: http://en.wikipedia.org/wiki/Latin_characters_in_Unicode The set of character conversions supported by this class is a superset of those supported by Lucene's {@link ISOLatin1AccentFilter} which strips accents from Latin1 characters. For example, 'À' will be replaced by 'a'.
	BaseCharFilter	* Base utility class for implementing a {@link CharFilter}. * You subclass this, and then record mappings by calling * {@link #addOffCorrectMap}, and then invoke the correct * method to correct an offset.
	CachingTokenFilter	This class can be used if the token attributes of a TokenStream are intended to be consumed more than once. It caches all token attribute states locally in a List. CachingTokenFilter implements the optional method {@link TokenStream#Reset()}, which repositions the stream to the first Token.
	CharacterCache	Obsolete. Replacement for Java 1.5 Character.valueOf()
	CharArraySet	A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first.
	CharArraySet..::..CharArraySetIterator	The Iterator<String> for this set. Strings are constructed on the fly, so use CopyC# nextCharArray for more efficient access.
	CharFilter	Subclasses of CharFilter can be chained to filter CharStream. They can be used as {@link java.io.Reader} with additional offset correction. {@link Tokenizer}s will automatically use {@link #CorrectOffset} if a CharFilter/CharStream subclass is used.
	CharReader	CharReader is a Reader wrapper. It reads chars from Reader and outputs {@link CharStream}, defining an identify function {@link #CorrectOffset} method that simply returns the provided offset.
	CharStream	CharStream adds {@link #CorrectOffset} functionality over {@link Reader}. All Tokenizers accept a CharStream instead of {@link Reader} as input, which enables arbitrary character based filtering before tokenization. The {@link #CorrectOffset} method fixed offsets to account for removal or insertion of characters, so that the offsets reported in the tokens match the character offsets of the original Reader.
	CharTokenizer	An abstract base class for simple, character-oriented tokenizers.
	ISOLatin1AccentFilter	Obsolete. A filter that replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalent. The case will not be altered. For instance, 'À' will be replaced by 'a'.
	KeywordAnalyzer	"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
	KeywordTokenizer	Emits the entire input as a single token.
	LengthFilter	Removes words that are too long or too short from the stream.
	LetterTokenizer	A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
	LowerCaseFilter	Normalizes token text to lower case.
	LowerCaseTokenizer	LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. It divides text at non-letters and converts them to lower case. While it is functionally equivalent to the combination of LetterTokenizer and LowerCaseFilter, there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.
	MappingCharFilter	Simplistic {@link CharFilter} that applies the mappings contained in a {@link NormalizeCharMap} to the character stream, and correcting the resulting changes to the offsets.
	NormalizeCharMap	Holds a map of String input to String output, to be used with {@link MappingCharFilter}.
	NumericTokenStream	Expert: This class provides a {@link TokenStream} for indexing numeric values that can be used by {@link NumericRangeQuery} or {@link NumericRangeFilter}. Note that for simple usage, {@link NumericField} is recommended. {@link NumericField} disables norms and term freqs, as they are not usually needed during searching. If you need to change these settings, you should use this class. See {@link NumericField} for capabilities of fields indexed numerically. Here's an example usage, for an CopyC# int field: Field field = new Field(name, new NumericTokenStream(precisionStep).setIntValue(value)); field.setOmitNorms(true); field.setOmitTermFreqAndPositions(true); document.add(field); For optimal performance, re-use the TokenStream and Field instance for more than one document: NumericTokenStream stream = new NumericTokenStream(precisionStep); Field field = new Field(name, stream); field.setOmitNorms(true); field.setOmitTermFreqAndPositions(true); Document document = new Document(); document.add(field); for(all documents) { stream.setIntValue(value) writer.addDocument(document); } This stream is not intended to be used in analyzers; it's more for iterating the different precisions during indexing a specific numeric value. NOTE: as token streams are only consumed once the document is added to the index, if you index more than one numeric field, use a separate CopyC# NumericTokenStream instance for each. See {@link NumericRangeQuery} for more details on the CopyC# precisionStep parameter as well as how numeric fields work under the hood. NOTE: This API is experimental and might change in incompatible ways in the next release.
	PerFieldAnalyzerWrapper	This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use {@link #addAnalyzer} to add a non-default analyzer on a field name basis. Example usage: PerFieldAnalyzerWrapper aWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); aWrapper.addAnalyzer("firstname", new KeywordAnalyzer()); aWrapper.addAnalyzer("lastname", new KeywordAnalyzer()); In this example, StandardAnalyzer will be used for all fields except "firstname" and "lastname", for which KeywordAnalyzer will be used. A PerFieldAnalyzerWrapper can be used like any other analyzer, for both indexing and query parsing.
	PorterStemFilter	Transforms the token stream as per the Porter stemming algorithm. Note: the input to the stemming filter must already be in lower case, so you will need to use LowerCaseFilter or LowerCaseTokenizer farther down the Tokenizer chain in order for this to work properly! To use this filter with other analyzers, you'll want to write an Analyzer class that sets up the TokenStream chain as you want it. To use this with LowerCaseTokenizer, for example, you'd write an analyzer like this: class MyAnalyzer extends Analyzer { public final TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter(new LowerCaseTokenizer(reader)); } }
	SimpleAnalyzer	An {@link Analyzer} that filters {@link LetterTokenizer} with {@link LowerCaseFilter}
	SinkTokenizer	Obsolete. A SinkTokenizer can be used to cache Tokens for use in an Analyzer WARNING: {@link TeeTokenFilter} and {@link SinkTokenizer} only work with the old TokenStream API. If you switch to the new API, you need to use {@link TeeSinkTokenFilter} instead, which offers the same functionality.
	StopAnalyzer	Filters {@link LetterTokenizer} with {@link LowerCaseFilter} and {@link StopFilter}. You must specify the required {@link Version} compatibility when creating StopAnalyzer: As of 2.9, position increments are preserved
	StopFilter	Removes stop words from a token stream.
	TeeSinkTokenFilter	This TokenFilter provides the ability to set aside attribute states that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. TeeSinkTokenFilter source1 = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader1)); TeeSinkTokenFilter.SinkTokenStream sink1 = source1.newSinkTokenStream(); TeeSinkTokenFilter.SinkTokenStream sink2 = source1.newSinkTokenStream(); TeeSinkTokenFilter source2 = new TeeSinkTokenFilter(new WhitespaceTokenizer(reader2)); source2.addSinkTokenStream(sink1); source2.addSinkTokenStream(sink2); TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); In this example, CopyC# sink1 and CopyC# sink2 will both get tokens from both CopyC# reader1 and CopyC# reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. It is important, that tees are consumed before sinks (in the above example, the field names must be less the sink's field names). If you are not sure, which stream is consumed first, you can simply add another sink and then pass all tokens to the sinks at once using {@link #consumeAllTokens}. This TokenFilter is exhausted after this. In the above example, change the example above to: ... TokenStream final1 = new LowerCaseFilter(source1.newSinkTokenStream()); TokenStream final2 = source2.newSinkTokenStream(); sink1.consumeAllTokens(); sink2.consumeAllTokens(); ... In this case, the fields can be added in any order, because the sources are not used anymore and all sinks are ready. Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene.
	TeeSinkTokenFilter..::..AnonymousClassSinkFilter
	TeeSinkTokenFilter..::..SinkFilter	A filter that decides which {@link AttributeSource} states to store in the sink.
	TeeSinkTokenFilter..::..SinkTokenStream
	TeeTokenFilter	Obsolete. Works in conjunction with the SinkTokenizer to provide the ability to set aside tokens that have already been analyzed. This is useful in situations where multiple fields share many common analysis steps and then go their separate ways. It is also useful for doing things like entity extraction or proper noun analysis as part of the analysis workflow and saving off those tokens for use in another field. SinkTokenizer sink1 = new SinkTokenizer(); SinkTokenizer sink2 = new SinkTokenizer(); TokenStream source1 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader1), sink1), sink2); TokenStream source2 = new TeeTokenFilter(new TeeTokenFilter(new WhitespaceTokenizer(reader2), sink1), sink2); TokenStream final1 = new LowerCaseFilter(source1); TokenStream final2 = source2; TokenStream final3 = new EntityDetect(sink1); TokenStream final4 = new URLDetect(sink2); d.add(new Field("f1", final1)); d.add(new Field("f2", final2)); d.add(new Field("f3", final3)); d.add(new Field("f4", final4)); In this example, CopyC# sink1 and CopyC# sink2 will both get tokens from both CopyC# reader1 and CopyC# reader2 after whitespace tokenizer and now we can further wrap any of these in extra analysis, and more "sources" can be inserted if desired. It is important, that tees are consumed before sinks (in the above example, the field names must be less the sink's field names). Note, the EntityDetect and URLDetect TokenStreams are for the example and do not currently exist in Lucene See LUCENE-1058. WARNING: {@link TeeTokenFilter} and {@link SinkTokenizer} only work with the old TokenStream API. If you switch to the new API, you need to use {@link TeeSinkTokenFilter} instead, which offers the same functionality.
	Token	A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string. The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc. The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word". A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use {@link TermPositions#GetPayloadLength()} and {@link TermPositions#GetPayload(byte[], int)} to retrieve the payloads from the index.
	TokenFilter	A TokenFilter is a TokenStream whose input is another TokenStream. This is an abstract class; subclasses must override {@link #IncrementToken()}.
	Tokenizer	A Tokenizer is a TokenStream whose input is a Reader. This is an abstract class; subclasses must override {@link #IncrementToken()} NOTE: Subclasses overriding {@link #next(Token)} must call {@link AttributeSource#ClearAttributes()} before setting attributes. Subclasses overriding {@link #IncrementToken()} must call {@link Token#Clear()} before setting Token attributes.
	TokenStream	A CopyC# TokenStream enumerates the sequence of tokens, either from {@link Field}s of a {@link Document} or from query text. This is an abstract class. Concrete subclasses are: {@link Tokenizer}, a CopyC# TokenStream whose input is a Reader; and {@link TokenFilter}, a CopyC# TokenStream whose input is another CopyC# TokenStream . A new CopyC# TokenStream API has been introduced with Lucene 2.9. This API has moved from being {@link Token} based to {@link Attribute} based. While {@link Token} still exists in 2.9 as a convenience class, the preferred way to store the information of a {@link Token} is to use {@link AttributeImpl}s. CopyC# TokenStream now extends {@link AttributeSource}, which provides access to all of the token {@link Attribute}s for the CopyC# TokenStream . Note that only one instance per {@link AttributeImpl} is created and reused for every token. This approach reduces object creation and allows local caching of references to the {@link AttributeImpl}s. See {@link #IncrementToken()} for further details. The workflow of the new CopyC# TokenStream API is as follows: Instantiation of CopyC# TokenStream /{@link TokenFilter}s which add/get attributes to/from the {@link AttributeSource}. The consumer calls {@link TokenStream#Reset()}. The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access The consumer calls {@link #IncrementToken()} until it returns false and consumes the attributes after each call. The consumer calls {@link #End()} so that any end-of-stream operations can be performed. The consumer calls {@link #Close()} to release any resource when finished using the CopyC# TokenStream To make sure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in {@link #IncrementToken()}. You can find some example code for the new API in the analysis package level Javadoc. Sometimes it is desirable to capture a current state of a CopyC# TokenStream , e. g. for buffering purposes (see {@link CachingTokenFilter}, {@link TeeSinkTokenFilter}). For this usecase {@link AttributeSource#CaptureState} and {@link AttributeSource#RestoreState} can be used.
	TokenWrapper	Obsolete. This class wraps a Token and supplies a single attribute instance where the delegate token can be replaced.
	WhitespaceAnalyzer	An Analyzer that uses {@link WhitespaceTokenizer}.
	WhitespaceTokenizer	A WhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens.
	WordlistLoader	Loads a text file and adds every line as an entry to a Hashtable. Every line should contain only one word. If the file is not found or on any error, an empty table is returned.