Lucene.Net  3.0.3
Lucene.Net is a .NET port of the Java Lucene Indexing Library
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Properties
Public Member Functions | Public Attributes | Static Public Attributes | Properties | List of all members
Lucene.Net.Analysis.Standard.StandardTokenizer Class Reference

A grammar-based tokenizer constructed with JFlex More...

Inherits Lucene.Net.Analysis.Tokenizer.

Public Member Functions

 StandardTokenizer (Version matchVersion, System.IO.TextReader input)
 Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner.
 
 StandardTokenizer (Version matchVersion, AttributeSource source, System.IO.TextReader input)
 Creates a new StandardTokenizer with a given AttributeSource.
 
 StandardTokenizer (Version matchVersion, AttributeFactory factory, System.IO.TextReader input)
 Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory
 
override bool IncrementToken ()
 Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.
 
override void End ()
 This method is called by the consumer after the last token has been consumed, after IncrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.
 
override void Reset (System.IO.TextReader reader)
 Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.
 
void SetReplaceInvalidAcronym (bool replaceInvalidAcronym)
 Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068
 

Public Attributes

const int ALPHANUM = 0
 
const int APOSTROPHE = 1
 
const int ACRONYM = 2
 
const int COMPANY = 3
 
const int EMAIL = 4
 
const int HOST = 5
 
const int NUM = 6
 
const int CJ = 7
 
const int ACRONYM_DEP = 8
 

Static Public Attributes

static readonly System.String[] TOKEN_TYPES = new System.String[]{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"}
 String token types that correspond to token type int constants
 

Properties

int MaxTokenLength [get, set]
 Set the max allowed token length. Any token longer than this is skipped.
 

Additional Inherited Members

- Protected Member Functions inherited from Lucene.Net.Analysis.Tokenizer
override void Dispose (bool disposing)
 

Detailed Description

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required Version compatibility when creating StandardAnalyzer:

Definition at line 56 of file StandardTokenizer.cs.

Constructor & Destructor Documentation

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer ( Version  matchVersion,
System.IO.TextReader  input 
)

Creates a new instance of the Lucene.Net.Analysis.Standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner.

Parameters
matchVersion
inputThe input reader

See http://issues.apache.org/jira/browse/LUCENE-1068

Definition at line 106 of file StandardTokenizer.cs.

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer ( Version  matchVersion,
AttributeSource  source,
System.IO.TextReader  input 
)

Creates a new StandardTokenizer with a given AttributeSource.

Definition at line 114 of file StandardTokenizer.cs.

Lucene.Net.Analysis.Standard.StandardTokenizer.StandardTokenizer ( Version  matchVersion,
AttributeFactory  factory,
System.IO.TextReader  input 
)

Creates a new StandardTokenizer with a given Lucene.Net.Util.AttributeSource.AttributeFactory

Definition at line 124 of file StandardTokenizer.cs.

Member Function Documentation

override void Lucene.Net.Analysis.Standard.StandardTokenizer.End ( )
virtual

This method is called by the consumer after the last token has been consumed, after IncrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

<throws> IOException </throws>

Reimplemented from Lucene.Net.Analysis.TokenStream.

Definition at line 207 of file StandardTokenizer.cs.

override bool Lucene.Net.Analysis.Standard.StandardTokenizer.IncrementToken ( )
virtual

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.CaptureState to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.AddAttribute{T}() and AttributeSource.GetAttribute{T}(), references to all Util.Attributes that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken().

Returns
false for end of stream; true otherwise

Implements Lucene.Net.Analysis.TokenStream.

Definition at line 159 of file StandardTokenizer.cs.

override void Lucene.Net.Analysis.Standard.StandardTokenizer.Reset ( System.IO.TextReader  input)
virtual

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Reimplemented from Lucene.Net.Analysis.Tokenizer.

Definition at line 214 of file StandardTokenizer.cs.

void Lucene.Net.Analysis.Standard.StandardTokenizer.SetReplaceInvalidAcronym ( bool  replaceInvalidAcronym)

Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

Parameters
replaceInvalidAcronymSet to true to replace mischaracterized acronyms as HOST.

Definition at line 227 of file StandardTokenizer.cs.

Member Data Documentation

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM = 2

Definition at line 67 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ACRONYM_DEP = 8

<deprecated> this solves a bug where HOSTs that end with '.' are identified as ACRONYMs. </deprecated>

Definition at line 78 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.ALPHANUM = 0

Definition at line 65 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.APOSTROPHE = 1

Definition at line 66 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.CJ = 7

Definition at line 72 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.COMPANY = 3

Definition at line 68 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.EMAIL = 4

Definition at line 69 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.HOST = 5

Definition at line 70 of file StandardTokenizer.cs.

const int Lucene.Net.Analysis.Standard.StandardTokenizer.NUM = 6

Definition at line 71 of file StandardTokenizer.cs.

readonly System.String [] Lucene.Net.Analysis.Standard.StandardTokenizer.TOKEN_TYPES = new System.String[]{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"}
static

String token types that correspond to token type int constants

Definition at line 81 of file StandardTokenizer.cs.

Property Documentation

int Lucene.Net.Analysis.Standard.StandardTokenizer.MaxTokenLength
getset

Set the max allowed token length. Any token longer than this is skipped.

Definition at line 91 of file StandardTokenizer.cs.


The documentation for this class was generated from the following file: