Class WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Wikipedia
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class WikipediaTokenizer : Tokenizer, IDisposable
Constructors
WikipediaTokenizer(AttributeFactory, TextReader, int, ICollection<string>)
Creates a new instance of the WikipediaTokenizer. Attaches the
input
to a the newly created JFlex scanner. Uses the given Lucene.Net.Util.AttributeSource.AttributeFactory.
Declaration
public WikipediaTokenizer(AttributeSource.AttributeFactory factory, TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | The Lucene.Net.Util.AttributeSource.AttributeFactory |
TextReader | input | The input |
int | tokenOutput | One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH |
ICollection<string> | untokenizedTypes | Untokenized types |
WikipediaTokenizer(TextReader)
Creates a new instance of the WikipediaTokenizer. Attaches the
input
to a newly created JFlex scanner.
Declaration
public WikipediaTokenizer(TextReader input)
Parameters
Type | Name | Description |
---|---|---|
TextReader | input | The Input TextReader |
WikipediaTokenizer(TextReader, int, ICollection<string>)
Creates a new instance of the WikipediaTokenizer. Attaches the
input
to a the newly created JFlex scanner.
Declaration
public WikipediaTokenizer(TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
Parameters
Type | Name | Description |
---|---|---|
TextReader | input | The input |
int | tokenOutput | One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH |
ICollection<string> | untokenizedTypes | Untokenized types |
Fields
ACRONYM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ACRONYM_ID = 2
Field Value
Type | Description |
---|---|
int |
ALPHANUM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ALPHANUM_ID = 0
Field Value
Type | Description |
---|---|
int |
APOSTROPHE_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int APOSTROPHE_ID = 1
Field Value
Type | Description |
---|---|
int |
BOLD
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string BOLD = "b"
Field Value
Type | Description |
---|---|
string |
BOLD_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int BOLD_ID = 12
Field Value
Type | Description |
---|---|
int |
BOLD_ITALICS
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string BOLD_ITALICS = "bi"
Field Value
Type | Description |
---|---|
string |
BOLD_ITALICS_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int BOLD_ITALICS_ID = 14
Field Value
Type | Description |
---|---|
int |
BOTH
Output the both the untokenized token and the splits
Declaration
public const int BOTH = 2
Field Value
Type | Description |
---|---|
int |
CATEGORY
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string CATEGORY = "c"
Field Value
Type | Description |
---|---|
string |
CATEGORY_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CATEGORY_ID = 11
Field Value
Type | Description |
---|---|
int |
CITATION
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string CITATION = "ci"
Field Value
Type | Description |
---|---|
string |
CITATION_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CITATION_ID = 10
Field Value
Type | Description |
---|---|
int |
CJ_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CJ_ID = 7
Field Value
Type | Description |
---|---|
int |
COMPANY_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int COMPANY_ID = 3
Field Value
Type | Description |
---|---|
int |
EMAIL_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EMAIL_ID = 4
Field Value
Type | Description |
---|---|
int |
EXTERNAL_LINK
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string EXTERNAL_LINK = "el"
Field Value
Type | Description |
---|---|
string |
EXTERNAL_LINK_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EXTERNAL_LINK_ID = 9
Field Value
Type | Description |
---|---|
int |
EXTERNAL_LINK_URL
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string EXTERNAL_LINK_URL = "elu"
Field Value
Type | Description |
---|---|
string |
EXTERNAL_LINK_URL_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EXTERNAL_LINK_URL_ID = 17
Field Value
Type | Description |
---|---|
int |
HEADING
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string HEADING = "h"
Field Value
Type | Description |
---|---|
string |
HEADING_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int HEADING_ID = 15
Field Value
Type | Description |
---|---|
int |
HOST_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int HOST_ID = 5
Field Value
Type | Description |
---|---|
int |
INTERNAL_LINK
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string INTERNAL_LINK = "il"
Field Value
Type | Description |
---|---|
string |
INTERNAL_LINK_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int INTERNAL_LINK_ID = 8
Field Value
Type | Description |
---|---|
int |
ITALICS
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string ITALICS = "i"
Field Value
Type | Description |
---|---|
string |
ITALICS_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ITALICS_ID = 13
Field Value
Type | Description |
---|---|
int |
NUM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int NUM_ID = 6
Field Value
Type | Description |
---|---|
int |
SUB_HEADING
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string SUB_HEADING = "sh"
Field Value
Type | Description |
---|---|
string |
SUB_HEADING_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int SUB_HEADING_ID = 16
Field Value
Type | Description |
---|---|
int |
TOKENS_ONLY
Only output tokens
Declaration
public const int TOKENS_ONLY = 0
Field Value
Type | Description |
---|---|
int |
TOKEN_TYPES
String token types that correspond to token type int constants
Declaration
public static readonly string[] TOKEN_TYPES
Field Value
Type | Description |
---|---|
string[] |
UNTOKENIZED_ONLY
Only output untokenized tokens, which are tokens that would normally be split into several tokens
Declaration
public const int UNTOKENIZED_ONLY = 1
Field Value
Type | Description |
---|---|
int |
UNTOKENIZED_TOKEN_FLAG
This flag is used to indicate that the produced "Token" would, if TOKENS_ONLY was used, produce multiple tokens.
Declaration
public const int UNTOKENIZED_TOKEN_FLAG = 1
Field Value
Type | Description |
---|---|
int |
Methods
Dispose(bool)
Releases resources associated with this stream.
If you override this method, always callbase.Dispose(disposing)
, otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on reuse).
Declaration
protected override void Dispose(bool disposing)
Parameters
Type | Name | Description |
---|---|---|
bool | disposing |
Overrides
Remarks
NOTE:
The default implementation closes the input TextReader, so
be sure to call base.Dispose(disposing)
when overriding this method.
End()
This method is called by the consumer after the last token has been
consumed, after Lucene.Net.Analysis.TokenStream.IncrementToken() returned false
(using the new Lucene.Net.Analysis.TokenStream API). Streams implementing the old API
should upgrade to use this feature.
base.End();
.
Declaration
public override void End()
Overrides
Exceptions
Type | Condition |
---|---|
IOException | If an I/O error occurs |
IncrementToken()
Lucene.Net.Analysis.TokenStream.IncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
Type | Description |
---|---|
bool |
Overrides
Reset()
Lucene.Net.Analysis.TokenStream.Reset()
Declaration
public override void Reset()