Class WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Wikipedia
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class WikipediaTokenizer : Tokenizer, IDisposable
Constructors
WikipediaTokenizer(AttributeFactory, TextReader, int, ICollection<string>)
Creates a new instance of the WikipediaTokenizer. Attaches the
input to a the newly created JFlex scanner. Uses the given Lucene.Net.Util.AttributeSource.AttributeFactory.
Declaration
public WikipediaTokenizer(AttributeSource.AttributeFactory factory, TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
Parameters
| Type | Name | Description |
|---|---|---|
| AttributeSource.AttributeFactory | factory | The Lucene.Net.Util.AttributeSource.AttributeFactory |
| TextReader | input | The input |
| int | tokenOutput | One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH |
| ICollection<string> | untokenizedTypes | Untokenized types |
WikipediaTokenizer(TextReader)
Creates a new instance of the WikipediaTokenizer. Attaches the
input to a newly created JFlex scanner.
Declaration
public WikipediaTokenizer(TextReader input)
Parameters
| Type | Name | Description |
|---|---|---|
| TextReader | input | The Input TextReader |
WikipediaTokenizer(TextReader, int, ICollection<string>)
Creates a new instance of the WikipediaTokenizer. Attaches the
input to a the newly created JFlex scanner.
Declaration
public WikipediaTokenizer(TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
Parameters
| Type | Name | Description |
|---|---|---|
| TextReader | input | The input |
| int | tokenOutput | One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH |
| ICollection<string> | untokenizedTypes | Untokenized types |
Fields
ACRONYM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ACRONYM_ID = 2
Field Value
| Type | Description |
|---|---|
| int |
ALPHANUM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ALPHANUM_ID = 0
Field Value
| Type | Description |
|---|---|
| int |
APOSTROPHE_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int APOSTROPHE_ID = 1
Field Value
| Type | Description |
|---|---|
| int |
BOLD
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string BOLD = "b"
Field Value
| Type | Description |
|---|---|
| string |
BOLD_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int BOLD_ID = 12
Field Value
| Type | Description |
|---|---|
| int |
BOLD_ITALICS
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string BOLD_ITALICS = "bi"
Field Value
| Type | Description |
|---|---|
| string |
BOLD_ITALICS_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int BOLD_ITALICS_ID = 14
Field Value
| Type | Description |
|---|---|
| int |
BOTH
Output the both the untokenized token and the splits
Declaration
public const int BOTH = 2
Field Value
| Type | Description |
|---|---|
| int |
CATEGORY
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string CATEGORY = "c"
Field Value
| Type | Description |
|---|---|
| string |
CATEGORY_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CATEGORY_ID = 11
Field Value
| Type | Description |
|---|---|
| int |
CITATION
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string CITATION = "ci"
Field Value
| Type | Description |
|---|---|
| string |
CITATION_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CITATION_ID = 10
Field Value
| Type | Description |
|---|---|
| int |
CJ_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int CJ_ID = 7
Field Value
| Type | Description |
|---|---|
| int |
COMPANY_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int COMPANY_ID = 3
Field Value
| Type | Description |
|---|---|
| int |
EMAIL_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EMAIL_ID = 4
Field Value
| Type | Description |
|---|---|
| int |
EXTERNAL_LINK
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string EXTERNAL_LINK = "el"
Field Value
| Type | Description |
|---|---|
| string |
EXTERNAL_LINK_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EXTERNAL_LINK_ID = 9
Field Value
| Type | Description |
|---|---|
| int |
EXTERNAL_LINK_URL
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string EXTERNAL_LINK_URL = "elu"
Field Value
| Type | Description |
|---|---|
| string |
EXTERNAL_LINK_URL_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int EXTERNAL_LINK_URL_ID = 17
Field Value
| Type | Description |
|---|---|
| int |
HEADING
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string HEADING = "h"
Field Value
| Type | Description |
|---|---|
| string |
HEADING_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int HEADING_ID = 15
Field Value
| Type | Description |
|---|---|
| int |
HOST_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int HOST_ID = 5
Field Value
| Type | Description |
|---|---|
| int |
INTERNAL_LINK
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string INTERNAL_LINK = "il"
Field Value
| Type | Description |
|---|---|
| string |
INTERNAL_LINK_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int INTERNAL_LINK_ID = 8
Field Value
| Type | Description |
|---|---|
| int |
ITALICS
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string ITALICS = "i"
Field Value
| Type | Description |
|---|---|
| string |
ITALICS_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int ITALICS_ID = 13
Field Value
| Type | Description |
|---|---|
| int |
NUM_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int NUM_ID = 6
Field Value
| Type | Description |
|---|---|
| int |
SUB_HEADING
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const string SUB_HEADING = "sh"
Field Value
| Type | Description |
|---|---|
| string |
SUB_HEADING_ID
Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public const int SUB_HEADING_ID = 16
Field Value
| Type | Description |
|---|---|
| int |
TOKENS_ONLY
Only output tokens
Declaration
public const int TOKENS_ONLY = 0
Field Value
| Type | Description |
|---|---|
| int |
TOKEN_TYPES
String token types that correspond to token type int constants
Declaration
public static readonly string[] TOKEN_TYPES
Field Value
| Type | Description |
|---|---|
| string[] |
UNTOKENIZED_ONLY
Only output untokenized tokens, which are tokens that would normally be split into several tokens
Declaration
public const int UNTOKENIZED_ONLY = 1
Field Value
| Type | Description |
|---|---|
| int |
UNTOKENIZED_TOKEN_FLAG
This flag is used to indicate that the produced "Token" would, if TOKENS_ONLY was used, produce multiple tokens.
Declaration
public const int UNTOKENIZED_TOKEN_FLAG = 1
Field Value
| Type | Description |
|---|---|
| int |
Methods
Dispose(bool)
Releases resources associated with this stream.
If you override this method, always callbase.Dispose(disposing), otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on reuse).
Declaration
protected override void Dispose(bool disposing)
Parameters
| Type | Name | Description |
|---|---|---|
| bool | disposing |
Overrides
Remarks
NOTE:
The default implementation closes the input TextReader, so
be sure to call base.Dispose(disposing) when overriding this method.
End()
This method is called by the consumer after the last token has been
consumed, after Lucene.Net.Analysis.TokenStream.IncrementToken() returned false
(using the new Lucene.Net.Analysis.TokenStream API). Streams implementing the old API
should upgrade to use this feature.
base.End();.
Declaration
public override void End()
Overrides
Exceptions
| Type | Condition |
|---|---|
| IOException | If an I/O error occurs |
IncrementToken()
Lucene.Net.Analysis.TokenStream.IncrementToken()
Declaration
public override sealed bool IncrementToken()
Returns
| Type | Description |
|---|---|
| bool |
Overrides
Reset()
Lucene.Net.Analysis.TokenStream.Reset()
Declaration
public override void Reset()