Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class WikipediaTokenizer

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    object
    AttributeSource
    TokenStream
    Tokenizer
    WikipediaTokenizer
    Implements
    IDisposable
    Inherited Members
    Tokenizer.SetReader(TextReader)
    TokenStream.Dispose()
    AttributeSource.GetAttributeFactory()
    AttributeSource.GetAttributeClassesEnumerator()
    AttributeSource.GetAttributeImplsEnumerator()
    AttributeSource.AddAttributeImpl(Attribute)
    AttributeSource.AddAttribute<T>()
    AttributeSource.HasAttributes
    AttributeSource.HasAttribute<T>()
    AttributeSource.GetAttribute<T>()
    AttributeSource.ClearAttributes()
    AttributeSource.CaptureState()
    AttributeSource.RestoreState(AttributeSource.State)
    AttributeSource.GetHashCode()
    AttributeSource.Equals(object)
    AttributeSource.ReflectAsString(bool)
    AttributeSource.ReflectWith(IAttributeReflector)
    AttributeSource.CloneAttributes()
    AttributeSource.CopyTo(AttributeSource)
    AttributeSource.ToString()
    object.Equals(object, object)
    object.GetType()
    object.ReferenceEquals(object, object)
    Namespace: Lucene.Net.Analysis.Wikipedia
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public sealed class WikipediaTokenizer : Tokenizer, IDisposable

    Constructors

    WikipediaTokenizer(AttributeFactory, TextReader, int, ICollection<string>)

    Creates a new instance of the WikipediaTokenizer. Attaches the input to a the newly created JFlex scanner. Uses the given Lucene.Net.Util.AttributeSource.AttributeFactory.

    Declaration
    public WikipediaTokenizer(AttributeSource.AttributeFactory factory, TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory

    The Lucene.Net.Util.AttributeSource.AttributeFactory

    TextReader input

    The input

    int tokenOutput

    One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH

    ICollection<string> untokenizedTypes

    Untokenized types

    WikipediaTokenizer(TextReader)

    Creates a new instance of the WikipediaTokenizer. Attaches the input to a newly created JFlex scanner.

    Declaration
    public WikipediaTokenizer(TextReader input)
    Parameters
    Type Name Description
    TextReader input

    The Input TextReader

    WikipediaTokenizer(TextReader, int, ICollection<string>)

    Creates a new instance of the WikipediaTokenizer. Attaches the input to a the newly created JFlex scanner.

    Declaration
    public WikipediaTokenizer(TextReader input, int tokenOutput, ICollection<string> untokenizedTypes)
    Parameters
    Type Name Description
    TextReader input

    The input

    int tokenOutput

    One of TOKENS_ONLY, UNTOKENIZED_ONLY, BOTH

    ICollection<string> untokenizedTypes

    Untokenized types

    Fields

    ACRONYM_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int ACRONYM_ID = 2
    Field Value
    Type Description
    int

    ALPHANUM_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int ALPHANUM_ID = 0
    Field Value
    Type Description
    int

    APOSTROPHE_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int APOSTROPHE_ID = 1
    Field Value
    Type Description
    int

    BOLD

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string BOLD = "b"
    Field Value
    Type Description
    string

    BOLD_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int BOLD_ID = 12
    Field Value
    Type Description
    int

    BOLD_ITALICS

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string BOLD_ITALICS = "bi"
    Field Value
    Type Description
    string

    BOLD_ITALICS_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int BOLD_ITALICS_ID = 14
    Field Value
    Type Description
    int

    BOTH

    Output the both the untokenized token and the splits

    Declaration
    public const int BOTH = 2
    Field Value
    Type Description
    int

    CATEGORY

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string CATEGORY = "c"
    Field Value
    Type Description
    string

    CATEGORY_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int CATEGORY_ID = 11
    Field Value
    Type Description
    int

    CITATION

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string CITATION = "ci"
    Field Value
    Type Description
    string

    CITATION_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int CITATION_ID = 10
    Field Value
    Type Description
    int

    CJ_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int CJ_ID = 7
    Field Value
    Type Description
    int

    COMPANY_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int COMPANY_ID = 3
    Field Value
    Type Description
    int

    EMAIL_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int EMAIL_ID = 4
    Field Value
    Type Description
    int

    EXTERNAL_LINK

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string EXTERNAL_LINK = "el"
    Field Value
    Type Description
    string

    EXTERNAL_LINK_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int EXTERNAL_LINK_ID = 9
    Field Value
    Type Description
    int

    EXTERNAL_LINK_URL

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string EXTERNAL_LINK_URL = "elu"
    Field Value
    Type Description
    string

    EXTERNAL_LINK_URL_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int EXTERNAL_LINK_URL_ID = 17
    Field Value
    Type Description
    int

    HEADING

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string HEADING = "h"
    Field Value
    Type Description
    string

    HEADING_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int HEADING_ID = 15
    Field Value
    Type Description
    int

    HOST_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int HOST_ID = 5
    Field Value
    Type Description
    int

    INTERNAL_LINK

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string INTERNAL_LINK = "il"
    Field Value
    Type Description
    string

    INTERNAL_LINK_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int INTERNAL_LINK_ID = 8
    Field Value
    Type Description
    int

    ITALICS

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string ITALICS = "i"
    Field Value
    Type Description
    string

    ITALICS_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int ITALICS_ID = 13
    Field Value
    Type Description
    int

    NUM_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int NUM_ID = 6
    Field Value
    Type Description
    int

    SUB_HEADING

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const string SUB_HEADING = "sh"
    Field Value
    Type Description
    string

    SUB_HEADING_ID

    Extension of StandardTokenizer that is aware of Wikipedia syntax. It is based off of the Wikipedia tutorial available at http://en.wikipedia.org/wiki/Wikipedia:Tutorial, but it may not be complete.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public const int SUB_HEADING_ID = 16
    Field Value
    Type Description
    int

    TOKENS_ONLY

    Only output tokens

    Declaration
    public const int TOKENS_ONLY = 0
    Field Value
    Type Description
    int

    TOKEN_TYPES

    String token types that correspond to token type int constants

    Declaration
    public static readonly string[] TOKEN_TYPES
    Field Value
    Type Description
    string[]

    UNTOKENIZED_ONLY

    Only output untokenized tokens, which are tokens that would normally be split into several tokens

    Declaration
    public const int UNTOKENIZED_ONLY = 1
    Field Value
    Type Description
    int

    UNTOKENIZED_TOKEN_FLAG

    This flag is used to indicate that the produced "Token" would, if TOKENS_ONLY was used, produce multiple tokens.

    Declaration
    public const int UNTOKENIZED_TOKEN_FLAG = 1
    Field Value
    Type Description
    int

    Methods

    Dispose(bool)

    Releases resources associated with this stream.

    If you override this method, always call base.Dispose(disposing), otherwise some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will throw InvalidOperationException on reuse).
    Declaration
    protected override void Dispose(bool disposing)
    Parameters
    Type Name Description
    bool disposing
    Overrides
    Tokenizer.Dispose(bool)
    Remarks

    NOTE: The default implementation closes the input TextReader, so be sure to call base.Dispose(disposing) when overriding this method.

    End()

    This method is called by the consumer after the last token has been consumed, after Lucene.Net.Analysis.TokenStream.IncrementToken() returned false (using the new Lucene.Net.Analysis.TokenStream API). Streams implementing the old API should upgrade to use this feature.

    This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

    Additionally any skipped positions (such as those removed by a stopfilter) can be applied to the position increment, or any adjustment of other attributes where the end-of-stream value may be important.

    If you override this method, always call base.End();.
    Declaration
    public override void End()
    Overrides
    Lucene.Net.Analysis.TokenStream.End()
    Exceptions
    Type Condition
    IOException

    If an I/O error occurs

    IncrementToken()

    Lucene.Net.Analysis.TokenStream.IncrementToken()

    Declaration
    public override sealed bool IncrementToken()
    Returns
    Type Description
    bool
    Overrides
    Lucene.Net.Analysis.TokenStream.IncrementToken()

    Reset()

    Lucene.Net.Analysis.TokenStream.Reset()

    Declaration
    public override void Reset()
    Overrides
    Lucene.Net.Analysis.Tokenizer.Reset()

    Implements

    IDisposable
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.