A Token is an occurrence of a term from the text of a field. It consists of a term's text, the start and end offset of the term in the text of the field, and a type string.

The start and end offsets permit applications to re-associate a token with its source text, e.g., to display highlighted query terms in a document browser, or to show matching text fragments in a KWIC display, etc.

The type is a string, assigned by a lexical analyzer (a.k.a. tokenizer), naming the lexical or syntactic class that the token belongs to. For example an end of sentence marker token might be implemented with type "eos". The default token type is "word".

A Token can optionally have metadata (a.k.a. Payload) in the form of a variable length byte array. Use {@link TermPositions#GetPayloadLength()} and {@link TermPositions#GetPayload(byte[], int)} to retrieve the payloads from the index.



NOTE: As of 2.9, Token implements all {@link Attribute} interfaces that are part of core Lucene and can be found in the {@code tokenattributes} subpackage. Even though it is not necessary to use Token anymore, with the new TokenStream API it can be used as convenience class that implements all {@link Attribute}s, which is especially useful to easily switch from the old to the new TokenStream API.



NOTE: As of 2.3, Token stores the term text internally as a malleable char[] termBuffer instead of String termText. The indexing code and core tokenizers have been changed to re-use a single Token instance, changing its buffer and other fields in-place as the Token is processed. This provides substantially better indexing performance as it saves the GC cost of new'ing a Token and String for every term. The APIs that accept String termText are still available but a warning about the associated performance cost has been added (below). The {@link #TermText()} method has been deprecated.

Tokenizers and TokenFilters should try to re-use a Token instance when possible for best performance, by implementing the {@link TokenStream#IncrementToken()} API. Failing that, to create a new Token you should first use one of the constructors that starts with null text. To load the token from a char[] use {@link #SetTermBuffer(char[], int, int)}. To load from a String use {@link #SetTermBuffer(String)} or {@link #SetTermBuffer(String, int, int)}. Alternatively you can get the Token's termBuffer by calling either {@link #TermBuffer()}, if you know that your text is shorter than the capacity of the termBuffer or {@link #ResizeTermBuffer(int)}, if there is any possibility that you may need to grow the buffer. Fill in the characters of your term into this buffer, with {@link String#getChars(int, int, char[], int)} if loading from a string, or with {@link System#arraycopy(Object, int, Object, int, int)}, and finally call {@link #SetTermLength(int)} to set the length of the term text. See LUCENE-969 for details.

Typical Token reuse patterns:

  • Copying text from a string (type is reset to {@link #DEFAULT_TYPE} if not specified):
                return reusableToken.reinit(string, startOffset, endOffset[, type]);
                
  • Copying some text from a string (type is reset to {@link #DEFAULT_TYPE} if not specified):
                return reusableToken.reinit(string, 0, string.length(), startOffset, endOffset[, type]);
                
  • Copying text from char[] buffer (type is reset to {@link #DEFAULT_TYPE} if not specified):
                return reusableToken.reinit(buffer, 0, buffer.length, startOffset, endOffset[, type]);
                
  • Copying some text from a char[] buffer (type is reset to {@link #DEFAULT_TYPE} if not specified):
                return reusableToken.reinit(buffer, start, end - start, startOffset, endOffset[, type]);
                
  • Copying from one one Token to another (type is reset to {@link #DEFAULT_TYPE} if not specified):
                return reusableToken.reinit(source.termBuffer(), 0, source.termLength(), source.startOffset(), source.endOffset()[, source.type()]);
                
A few things to note:
  • clear() initializes all of the fields to default values. This was changed in contrast to Lucene 2.4, but should affect no one.
  • Because
    CopyC#
    TokenStreams
    can be chained, one cannot assume that the
    CopyC#
    Token's
    current type is correct.
  • The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.
  • When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.

Namespace: Lucene.Net.Analysis
Assembly: Lucene.Net (in Lucene.Net.dll) Version: 2.9.4.1

Syntax

Inheritance Hierarchy

System..::..Object
  Lucene.Net.Util..::..AttributeImpl
    Lucene.Net.Analysis..::..Token

See Also