Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.TokenAttributes

    General-purpose attributes for text analysis.

    Classes

    CharTermAttribute

    Default implementation of ICharTermAttribute.

    FlagsAttribute

    Default implementation of IFlagsAttribute.

    KeywordAttribute

    Default implementation of IKeywordAttribute.

    OffsetAttribute

    Default implementation of IOffsetAttribute.

    PayloadAttribute

    Default implementation of IPayloadAttribute.

    PositionIncrementAttribute

    Default implementation of IPositionIncrementAttribute.

    PositionLengthAttribute

    Default implementation of IPositionLengthAttribute.

    TypeAttribute

    Default implementation of ITypeAttribute.

    Interfaces

    ICharTermAttribute

    The term text of a Token.

    IFlagsAttribute

    This attribute can be used to pass different flags down the Tokenizer chain, eg from one TokenFilter to another one.

    This is completely distinct from TypeAttribute, although they do share similar purposes. The flags can be used to encode information about the token for use by other TokenFilters. @lucene.experimental While we think this is here to stay, we may want to change it to be a long.

    IKeywordAttribute

    This attribute can be used to mark a token as a keyword. Keyword aware TokenStreams can decide to modify a token based on the return value of IsKeyword if the token is modified. Stemming filters for instance can use this attribute to conditionally skip a term if IsKeyword returns true.

    IOffsetAttribute

    The start and end character offset of a Token.

    IPayloadAttribute

    The payload of a Token.

    The payload is stored in the index at each position, and can be used to influence scoring when using Payload-based queries in the Lucene.Net.Search.Payloads and Lucene.Net.Search.Spans namespaces.

    NOTE: because the payload will be stored at each position, its usually best to use the minimum number of bytes necessary. Some codec implementations may optimize payload storage when all payloads have the same length.

    IPositionIncrementAttribute

    Determines the position of this token relative to the previous Token in a TokenStream, used in phrase searching.

    The default value is one.

    Some common uses for this are:

    • Set it to zero to put multiple terms in the same position. this is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.
    • Set it to values greater than one to inhibit exact phrase matches. If, for example, one does not want phrases to match across removed stop words, then one could build a stop word filter that removes stop words and also sets the increment to the number of stop words removed before each non-stop word. Then exact phrase queries will only match when the terms occur with no intervening stop words.

    IPositionLengthAttribute

    Determines how many positions this token spans. Very few analyzer components actually produce this attribute, and indexing ignores it, but it's useful to express the graph structure naturally produced by decompounding, word splitting/joining, synonym filtering, etc.

    NOTE: this is optional, and most analyzers don't change the default value (1).

    ITermToBytesRefAttribute

    This attribute is requested by TermsHashPerField to index the contents. This attribute can be used to customize the final byte[] encoding of terms.

    Consumers of this attribute call BytesRef up-front, and then invoke FillBytesRef() for each term. Example:

      TermToBytesRefAttribute termAtt = tokenStream.GetAttribute<TermToBytesRefAttribute>;
      BytesRef bytes = termAtt.BytesRef;
    
      while (tokenStream.IncrementToken()
      {
        // you must call termAtt.FillBytesRef() before doing something with the bytes.
        // this encodes the term value (internally it might be a char[], etc) into the bytes.
        int hashCode = termAtt.FillBytesRef();
    
        if (IsInteresting(bytes))
        {
          // because the bytes are reused by the attribute (like CharTermAttribute's char[] buffer),
          // you should make a copy if you need persistent access to the bytes, otherwise they will
          // be rewritten across calls to IncrementToken()
    
          DoSomethingWith(new BytesRef(bytes));
        }
      }
      ...

    @lucene.experimental this is a very expert API, please use CharTermAttribute and its implementation of this method for UTF-8 terms.

    ITypeAttribute

    A Token's lexical type. The Default value is "word".

    • Improve this Doc
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)