Lucene.Net  3.0.3
Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET runtime users.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Properties Pages
Public Member Functions | Static Public Attributes | Protected Member Functions | Properties | List of all members
Lucene.Net.Analysis.Shingle.ShingleMatrixFilter Class Reference

More...

Inherits Lucene.Net.Analysis.TokenStream.

Public Member Functions

 ShingleMatrixFilter (Matrix.Matrix matrix, int minimumShingleSize, int maximumShingleSize, Char spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec)
 Creates a shingle filter based on a user defined matrix.
 
 ShingleMatrixFilter (TokenStream input, int minimumShingleSize, int maximumShingleSize)
 Creates a shingle filter using default settings.
 
 ShingleMatrixFilter (TokenStream input, int minimumShingleSize, int maximumShingleSize, Char?spacerCharacter)
 Creates a shingle filter using default settings.
 
 ShingleMatrixFilter (TokenStream input, int minimumShingleSize, int maximumShingleSize, Char?spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle)
 Creates a shingle filter using the default TokenSettingsCodec.
 
 ShingleMatrixFilter (TokenStream input, int minimumShingleSize, int maximumShingleSize, Char?spacerCharacter, bool ignoringSinglePrefixOrSuffixShingle, TokenSettingsCodec settingsCodec)
 Creates a shingle filter with ad hoc parameter settings.
 
override void Reset ()
 Resets this stream to the beginning. This is an optional operation, so subclasses may or may not implement this method. Reset() is not needed for the standard indexing process. However, if the tokens of a TokenStream are intended to be consumed more than once, it is necessary to implement Reset(). Note that if your TokenStream caches tokens and feeds them back again after a reset, it is imperative that you clone the tokens when you store them away (on the first pass) as well as when you return them (on future passes after Reset()).
 
override sealed bool IncrementToken ()
 Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.
 
void UpdateToken (Token token, List< Token > shingle, int currentPermutationStartOffset, List< Row > currentPermutationRows, List< Token > currentPermuationTokens)
 Final touch of a shingle token before it is passed on to the consumer from method IncrementToken().
 
float CalculateShingleWeight (Token shingleToken, List< Token > shingle, int currentPermutationStartOffset, List< Row > currentPermutationRows, List< Token > currentPermuationTokens)
 Evaluates the new shingle token weight.
 
- Public Member Functions inherited from Lucene.Net.Analysis.TokenStream
virtual void End ()
 This method is called by the consumer after the last token has been consumed, after IncrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature. This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.
 
void Close ()
 Releases resources associated with this stream.
 
void Dispose ()
 
- Public Member Functions inherited from Lucene.Net.Util.AttributeSource
 AttributeSource ()
 An AttributeSource using the default attribute factory AttributeSource.AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY.
 
 AttributeSource (AttributeSource input)
 An AttributeSource that uses the same attributes as the supplied one.
 
 AttributeSource (AttributeFactory factory)
 An AttributeSource using the supplied AttributeFactory for creating new IAttribute instances.
 
virtual IEnumerable< Type > GetAttributeTypesIterator ()
 Returns a new iterator that iterates the attribute classes in the same order they were added in. Signature for Java 1.5: public Iterator<Class<? extends Attribute>> getAttributeClassesIterator()
 
virtual IEnumerable< AttributeGetAttributeImplsIterator ()
 Returns a new iterator that iterates all unique Attribute implementations. This iterator may contain less entries that GetAttributeTypesIterator, if one instance implements more than one Attribute interface. Signature for Java 1.5: public Iterator<AttributeImpl> getAttributeImplsIterator()
 
virtual void AddAttributeImpl (Attribute att)
 Expert: Adds a custom AttributeImpl instance with one or more Attribute interfaces.
 
virtual T AddAttribute< T > ()
 The caller must pass in a Class<? extends Attribute> value. This method first checks if an instance of that class is already in this AttributeSource and returns it. Otherwise a new instance is created, added to this AttributeSource and returned.
 
virtual bool HasAttribute< T > ()
 The caller must pass in a Class<? extends Attribute> value. Returns true, iff this AttributeSource contains the passed-in Attribute.
 
virtual T GetAttribute< T > ()
 The caller must pass in a Class<? extends Attribute> value. Returns the instance of the passed in Attribute contained in this AttributeSource
 
virtual void ClearAttributes ()
 Resets all Attributes in this AttributeSource by calling Attribute.Clear() on each Attribute implementation.
 
virtual State CaptureState ()
 Captures the state of all Attributes. The return value can be passed to RestoreState to restore the state of this or another AttributeSource.
 
virtual void RestoreState (State state)
 Restores this state by copying the values of all attribute implementations that this state contains into the attributes implementations of the targetStream. The targetStream must contain a corresponding instance for each argument contained in this state (e.g. it is not possible to restore the state of an AttributeSource containing a TermAttribute into a AttributeSource using a Token instance as implementation).
 
override int GetHashCode ()
 
override bool Equals (System.Object obj)
 
override System.String ToString ()
 
virtual AttributeSource CloneAttributes ()
 Performs a clone of all Attribute instances returned in a new AttributeSource instance. This method can be used to e.g. create another TokenStream with exactly the same attributes (using AttributeSource(AttributeSource))
 

Static Public Attributes

static Char DefaultSpacerCharacter = '_'
 
static TokenSettingsCodec DefaultSettingsCodec = new OneDimensionalNonWeightedTokenSettingsCodec()
 
static bool IgnoringSinglePrefixOrSuffixShingleByDefault
 

Protected Member Functions

override void Dispose (bool disposing)
 

Properties

int MinimumShingleSize [get, set]
 
int MaximumShingleSize [get, set]
 
Matrix.Matrix Matrix [get, set]
 
Char SpacerCharacter [get, set]
 
bool IsIgnoringSinglePrefixOrSuffixShingle [get, set]
 

Detailed Description

A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.

Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.

Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.

This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.

In order to use this filter you need to define a way of positioning the input stream tokens in the matrix. This is done using a ShingleMatrixFilter.TokenSettingsCodec. There are three simple implementations for demonstrational purposes, see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec, ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec.

Consider this token matrix:

 Token[column][row][z-axis]{
   {{hello}, {greetings, and, salutations}},
   {{world}, {earth}, {tellus}}
 };

It would produce the following 2-3 gram sized shingles:

"hello_world"
"greetings_and"
"greetings_and_salutations"
"and_salutations"
"and_salutations_world"
"salutations_world"
"hello_earth"
"and_salutations_earth"
"salutations_earth"
"hello_tellus"
"and_salutations_tellus"
"salutations_tellus"
 

This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.

The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer.

There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.

The filter also has basic support for calculating weights for the shingles based on the weights of the tokens from the input stream, output shingle size, etc. See CalculateShingleWeight.

NOTE: This filter might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes.

Definition at line 105 of file ShingleMatrixFilter.cs.

Constructor & Destructor Documentation

Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.ShingleMatrixFilter ( Matrix.Matrix  matrix,
int  minimumShingleSize,
int  maximumShingleSize,
Char  spacerCharacter,
bool  ignoringSinglePrefixOrSuffixShingle,
TokenSettingsCodec  settingsCodec 
)

Creates a shingle filter based on a user defined matrix.

The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a bool, set the input stream to null or something, and keep track of where in the matrix we are at.

Parameters
matrixthe input based for creating shingles. Does not need to contain any information until ShingleMatrixFilter.IncrementToken() is called the first time.
minimumShingleSizeminimum number of tokens in any shingle.
maximumShingleSizemaximum number of tokens in any shingle.
spacerCharactercharacter to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingleif true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodeccodec used to read input token weight and matrix positioning.

Definition at line 166 of file ShingleMatrixFilter.cs.

Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.ShingleMatrixFilter ( TokenStream  input,
int  minimumShingleSize,
int  maximumShingleSize 
)

Creates a shingle filter using default settings.

See ShingleMatrixFilter.DefaultSpacerCharacter, ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault, and ShingleMatrixFilter.DefaultSettingsCodec

Parameters
inputstream from which to construct the matrix
minimumShingleSizeminimum number of tokens in any shingle.
maximumShingleSizemaximum number of tokens in any shingle.

Definition at line 205 of file ShingleMatrixFilter.cs.

Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.ShingleMatrixFilter ( TokenStream  input,
int  minimumShingleSize,
int  maximumShingleSize,
Char?  spacerCharacter 
)

Creates a shingle filter using default settings.

See IgnoringSinglePrefixOrSuffixShingleByDefault, and DefaultSettingsCodec

Parameters
inputstream from which to construct the matrix
minimumShingleSizeminimum number of tokens in any shingle.
maximumShingleSizemaximum number of tokens in any shingle.
spacerCharactercharacter to use between texts of the token parts in a shingle. null for none.

Definition at line 217 of file ShingleMatrixFilter.cs.

Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.ShingleMatrixFilter ( TokenStream  input,
int  minimumShingleSize,
int  maximumShingleSize,
Char?  spacerCharacter,
bool  ignoringSinglePrefixOrSuffixShingle 
)

Creates a shingle filter using the default TokenSettingsCodec.

See DefaultSettingsCodec

Parameters
inputstream from which to construct the matrix
minimumShingleSizeminimum number of tokens in any shingle.
maximumShingleSizemaximum number of tokens in any shingle.
spacerCharactercharacter to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingleif true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.

Definition at line 230 of file ShingleMatrixFilter.cs.

Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.ShingleMatrixFilter ( TokenStream  input,
int  minimumShingleSize,
int  maximumShingleSize,
Char?  spacerCharacter,
bool  ignoringSinglePrefixOrSuffixShingle,
TokenSettingsCodec  settingsCodec 
)

Creates a shingle filter with ad hoc parameter settings.

Parameters
inputstream from which to construct the matrix
minimumShingleSizeminimum number of tokens in any shingle.
maximumShingleSizemaximum number of tokens in any shingle.
spacerCharactercharacter to use between texts of the token parts in a shingle. null for none.
ignoringSinglePrefixOrSuffixShingleif true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
settingsCodeccodec used to read input token weight and matrix positioning.

Definition at line 242 of file ShingleMatrixFilter.cs.

Member Function Documentation

float Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.CalculateShingleWeight ( Token  shingleToken,
List< Token shingle,
int  currentPermutationStartOffset,
List< Row currentPermutationRows,
List< Token currentPermuationTokens 
)

Evaluates the new shingle token weight.

for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed))

This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.

Parameters
shingleTokentoken returned to consumer
shingletokens the tokens used to produce the shingle token.
currentPermutationStartOffsetstart offset in parameter currentPermutationRows and currentPermutationTokens.
currentPermutationRowsan index to what matrix row a token in parameter currentPermutationTokens exist.
currentPermuationTokensall tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.
Returns
weight to be set for parameter shingleToken

Definition at line 557 of file ShingleMatrixFilter.cs.

override void Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.Dispose ( bool  disposing)
protectedvirtual

Implements Lucene.Net.Analysis.TokenStream.

Definition at line 285 of file ShingleMatrixFilter.cs.

override sealed bool Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.IncrementToken ( )
virtual

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate Util.Attributes with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.CaptureState to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.AddAttribute{T}() and AttributeSource.GetAttribute{T}(), references to all Util.Attributes that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in IncrementToken().

Returns
false for end of stream; true otherwise

Implements Lucene.Net.Analysis.TokenStream.

Definition at line 290 of file ShingleMatrixFilter.cs.

override void Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.Reset ( )
virtual

Resets this stream to the beginning. This is an optional operation, so subclasses may or may not implement this method. Reset() is not needed for the standard indexing process. However, if the tokens of a TokenStream are intended to be consumed more than once, it is necessary to implement Reset(). Note that if your TokenStream caches tokens and feeds them back again after a reset, it is imperative that you clone the tokens when you store them away (on the first pass) as well as when you return them (on future passes after Reset()).

Reimplemented from Lucene.Net.Analysis.TokenStream.

Definition at line 278 of file ShingleMatrixFilter.cs.

void Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.UpdateToken ( Token  token,
List< Token shingle,
int  currentPermutationStartOffset,
List< Row currentPermutationRows,
List< Token currentPermuationTokens 
)

Final touch of a shingle token before it is passed on to the consumer from method IncrementToken().

Calculates and sets type, flags, position increment, start/end offsets and weight.

Parameters
tokenShingle Token
shingleTokens used to produce the shingle token.
currentPermutationStartOffsetStart offset in parameter currentPermutationTokens
currentPermutationRowsindex to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens
currentPermuationTokenstokens of the current permutation of rows in the matrix.

Definition at line 528 of file ShingleMatrixFilter.cs.

Member Data Documentation

TokenSettingsCodec Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.DefaultSettingsCodec = new OneDimensionalNonWeightedTokenSettingsCodec()
static

Definition at line 108 of file ShingleMatrixFilter.cs.

Char Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.DefaultSpacerCharacter = '_'
static

Definition at line 107 of file ShingleMatrixFilter.cs.

bool Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.IgnoringSinglePrefixOrSuffixShingleByDefault
static

Definition at line 109 of file ShingleMatrixFilter.cs.

Property Documentation

bool Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.IsIgnoringSinglePrefixOrSuffixShingle
getset

Definition at line 276 of file ShingleMatrixFilter.cs.

Matrix.Matrix Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.Matrix
getset

Definition at line 272 of file ShingleMatrixFilter.cs.

int Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.MaximumShingleSize
getset

Definition at line 270 of file ShingleMatrixFilter.cs.

int Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.MinimumShingleSize
getset

Definition at line 268 of file ShingleMatrixFilter.cs.

Char Lucene.Net.Analysis.Shingle.ShingleMatrixFilter.SpacerCharacter
getset

Definition at line 274 of file ShingleMatrixFilter.cs.


The documentation for this class was generated from the following file: