Namespace Lucene.Net.Search.Similarities
This package contains the various ranking models that can be used in Lucene. The abstract class Similarity serves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.
Table Of Contents
Summary of the Ranking Methods
DefaultSimilarity is the original Lucene scoring function. It is based on a highly optimized Vector Space Model. For more information, see TFIDFSimilarity.
BM25Similarity is an optimized implementation of the successful Okapi BM25 model.
SimilarityBase provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase: * Amati and Rijsbergen's {@linkplain org.apache.lucene.search.similarities.DFRSimilarity DFR} framework; * Clinchant and Gaussier's {@linkplain org.apache.lucene.search.similarities.IBSimilarity Information-based models} for IR; * The implementation of two {@linkplain org.apache.lucene.search.similarities.LMSimilarity language models} from Zhai and Lafferty's paper. Since SimilarityBase is not optimized to the same extent as DefaultSimilarity and BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see below.
Changing Similarity
Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).
To change Similarity, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
To make this change, implement your own Similarity (likely you'll want to simply subclass an existing method, be it DefaultSimilarity or a descendant of SimilarityBase), and then register the new class by calling #setSimilarity(Similarity) before indexing and #setSimilarity(Similarity) before searching.
Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}
The easiest way to quickly implement a new ranking method is to extend SimilarityBase, which provides basic implementations for the low level . Subclasses are only required to implement the Float) and #toString() methods.
Another option is to extend one of the frameworks based on SimilarityBase. These Similarities are implemented modularly, e.g. DFRSimilarity delegates computation of the three parts of its formula to the classes BasicModel, AfterEffect and Normalization. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell DFRSimilarity to use it.
Changing {@linkplain org.apache.lucene.search.similarities.DefaultSimilarity}
If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases: 1.
The SweetSpotSimilarity
in org.apache.lucene.misc
gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
3.Changing Length Normalization — By overriding State), it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".
In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it might make sense to to override your Similarity method.
Classes
AfterEffect
This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. This component is also called the after effect and is defined by the formula Inf2 = 1 - Prob2, where Prob2 measures the information gain.
@lucene.experimental
AfterEffect.NoAfterEffect
Implementation used when there is no aftereffect.
AfterEffectB
Model of the information gain based on the ratio of two Bernoulli processes.
AfterEffectL
Model of the information gain based on Laplace's law of succession.
BasicModel
This class acts as the base class for the specific basic model implementations in the DFR framework. Basic models compute the informative content Inf1 = -log2Prob1 .
BasicModelBE
Limiting form of the Bose-Einstein model. The formula used in Lucene differs
slightly from the one in the original paper: F
is increased by tfn+1
and N
is increased by F
@lucene.experimental
NOTE: in some corner cases this model may give poor performance with Normalizations that
return large values for tfn
such as NormalizationH3. Consider using the
geometric approximation (BasicModelG) instead, which provides the same relevance
but with less practical problems.
BasicModelD
Implements the approximation of the binomial model with the divergence
for DFR. The formula used in Lucene differs slightly from the one in the
original paper: to avoid underflow for small values of N
and
F
, N
is increased by 1
and
F
is always increased by tfn+1
.
WARNING: for terms that do not meet the expected random distribution (e.g. stopwords), this model may give poor performance, such as abnormally high scores for low tf values.
BasicModelG
Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs
slightly from the one in the original paper: F
is increased by 1
and N
is increased by F
.
BasicModelIF
An approximation of the I(ne) model.
BasicModelIn
The basic tf-idf model of randomness.
BasicModelIne
Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.
BasicModelP
Implements the Poisson approximation for the binomial model for DFR.
@lucene.experimental
WARNING: for terms that do not meet the expected random distribution (e.g. stopwords), this model may give poor performance, such as abnormally high scores for low tf values.
BasicStats
Stores all statistics commonly used ranking methods.
BM25Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
DefaultSimilarity
Expert: Default scoring implementation which encodes (EncodeNormValue(Single)) norm values as a single byte before being stored. At search time, the norm byte value is read from the index Directory and decoded (DecodeNormValue(Int64)) back to a float norm value. this encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that Decode(Encode(x)) = x. For instance, Decode(Encode(0.89)) = 0.75.
Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.
DFRSimilarity
Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.
The DFR scoring formula is composed of three separate components: the basic model, the aftereffect and an additional normalization component, represented by the classes BasicModel, AfterEffect and Normalization, respectively. The names of these classes were chosen to match the names of their counterparts in the Terrier IR engine.
To construct a DFRSimilarity, you must specify the implementations for all three components of DFR:
ComponentImplementations | |
---|---|
BasicModel: Basic model of information content:
| |
AfterEffect: First normalization of information gain:
| |
Normalization: Second (length) normalization:
|
Note that qtf, the multiplicity of term-occurrence in the query, is not handled by this implementation.
Distribution
The probabilistic distribution used to model term occurrence in information-based models.
DistributionLL
Log-logistic distribution.
Unlike for DFR, the natural logarithm is used, as it is faster to compute and the original paper does not express any preference to a specific base.
DistributionSPL
The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.
Unlike for DFR, the natural logarithm is used, as it is faster to compute and the original paper does not express any preference to a specific base.
IBSimilarity
Provides a framework for the family of information-based models, as described in StÉphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.
The retrieval function is of the form RSV(q, d) = ∑ -xqw log Prob(Xw >= tdw | λw), where
- xqw is the query boost;
- Xw is a random variable that counts the occurrences of word w;
- tdw is the normalized term frequency;
- λw is a parameter.
The framework described in the paper has many similarities to the DFR framework (see DFRSimilarity). It is possible that the two Similarities will be merged at one point.
To construct an IBSimilarity, you must specify the implementations for all three components of the Information-Based model.
ComponentImplementations | |
---|---|
Distribution: Probabilistic distribution used to
model term occurrence
| |
Lambda: λw parameter of the probability distribution | |
Normalization: Term frequency normalizationAny supported DFR normalization (listed in DFRSimilarity) |
Lambda
The lambda (λw) parameter in information-based models.
LambdaDF
Computes lambda as docFreq+1 / numberOfDocuments+1
.
LambdaTTF
Computes lambda as totalTermFreq+1 / numberOfDocuments+1
.
LMDirichletSimilarity
Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.
The formula as defined the paper assigns a negative score to documents that
contain the term, but with fewer occurrences than predicted by the collection
language model. The Lucene implementation returns 0
for such
documents.
LMJelinekMercerSimilarity
Language model based on the Jelinek-Mercer smoothing method. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.
The model has a single parameter, λ. According to said paper, the
optimal value depends on both the collection and the query. The optimal value
is around 0.1
for title queries and 0.7
for long queries.
LMSimilarity
Abstract superclass for language modeling Similarities. The following inner types are introduced:
- LMSimilarity.LMStats, which defines a new statistic, the probability that the collection language model generates the current term;
- LMSimilarity.ICollectionModel, which is a strategy interface for object that
compute the collection language model
p(w|C)
; - LMSimilarity.DefaultCollectionModel, an implementation of the former, that computes the term probability as the number of occurrences of the term in the collection, divided by the total number of tokens.
LMSimilarity.DefaultCollectionModel
Models p(w|C)
as the number of occurrences of the term in the
collection, divided by the total number of tokens + 1
.
LMSimilarity.LMStats
Stores the collection distribution of the current term.
MultiSimilarity
Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. Shaw, Edward A. Fox. In Text REtrieval Conference (1993), pp. 243-252
Normalization
This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.
Normalization.NoNormalization
Implementation used when there is no normalization.
NormalizationH1
Normalization model that assumes a uniform distribution of the term frequency.
While this model is parameterless in the
original article,
information-based models (see IBSimilarity) introduced a
multiplying factor.
The default value for the c
parameter is 1
.
NormalizationH2
Normalization model in which the term frequency is inversely related to the length.
While this model is parameterless in the
original article, the thesis
introduces the parameterized variant.
The default value for the c
parameter is 1
.
NormalizationH3
Dirichlet Priors normalization
NormalizationZ
Pareto-Zipf Normalization
PerFieldSimilarityWrapper
Provides the ability to use a different Similarity for different fields.
Subclasses should implement Get(String) to return an appropriate Similarity (for example, using field-specific parameter values) for the field.
Similarity
Similarity defines the components of Lucene scoring.
Expert: Scoring API.
This is a low-level API, you should only extend this API if you want to implement an information retrieval model. If you are instead looking for a convenient way to alter Lucene's scoring, consider extending a higher-level implementation such as TFIDFSimilarity, which implements the vector space model with this API, or just tweaking the default implementation: DefaultSimilarity.
Similarity determines how Lucene weights terms, and Lucene interacts with this class at both index-time and query-time.
At indexing time, the indexer calls ComputeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via GetNormValues(String). Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information.
Implementations should carefully consider how the normalization is encoded: while Lucene's classical TFIDFSimilarity encodes a combination of index-time boost and length normalization information with SmallSingle into a single byte, this might not be suitable for all purposes.
Many formulas require the use of average document length, which can be computed via a combination of SumTotalTermFreq and MaxDoc or DocCount, depending upon whether the average should reflect field sparsity.
Additional scoring factors can be stored in named NumericDocValuesFields and accessed at query-time with GetNumericDocValues(String).
Finally, using index-time boosts (either via folding into the normalization byte or via DocValues), is an inefficient way to boost the scores of different fields if the boost will be the same for every document, instead the Similarity can simply take a constant boost parameter C, and PerFieldSimilarityWrapper can return different instances with different boosts depending upon field name.
At query-time, Queries interact with the Similarity via these steps:
- The ComputeWeight(Single, CollectionStatistics, TermStatistics[]) method is called a single time, allowing the implementation to compute any statistics (such as IDF, average document length, etc) across the entire collection. The TermStatistics and CollectionStatistics passed in already contain all of the raw statistics involved, so a Similarity can freely use any combination of statistics without causing any additional I/O. Lucene makes no assumption about what is stored in the returned Similarity.SimWeight object.
- The query normalization process occurs a single time: GetValueForNormalization() is called for each query leaf node, QueryNorm(Single) is called for the top-level query, and finally Normalize(Single, Single) passes down the normalization value and any top-level boosts (e.g. from enclosing BooleanQuerys).
- For each segment in the index, the Query creates a GetSimScorer(Similarity.SimWeight, AtomicReaderContext) The GetScore() method is called for each matching document.
When Explain(Query, Int32) is called, queries consult the Similarity's DocScorer for an explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency was computed.
Similarity.SimScorer
API for scoring "sloppy" queries such as TermQuery, SpanQuery, and PhraseQuery.
Frequencies are floating-point values: an approximate within-document frequency adjusted for "sloppiness" by ComputeSlopFactor(Int32).
Similarity.SimWeight
Stores the weight for a query across the indexed collection. this abstract implementation is empty; descendants of Similarity should subclass Similarity.SimWeight and define the statistics they require in the subclass. Examples include idf, average field length, etc.
SimilarityBase
A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the Score(BasicStats, Single, Single) and ToString() methods. Implementing Explain(Explanation, BasicStats, Int32, Single, Single) is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.
Note: multi-word queries such as phrase queries are scored in a different way than Lucene's default ranking algorithm: whereas it "fakes" an IDF value for the phrase as a whole (since it does not know it), this class instead scores phrases as a summation of the individual term scores.
TFIDFSimilarity
Implementation of Similarity with the Vector Space Model.
Expert: Scoring API.
TFIDFSimilarity defines the components of Lucene scoring. Overriding computation of these components is a convenient way to alter Lucene scoring.
Suggested reading: Introduction To Information Retrieval, Chapter 6.
The following describes how Lucene scoring evolves from underlying information retrieval models to (efficient) implementation. We first brief on VSM Score, then derive from it Lucene's Conceptual Scoring Formula, from which, finally, evolves Lucene's Practical Scoring Function (the latter is connected directly with Lucene classes and methods).
Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights are Tf-idf values.
VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is using Tf-idf. Tf and Idf are described in more detail below, but for now, for completion, let's just say that for given term t and document (or query) x, Tf(t,x) varies with the number of occurrences of term t in x (when one increases so does the other) and idf(t) similarly varies with the inverse of the number of index documents containing term t.
VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d):
| |||
VSM Score |
Where V(q) · V(d) is the dot product of the weighted vectors, and |V(q)| and |V(d)| are their Euclidean norms.
Note: the above equation can be viewed as the dot product of the normalized weighted vectors, in the sense that dividing V(q) by its euclidean norm is normalizing it to a unit vector.
Lucene refines VSM score for both search quality and usability:
- Normalizing V(d) to the unit vector is known to be problematic in that it removes all document length information. For some documents removing this info is probably ok, e.g. a document made by duplicating a certain paragraph 10 times, especially if that paragraph is made of distinct terms. But for a document which contains no duplicated paragraphs, this might be wrong. To avoid this problem, a different document length normalization factor is used, which normalizes to a vector equal to or larger than the unit vector: doc-len-norm(d).
- At indexing, users can specify that certain documents are more important than others, by assigning a document boost. For this, the score of each document is also multiplied by its boost value doc-boost(d).
- Lucene is field based, hence each query term applies to a single field, document length normalization is by the length of the certain field, and in addition to document boost there are also document fields boosts.
- The same field can be added to a document during indexing several times, and so the boost of that field is the multiplication of the boosts of the separate additions (or parts) of that field within the document.
- At search time users can specify boosts to each query, sub-query, and each query term, hence the contribution of a query term to the score of a document is multiplied by the boost of that query term query-boost(q).
- A document may match a multi term query without containing all the terms of that query (this is correct for some of the queries), and users can further reward documents matching more query terms through a coordination factor, which is usually larger when more terms are matched: coord-factor(q,d).
Under the simplifying assumption of a single field in the index, we get Lucene's Conceptual scoring formula:
| |||||||||
Lucene Conceptual Scoring Formula |
The conceptual formula is a simplification in the sense that (1) terms and documents are fielded and (2) boosts are usually per query term rather than per query.
We now describe how Lucene implements this conceptual scoring formula, and derive from it Lucene's Practical Scoring Function.
For efficient score computation some scoring components are computed and aggregated in advance:
- Query-boost for the query (actually for each query term) is known when search starts.
- Query Euclidean norm |V(q)| can be computed when search starts,
as it is independent of the document being scored.
From search optimization perspective, it is a valid question
why bother to normalize the query at all, because all
scored documents will be multiplied by the same |V(q)|,
and hence documents ranks (their order by score) will not
be affected by this normalization.
There are two good reasons to keep this normalization:
- Recall that Cosine Similarity can be used find how similar two documents are. One can use Lucene for e.g. clustering, and use a document as a query to compute its similarity to other documents. In this use case it is important that the score of document d3 for query d1 is comparable to the score of document d3 for query d2. In other words, scores of a document for two distinct queries should be comparable. There are other applications that may require this. And this is exactly what normalizing the query vector V(q) provides: comparability (to a certain extent) of two or more queries.
- Applying query normalization on the scores helps to keep the scores around the unit vector, hence preventing loss of score data because of floating point precision limitations.
- Document length norm doc-len-norm(d) and document boost doc-boost(d) are known at indexing time. They are computed in advance and their multiplication is saved as a single value in the index: norm(d). (In the equations below, norm(t in d) means norm(field(t) in doc d) where field(t) is the field associated with term t.)
Lucene's Practical Scoring Function is derived from the above. The color codes demonstrate how it relates to those of the conceptual formula:
| |||||
Lucene Practical Scoring Function |
where
-
tf(t in d)
correlates to the term's frequency,
defined as the number of times term t appears in the currently scored document d.
Documents that have more occurrences of a given term receive a higher score.
Note that tf(t in q) is assumed to be 1 and therefore it does not appear in this equation,
However if a query contains twice the same term, there will be
two term-queries with that same term and hence the computation would still be correct (although
not very efficient).
The default computation for tf(t in d) in
DefaultSimilarity (Tf(Single)) is:
tf(t in d) =
frequency½ -
idf(t) stands for Inverse Document Frequency. this value
correlates to the inverse of DocFreq
(the number of documents in which the term t appears).
this means rarer terms give higher contribution to the total score.
idf(t) appears for t in both the query and the document,
hence it is squared in the equation.
The default computation for idf(t) in
DefaultSimilarity (Idf(Int64, Int64)) is:
idf(t) = 1 + log (
)NumDocs ––––––––– DocFreq+1 -
coord(q,d)
is a score factor based on how many of the query terms are found in the specified document.
Typically, a document that contains more of the query's terms will receive a higher score
than another document with fewer query terms.
this is a search time factor computed in
coord(q,d) (Coord(Int32, Int32))
by the Similarity in effect at search time.
-
queryNorm(q)
is a normalizing factor used to make scores between queries comparable.
this factor does not affect document ranking (since all ranked documents are multiplied by the same factor),
but rather just attempts to make scores from different queries (or even different indexes) comparable.
this is a search time factor computed by the Similarity in effect at search time.
The default computation in DefaultSimilarity (QueryNorm(Single)) produces a Euclidean norm:
queryNorm(q) =
queryNorm(sumOfSquaredWeights) =
1 –––––––––––––– sumOfSquaredWeights½ The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a BooleanQuery computes this value as:
where sumOfSquaredWeights is GetValueForNormalization() and q.Boost is BoostsumOfSquaredWeights =
q.Boost 2 · ∑ ( idf(t) · t.Boost ) 2t in q -
t.Boost
is a search time boost of term t in the query q as
specified in the query text
(see query syntax),
or as set by application calls to
Boost.
Notice that there is really no direct API for accessing a boost of one term in a multi term query,
but rather multi terms are represented in a query as multi
TermQuery objects,
and so the boost of a term in the query is accessible by calling the sub-query
Boost.
-
norm(t,d) encapsulates a few (indexing time) boost and length factors:
- Field boost - set Boost before adding the field to a document.
- lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together:
Note that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.norm(t,d) =
lengthNorm · ∏Boostfield f in d named as t
Interfaces
LMSimilarity.ICollectionModel
A strategy for computing the collection language model.