Namespace Lucene.Net.Search.Similarities

This package contains the various ranking models that can be used in Lucene. The abstract class Similarity serves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.

Summary of the Ranking Methods
Changing the Similarity

Summary of the Ranking Methods

DefaultSimilarity is the original Lucene scoring function. It is based on a highly optimized Vector Space Model. For more information, see TFIDFSimilarity.

BM25Similarity is an optimized implementation of the successful Okapi BM25 model.

SimilarityBase provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase:

Amati and Rijsbergen's DFR framework;
Clinchant and Gaussier's Information-based models for IR;
The implementation of two language models from Zhai and Lafferty's paper.

Since SimilarityBase is not optimized to the same extent as DefaultSimilarity and BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses; see below.

Changing Similarity

Chances are the available Similarities are sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).

To change Similarity, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.

To make this change, implement your own Similarity (likely you'll want to simply subclass an existing method, be it DefaultSimilarity or a descendant of SimilarityBase), and then register the new class by setting IndexWriterConfig.Similarity before indexing and IndexSearcher.Similarity before searching.

Extending SimilarityBase

The easiest way to quickly implement a new ranking method is to extend SimilarityBase, which provides basic implementations for the low level . Subclasses are only required to implement the Float) and #toString() methods.

Another option is to extend one of the frameworks based on SimilarityBase. These Similarities are implemented modularly, e.g. DFRSimilarity delegates computation of the three parts of its formula to the classes BasicModel, AfterEffect and Normalization. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell DFRSimilarity to use it.

Changing DefaultSimilarity

If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:

The SweetSpotSimilarity gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
Overriding Tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the Tf() method.
Changing Length Normalization — By overriding State), it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".

In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):

[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it might make sense to to override your Similarity method.

Classes

AfterEffect

This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. This component is also called the after effect and is defined by the formula Inf₂ = 1 - Prob₂, where Prob₂ measures the information gain.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

AfterEffect.NoAfterEffect

Implementation used when there is no aftereffect.

AfterEffectB

Model of the information gain based on the ratio of two Bernoulli processes.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

AfterEffectL

Model of the information gain based on Laplace's law of succession.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModel

This class acts as the base class for the specific basic model implementations in the DFR framework. Basic models compute the informative content Inf₁ = -log₂Prob₁ .

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelBE

Limiting form of the Bose-Einstein model. The formula used in Lucene differs slightly from the one in the original paper: F is increased by tfn+1 and N is increased by F

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NOTE: in some corner cases this model may give poor performance with Normalizations that return large values for tfn such as NormalizationH3. Consider using the geometric approximation (BasicModelG) instead, which provides the same relevance but with less practical problems.

BasicModelD

Implements the approximation of the binomial model with the divergence for DFR. The formula used in Lucene differs slightly from the one in the original paper: to avoid underflow for small values of N and F, N is increased by 1 and F is always increased by tfn+1.

WARNING: for terms that do not meet the expected random distribution (e.g. stopwords), this model may give poor performance, such as abnormally high scores for low tf values.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelG

Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs slightly from the one in the original paper: F is increased by 1 and N is increased by F.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelIF

An approximation of the I(n_e) model.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelIn

The basic tf-idf model of randomness.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelIne

Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BasicModelP

Implements the Poisson approximation for the binomial model for DFR.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

WARNING: for terms that do not meet the expected random distribution (e.g. stopwords), this model may give poor performance, such as abnormally high scores for low tf values.

BasicStats

Stores all statistics commonly used ranking methods.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

BM25Similarity

BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DefaultSimilarity

Expert: Default scoring implementation which encodes (EncodeNormValue(Single)) norm values as a single byte before being stored. At search time, the norm byte value is read from the index Directory and decoded (DecodeNormValue(Int64)) back to a float norm value. this encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that Decode(Encode(x)) = x. For instance, Decode(Encode(0.89)) = 0.75.

Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms - for all documents - are maintained in memory.

The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

Last, note that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

DFRSimilarity

Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.

The DFR scoring formula is composed of three separate components: the basic model, the aftereffect and an additional normalization component, represented by the classes BasicModel, AfterEffect and Normalization, respectively. The names of these classes were chosen to match the names of their counterparts in the Terrier IR engine.

To construct a DFRSimilarity, you must specify the implementations for all three components of DFR:

ComponentImplementations
BasicModel: Basic model of information content: BasicModelBE: Limiting form of Bose-Einstein BasicModelG: Geometric approximation of Bose-Einstein BasicModelP: Poisson approximation of the Binomial BasicModelD: Divergence approximation of the Binomial BasicModelIn: Inverse document frequency BasicModelIne: Inverse expected document frequency [mixture of Poisson and IDF] BasicModelIF: Inverse term frequency [approximation of I(ne)]
AfterEffect: First normalization of information gain: AfterEffectL: Laplace's law of succession AfterEffectB: Ratio of two Bernoulli processes AfterEffect.NoAfterEffect: no first normalization
Normalization: Second (length) normalization: NormalizationH1: Uniform distribution of term frequency NormalizationH2: term frequency density inversely related to length NormalizationH3: term frequency normalization provided by Dirichlet prior NormalizationZ: term frequency normalization provided by a Zipfian relation Normalization.NoNormalization: no second normalization

Note that qtf, the multiplicity of term-occurrence in the query, is not handled by this implementation.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Distribution

The probabilistic distribution used to model term occurrence in information-based models.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DistributionLL

Log-logistic distribution.

Unlike for DFR, the natural logarithm is used, as it is faster to compute and the original paper does not express any preference to a specific base.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

DistributionSPL

The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.

Unlike for DFR, the natural logarithm is used, as it is faster to compute and the original paper does not express any preference to a specific base.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

IBSimilarity

Provides a framework for the family of information-based models, as described in StÉphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.

The retrieval function is of the form RSV(q, d) = ∑ -x^q_w log Prob(X_w >= t^d_w | λ_w), where

x^q_w is the query boost;
X_w is a random variable that counts the occurrences of word w;
t^d_w is the normalized term frequency;
λ_w is a parameter.

The framework described in the paper has many similarities to the DFR framework (see DFRSimilarity). It is possible that the two Similarities will be merged at one point.

To construct an IBSimilarity, you must specify the implementations for all three components of the Information-Based model.

ComponentImplementations
Distribution: Probabilistic distribution used to model term occurrence DistributionLL: Log-logistic DistributionLL: Smoothed power-law
Lambda: λ_w parameter of the probability distribution LambdaDF: `N_w/N` or average number of documents where w occurs LambdaTTF: `F_w/N` or average number of occurrences of w in the collection
Normalization: Term frequency normalizationAny supported DFR normalization (listed in DFRSimilarity)

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Lambda

The lambda (λ_w) parameter in information-based models.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LambdaDF

Computes lambda as docFreq+1 / numberOfDocuments+1.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LambdaTTF

Computes lambda as totalTermFreq+1 / numberOfDocuments+1.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LMDirichletSimilarity

Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

The formula as defined the paper assigns a negative score to documents that contain the term, but with fewer occurrences than predicted by the collection language model. The Lucene implementation returns 0 for such documents.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LMJelinekMercerSimilarity

Language model based on the Jelinek-Mercer smoothing method. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

The model has a single parameter, λ. According to said paper, the optimal value depends on both the collection and the query. The optimal value is around 0.1 for title queries and 0.7 for long queries.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LMSimilarity

Abstract superclass for language modeling Similarities. The following inner types are introduced:

LMSimilarity.LMStats, which defines a new statistic, the probability that the collection language model generates the current term;
LMSimilarity.ICollectionModel, which is a strategy interface for object that compute the collection language model p(w|C);
LMSimilarity.DefaultCollectionModel, an implementation of the former, that computes the term probability as the number of occurrences of the term in the collection, divided by the total number of tokens.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

LMSimilarity.DefaultCollectionModel

Models p(w|C) as the number of occurrences of the term in the collection, divided by the total number of tokens + 1.

LMSimilarity.LMStats

Stores the collection distribution of the current term.

MultiSimilarity

Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. Shaw, Edward A. Fox. In Text REtrieval Conference (1993), pp. 243-252

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Normalization

This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Normalization.NoNormalization

Implementation used when there is no normalization.

NormalizationH1

Normalization model that assumes a uniform distribution of the term frequency.

While this model is parameterless in the original article, information-based models (see IBSimilarity) introduced a multiplying factor. The default value for the c parameter is 1.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NormalizationH2

Normalization model in which the term frequency is inversely related to the length.

While this model is parameterless in the original article, the thesis introduces the parameterized variant. The default value for the c parameter is 1.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NormalizationH3

Dirichlet Priors normalization

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

NormalizationZ

Pareto-Zipf Normalization

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

PerFieldSimilarityWrapper

Provides the ability to use a different Similarity for different fields.

Subclasses should implement Get(String) to return an appropriate Similarity (for example, using field-specific parameter values) for the field.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Similarity

Similarity defines the components of Lucene scoring.

Expert: Scoring API.

This is a low-level API, you should only extend this API if you want to implement an information retrieval model. If you are instead looking for a convenient way to alter Lucene's scoring, consider extending a higher-level implementation such as TFIDFSimilarity, which implements the vector space model with this API, or just tweaking the default implementation: DefaultSimilarity.

Similarity determines how Lucene weights terms, and Lucene interacts with this class at both index-time and query-time.

At indexing time, the indexer calls ComputeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via GetNormValues(String). Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information.

Implementations should carefully consider how the normalization is encoded: while Lucene's classical TFIDFSimilarity encodes a combination of index-time boost and length normalization information with SmallSingle into a single byte, this might not be suitable for all purposes.

Many formulas require the use of average document length, which can be computed via a combination of SumTotalTermFreq and MaxDoc or DocCount, depending upon whether the average should reflect field sparsity.

Additional scoring factors can be stored in named NumericDocValuesFields and accessed at query-time with GetNumericDocValues(String).

Finally, using index-time boosts (either via folding into the normalization byte or via DocValues), is an inefficient way to boost the scores of different fields if the boost will be the same for every document, instead the Similarity can simply take a constant boost parameter C, and PerFieldSimilarityWrapper can return different instances with different boosts depending upon field name.

At query-time, Queries interact with the Similarity via these steps:

The ComputeWeight(Single, CollectionStatistics, TermStatistics[]) method is called a single time, allowing the implementation to compute any statistics (such as IDF, average document length, etc) across the entire collection. The TermStatistics and CollectionStatistics passed in already contain all of the raw statistics involved, so a Similarity can freely use any combination of statistics without causing any additional I/O. Lucene makes no assumption about what is stored in the returned Similarity.SimWeight object.
The query normalization process occurs a single time: GetValueForNormalization() is called for each query leaf node, QueryNorm(Single) is called for the top-level query, and finally Normalize(Single, Single) passes down the normalization value and any top-level boosts (e.g. from enclosing BooleanQuerys).
For each segment in the index, the Query creates a GetSimScorer(Similarity.SimWeight, AtomicReaderContext) The GetScore() method is called for each matching document.

When Explain(Query, Int32) is called, queries consult the Similarity's DocScorer for an explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency was computed.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

Similarity.SimScorer

API for scoring "sloppy" queries such as TermQuery, SpanQuery, and PhraseQuery.

Frequencies are floating-point values: an approximate within-document frequency adjusted for "sloppiness" by ComputeSlopFactor(Int32).

Similarity.SimWeight

Stores the weight for a query across the indexed collection. this abstract implementation is empty; descendants of Similarity should subclass Similarity.SimWeight and define the statistics they require in the subclass. Examples include idf, average field length, etc.

SimilarityBase

A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the Score(BasicStats, Single, Single) and ToString() methods. Implementing Explain(Explanation, BasicStats, Int32, Single, Single) is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.

Note: multi-word queries such as phrase queries are scored in a different way than Lucene's default ranking algorithm: whereas it "fakes" an IDF value for the phrase as a whole (since it does not know it), this class instead scores phrases as a summation of the individual term scores.

This is a Lucene.NET EXPERIMENTAL API, use at your own risk

TFIDFSimilarity

Implementation of Similarity with the Vector Space Model.

Expert: Scoring API.

TFIDFSimilarity defines the components of Lucene scoring. Overriding computation of these components is a convenient way to alter Lucene scoring.

The following describes how Lucene scoring evolves from underlying information retrieval models to (efficient) implementation. We first brief on VSM Score, then derive from it Lucene's Conceptual Scoring Formula, from which, finally, evolves Lucene's Practical Scoring Function (the latter is connected directly with Lucene classes and methods).

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights are Tf-idf values.

VSM does not require weights to be Tf-idf values, but Tf-idf values are believed to produce search results of high quality, and so Lucene is using Tf-idf. Tf and Idf are described in more detail below, but for now, for completion, let's just say that for given term t and document (or query) x, Tf(t,x) varies with the number of occurrences of term t in x (when one increases so does the other) and idf(t) similarly varies with the inverse of the number of index documents containing term t.

VSM score of document d for query q is the Cosine Similarity of the weighted query vectors V(q) and V(d):

cosine-similarity(q,d) =
V(q) · V(d)–––––––––|V(q)| |V(d)|

VSM Score

Where V(q) · V(d) is the dot product of the weighted vectors, and |V(q)| and |V(d)| are their Euclidean norms.

Note: the above equation can be viewed as the dot product of the normalized weighted vectors, in the sense that dividing V(q) by its euclidean norm is normalizing it to a unit vector.

Lucene refines VSM score for both search quality and usability:

Normalizing V(d) to the unit vector is known to be problematic in that it removes all document length information. For some documents removing this info is probably ok, e.g. a document made by duplicating a certain paragraph 10 times, especially if that paragraph is made of distinct terms. But for a document which contains no duplicated paragraphs, this might be wrong. To avoid this problem, a different document length normalization factor is used, which normalizes to a vector equal to or larger than the unit vector: doc-len-norm(d).
At indexing, users can specify that certain documents are more important than others, by assigning a document boost. For this, the score of each document is also multiplied by its boost value doc-boost(d).
Lucene is field based, hence each query term applies to a single field, document length normalization is by the length of the certain field, and in addition to document boost there are also document fields boosts.
The same field can be added to a document during indexing several times, and so the boost of that field is the multiplication of the boosts of the separate additions (or parts) of that field within the document.
At search time users can specify boosts to each query, sub-query, and each query term, hence the contribution of a query term to the score of a document is multiplied by the boost of that query term query-boost(q).
A document may match a multi term query without containing all the terms of that query (this is correct for some of the queries), and users can further reward documents matching more query terms through a coordination factor, which is usually larger when more terms are matched: coord-factor(q,d).

Under the simplifying assumption of a single field in the index, we get Lucene's Conceptual scoring formula:

score(q,d) =
coord-factor(q,d) ·
query-boost(q) ·

V(q) · V(d)
–––––––––
\|V(q)\|

· doc-len-norm(d) · doc-boost(d)

Lucene Conceptual Scoring Formula

The conceptual formula is a simplification in the sense that (1) terms and documents are fielded and (2) boosts are usually per query term rather than per query.

We now describe how Lucene implements this conceptual scoring formula, and derive from it Lucene's Practical Scoring Function.

For efficient score computation some scoring components are computed and aggregated in advance:

Query-boost for the query (actually for each query term) is known when search starts.
Query Euclidean norm |V(q)| can be computed when search starts, as it is independent of the document being scored. From search optimization perspective, it is a valid question why bother to normalize the query at all, because all scored documents will be multiplied by the same |V(q)|, and hence documents ranks (their order by score) will not be affected by this normalization. There are two good reasons to keep this normalization:
- Recall that Cosine Similarity can be used find how similar two documents are. One can use Lucene for e.g. clustering, and use a document as a query to compute its similarity to other documents. In this use case it is important that the score of document d3 for query d1 is comparable to the score of document d3 for query d2. In other words, scores of a document for two distinct queries should be comparable. There are other applications that may require this. And this is exactly what normalizing the query vector V(q) provides: comparability (to a certain extent) of two or more queries.
- Applying query normalization on the scores helps to keep the scores around the unit vector, hence preventing loss of score data because of floating point precision limitations.
Document length norm doc-len-norm(d) and document boost doc-boost(d) are known at indexing time. They are computed in advance and their multiplication is saved as a single value in the index: norm(d). (In the equations below, norm(t in d) means norm(field(t) in doc d) where field(t) is the field associated with term t.)

Lucene's Practical Scoring Function is derived from the above. The color codes demonstrate how it relates to those of the conceptual formula:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)² · t.Boost · norm(t,d) )
t in q

Lucene Practical Scoring Function

where

tf(t in d) correlates to the term's frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. Note that tf(t in q) is assumed to be 1 and therefore it does not appear in this equation, However if a query contains twice the same term, there will be two term-queries with that same term and hence the computation would still be correct (although not very efficient). The default computation for tf(t in d) in DefaultSimilarity (Tf(Single)) is:

tf(t in d) =
frequency^½
idf(t) stands for Inverse Document Frequency. this value correlates to the inverse of DocFreq (the number of documents in which the term t appears). this means rarer terms give higher contribution to the total score. idf(t) appears for t in both the query and the document, hence it is squared in the equation. The default computation for idf(t) in DefaultSimilarity (Idf(Int64, Int64)) is:

idf(t) = 1 + log (
NumDocs
–––––––––
DocFreq+1
)
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. this is a search time factor computed in coord(q,d) (Coord(Int32, Int32)) by the Similarity in effect at search time.
queryNorm(q) is a normalizing factor used to make scores between queries comparable. this factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. this is a search time factor computed by the Similarity in effect at search time.

The default computation in DefaultSimilarity (QueryNorm(Single)) produces a Euclidean norm:

queryNorm(q) =
queryNorm(sumOfSquaredWeights) =

1
––––––––––––––
sumOfSquaredWeights^½

The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a BooleanQuery computes this value as:

sumOfSquaredWeights =
q.Boost ² · ∑ ( idf(t) · t.Boost ) ²
t in q
where sumOfSquaredWeights is GetValueForNormalization() and q.Boost is Boost
t.Boost is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to Boost. Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query Boost.
norm(t,d) encapsulates a few (indexing time) boost and length factors:
- Field boost - set Boost before adding the field to a document.
- lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
The ComputeNorm(FieldInvertState) method is responsible for combining all of these factors into a single System.Single.

When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together:

norm(t,d) =
lengthNorm · ∏Boost
field f in d named as t
Note that search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

Interfaces

LMSimilarity.ICollectionModel

A strategy for computing the collection language model.

sumOfSquaredWeights = q.Boost ² · ∑ ( idf(t) · t.Boost ) ²
t in q

Namespace Lucene.Net.Search.Similarities

Table Of Contents

Summary of the Ranking Methods

Changing Similarity

Extending SimilarityBase

Changing DefaultSimilarity

Classes

Interfaces