Class BM25Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
Note
This API is experimental and might change in incompatible ways in the next release.
Inherited Members
Namespace: Lucene.Net.Search.Similarities
Assembly: Lucene.Net.dll
Syntax
public class BM25Similarity : Similarity
Constructors
BM25Similarity()
BM25 with these default values:
k1 = 1.2
,b = 0.75
.
Declaration
public BM25Similarity()
BM25Similarity(float, float)
BM25 with the supplied parameter values.
Declaration
public BM25Similarity(float k1, float b)
Parameters
Type | Name | Description |
---|---|---|
float | k1 | Controls non-linear term frequency normalization (saturation). |
float | b | Controls to what degree document length normalizes tf values. |
Properties
B
Returns the b
parameter
Declaration
public virtual float B { get; }
Property Value
Type | Description |
---|---|
float |
See Also
DiscountOverlaps
Gets or Sets whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
Declaration
public virtual bool DiscountOverlaps { get; set; }
Property Value
Type | Description |
---|---|
bool |
K1
Returns the k1
parameter
Declaration
public virtual float K1 { get; }
Property Value
Type | Description |
---|---|
float |
See Also
Methods
AvgFieldLength(CollectionStatistics)
The default implementation computes the average as sumTotalTermFreq / maxDoc
,
or returns 1
if the index does not store sumTotalTermFreq (Lucene 3.x indexes
or any field that omits frequency information).
Declaration
protected virtual float AvgFieldLength(CollectionStatistics collectionStats)
Parameters
Type | Name | Description |
---|---|---|
CollectionStatistics | collectionStats |
Returns
Type | Description |
---|---|
float |
ComputeNorm(FieldInvertState)
Computes the normalization value for a field, given the accumulated state of term processing for this field (see FieldInvertState).
Matches in longer fields are less precise, so implementations of this method usually set smaller values whenstate.Length
is large,
and larger values when state.Length
is small.
Note
This API is experimental and might change in incompatible ways in the next release.
Declaration
public override sealed long ComputeNorm(FieldInvertState state)
Parameters
Type | Name | Description |
---|---|---|
FieldInvertState | state | current processing state for this field |
Returns
Type | Description |
---|---|
long | computed norm value |
Overrides
ComputeWeight(float, CollectionStatistics, params TermStatistics[])
Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.
Declaration
public override sealed Similarity.SimWeight ComputeWeight(float queryBoost, CollectionStatistics collectionStats, params TermStatistics[] termStats)
Parameters
Type | Name | Description |
---|---|---|
float | queryBoost | the query-time boost. |
CollectionStatistics | collectionStats | collection-level statistics, such as the number of tokens in the collection. |
TermStatistics[] | termStats | term-level statistics, such as the document frequency of a term across the collection. |
Returns
Type | Description |
---|---|
Similarity.SimWeight | Similarity.SimWeight object with the information this Similarity needs to score a query. |
Overrides
DecodeNormValue(byte)
The default implementation returns 1 / f2
where f
is Byte315ToSingle(byte).
Declaration
protected virtual float DecodeNormValue(byte b)
Parameters
Type | Name | Description |
---|---|---|
byte | b |
Returns
Type | Description |
---|---|
float |
EncodeNormValue(float, int)
The default implementation encodes boost / sqrt(length)
with SingleToByte315(float). This is compatible with
Lucene's default implementation. If you change this, then you should
change DecodeNormValue(byte) to match.
Declaration
protected virtual byte EncodeNormValue(float boost, int fieldLength)
Parameters
Type | Name | Description |
---|---|---|
float | boost | |
int | fieldLength |
Returns
Type | Description |
---|---|
byte |
GetSimScorer(SimWeight, AtomicReaderContext)
Creates a new Similarity.SimScorer to score matching documents from a segment of the inverted index.
Declaration
public override sealed Similarity.SimScorer GetSimScorer(Similarity.SimWeight stats, AtomicReaderContext context)
Parameters
Type | Name | Description |
---|---|---|
Similarity.SimWeight | stats | |
AtomicReaderContext | context | segment of the inverted index to be scored. |
Returns
Type | Description |
---|---|
Similarity.SimScorer | Sloppy Similarity.SimScorer for scoring documents across |
Overrides
Exceptions
Type | Condition |
---|---|
IOException | if there is a low-level I/O error |
Idf(long, long)
Implemented as log(1 + (numDocs - docFreq + 0.5)/(docFreq + 0.5))
.
Declaration
protected virtual float Idf(long docFreq, long numDocs)
Parameters
Type | Name | Description |
---|---|---|
long | docFreq | |
long | numDocs |
Returns
Type | Description |
---|---|
float |
IdfExplain(CollectionStatistics, TermStatistics)
Computes a score factor for a simple term and returns an explanation for that score factor.
The default implementation uses:Idf(docFreq, searcher.MaxDoc);
Note that MaxDoc is used instead of NumDocs because also DocFreq is used, and when the latter is inaccurate, so is MaxDoc, and in the same direction. In addition, MaxDoc is more efficient to compute
Declaration
public virtual Explanation IdfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
Parameters
Type | Name | Description |
---|---|---|
CollectionStatistics | collectionStats | collection-level statistics |
TermStatistics | termStats | term-level statistics for the term |
Returns
Type | Description |
---|---|
Explanation | an Explanation object that includes both an idf score factor and an explanation for the term. |
IdfExplain(CollectionStatistics, TermStatistics[])
Computes a score factor for a phrase.
The default implementation sums the idf factor for each term in the phrase.Declaration
public virtual Explanation IdfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats)
Parameters
Type | Name | Description |
---|---|---|
CollectionStatistics | collectionStats | collection-level statistics |
TermStatistics[] | termStats | term-level statistics for the terms in the phrase |
Returns
Type | Description |
---|---|
Explanation | an Explanation object that includes both an idf score factor for the phrase and an explanation for each term. |
ScorePayload(int, int, int, BytesRef)
The default implementation returns 1
Declaration
protected virtual float ScorePayload(int doc, int start, int end, BytesRef payload)
Parameters
Type | Name | Description |
---|---|---|
int | doc | |
int | start | |
int | end | |
BytesRef | payload |
Returns
Type | Description |
---|---|
float |
SloppyFreq(int)
Implemented as 1 / (distance + 1)
.
Declaration
protected virtual float SloppyFreq(int distance)
Parameters
Type | Name | Description |
---|---|---|
int | distance |
Returns
Type | Description |
---|---|
float |
ToString()
Returns a string that represents the current object.
Declaration
public override string ToString()
Returns
Type | Description |
---|---|
string | A string that represents the current object. |