Namespace Lucene.Net.Search
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
Classes
AutomatonQuery
A Query that will match terms against a finite-state machine.
This query will match documents that contain terms accepted by a given finite-state machine. The automaton can be constructed with the Lucene.Net.Util.Automaton API. Alternatively, it can be created from a regular expression with RegexpQuery or from the standard Lucene wildcard syntax with WildcardQuery.
When the query is executed, it will create an equivalent DFA of the
finite-state machine, and will enumerate the term dictionary in an
intelligent way to reduce the number of comparisons. For example: the regular
expression of [dl]og?
will make approximately four comparisons:
do, dog, lo, and log.
Note
This API is experimental and might change in incompatible ways in the next release.
BitsFilteredDocIdSet
This implementation supplies a filtered DocIdSet, that excludes all docids which are not in a IBits instance. This is especially useful in Filter to apply the acceptDocs passed to GetDocIdSet(AtomicReaderContext, IBits) before returning the final DocIdSet.
BooleanClause
A clause in a BooleanQuery.
BooleanQuery
A Query that matches documents matching boolean combinations of other queries, e.g. TermQuerys, PhraseQuerys or other BooleanQuerys.
Collection initializer note: To create and populate a BooleanQuery in a single statement, you can use the following example as a guide:var booleanQuery = new BooleanQuery() {
{ new WildcardQuery(new Term("field2", "foobar")), Occur.SHOULD },
{ new MultiPhraseQuery() {
new Term("field", "microsoft"),
new Term("field", "office")
}, Occur.SHOULD }
};
// or
var booleanQuery = new BooleanQuery() {
new BooleanClause(new WildcardQuery(new Term("field2", "foobar")), Occur.SHOULD),
new BooleanClause(new MultiPhraseQuery() {
new Term("field", "microsoft"),
new Term("field", "office")
}, Occur.SHOULD)
};
BooleanQuery.BooleanWeight
Expert: the Weight for BooleanQuery, used to normalize, score and explain these queries.
Note
This API is experimental and might change in incompatible ways in the next release.
BooleanQuery.TooManyClausesException
Thrown when an attempt is made to add more than MaxClauseCount clauses. This typically happens if a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery is expanded to many terms during search.
BoostAttribute
Implementation class for IBoostAttribute.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
BulkScorer
This class is used to score a range of documents at once, and is returned by GetBulkScorer(AtomicReaderContext, bool, IBits). Only queries that have a more optimized means of scoring across a range of documents need to override this. Otherwise, a default implementation is wrapped around the Scorer returned by GetScorer(AtomicReaderContext, IBits).
CachingCollector
Caches all docs, and optionally also scores, coming from
a search, and is then able to replay them to another
collector. You specify the max RAM this class may use.
Once the collection is done, call IsCached. If
this returns true
, you can use Replay(ICollector)
against a new collector. If it returns false
, this means
too much RAM was required and you must instead re-run the
original search.
See the Lucene modules/grouping
module for more
details including a full code example.
Note
This API is experimental and might change in incompatible ways in the next release.
CachingWrapperFilter
Wraps another Filter's result and caches it. The purpose is to allow filters to simply filter, and then wrap with this class to add caching.
CollectionStatistics
Contains statistics for a collection (field)
Note
This API is experimental and might change in incompatible ways in the next release.
CollectionTerminatedException
Throw this exception in Collect(int) to prematurely terminate collection of the current leaf.
Note: IndexSearcher swallows this exception and never re-throws it. As a consequence, you should not catch it when calling any overload of Search(Weight, FieldDoc?, int, Sort, bool, bool, bool) as it is unnecessary and might hide misuse of this exception.Collector
LUCENENET specific class used to hold the NewAnonymous(Action<Scorer>, Action<int>, Action<AtomicReaderContext>, Func<bool>) static method.
ComplexExplanation
Expert: Describes the score computation for document and query, and can distinguish a match independent of a positive value.
ConstantScoreAutoRewrite
A rewrite method that tries to pick the best constant-score rewrite method based on term and document counts from the query. If both the number of terms and documents is small enough, then CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE is used. Otherwise, CONSTANT_SCORE_FILTER_REWRITE is used.
ConstantScoreQuery
A query that wraps another query or a filter and simply returns a constant score equal to the query boost for every document that matches the filter or query. For queries it therefore simply strips of all scores and returns a constant one.
ConstantScoreQuery.ConstantBulkScorer
We return this as our BulkScorer so that if the CSQ wraps a query with its own optimized top-level scorer (e.g. BooleanScorer) we can use that top-level scorer.
ConstantScoreQuery.ConstantScorer
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
ConstantScoreQuery.ConstantWeight
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
ControlledRealTimeReopenThread<T>
Utility class that runs a thread to manage periodic reopens of a ReferenceManager<G>, with methods to wait for a specific index changes to become visible. To use this class you must first wrap your IndexWriter with a TrackingIndexWriter and always use it to make changes to the index, saving the returned generation. Then, when a given search request needs to see a specific index change, call the WaitForGeneration(long) to wait for that change to be visible. Note that this will only scale well if most searches do not need to wait for a specific index generation.
Note
This API is experimental and might change in incompatible ways in the next release.
DisjunctionMaxQuery
A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as BooleanQuery would give).
If the query is "albino elephant" this ensures that "albino" matching one field and "elephant" matching another gets a higher score than "albino" matching both fields. To get this result, use both BooleanQuery and DisjunctionMaxQuery: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery's is combined into a BooleanQuery. The tie breaker capability allows results that include the same term in multiple fields to be judged better than results that include this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields. Collection initializer note: To create and populate a DisjunctionMaxQuery in a single statement, you can use the following example as a guide:var disjunctionMaxQuery = new DisjunctionMaxQuery(0.1f) {
new TermQuery(new Term("field1", "albino")),
new TermQuery(new Term("field2", "elephant"))
};
DisjunctionMaxQuery.DisjunctionMaxWeight
Expert: the Weight for DisjunctionMaxQuery, used to normalize, score and explain these queries.
NOTE: this API and implementation is subject to change suddenly in the next release.
DocIdSet
A DocIdSet contains a set of doc ids. Implementing classes must only implement GetIterator() to provide access to the set.
DocIdSetIterator
This abstract class defines methods to iterate over a set of non-decreasing doc ids. Note that this class assumes it iterates on doc Ids, and therefore NO_MORE_DOCS is set to MaxValue in order to be used as a sentinel object. Implementations of this class are expected to consider MaxValue as an invalid value.
DocTermOrdsRangeFilter
A range filter built on top of a cached multi-valued term field (in IFieldCache).
Like FieldCacheRangeFilter, this is just a specialized range query versus using a TermRangeQuery with DocTermOrdsRewriteMethod: it will only do two ordinal to term lookups.
DocTermOrdsRewriteMethod
Rewrites MultiTermQuerys into a filter, using DocTermOrds for term enumeration.
This can be used to perform these queries against an unindexed docvalues field.
Note
This API is experimental and might change in incompatible ways in the next release.
Explanation
Expert: Describes the score computation for document and query.
FieldCache
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
FieldCache.Bytes
Field values as 8-bit signed bytes
FieldCache.CacheEntry
EXPERT: A unique Identifier/Description for each item in the IFieldCache. Can be useful for logging/debugging.
Note
This API is experimental and might change in incompatible ways in the next release.
FieldCache.CreationPlaceholder<TValue>
Placeholder indicating creation of this cache is currently in-progress.
FieldCache.Doubles
Field values as 64-bit doubles
FieldCache.Int16s
Field values as 16-bit signed shorts
NOTE: This was Shorts in LuceneFieldCache.Int32s
Field values as 32-bit signed integers
NOTE: This was Ints in LuceneFieldCache.Int64s
Field values as 64-bit signed long integers
NOTE: This was Longs in LuceneFieldCache.Singles
Field values as 32-bit floats
NOTE: This was Floats in LuceneFieldCacheDocIdSet
Base class for DocIdSet to be used with IFieldCache. The implementation
of its iterator is very stupid and slow if the implementation of the
MatchDoc(int) method is not optimized, as iterators simply increment
the document id until MatchDoc(int) returns true
. Because of this
MatchDoc(int) must be as fast as possible and in no case do any
I/O.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
FieldCacheRangeFilter
A range filter built on top of a cached single term field (in IFieldCache).
FieldCacheRangeFilter builds a single cache for the field the first time it is used. Each subsequent FieldCacheRangeFilter on the same field then reuses this cache, even if the range itself changes. this means that FieldCacheRangeFilter is much faster (sometimes more than 100x as fast) as building a TermRangeFilter, if using a NewStringRange(string, string, string, bool, bool). However, if the range never changes it is slower (around 2x as slow) than building a CachingWrapperFilter on top of a single TermRangeFilter. For numeric data types, this filter may be significantly faster than NumericRangeFilter. Furthermore, it does not need the numeric values encoded by Int32Field, SingleField, Int64Field or DoubleField. But it has the problem that it only works with exact one value/document (see below). As with all IFieldCache based functionality, FieldCacheRangeFilter is only valid for fields which exact one term for each document (except for NewStringRange(string, string, string, bool, bool) where 0 terms are also allowed). Due to a restriction of IFieldCache, for numeric ranges all terms that do not have a numeric value, 0 is assumed. Thus it works on dates, prices and other single value fields but will not work on regular text fields. It is preferable to use a NOT_ANALYZED field to ensure that there is only a single term. This class does not have an constructor, use one of the static factory methods available, that create a correct instance for different data types supported by IFieldCache.FieldCacheRangeFilter<T>
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
FieldCacheRewriteMethod
Rewrites MultiTermQuerys into a filter, using the IFieldCache for term enumeration.
This can be used to perform these queries against an unindexed docvalues field.Note
This API is experimental and might change in incompatible ways in the next release.
FieldCacheTermsFilter
A Filter that only accepts documents whose single term value in the specified field is contained in the provided set of allowed terms.
This is the same functionality as TermsFilter (from queries/), except this filter requires that the field contains only a single term for all documents. Because of drastically different implementations, they also have different performance characteristics, as described below.
The first invocation of this filter on a given field will be slower, since a SortedDocValues must be created. Subsequent invocations using the same field will re-use this cache. However, as with all functionality based on IFieldCache, persistent RAM is consumed to hold the cache, and is not freed until the IndexReader is disposed. In contrast, TermsFilter has no persistent RAM consumption.
With each search, this filter translates the specified set of Terms into a private FixedBitSet keyed by term number per unique IndexReader (normally one reader per segment). Then, during matching, the term number for each docID is retrieved from the cache and then checked for inclusion using the FixedBitSet. Since all testing is done using RAM resident data structures, performance should be very fast, most likely fast enough to not require further caching of the DocIdSet for each possible combination of terms. However, because docIDs are simply scanned linearly, an index with a great many small documents may find this linear scan too costly.
In contrast, TermsFilter builds up a FixedBitSet, keyed by docID, every time it's created, by enumerating through all matching docs using DocsEnum to seek and scan through each term's docID list. While there is no linear scan of all docIDs, besides the allocation of the underlying array in the FixedBitSet, this approach requires a number of "disk seeks" in proportion to the number of terms, which can be exceptionally costly when there are cache misses in the OS's IO cache.
Generally, this filter will be slower on the first invocation for a given field, but subsequent invocations, even if you change the allowed set of Terms, should be faster than TermsFilter, especially as the number of Terms being matched increases. If you are matching only a very small number of terms, and those terms in turn match a very small number of documents, TermsFilter may perform faster.
Which filter is best is very application dependent.
FieldComparer
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
FieldComparer.ByteComparer
Parses field's values as byte (using GetBytes(AtomicReader, string, IByteParser, bool) and sorts by ascending value
FieldComparer.DocComparer
Sorts by ascending docID
FieldComparer.DoubleComparer
Parses field's values as double (using GetDoubles(AtomicReader, string, IDoubleParser, bool) and sorts by ascending value
FieldComparer.Int16Comparer
Parses field's values as short (using GetInt16s(AtomicReader, string, IInt16Parser, bool) and sorts by ascending value
NOTE: This was ShortComparator in LuceneFieldComparer.Int32Comparer
Parses field's values as int (using GetInt32s(AtomicReader, string, IInt32Parser, bool) and sorts by ascending value
NOTE: This was IntComparator in LuceneFieldComparer.Int64Comparer
Parses field's values as long (using GetInt64s(AtomicReader, string, IInt64Parser, bool) and sorts by ascending value
NOTE: This was LongComparator in LuceneFieldComparer.NumericComparer<TNumber>
Base FieldComparer class for numeric types
FieldComparer.RelevanceComparer
Sorts by descending relevance. NOTE: if you are sorting only by descending relevance and then secondarily by ascending docID, performance is faster using TopScoreDocCollector directly (which all overloads of Search(Query, int) use when no Sort is specified).
FieldComparer.SingleComparer
Parses field's values as float (using GetSingles(AtomicReader, string, ISingleParser, bool) and sorts by ascending value
NOTE: This was FloatComparator in LuceneFieldComparer.TermOrdValComparer
Sorts by field's natural Term sort order, using ordinals. This is functionally equivalent to FieldComparer.TermValComparer, but it first resolves the string to their relative ordinal positions (using the index returned by GetTermsIndex(AtomicReader, string, float)), and does most comparisons using the ordinals. For medium to large results, this comparer will be much faster than FieldComparer.TermValComparer. For very small result sets it may be slower.
FieldComparer.TermValComparer
Sorts by field's natural Term sort order. All comparisons are done using CompareTo(BytesRef), which is slow for medium to large result sets but possibly very fast for very small results sets.
FieldComparerSource
Provides a FieldComparer for custom field sorting.
Note
This API is experimental and might change in incompatible ways in the next release.
FieldComparer<T>
Expert: a FieldComparer compares hits so as to determine their sort order when collecting the top results with TopFieldCollector. The concrete public FieldComparer classes here correspond to the SortField types.
This API is designed to achieve high performance sorting, by exposing a tight interaction with FieldValueHitQueue as it visits hits. Whenever a hit is competitive, it's enrolled into a virtual slot, which is an int ranging from 0 to numHits-1. The FieldComparer is made aware of segment transitions during searching in case any internal state it's tracking needs to be recomputed during these transitions.
A comparer must define these functions:
- Compare(int, int) Compare a hit at 'slot a' with hit 'slot b'.
- SetBottom(int)This method is called by FieldValueHitQueue to notify the FieldComparer of the current weakest ("bottom") slot. Note that this slot may not hold the weakest value according to your comparer, in cases where your comparer is not the primary one (ie, is only used to break ties from the comparers before it).
- CompareBottom(int)Compare a new hit (docID) against the "weakest" (bottom) entry in the queue.
- SetTopValue(T)This method is called by TopFieldCollector to notify the FieldComparer of the top most value, which is used by future calls to CompareTop(int).
- CompareTop(int)Compare a new hit (docID) against the top value previously set by a call to SetTopValue(T).
- Copy(int, int)Installs a new hit into the priority queue. The FieldValueHitQueue calls this method when a new hit is competitive.
- SetNextReader(AtomicReaderContext)Invoked when the search is switching to the next segment. You may need to update internal state of the comparer, for example retrieving new values from the IFieldCache.
- GetValue(int)Return the sort value stored in the specified slot. This is only called at the end of the search, in order to populate Fields when returning the top results.
Note
This API is experimental and might change in incompatible ways in the next release.
FieldDoc
Expert: A ScoreDoc which also contains information about
how to sort the referenced document. In addition to the
document number and score, this object contains an array
of values for the document from the field(s) used to sort.
For example, if the sort criteria was to sort by fields
"a", "b" then "c", the fields
object array
will have three elements, corresponding respectively to
the term values for the document in fields "a", "b" and "c".
The class of each element in the array will be either
int, float or string depending on the type of values
in the terms of each field.
FieldValueFilter
A Filter that accepts all documents that have one or more values in a given field. this Filter request IBits from the IFieldCache and build the bits if not present.
FieldValueHitQueue
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
FieldValueHitQueue.Entry
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
FieldValueHitQueue<T>
Expert: A hit queue for sorting by hits by terms in more than one field.
Uses FieldCache.DEFAULT
for maintaining
internal term lookup tables.
Note
This API is experimental and might change in incompatible ways in the next release.
Filter
Abstract base class for restricting which documents may be returned during searching.
FilteredDocIdSet
Abstract decorator class for a DocIdSet implementation that provides on-demand filtering/validation mechanism on a given DocIdSet.
Technically, this same functionality could be achieved with ChainedFilter (under queries/), however the benefit of this class is it never materializes the full bitset for the filter. Instead, the Match(int) method is invoked on-demand, per docID visited during searching. If you know few docIDs will be visited, and the logic behind Match(int) is relatively costly, this may be a better way to filter than ChainedFilter.
FilteredDocIdSetIterator
Abstract decorator class of a DocIdSetIterator implementation that provides on-demand filter/validation mechanism on an underlying DocIdSetIterator. See DocIdSetIterator.
FilteredQuery
A query that applies a filter to the results of another query.
Note: the bits are retrieved from the filter each time this query is used in a search - use a CachingWrapperFilter to avoid regenerating the bits every time. @since 1.4FilteredQuery.FilterStrategy
Abstract class that defines how the filter (DocIdSet) applied during document collection.
FilteredQuery.RandomAccessFilterStrategy
A FilteredQuery.FilterStrategy that conditionally uses a random access filter if the given DocIdSet supports random access (returns a non-null value from Bits) and UseRandomAccess(IBits, int) returns
true
. Otherwise this strategy falls back to a "zig-zag join" (
LEAP_FROG_FILTER_FIRST_STRATEGY) strategy .
FuzzyQuery
Implements the fuzzy search query. The similarity measurement
is based on the Damerau-Levenshtein (optimal string alignment) algorithm,
though you can explicitly choose classic Levenshtein by passing false
to the transpositions
parameter.
FuzzyTermsEnum
Subclass of TermsEnum for enumerating all terms that are similar to the specified filter term.
Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than all that precede it.
FuzzyTermsEnum.LevenshteinAutomataAttribute
Stores compiled automata as a list (indexed by edit distance)
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
IndexSearcher
Implements search over a single IndexReader.
Applications usually need only call the inherited Search(Query, int) or Search(Query, Filter?, int) methods. For performance reasons, if your index is unchanging, you should share a single IndexSearcher instance across multiple searches instead of creating a new one per-search. If your index has changed and you wish to see the changes reflected in searching, you should use OpenIfChanged(DirectoryReader) to obtain a new reader and then create a new IndexSearcher from that. Also, for low-latency turnaround it's best to use a near-real-time reader (Open(IndexWriter, bool)). Once you have a new IndexReader, it's relatively cheap to create a new IndexSearcher from it.NOTE: IndexSearcher instances are completely thread safe, meaning multiple threads can call any of its methods, concurrently. If your application requires external synchronization, you should not synchronize on the IndexSearcher instance; use your own (non-Lucene) objects instead.
IndexSearcher.LeafSlice
A class holding a subset of the IndexSearchers leaf contexts to be executed within a single thread.
Note
This API is experimental and might change in incompatible ways in the next release.
LiveFieldValues<S, T>
Tracks live field values across NRT reader reopens. This holds a map for all updated ids since the last reader reopen. Once the NRT reader is reopened, it prunes the map. This means you must reopen your NRT reader periodically otherwise the RAM consumption of this class will grow unbounded!
NOTE: you must ensure the same id is never updated at the same time by two threads, because in this case you cannot in general know which thread "won".MatchAllDocsQuery
A query that matches all documents.
MaxNonCompetitiveBoostAttribute
Implementation class for IMaxNonCompetitiveBoostAttribute.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
MultiCollector
A ICollector which allows running a search with several
ICollectors. It offers a static Wrap(params ICollector[]) method which accepts a
list of collectors and wraps them with MultiCollector, while
filtering out the null
ones.
MultiPhraseQuery
MultiPhraseQuery is a generalized version of PhraseQuery, with an added method Add(Term[]).
To use this class, to search for the phrase "Microsoft app*" first use Add(Term) on the term "Microsoft", then find all terms that have "app" as prefix usingMultiFields.GetFields(IndexReader).GetTerms(string)
, and use Add(Term[])
to add them to the query.
Collection initializer note: To create and populate a MultiPhraseQuery
in a single statement, you can use the following example as a guide:
var multiPhraseQuery = new MultiPhraseQuery() {
new Term("field", "microsoft"),
new Term("field", "office")
};
Note that as long as you specify all of the parameters, you can use either Add(Term), Add(Term[]), or Add(Term[], int) as the method to use to initialize. If there are multiple parameters, each parameter set must be surrounded by curly braces.
MultiTermQuery
An abstract Query that matches documents containing a subset of terms provided by a FilteredTermsEnum enumeration.
This query cannot be used directly; you must subclass it and define GetTermsEnum(Terms, AttributeSource) to provide a FilteredTermsEnum that iterates through the terms to be matched. NOTE: if MultiTermRewriteMethod is either CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE or SCORING_BOOLEAN_QUERY_REWRITE, you may encounter a BooleanQuery.TooManyClausesException exception during searching, which happens when the number of terms to be searched exceeds MaxClauseCount. Setting MultiTermRewriteMethod to CONSTANT_SCORE_FILTER_REWRITE prevents this. The recommended rewrite method is CONSTANT_SCORE_AUTO_REWRITE_DEFAULT: it doesn't spend CPU computing unhelpful scores, and it tries to pick the most performant rewrite method given the query. If you need scoring (likeMultiTermQuery.RewriteMethod
Abstract class that defines how the query is rewritten.
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite
A rewrite method that first translates each term into SHOULD clause in a BooleanQuery, but the scores are only computed as the boost.
This rewrite method only uses the top scoring terms so it will not overflow the boolean max clause count.MultiTermQuery.TopTermsScoringBooleanQueryRewrite
A rewrite method that first translates each term into SHOULD clause in a BooleanQuery, and keeps the scores as computed by the query.
This rewrite method only uses the top scoring terms so it will not overflow the boolean max clause count. It is the default rewrite method for FuzzyQuery.MultiTermQueryWrapperFilter<Q>
A wrapper for MultiTermQuery, that exposes its functionality as a Filter.
MultiTermQueryWrapperFilter<Q> is not designed to be used by itself. Normally you subclass it to provide a Filter counterpart for a MultiTermQuery subclass. For example, TermRangeFilter and PrefixFilter extend MultiTermQueryWrapperFilter<Q>. This class also provides the functionality behind CONSTANT_SCORE_FILTER_REWRITE; this is why it is not abstract.NGramPhraseQuery
This is a PhraseQuery which is optimized for n-gram phrase query. For example, when you query "ABCD" on a 2-gram field, you may want to use NGramPhraseQuery rather than PhraseQuery, because NGramPhraseQuery will Rewrite(IndexReader) the query to "AB/0 CD/2", while PhraseQuery will query "AB/0 BC/1 CD/2" (where term/position).
Collection initializer note: To create and populate a PhraseQuery in a single statement, you can use the following example as a guide:var phraseQuery = new NGramPhraseQuery(2) {
new Term("field", "ABCD"),
new Term("field", "EFGH")
};
Note that as long as you specify all of the parameters, you can use either Add(Term) or Add(Term, int) as the method to use to initialize. If there are multiple parameters, each parameter set must be surrounded by curly braces.
NumericRangeFilter
LUCENENET specific static class to provide access to static methods without referring to the NumericRangeFilter<T>'s generic closing type.
NumericRangeFilter<T>
A Filter that only accepts numeric values within a specified range. To use this, you must first index the numeric values using Int32Field, SingleField, Int64Field or DoubleField (expert: NumericTokenStream).
You create a new NumericRangeFilter with the static factory methods, eg:Filter f = NumericRangeFilter.NewFloatRange("weight", 0.03f, 0.10f, true, true);
Accepts all documents whose float valued "weight" field ranges from 0.03 to 0.10, inclusive. See NumericRangeQuery for details on how Lucene indexes and searches numeric valued fields.
@since 2.9NumericRangeQuery
LUCENENET specific class to provide access to static factory metods of NumericRangeQuery<T> without referring to its genereic closing type.
NumericRangeQuery<T>
A Query that matches numeric values within a specified range. To use this, you must first index the numeric values using Int32Field, SingleField, Int64Field or DoubleField (expert: NumericTokenStream). If your terms are instead textual, you should use TermRangeQuery. NumericRangeFilter is the filter equivalent of this query.
You create a new NumericRangeQuery<T> with the static factory methods, eg:
Query q = NumericRangeQuery.NewFloatRange("weight", 0.03f, 0.10f, true, true);
matches all documents whose float valued "weight" field ranges from 0.03 to 0.10, inclusive.
The performance of NumericRangeQuery<T> is much better than the corresponding TermRangeQuery because the number of terms that must be searched is usually far fewer, thanks to trie indexing, described below.
You can optionally specify a precisionStep
when creating this query. This is necessary if you've
changed this configuration from its default (4) during
indexing. Lower values consume more disk space but speed
up searching. Suitable values are between 1 and
8. A good starting point to test is 4,
which is the default value for all Numeric*
classes. See below for
details.
This query defaults to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT. With precision steps of <=4, this query can be run with one of the BooleanQuery rewrite methods without changing BooleanQuery's default max clause count.
How it works
See the publication about panFMP,
where this algorithm was described (referred to as TrieRangeQuery
):
Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. doi:10.1016/j.cageo.2008.02.023
A quote from this paper: Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). We have developed an extension to Apache Lucene that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions (for a more detailed description of how the values are stored, see NumericUtils). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly. This reduces the number of terms dramatically.
For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of
7*255*2 + 255 = 3825
distinct terms (when there is a term for every distinct value of an
8-byte-number in the index and the range covers almost all of them; a maximum of 255 distinct values is used
because it would always be possible to reduce the full 256 values to one term with degraded precision).
In practice, we have seen up to 300 terms in most cases (index with 500,000 metadata records
and a uniform value distribution).
Precision Step
You can choose any precisionStep when encoding values. Lower step values mean more precisions and so more terms in index (and index gets larger). The number of indexed terms per value is (those are generated by NumericTokenStream):indexedTermsPerValue = ceil(bitsPerValue / precisionStep)
As the lower precision terms are shared by many values, the additional terms only slightly grow the term dictionary (approx. 7% forprecisionStep=4
), but have a larger
impact on the postings (the postings file will have more entries, as every document is linked to
indexedTermsPerValue
terms instead of one). The formula to estimate the growth
of the term dictionary in comparison to one term per value:
On the other hand, if the precisionStep is smaller, the maximum number of terms to match reduces, which optimizes query speed. The formula to calculate the maximum number of terms that will be visited while executing the query is:
For longs stored using a precision step of 4, maxQueryTerms = 15*15*2 + 15 = 465
, and for a precision
step of 2, maxQueryTerms = 31*3*2 + 3 = 189
. But the faster search speed is reduced by more seeking
in the term enum of the index. Because of this, the ideal precisionStep value can only
be found out by testing. Important: You can index with a lower precision step value and test search speed
using a multiple of the original step value.
Good values for precisionStep are depending on usage and data type:
- The default for all data types is 4, which is used, when no
is given.precisionStep
- Ideal value in most cases for 64 bit data types (long, double) is 6 or 8.
- Ideal value in most cases for 32 bit data types (int, float) is 4.
- For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use MaxValue (see below).
- Steps >=64 for long/double and >=32 for int/float produces one token per value in the index and querying is as slow as a conventional TermRangeQuery. But it can be used to produce fields, that are solely used for sorting (in this case simply use MaxValue as precisionStep). Using Int32Field, Int64Field, SingleField or DoubleField for sorting is ideal, because building the field cache is much faster than with text-only numbers. These fields have one term per value and therefore also work with term enumeration for building distinct lists (e.g. facets / preselected values to search for). Sorting is also possible with range query optimized fields using one of the above precisionSteps.
Comparisons of the different types of RangeQueries on an index with about 500,000 docs showed that TermRangeQuery in boolean rewrite mode (with raised BooleanQuery clause count) took about 30-40 secs to complete, TermRangeQuery in constant score filter rewrite mode took 5 secs and executing this class took <100ms to complete (on an Opteron64 machine, Java 1.5, 8 bit precision step). This query type was developed for a geographic portal, where the performance for e.g. bounding boxes or exact date/time stamps is important.
@since 2.9
PhraseQuery
A Query that matches documents containing a particular sequence of terms.
A PhraseQuery is built by QueryParser for input like "new york"
.
var phraseQuery = new PhraseQuery() {
new Term("field", "microsoft"),
new Term("field", "office")
};
Note that as long as you specify all of the parameters, you can use either Add(Term) or Add(Term, int) as the method to use to initialize. If there are multiple parameters, each parameter set must be surrounded by curly braces.
PositiveScoresOnlyCollector
A ICollector implementation which wraps another ICollector and makes sure only documents with scores > 0 are collected.
PrefixFilter
A Filter that restricts search results to values that have a matching prefix in a given field.
PrefixQuery
A Query that matches documents containing terms with a specified prefix. A PrefixQuery
is built by QueryParser for input like app*
.
PrefixTermsEnum
Subclass of FilteredTermsEnum for enumerating all terms that match the specified prefix filter term.
Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than all that precede it.
Query
The abstract base class for queries.
Instantiable subclasses are:- TermQuery
- BooleanQuery
- WildcardQuery
- PhraseQuery
- PrefixQuery
- MultiPhraseQuery
- FuzzyQuery
- RegexpQuery
- TermRangeQuery
- NumericRangeQuery
- ConstantScoreQuery
- DisjunctionMaxQuery
- MatchAllDocsQuery
QueryRescorer
A Rescorer that uses a provided Query to assign scores to the first-pass hits.
Note
This API is experimental and might change in incompatible ways in the next release.
QueryWrapperFilter
Constrains search results to only match those which also match a provided query.
This could be used, for example, with a NumericRangeQuery on a suitably formatted date field to implement date filtering. One could re-use a singleCachingWrapperFilter(QueryWrapperFilter)
that matches, e.g., only documents modified
within the last week. This would only need to be reconstructed once per day.
ReferenceManager
LUCENENET specific class used to provide static access to ReferenceManager.IRefreshListener without having to specifiy the generic closing type of ReferenceManager<G>.
ReferenceManagerExtensions
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
ReferenceManager<G>
Utility class to safely share instances of a certain type across multiple threads, while periodically refreshing them. This class ensures each reference is closed only once all threads have finished using it. It is recommended to consult the documentation of ReferenceManager<G> implementations for their MaybeRefresh() semantics.
Note
This API is experimental and might change in incompatible ways in the next release.
RegexpQuery
A fast regular expression query based on the Lucene.Net.Util.Automaton package.
- Comparisons are fast
- The term dictionary is enumerated in an intelligent way, to avoid comparisons. See AutomatonQuery for more details.
The supported syntax is documented in the RegExp class. Note this might be different than other regular expression implementations. For some alternatives with different syntax, look under the sandbox.
Note this query can be slow, as it needs to iterate over many terms. In order
to prevent extremely slow RegexpQuerys, a RegExp term should not start with
the expression .*
Note
This API is experimental and might change in incompatible ways in the next release.
Rescorer
Re-scores the topN results (TopDocs) from an original query. See QueryRescorer for an actual implementation. Typically, you run a low-cost first-pass query across the entire index, collecting the top few hundred hits perhaps, and then use this class to mix in a more costly second pass scoring.
See Rescore(IndexSearcher, TopDocs, Query, double, int) for a simple static method to call to rescore using a 2nd pass Query.Note
This API is experimental and might change in incompatible ways in the next release.
ScoreCachingWrappingScorer
A Scorer which wraps another scorer and caches the score of the current document. Successive calls to GetScore() will return the same result and will not invoke the wrapped Scorer's GetScore() method, unless the current document has changed.
This class might be useful due to the changes done to the ICollector interface, in which the score is not computed for a document by default, only if the collector requests it. Some collectors may need to use the score in several places, however all they have in hand is a Scorer object, and might end up computing the score of a document more than once.ScoreDoc
Holds one hit in TopDocs.
Scorer
Expert: Common scoring functionality for different types of queries.
A Scorer iterates over documents matching a query in increasing order of doc Id.
Document scores are computed using a given Similarity implementation.
NOTE: The values NaN, NegativeInfinity and PositiveInfinity are not valid scores. Certain collectors (eg TopScoreDocCollector) will not properly collect hits with these scores.
Scorer.ChildScorer
A child Scorer and its relationship to its parent. The meaning of the relationship depends upon the parent query.
Note
This API is experimental and might change in incompatible ways in the next release.
ScoringRewrite<Q>
Base rewrite method that translates each term into a query, and keeps the scores as computed by the query.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
SearcherFactory
Factory class used by SearcherManager to create new IndexSearchers. The default implementation just creates an IndexSearcher with no custom behavior:
public IndexSearcher NewSearcher(IndexReader r)
{
return new IndexSearcher(r);
}
You can pass your own factory instead if you want custom behavior, such as:
- Setting a custom scoring model: Similarity
- Parallel per-segment search: IndexSearcher(IndexReader, TaskScheduler?)
- Return custom subclasses of IndexSearcher (for example that implement distributed scoring)
- Run queries to warm your IndexSearcher before it is used. Note: when using near-realtime search you may want to also set MergedSegmentWarmer to warm newly merged segments in the background, outside of the reopen path.
Note
This API is experimental and might change in incompatible ways in the next release.
SearcherLifetimeManager
Keeps track of current plus old IndexSearchers, disposing the old ones once they have timed out.
Use it like this:
SearcherLifetimeManager mgr = new SearcherLifetimeManager();
Per search-request, if it's a "new" search request, then obtain the latest searcher you have (for example, by using SearcherManager), and then record this searcher:
// Record the current searcher, and save the returend
// token into user's search results (eg as a hidden
// HTML form field):
long token = mgr.Record(searcher);
When a follow-up search arrives, for example the user clicks next page, drills down/up, etc., take the token that you saved from the previous search and:
// If possible, obtain the same searcher as the last
// search:
IndexSearcher searcher = mgr.Acquire(token);
if (searcher != null)
{
// Searcher is still here
try
{
// do searching...
}
finally
{
mgr.Release(searcher);
// Do not use searcher after this!
searcher = null;
}
}
else
{
// Searcher was pruned -- notify user session timed
// out, or, pull fresh searcher again
}
Finally, in a separate thread, ideally the same thread that's periodically reopening your searchers, you should periodically prune old searchers:
mgr.Prune(new PruneByAge(600.0));
NOTE: keeping many searchers around means
you'll use more resources (open files, RAM) than a single searcher. However, as long as you are using OpenIfChanged(DirectoryReader), the searchers will usually share almost all segments and the added resource usage is contained. When a large merge has completed, and you reopen, because that is a large change, the new searcher will use higher additional RAM than other searchers; but large merges don't complete very often and it's unlikely you'll hit two of them in your expiration window. Still you should budget plenty of heap in the runtime to have a good safety margin.
SearcherLifetimeManager.PruneByAge
Simple pruner that drops any searcher older by more than the specified seconds, than the newest searcher.
SearcherManager
Utility class to safely share IndexSearcher instances across multiple threads, while periodically reopening. This class ensures each searcher is disposed only once all threads have finished using it.
Use Acquire() to obtain the current searcher, and Release(G) to release it, like this:IndexSearcher s = manager.Acquire();
try
{
// Do searching, doc retrieval, etc. with s
}
finally
{
manager.Release(s);
// Do not use s after this!
s = null;
}
In addition you should periodically call MaybeRefresh(). While it's possible to call this just before running each query, this is discouraged since it penalizes the unlucky queries that do the reopen. It's better to use a separate background thread, that periodically calls MaybeRefresh(). Finally, be sure to call Dispose() once you are done.
Note
This API is experimental and might change in incompatible ways in the next release.
Sort
Encapsulates sort criteria for returned hits.
The fields used to determine sort order must be carefully chosen. Documents must contain a single term in such a field, and the value of the term should indicate the document's relative position in a given sort order. The field must be indexed, but should not be tokenized, and does not need to be stored (unless you happen to want it back with the rest of your document data). In other words:document.Add(new Field("byNumber", x.ToString(CultureInfo.InvariantCulture), Field.Store.NO, Field.Index.NOT_ANALYZED));
Valid Types of Values
There are four possible kinds of term values which may be put into sorting fields: ints, longs, floats, or strings. Unless SortField objects are specified, the type of value in the field is determined by parsing the first term in the field. int term values should contain only digits and an optional preceding negative sign. Values must be base 10 and in the range MinValue and MaxValue inclusive. Documents which should appear first in the sort should have low value integers, later documents high values (i.e. the documents should be numbered1..n
where
1
is the first and n
the last).
long term values should contain only digits and an optional
preceding negative sign. Values must be base 10 and in the range
MinValue and MaxValue inclusive.
Documents which should appear first in the sort
should have low value integers, later documents high values.
float term values should conform to values accepted by
float (except that NaN
and Infinity
are not supported).
Documents which should appear first in the sort
should have low values, later documents high values.
string term values can contain any valid string, but should
not be tokenized. The values are sorted according to their
comparable natural order (Ordinal). Note that using this type
of term value has higher memory requirements than the other
two types.
Object Reuse
One of these objects can be used multiple times and the sort order changed between usages. This class is thread safe.Memory Usage
Sorting uses of caches of term values maintained by the internal HitQueue(s). The cache is static and contains an int or float array of lengthIndexReader.MaxDoc
for each field
name for which a sort is performed. In other words, the size of the
cache in bytes is:
4 * IndexReader.MaxDoc * (# of different fields actually used to sort)
For string fields, the cache is larger: in addition to the
above array, the value of every term in the field is kept in memory.
If there are many unique terms in the field, this could
be quite large.
Note that the size of the cache is not affected by how many
fields are in the index and might be used to sort - only by
the ones actually used to sort a result set.
Created: Feb 12, 2004 10:53:57 AM
@since lucene 1.4
SortField
Stores information about how to sort documents by terms in an individual field. Fields must be indexed in order to sort by them.
Created: Feb 11, 2004 1:25:29 PM @since lucene 1.4SortRescorer
A Rescorer that re-sorts according to a provided Sort.
TermCollectingRewrite<Q>
Code to search indices.
Table Of Contents
- Search Basics
- The Query Classes
- Scoring: Introduction
- Scoring: Basics
- Changing the Scoring
- Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (Lucene.Net.Spans, Lucene.Net.Payloads), or the Lucene.Net.Queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call Search(Query, int) or Search(Query, Filter, int).
Once a Query has been created and submitted to the IndexSearcher, the scoring process begins. After some infrastructure setup, control finally passes to the Weight implementation and its Scorer or BulkScorer instances. See the Algorithm section for more notes on the process.
Query Classes
TermQuery
Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:
TermQuery tq = new TermQuery(new Term("fieldName", "term"));
In this example, the Query identifies all Documents that have the Field named "fieldName"
containing the word "term"
.
BooleanQuery
Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:
[Occur.SHOULD](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_SHOULD) — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
[Occur.MUST](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST) — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
[Occur.MUST_NOT](xref:Lucene.Net.Search.Occur#Lucene_Net_Search_Occur_MUST_NOT) — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClausesException will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method BooleanQuery.MaxClauseCount.
Phrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match. The slop is 0 by default, meaning the phrase must match exactly.
MultiPhraseQuery — A more general form of PhraseQuery that accepts multiple Terms for a position in the phrase. For example, this can be used to perform phrase queries that also incorporate synonyms.
SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.
TermRangeQuery
The TermRangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term according to TermsEnum.Comparer. It is not intended for numerical ranges; use NumericRangeQuery instead. For example, one could find all documents that have terms beginning with the letters a through c.
NumericRangeQuery
The NumericRangeQuery matches all documents that occur in a numeric range. For NumericRangeQuery to work, you must index the values using a one of the numeric fields (Int32Field, Int64Field, SingleField, or DoubleField).
PrefixQuery, WildcardQuery, RegexpQuery
While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with and ?, as these are extremely slow. Some QueryParsers may not allow this by default, but provide an AllowLeadingWildcard
property to remove that protection. The RegexpQuery is even more general than WildcardQuery, allowing an application to identify all documents with terms that match a regular expression pattern.
FuzzyQuery
A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on user@lucenenet.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including:
These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the Lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful IndexSearcher.Explain(Query, int) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times:
- Index-time boost by setting Field.Boost before a document is added to the index.
- Query-time boost by setting a boost on a query clause, setting Query.Boost.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows:
- All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied.
- The boost is then encoded into a normalization value by the Similarity object at index-time: ComputeNorm. The actual encoding depends upon the Similarity implementation, but note that most use a lossy encoding (such as multiplying the boost with document length or similar, packed into a single byte!).
- Decoding of any index-time normalization values and integration into the document's score is also performed at search time by the Similarity.
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity and at query-time with IndexSearcher.Similarity. Be sure to use the same Similarity at query-time as at index-time (so that norms are encoded/decoded correctly); Lucene makes no effort to verify this.
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via FieldCache or NumericDocValues and integrate them into the score.
See the Lucene.Net.Search.Similarities package documentation for information on the built-in available scoring models and extending or changing Similarity.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes:
- Query — The abstract object representation of the user's information need.
- Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as DocFreq for a given term across all segments).
- Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment.
- BulkScorer — An abstract class that scores a range of documents. A default implementation simply iterates through the hits from Scorer, but some queries such as BooleanQuery have more efficient implementations. Details on each of these classes, and their children, can be found in the subsections below.
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:
- CreateWeight(IndexSearcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
- Rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, and other queries that implement CreateWeight(IndexSearcher searcher)
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any IndexSearcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines five members that must be implemented:
- Query — Pointer to the Query that this Weight represents.
- GetValueForNormalization() — A weight can return a floating point value to indicate its magnitude for query normalization. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.GetValueForNormalization(). For example, with Lucene's classic vector-space formula, this is implemented as the sum of squared weights:
(idf * boost)
² - Normalize(float norm, float topLevelBoost) — Performs query normalization:
topLevelBoost
: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of5
would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document.norm
: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: SimWeight.Normalize(float, float).
- GetScorer(AtomicReaderContext context, IBits acceptDocs) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
- GetScorer(AtomicReaderContext, bool scoreDocsInOrder, IBits acceptDocs) — Construct a new BulkScorer for this Weight. See The BulkScorer Class below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- Explain(AtomicReaderContext context, int doc) — Provide a means for explaining why a given document was scored the way it was. Typically a weight such as TermWeight that scores via a Similarity will make use of the Similarity's implementation: SimScorer.Explain(int doc, Explanation freq).
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from DocIdSetIterator):
- NextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
- DocID — Returns the id of the Document that contains the match.
- GetScore() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the configured Similarity: SimScorer.Score(int doc, float freq).
- Freq — Returns the number of matches for the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer simply defers to the term frequency from the inverted index: DocsEnum.Freq.
- Advance() — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
- GetChildren() — Returns any child subscorers underneath this scorer. This allows for users to navigate the scorer hierarchy and receive more fine-grained details on the scoring process.
The BulkScorer Class
The BulkScorer scores a range of documents. There is only one abstract method:
- Score(ICollector, int) — Score all documents up to but not including the specified max document.
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the IndexSearcher, beginning the scoring process.
Once inside the IndexSearcher, a ICollector is used for the scoring and sorting of the search results. These important objects are involved in a search:
- The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher.
- The IndexSearcher that initiated the call.
- A Filter for limiting the result set. Note, the Filter may be
null
. - A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by IndexSearcher.CreateNormalizedWeight(Query), Filter and the number of results we want. This method returns a TopDocs object, which is an internal collection of search results. The IndexSearcher creates a TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the ICollector mechanism, see IndexSearcher). The TopScoreDocCollector uses a PriorityQueue to collect the top results for the search.
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each IndexReader segment and proceed by calling BulkScorer.Score(ICollector).
At last, we are actually going to score some documents. The score method takes in the ICollector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from BooleanWeight (see the section on custom queries for info on changing this).
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the Coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.NextDoc() method. The NextDoc() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overridden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.
TermQuery
A Query that matches documents containing a term. this may be combined with other terms with a BooleanQuery.
TermRangeFilter
A Filter that restricts search results to a range of term values in a given field.
This filter matches the documents looking for terms that fall into the supplied range according to CompareTo(byte), It is not intended for numerical ranges; use NumericRangeFilter instead. If you construct a large number of range filters with different ranges but on the same field, FieldCacheRangeFilter may have significantly better performance. @since 2.9TermRangeQuery
A Query that matches documents within an range of terms.
This query matches the documents looking for terms that fall into the supplied range according to CompareTo(byte). It is not intended for numerical ranges; use NumericRangeQuery instead. This query uses the CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method. @since 2.9TermRangeTermsEnum
Subclass of FilteredTermsEnum for enumerating all terms that match the specified range parameters.
Term enumerations are always ordered by Comparer. Each term in the enumeration is greater than all that precede it.
TermStatistics
Contains statistics for a specific term
Note
This API is experimental and might change in incompatible ways in the next release.
TimeLimitingCollector
The TimeLimitingCollector is used to timeout search requests that take longer than the maximum allowed search time limit. After this time is exceeded, the search thread is stopped by throwing a TimeLimitingCollector.TimeExceededException.
TimeLimitingCollector.TimeExceededException
Thrown when elapsed search time exceeds allowed search time.
TimeLimitingCollector.TimerThread
Thread used to timeout search requests. Can be stopped completely with StopTimer()
Note
This API is experimental and might change in incompatible ways in the next release.
TopDocs
Represents hits returned by Search(Query, Filter?, int) and Search(Query, int).
TopDocsCollector<T>
A base class for all collectors that return a TopDocs output. This collector allows easy extension by providing a single constructor which accepts a PriorityQueue<T> as well as protected members for that priority queue and a counter of the number of total hits.
Extending classes can override any of the methods to provide their own implementation, as well as avoid the use of the priority queue entirely by passing null to TopDocsCollector(PriorityQueue<T>). In that case however, you might want to consider overriding all methods, in order to avoid a NullReferenceException.TopFieldCollector
A ICollector that sorts by SortField using FieldComparers.
See the Create(Sort, int, bool, bool, bool, bool) method for instantiating a TopFieldCollector.Note
This API is experimental and might change in incompatible ways in the next release.
TopFieldDocs
Represents hits returned by Search(Query, Filter?, int, Sort).
TopScoreDocCollector
A ICollector implementation that collects the top-scoring hits, returning them as a TopDocs. this is used by IndexSearcher to implement TopDocs-based search. Hits are sorted by score descending and then (when the scores are tied) docID ascending. When you create an instance of this collector you should know in advance whether documents are going to be collected in doc Id order or not.
NOTE: The values NaN and NegativeInfinity are not valid scores. This collector will not properly collect hits with such scores.TopTermsRewrite<Q>
Base rewrite method for collecting only the top terms via a priority queue.
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
TotalHitCountCollector
Just counts the total number of hits.
Weight
Expert: Calculate query weights and build query scorers.
The purpose of Weight is to ensure searching does not modify a Query, so that a Query instance can be reused. IndexSearcher dependent state of the query should reside in the Weight. AtomicReader dependent state should reside in the Scorer. Since Weight creates Scorer instances for a given AtomicReaderContext (GetScorer(AtomicReaderContext, IBits)) callers must maintain the relationship between the searcher's top-level IndexReaderContext and the context used to create a Scorer. A Weight is used in the following way:- A Weight is constructed by a top-level query, given a IndexSearcher (CreateWeight(IndexSearcher)).
- The GetValueForNormalization() method is called on the Weight to compute the query normalization factor QueryNorm(float) of the query clauses contained in the query.
- The query normalization factor is passed to Normalize(float, float). At this point the weighting is complete.
- A Scorer is constructed by GetScorer(AtomicReaderContext, IBits).
WildcardQuery
Implements the wildcard search query. Supported wildcards are *
, which
matches any character sequence (including the empty one), and ?
,
which matches any single character. '' is the escape character.
*
This query uses the
CONSTANT_SCORE_AUTO_REWRITE_DEFAULT
rewrite method.
Structs
ReferenceContext<T>
ReferenceContext<T> holds a reference instance and ensures it is properly de-referenced from its corresponding ReferenceManager<G> when Dispose() is called. This struct is intended to be used with a using block to simplify releasing a reference such as a SearcherManager instance.
LUCENENET specificInterfaces
FieldCache.IByteParser
Interface to parse bytes from document fields.
FieldCache.ICreationPlaceholder
Interface used to identify a FieldCache.CreationPlaceholder<TValue> without referencing its generic closing type.
FieldCache.IDoubleParser
Interface to parse doubles from document fields.
FieldCache.IInt16Parser
Interface to parse shorts from document fields.
NOTE: This was ShortParser in LuceneFieldCache.IInt32Parser
Interface to parse ints from document fields.
NOTE: This was IntParser in LuceneFieldCache.IInt64Parser
Interface to parse long from document fields.
NOTE: This was LongParser in LuceneFieldCache.IParser
Marker interface as super-interface to all parsers. It is used to specify a custom parser to SortField(string, IParser).
FieldCache.ISingleParser
Interface to parse floats from document fields.
NOTE: This was FloatParser in LuceneFuzzyTermsEnum.ILevenshteinAutomataAttribute
Reuses compiled automata across different segments, because they are independent of the index
Note
This API is for internal purposes only and might change in incompatible ways in the next release.
IBoostAttribute
Add this IAttribute to a TermsEnum returned by GetTermsEnum(Terms, AttributeSource) and update the boost on each returned term. This enables to control the boost factor for each matching term in SCORING_BOOLEAN_QUERY_REWRITE or TopTermsRewrite<Q> mode. FuzzyQuery is using this to take the edit distance into account.
Please note: this attribute is intended to be added only by the TermsEnum to itself in its constructor and consumed by the MultiTermQuery.RewriteMethod.Note
This API is for internal purposes only and might change in incompatible ways in the next release.
ICollector
Expert: Collectors are primarily meant to be used to gather raw results from a search, and implement sorting or custom result filtering, collation, etc.
Lucene's core collectors are derived from Collector. Likely your application can use one of these classes, or subclass TopDocsCollector<T>, instead of implementing ICollector directly:
- TopDocsCollector<T> is an abstract base class that assumes you will retrieve the top N docs, according to some criteria, after collection is done.
- TopScoreDocCollector is a concrete subclass TopDocsCollector<T> and sorts according to score + docID. This is used internally by the IndexSearcher search methods that do not take an explicit Sort. It is likely the most frequently used collector.
- TopFieldCollector subclasses TopDocsCollector<T> and sorts according to a specified Sort object (sort by field). This is used internally by the IndexSearcher search methods that take an explicit Sort.
- TimeLimitingCollector, which wraps any other Collector and aborts the search if it's taken too much time.
- PositiveScoresOnlyCollector wraps any other ICollector and prevents collection of hits whose score is <= 0.0
ICollector decouples the score from the collected doc: the score computation is skipped entirely if it's not needed. Collectors that do need the score should implement the SetScorer(Scorer) method, to hold onto the passed Scorer instance, and call GetScore() within the collect method to compute the current hit's score. If your collector may request the score for a single hit multiple times, you should use ScoreCachingWrappingScorer.
NOTE: The doc that is passed to the collect method is relative to the current reader. If your collector needs to resolve this to the docID space of the Multi*Reader, you must re-base it by recording the docBase from the most recent SetNextReader(AtomicReaderContext) call. Here's a simple example showing how to collect docIDs into an OpenBitSet:
private class MySearchCollector : ICollector
{
private readonly OpenBitSet bits;
private int docBase;
public MySearchCollector(OpenBitSet bits)
{
if (bits is null) throw new ArgumentNullException("bits");
this.bits = bits;
}
// ignore scorer
public void SetScorer(Scorer scorer)
{
}
// accept docs out of order (for a BitSet it doesn't matter)
public bool AcceptDocsOutOfOrder
{
get { return true; }
}
public void Collect(int doc)
{
bits.Set(doc + docBase);
}
public void SetNextReader(AtomicReaderContext context)
{
this.docBase = context.DocBase;
}
}
IndexSearcher searcher = new IndexSearcher(indexReader);
OpenBitSet bits = new OpenBitSet(indexReader.MaxDoc);
searcher.Search(query, new MySearchCollector(bits));
Not all collectors will need to rebase the docID. For example, a collector that simply counts the total number of hits would skip it.
NOTE: Prior to 2.9, Lucene silently filtered out hits with score <= 0. As of 2.9, the core ICollectors no longer do that. It's very unusual to have such hits (a negative query boost, or function query returning negative custom scores, could cause it to happen). If you need that behavior, use PositiveScoresOnlyCollector.
Note
This API is experimental and might change in incompatible ways in the next release.
IFieldCache
Expert: Maintains caches of term values.
Created: May 19, 2004 11:13:14 AMNote
This API is for internal purposes only and might change in incompatible ways in the next release.
IMaxNonCompetitiveBoostAttribute
Add this IAttribute to a fresh AttributeSource before calling GetTermsEnum(Terms, AttributeSource). FuzzyQuery is using this to control its internal behaviour to only return competitive terms.
Please note: this attribute is intended to be added by the MultiTermQuery.RewriteMethod to an empty AttributeSource that is shared for all segments during query rewrite. This attribute source is passed to all segment enums on GetTermsEnum(Terms, AttributeSource). TopTermsRewrite<Q> uses this attribute to inform all enums about the current boost, that is not competitive.Note
This API is for internal purposes only and might change in incompatible ways in the next release.
ITopDocsCollector
LUCENENET specific interface used to reference TopDocsCollector<T> without referencing its generic type.
ReferenceManager.IRefreshListener
Use to receive notification when a refresh has finished. See AddListener(IRefreshListener).
SearcherLifetimeManager.IPruner
See Prune(IPruner).
Enums
Occur
Specifies how clauses are to occur in matching documents.
SortFieldType
Specifies the type of the terms to be sorted, or special types such as CUSTOM