Namespace Lucene.Net.Search
Code to search indices.
Table Of Contents
- Search Basics 2. The Query Classes 3. Scoring: Introduction 4. Scoring: Basics 5. Changing the Scoring 6. Appendix: Search Algorithm
Search Basics
Lucene offers a wide variety of Query implementations, most of which are in this package, its subpackages (spans, payloads), or the queries module. These implementations can be combined in a wide variety of ways to provide complex querying capabilities along with information about where matches took place in the document collection. The Query Classes section below highlights some of the more important Query classes. For details on implementing your own Query class, see Custom Queries -- Expert Level below.
To perform a search, applications usually call #search(Query,int) or #search(Query,Filter,int).
Once a Query has been created and submitted to the Index
<!-- TODO: this page over-links the same things too many times -->
Query Classes
[TermQuery](xref:Lucene.Net.Search.TermQuery)
Of the various implementations of Query, the Term
[BooleanQuery](xref:Lucene.Net.Search.BooleanQuery)
Things start to get interesting when one combines multiple Term
SHOULD — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.
2.MUST — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.
3.NOT — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.
Boolean queries are constructed by adding two or more BooleanPhrases
Another common search is to find documents containing certain phrases. This is handled three different ways:
Phrase
Multi
Span
[TermRangeQuery](xref:Lucene.Net.Search.TermRangeQuery)
The Term
[NumericRangeQuery](xref:Lucene.Net.Search.NumericRangeQuery)
The Numeric
[PrefixQuery](xref:Lucene.Net.Search.PrefixQuery),
[WildcardQuery](xref:Lucene.Net.Search.WildcardQuery),
[RegexpQuery](xref:Lucene.Net.Search.RegexpQuery)
While the PrefixsetAllowLeadingWildcard
method to remove that protection. The Regexp
[FuzzyQuery](xref:Lucene.Net.Search.FuzzyQuery)
A Fuzzy
Scoring — Introduction
Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user. In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms scores lower than a different document with only one of the query terms.
While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can help you figure out the what and why of Lucene scoring.
Lucene scoring supports a number of pluggable information retrieval models, including: * Vector Space Model (VSM) * Probablistic Models such as Okapi BM25 and DFR * Language models These models can be plugged in via the Similarity API, and offer extension hooks and parameters for tuning. In general, Lucene first finds the documents that need to be scored based on boolean logic in the Query specification, and then ranks this subset of matching documents via the retrieval model. For some valuable references on VSM and IR in general refer to Lucene Wiki IR references.
The rest of this document will cover Scoring basics and explain how to change your Similarity. Next, it will cover ways you can customize the lucene internals in Custom Queries -- Expert Level, which gives details on implementing your own Query class and related functionality. Finally, we will finish up with some reference material in the Appendix.
Scoring — Basics
Scoring is very much dependent on the way documents are indexed, so it is important to understand indexing. (see Lucene overview before continuing on with this section) Be sure to use the useful Doc) to understand how the score for a certain matching document was computed.
Generally, the Query determines which documents match (a binary decision), while the Similarity determines how to assign scores to the matching documents.
Fields and Documents
In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (Tokenized, Stored, etc). It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field may return different scores for the same query due to length normalization.
Score Boosting
Lucene allows influencing search results by "boosting" at different times: * Index-time boost by calling Field.
Indexing time boosts are pre-processed for storage efficiency and written to storage for a field as follows: * All boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. * The boost is then encoded into a normalization value by the Similarity object at index-time: Compute
Changing Scoring — Similarity
Changing Similarity is an easy way to influence scoring, this is done at index-time with Index
You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can extend by plugging in a different component (e.g. term frequency normalizer).
Finally, you can extend the low level Similarity directly to implement a new retrieval model, or to use external scoring factors particular to your application. For example, a custom Similarity can access per-document values via Field
See the Lucene.
Custom Queries — Expert Level
Custom queries are an expert level task, so tread carefully and be prepared to share your code if you want help.
With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by three main classes: 1. Query — The abstract object representation of the user's information need. 2. Weight — The internal interface representation of the user's Query, so that Query objects may be reused. This is global (across all segments of the index) and generally will require global statistics (such as docFreq for a given term across all segments). 3. Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities. This is created per-segment. 4. Bulk
The Query Class
In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes: 1. Searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface. 2. Reader) — Rewrites queries into primitive queries. Primitive queries are: Term
The Weight Interface
The Weight interface provides an internal representation of the Query so that it can be reused. Any Index` 3. [TopLevelBoost)](xref:Lucene.Net.Search.Weight#methods) — Performs query normalization: *
topLevelBoost: A query-boost factor from any wrapping queries that should be multiplied into every document's score. For example, a TermQuery that is wrapped within a BooleanQuery with a boost of
5would receive this value at this time. This allows the TermQuery (the leaf node in this case) to compute this up-front a single time (e.g. by multiplying into the IDF), rather than for every document. *
norm`: Passes in a a normalization factor which may allow for comparing scores between queries. Typically a weight such as TermWeight that scores via a Similarity will just defer to the Similarity's implementation: Sim
The Scorer Class
The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract (some of them are not yet abstract, but will be in future versions and should be considered as such now) methods which must be implemented (some of them inherited from Doc
The BulkScorer Class
The Bulk
Why would I want to add my own Query?
In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).
Appendix: Search Algorithm
This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.
In the typical search application, a Query is passed to the Index
Once inside the IndexSearcher, a Collector is used for the scoring and sorting of the search results. These important objects are involved in a search: 1. The Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the IndexSearcher. 2. The IndexSearcher that initiated the call. 3. A Filter for limiting the result set. Note, the Filter may be null. 4. A Sort object for specifying how to sort the results if the standard score-based sort method is not desired.
Assuming we are not sorting (since sorting doesn't affect the raw Lucene score), we call one of the search methods of the IndexSearcher, passing in the Weight object created by Index
If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for each Index
At last, we are actually going to score some documents. The score method takes in the Collector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2
created from Boolean
Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer.
Classes
AssertingBulkOutOfOrderScorer
A crazy
AssertingBulkScorer
Wraps a
AssertingCollector
Wraps another
AssertingIndexSearcher
Helper class that adds some extra checks to ensure correct
usage of
AssertingQuery
Assertion-enabled query.
AssertingScorer
Wraps a
CheckHits
Utility class for asserting expected hits in tests.
ExplanationAsserter
Asserts that the score explanation for every document matching a query corresponds with the true score.
NOTE: this HitCollector should only be used with the
ExplanationAssertingSearcher
An
FCInvisibleMultiReader
This is a
QueryUtils
Utility class for sanity-checking queries.
RandomSimilarityProvider
Similarity implementation that randomizes Similarity implementations per-field.
The choices are 'sticky', so the selected algorithm is always used for the same field.
SearchEquivalenceTestBase
Simple base class for checking search equivalence.
Extend it, and write tests that create Random
SearcherExpiredException
Thrown when the lease for a searcher has expired.
SetCollector
Just collects document ids into a set.
ShardSearchingTestBase
Base test class for simulating distributed search across multiple shards.
ShardSearchingTestBase.NodeState
ShardSearchingTestBase.NodeState.ShardIndexSearcher
Matches docs in the local shard but scores based on aggregated stats ("mock distributed scoring") from all nodes.
ShardSearchingTestBase.SearcherAndVersion
An