Lucene.Net
3.0.3
Lucene.Net is a port of the Lucene search engine library, written in C# and targeted at .NET runtime users.
|
High-performance single-document main memory Apache Lucene fulltext search index. More...
Public Member Functions | |
MemoryIndex () | |
void | AddField (String fieldName, String text, Analyzer analyzer) |
TokenStream | CreateKeywordTokenStream< T > (ICollection< T > keywords) |
void | AddField (String fieldName, TokenStream stream) |
void | AddField (String fieldName, TokenStream stream, float boost) |
IndexSearcher | CreateSearcher () |
float | Search (Query query) |
int | GetMemorySize () |
override String | ToString () |
High-performance single-document main memory Apache Lucene fulltext search index.
This class is a replacement/substitute for a large subset of RAMDirectory functionality. It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Rather than targeting fulltext search of infrequent queries over huge persistent data archives (historic search), this class targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). For example as in
float score = search(String text, Query query)
Each instance can hold at most one Lucene "document", with a document containing zero or more "fields", each field having a name and a fulltext value. The fulltext value is tokenized (split and transformed) into zero or more index terms (aka words) on addField()
, according to the policy implemented by an Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop words), reduce the terms to their natural linguistic root form such as "fishing" being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. For details, see Lucene Analyzer Intro.
Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.
For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER; //Analyzer analyzer = new SimpleAnalyzer(); MemoryIndex index = new MemoryIndex(); index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer); index.addField("author", "Tales of James", analyzer); QueryParser parser = new QueryParser("content", analyzer); float score = index.search(parser.parse("+author:james +salmon~ +fish/// manual~")); if (score > 0.0f) { System.out.println("it's a match"); } else { System.out.println("no match found"); } System.out.println("indexData=" + index.toString());
(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :) declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish/// manual~"; (: any arbitrary Lucene query can go here :)
for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book
An instance can be queried multiple times with the same or different queries, but an instance is not thread-safe. If desired use idioms such as:
MemoryIndex index = ... synchronized (index) { // read and/or write index (i.e. add fields and/or query) }
Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework.
This class performs very well for very small texts (e.g. 10 chars) as well as for large texts (e.g. 10 MB) and everything in between. Typically, it is about 10-100 times faster than RAMDirectory
. Note that RAMDirectory
has particularly large efficiency overheads for small to medium sized texts, both in time and space. Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst case. Memory consumption is probably larger than for RAMDirectory
.
Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
/summary>
Definition at line 30 of file EmptyCollector.cs.
Lucene.Net.Index.Memory.MemoryIndex.MemoryIndex | ( | ) |
Definition at line 177 of file MemoryIndex.cs.
void Lucene.Net.Index.Memory.MemoryIndex.AddField | ( | String | fieldName, |
String | text, | ||
Analyzer | analyzer | ||
) |
Definition at line 216 of file MemoryIndex.cs.
void Lucene.Net.Index.Memory.MemoryIndex.AddField | ( | String | fieldName, |
TokenStream | stream | ||
) |
Definition at line 259 of file MemoryIndex.cs.
void Lucene.Net.Index.Memory.MemoryIndex.AddField | ( | String | fieldName, |
TokenStream | stream, | ||
float | boost | ||
) |
Definition at line 279 of file MemoryIndex.cs.
TokenStream Lucene.Net.Index.Memory.MemoryIndex.CreateKeywordTokenStream< T > | ( | ICollection< T > | keywords | ) |
Definition at line 242 of file MemoryIndex.cs.
IndexSearcher Lucene.Net.Index.Memory.MemoryIndex.CreateSearcher | ( | ) |
Definition at line 364 of file MemoryIndex.cs.
int Lucene.Net.Index.Memory.MemoryIndex.GetMemorySize | ( | ) |
Definition at line 428 of file MemoryIndex.cs.
float Lucene.Net.Index.Memory.MemoryIndex.Search | ( | Query | query | ) |
Definition at line 384 of file MemoryIndex.cs.
override String Lucene.Net.Index.Memory.MemoryIndex.ToString | ( | ) |
Definition at line 494 of file MemoryIndex.cs.