Lucene.Net
Apache Lucene is a high-performance, full-featured text search engine library. Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length;="" i++)="" {="" document="" hitdoc="isearcher.doc(hits[i].doc);" assertequals("this="" is="" the="" text="" to="" be="" indexed.",="" hitdoc.get("fieldname"));="" }="" ireader.close();="">
The Lucene API is divided into several packages:
Lucene.
Net. defines an abstract Analyzer API for converting text from a {@link java.io.Reader} into a TokenAnalysis Stream , an enumeration of token Attributes. A TokenStream can be composed by applying TokenFilter s to the output of a Tokenizer. Tokenizers and TokenFilters are strung together and applied with an Analyzer. analyzers-common provides a number of Analyzer implementations, including StopAnalyzer and the grammar-based StandardAnalyzer.Lucene.
Net. provides an abstraction over the encoding and decoding of the inverted index structure, as well as different implementations that can be chosen depending upon application needs.Codecs Lucene.
Net. provides a simple Document class. A Document is simply a set of named Fields, whose values may be strings or instances of {@link java.io.Reader}.Documents Lucene.
Net. provides two primary classes: IndexIndex Writer , which creates and adds documents to indices; and IndexReader , which accesses the data in the index.Lucene.
Net. provides data structures to represent queries (ie TermSearch Query for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the IndexSearcher which turns queries into TopDocs . A number of QueryParsers are provided for producing query structures from strings or xml.Lucene.
Net. defines an abstract class for storing persistent data, the Directory, which is a collection of named files written by an IndexStore Output and read by an IndexInput . Multiple implementations are provided, including FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.Lucene.
Net. contains a few handy data structures and util classes, ie OpenUtil Bit and PriorityQueue.Set
To use Lucene, an application should:
Create an Index
Writer and add documents to it with AddDocument ;Call QueryParser.parse() to build a query from a string; and
Create an Index
Searcher and pass the query to its Search method.
Some simple examples of code which does this are:
IndexFiles.java creates an index for all the files contained in a directory.
SearchFiles.java prompts for queries and searches an index.