Namespace Lucene.Net.Benchmarks.ByTask.Feeds
Sources for benchmark inputs: documents and queries.
Classes
AbstractQueryMaker
Abstract base query maker. Each query maker should just implement the PrepareQueries() method.
ContentItemsSource
Base class for source of data for benchmarking.
ContentSource
Represents content from a specified source, such as TREC, Reuters etc. A ContentSource is responsible for creating DocData objects for its documents to be consumed by DocMaker. It also keeps track of various statistics, such as how many documents were generated, size in bytes etc.
For supported configuration parameters see ContentItemsSource.DemoHTMLParser
Simple HTML Parser extracting title, meta tags, and body text that is based on NekoHTML.
DemoHTMLParser.Parser
The actual parser to read HTML documents.
DirContentSource
A ContentSource using the Dir collection for its input. Supports the following configuration parameters (on top of ContentSource):
- work.dirspecifies the working directory. Required if "docs.dir" denotes a relative path (default=work).
- docs.dirspecifies the directory the Dir collection. Can be set to a relative path if "work.dir" is also specified (default=dir-out).
DirContentSource.Enumerator
Iterator over the files in the directory.
DocData
Output of parsing (e.g. HTML parsing) of an input document.
DocMaker
Creates Lucene.Net.Documents.Document objects. Uses a ContentSource to generate DocData objects.
DocMaker.DocState
Document state, supports reuse of field instances
across documents (see reuseFields
parameter).
EnwikiContentSource
A ContentSource which reads the English Wikipedia dump. You can read
the .bz2
file directly (it will be decompressed on the fly). Config
properties:
- keep.image.only.docsfalse|true (default true).
- docs.file<path to the file>
EnwikiQueryMaker
A QueryMaker that uses common and uncommon actual Wikipedia queries for searching the English Wikipedia collection. 90 queries total.
FacetSource
Source items for facets.
For supported configuration parameters see ContentItemsSource.FileBasedQueryMaker
Create queries from a FileStream. One per line, pass them through the QueryParser. Lines beginning with # are treated as comments.
GeonamesLineParser
A line parser for Geonames.org data. See 'geoname' table. Requires SpatialDocMaker.
HeaderLineParser
LineParser which sets field names and order by the header - any header - of the lines file. It is less efficient than SimpleLineParser but more powerful.
Int64ToEnglishContentSource
Creates documents whose content is a long number starting from
MinValue + 10
.
Int64ToEnglishQueryMaker
Creates queries whose content is a spelled-out long number
starting from MinValue + 10
.
LineDocSource
A ContentSource reading one line at a time as a Lucene.Net.Documents.Document from a single file. This saves IO cost (over DirContentSource) of recursing through a directory and opening a new file for every document.
LineParser
Reader of a single input line into DocData.
NoMoreDataException
Exception indicating there is no more data.
Thrown by Docs Makers if doc.maker.forever
is false
and docs sources of that maker where exhausted.
This is useful for iterating all document of a source, in case we don't know in advance how many docs there are.
RandomFacetSource
Simple implementation of a random facet source.
ReutersContentSource
A ContentSource reading from the Reuters collection.
Config properties:- work.dirpath to the root of docs and indexes dirs (default work).
- docs.dirpath to the docs dir (default reuters-out).
ReutersQueryMaker
A IQueryMaker that makes queries devised manually (by Grant Ingersoll) for searching in the Reuters collection.
SimpleLineParser
LineParser which ignores the header passed to its constructor and assumes simply that field names and their order are the same as in DEFAULT_FIELDS.
SimpleQueryMaker
A IQueryMaker that makes queries for a collection created using SingleDocSource.
SimpleSloppyPhraseQueryMaker
Create sloppy phrase queries for performance test, in an index created using simple doc maker.
SingleDocSource
Creates the same document each time GetNextDocData(DocData) is called.
SortableSingleDocSource
Adds fields appropriate for sorting: country, random_string and sort_field (int). Supports the following parameters:
- sort.rngdefines the range for sort-by-int field (default 20000).
- rand.seeddefines the seed to initialize Random with (default 13).
SpatialDocMaker
Indexes spatial data according to a configured Lucene.Net.Spatial.SpatialStrategy with optional shape transformation via a configured IShapeConverter. The converter can turn points into circles and bounding boxes, in order to vary the type of indexing performance tests. Unless it's subclass-ed to do otherwise, this class configures a Spatial4n.Context.SpatialContext, Lucene.Net.Spatial.Prefix.Tree.SpatialPrefixTree, and Lucene.Net.Spatial.Prefix.RecursivePrefixTreeStrategy. The Strategy is made available to a query maker via the static method GetSpatialStrategy(int). See spatial.alg for a listing of spatial parameters, in particular those starting with "spatial." and "doc.spatial".
SpatialFileQueryMaker
Reads spatial data from the body field docs from an internally created LineDocSource. It's parsed by ReadShapeFromWkt(string) and then further manipulated via a configurable IShapeConverter. When using point data, it's likely you'll want to configure the shape converter so that the query shapes actually cover a region. The queries are all created & cached in advance. This query maker works in conjunction with SpatialDocMaker. See spatial.alg for a listing of options, in particular the options starting with "query.".
TrecContentSource
Implements a ContentSource over the TREC collection.
TrecDocParser
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
TrecFBISParser
Parser for the FBIS docs in trec disks 4+5 collection format
TrecFR94Parser
Parser for the FR94 docs in trec disks 4+5 collection format
TrecFTParser
Parser for the FT docs in trec disks 4+5 collection format
TrecGov2Parser
Parser for the GOV2 collection format
TrecLATimesParser
Parser for the FT docs in trec disks 4+5 collection format
TrecParserByPath
Parser for trec docs which selects the parser to apply according to the source files path, defaulting to TrecGov2Parser.
Interfaces
IHTMLParser
HTML Parsing Interface for test purposes.
IQueryMaker
Create queries for the test.
IShapeConverter
Converts one shape to another. Created by MakeShapeConverter(SpatialStrategy, Config, string).
Enums
TrecDocParser.ParsePathType
Types of trec parse paths,