Class TrecDocParser
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
Inheritance
Inherited Members
Namespace: Lucene.Net.Benchmarks.ByTask.Feeds
Assembly: Lucene.Net.Benchmark.dll
Syntax
public abstract class TrecDocParser
Fields
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
Declaration
public static readonly TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
Field Value
Type | Description |
---|---|
TrecDocParser.ParsePathType |
Methods
Extract(StringBuilder, string, string, int, string[])
Extract from buf
the text of interest within specified tags.
Declaration
public static string Extract(StringBuilder buf, string startTag, string endTag, int maxPos, string[] noisePrefixes)
Parameters
Type | Name | Description |
---|---|---|
StringBuilder | buf | Entire input text. |
string | startTag | Tag marking start of text of interest. |
string | endTag | Tag marking end of text of interest. |
int | maxPos | if ≥ 0 sets a limit on start of text of interest. |
string[] | noisePrefixes | Text of interest or null if not found. |
Returns
Type | Description |
---|---|
string |
Parse(DocData, string, TrecContentSource, StringBuilder, ParsePathType)
Parse the text prepared in docBuf into a result DocData, no synchronization is required.
Declaration
public abstract DocData Parse(DocData docData, string name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
Parameters
Type | Name | Description |
---|---|---|
DocData | docData | Reusable result. |
string | name | Name that should be set to the result. |
TrecContentSource | trecSrc | Calling trec content source. |
StringBuilder | docBuf | Text to parse. |
TrecDocParser.ParsePathType | pathType | Type of parsed file, or UNKNOWN if unknown - may be used by parsers to alter their behavior according to the file path type. |
Returns
Type | Description |
---|---|
DocData |
PathType(FileInfo)
Compute the path type of a file by inspecting name of file and its parents.
Declaration
public static TrecDocParser.ParsePathType PathType(FileInfo f)
Parameters
Type | Name | Description |
---|---|---|
FileInfo | f |
Returns
Type | Description |
---|---|
TrecDocParser.ParsePathType |
StripTags(string, int)
Strip tags from input.
Declaration
public static string StripTags(string buf, int start)
Parameters
Type | Name | Description |
---|---|---|
string | buf | |
int | start |
Returns
Type | Description |
---|---|
string |
See Also
StripTags(StringBuilder, int)
strip tags from
buf
: each tag is replaced by a single blank.
Declaration
public static string StripTags(StringBuilder buf, int start)
Parameters
Type | Name | Description |
---|---|---|
StringBuilder | buf | |
int | start |
Returns
Type | Description |
---|---|
string | Text obtained when stripping all tags from |