Class TrecDocParser
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
Inheritance
Inherited Members
Namespace: Lucene.Net.Benchmarks.ByTask.Feeds
Assembly: Lucene.Net.Benchmark.dll
Syntax
public abstract class TrecDocParser
Fields
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
Declaration
public static readonly TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
Field Value
| Type | Description |
|---|---|
| TrecDocParser.ParsePathType |
Methods
Extract(StringBuilder, string, string, int, string[])
Extract from buf the text of interest within specified tags.
Declaration
public static string Extract(StringBuilder buf, string startTag, string endTag, int maxPos, string[] noisePrefixes)
Parameters
| Type | Name | Description |
|---|---|---|
| StringBuilder | buf | Entire input text. |
| string | startTag | Tag marking start of text of interest. |
| string | endTag | Tag marking end of text of interest. |
| int | maxPos | if ≥ 0 sets a limit on start of text of interest. |
| string[] | noisePrefixes | Text of interest or null if not found. |
Returns
| Type | Description |
|---|---|
| string |
Parse(DocData, string, TrecContentSource, StringBuilder, ParsePathType)
Parse the text prepared in docBuf into a result DocData, no synchronization is required.
Declaration
public abstract DocData Parse(DocData docData, string name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
Parameters
| Type | Name | Description |
|---|---|---|
| DocData | docData | Reusable result. |
| string | name | Name that should be set to the result. |
| TrecContentSource | trecSrc | Calling trec content source. |
| StringBuilder | docBuf | Text to parse. |
| TrecDocParser.ParsePathType | pathType | Type of parsed file, or UNKNOWN if unknown - may be used by parsers to alter their behavior according to the file path type. |
Returns
| Type | Description |
|---|---|
| DocData |
PathType(FileInfo)
Compute the path type of a file by inspecting name of file and its parents.
Declaration
public static TrecDocParser.ParsePathType PathType(FileInfo f)
Parameters
| Type | Name | Description |
|---|---|---|
| FileInfo | f |
Returns
| Type | Description |
|---|---|
| TrecDocParser.ParsePathType |
StripTags(string, int)
Strip tags from input.
Declaration
public static string StripTags(string buf, int start)
Parameters
| Type | Name | Description |
|---|---|---|
| string | buf | |
| int | start |
Returns
| Type | Description |
|---|---|
| string |
See Also
StripTags(StringBuilder, int)
strip tags from
buf: each tag is replaced by a single blank.
Declaration
public static string StripTags(StringBuilder buf, int start)
Parameters
| Type | Name | Description |
|---|---|---|
| StringBuilder | buf | |
| int | start |
Returns
| Type | Description |
|---|---|
| string | Text obtained when stripping all tags from |