Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class TrecDocParser

    Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.

    Inheritance
    object
    TrecDocParser
    TrecFBISParser
    TrecFR94Parser
    TrecFTParser
    TrecGov2Parser
    TrecLATimesParser
    TrecParserByPath
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Benchmarks.ByTask.Feeds
    Assembly: Lucene.Net.Benchmark.dll
    Syntax
    public abstract class TrecDocParser

    Fields

    DEFAULT_PATH_TYPE

    trec parser type used for unknown extensions

    Declaration
    public static readonly TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
    Field Value
    Type Description
    TrecDocParser.ParsePathType

    Methods

    Extract(StringBuilder, string, string, int, string[])

    Extract from buf the text of interest within specified tags.

    Declaration
    public static string Extract(StringBuilder buf, string startTag, string endTag, int maxPos, string[] noisePrefixes)
    Parameters
    Type Name Description
    StringBuilder buf

    Entire input text.

    string startTag

    Tag marking start of text of interest.

    string endTag

    Tag marking end of text of interest.

    int maxPos

    if ≥ 0 sets a limit on start of text of interest.

    string[] noisePrefixes

    Text of interest or null if not found.

    Returns
    Type Description
    string

    Parse(DocData, string, TrecContentSource, StringBuilder, ParsePathType)

    Parse the text prepared in docBuf into a result DocData, no synchronization is required.

    Declaration
    public abstract DocData Parse(DocData docData, string name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
    Parameters
    Type Name Description
    DocData docData

    Reusable result.

    string name

    Name that should be set to the result.

    TrecContentSource trecSrc

    Calling trec content source.

    StringBuilder docBuf

    Text to parse.

    TrecDocParser.ParsePathType pathType

    Type of parsed file, or UNKNOWN if unknown - may be used by parsers to alter their behavior according to the file path type.

    Returns
    Type Description
    DocData

    PathType(FileInfo)

    Compute the path type of a file by inspecting name of file and its parents.

    Declaration
    public static TrecDocParser.ParsePathType PathType(FileInfo f)
    Parameters
    Type Name Description
    FileInfo f
    Returns
    Type Description
    TrecDocParser.ParsePathType

    StripTags(string, int)

    Strip tags from input.

    Declaration
    public static string StripTags(string buf, int start)
    Parameters
    Type Name Description
    string buf
    int start
    Returns
    Type Description
    string
    See Also
    StripTags(StringBuilder, int)

    StripTags(StringBuilder, int)

    strip tags from

    buf
    : each tag is replaced by a single blank.
    Declaration
    public static string StripTags(StringBuilder buf, int start)
    Parameters
    Type Name Description
    StringBuilder buf
    int start
    Returns
    Type Description
    string

    Text obtained when stripping all tags from buf (input StringBuilder is unmodified).

    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.