Show / Hide Table of Contents

    extract-wikipedia

    Name

    benchmark-extract-wikipedia - Extracts a downloaded Wikipedia dump into separate files for indexing.

    Synopsis

    lucene benchmark extract-wikipedia <INPUT_WIKIPEDIA_FILE> <OUTPUT_DIRECTORY> [-d|--discard-image-only-docs] [?|-h|--help]

    Arguments

    INPUT_WIKIPEDIA_FILE

    Input path to a Wikipedia XML file.

    OUTPUT_DIRECTORY

    Path to a directory where the output files will be written.

    Options

    ?|-h|--help

    Prints out a short help for the command.

    -d|--discard-image-only-docs

    Tells the extractor to skip WIKI docs that contain only images.

    Example

    Extracts the c:\wiki.xml file into the c:\out directory, skipping any docs that only contain images.

    lucene benchmark extract-wikipedia c:\wiki.xml c:\out -d

    • Improve this Doc
    Back to top Copyright © 2019 Licensed to the Apache Software Foundation (ASF)