extract-wikipedia

benchmark-extract-wikipedia - Extracts a downloaded Wikipedia dump into separate files for indexing.

lucene benchmark extract-wikipedia <INPUT_WIKIPEDIA_FILE> <OUTPUT_DIRECTORY> [-d|--discard-image-only-docs] [?|-h|--help]

INPUT_WIKIPEDIA_FILE

Input path to a Wikipedia XML file.

OUTPUT_DIRECTORY

Path to a directory where the output files will be written.

?|-h|--help

Prints out a short help for the command.

-d|--discard-image-only-docs

Tells the extractor to skip WIKI docs that contain only images.

Extracts the c:\wiki.xml file into the c:\out directory, skipping any docs that only contain images.

lucene benchmark extract-wikipedia c:\wiki.xml c:\out -d