kuromoji-build-dictionary

Name

analysis-kuromoji-build-dictionary - Generates a set of custom dictionary files for the Lucene.Net.Analysis.Kuromoji library.

Synopsis

lucene analysis kuromoji-build-dictionary <FORMAT> <INPUT_DIRECTORY> <OUTPUT_DIRECTORY> [-e|--encoding] [-n|--normalize] [?|-h|--help]

Description

Generates the following set of binary files:

CharacterDefinition.dat
ConnectionCosts.dat
TokenInfoDictionary$buffer.dat
TokenInfoDictionary$fst.dat
TokenInfoDictionary$posDict.dat
TokenInfoDictionary$targetMap.dat
UnknownDictionary$buffer.dat
UnknownDictionary$posDict.dat
UnknownDictionary$targetMap.dat

If these files are placed into a subdirectory of your application named kuromoji-data, they will be used automatically by Lucene.Net.Analysis.Kuromoji features such as the JapaneseAnalyzer or JapaneseTokenizer. To use an alternate directory location, put the path in an environment variable named kuromoji.data.dir. The files must be placed in a subdirectory of this location named kuromoji-data.

See this blog post for information about the dictionary format. A sample is available at https://sourceforge.net/projects/mecab/files/mecab-ipadic/2.7.0-20070801/. The Kuromoji project documentation may also be helpful.

Arguments

FORMAT

The dictionary format. Valid values are IPADIC and UNIDIC. If an invalid value is passed, IPADIC is assumed.

INPUT_DIRECTORY

The directory where the dictionary input files are located.

OUTPUT_DIRECTORY

The directory to put the dictionary output.

Options

?|-h|--help

Prints out a short help for the command.

-e|--encoding <ENCODING>

The file encoding used by the input files. If not supplied, the default value is EUC-JP.

-n|--normalize

Normalize the entries using normalization form KC.

Example

lucene analysis kuromoji-build-dictionary IPADIC X:\kuromoji-data X:\kuromoji-dictionary --normalize