kuromoji-build-dictionary
Name
analysis-kuromoji-build-dictionary - Generates a set of custom dictionary files for the Lucene.Net.Analysis.Kuromoji library.
Synopsis
lucene analysis kuromoji-build-dictionary <FORMAT> <INPUT_DIRECTORY> <OUTPUT_DIRECTORY> [-e|--encoding] [-n|--normalize] [?|-h|--help]
Description
Generates the following set of binary files:
- CharacterDefinition.dat
- ConnectionCosts.dat
- TokenInfoDictionary$buffer.dat
- TokenInfoDictionary$fst.dat
- TokenInfoDictionary$posDict.dat
- TokenInfoDictionary$targetMap.dat
- UnknownDictionary$buffer.dat
- UnknownDictionary$posDict.dat
- UnknownDictionary$targetMap.dat
If these files are placed into a subdirectory of your application named kuromoji-data, they will be used automatically by Lucene.Net.Analysis.Kuromoji features such as the JapaneseAnalyzer or JapaneseTokenizer. To use an alternate directory location, put the path in an environment variable named kuromoji.data.dir. The files must be placed in a subdirectory of this location named kuromoji-data.
See this blog post for information about the dictionary format. A sample is available at https://sourceforge.net/projects/mecab/files/mecab-ipadic/2.7.0-20070801/. The Kuromoji project documentation may also be helpful.
Arguments
FORMAT
The dictionary format. Valid values are IPADIC and UNIDIC. If an invalid value is passed, IPADIC is assumed.
INPUT_DIRECTORY
The directory where the dictionary input files are located.
OUTPUT_DIRECTORY
The directory to put the dictionary output.
Options
?|-h|--help
Prints out a short help for the command.
-e|--encoding <ENCODING>
The file encoding used by the input files. If not supplied, the default value is EUC-JP.
-n|--normalize
Normalize the entries using normalization form KC.
Example
lucene analysis kuromoji-build-dictionary IPADIC X:\kuromoji-data X:\kuromoji-dictionary --normalize