Class ICUTokenizerFactory
Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the ICU4N.Text.BreakIterator and typing provided by the DefaultICUTokenizerConfig.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public class ICUTokenizerFactory : TokenizerFactory, IResourceLoaderAware
Remarks
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list ofcode:rulefile
pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>
Constructors
ICUTokenizerFactory(IDictionary<string, string>)
Creates a new ICUTokenizerFactory.
Declaration
public ICUTokenizerFactory(IDictionary<string, string> args)
Parameters
Type | Name | Description |
---|---|---|
IDictionary<string, string> | args |
Remarks
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list ofcode:rulefile
pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>
Methods
Create(AttributeFactory, TextReader)
Creates a Lucene.Net.Analysis.TokenStream of the specified input using the given Lucene.Net.Util.AttributeSource.AttributeFactory
Declaration
public override Tokenizer Create(AttributeSource.AttributeFactory factory, TextReader input)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | |
TextReader | input |
Returns
Type | Description |
---|---|
Tokenizer |
Overrides
Remarks
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list ofcode:rulefile
pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>
Inform(IResourceLoader)
Initializes this component with the provided Lucene.Net.Analysis.Util.IResourceLoader (used for loading types, embedded resources, files, etc).
Declaration
public virtual void Inform(IResourceLoader loader)
Parameters
Type | Name | Description |
---|---|---|
IResourceLoader | loader |
Remarks
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list ofcode:rulefile
pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>