Class ICUTokenizerFactory
Factory for ICUTokenizer.
Words are broken across script boundaries, then segmented according to
the
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Icu.Segmentation
Assembly: Lucene.Net.ICU.dll
Syntax
public class ICUTokenizerFactory : TokenizerFactory, IResourceLoaderAware
Remarks
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files,
which are compiled by the ICU
To add per-script rules, add a "rulefiles" argument, which should contain a
comma-separated list of code:rulefile
pairs in the following format:
four-letter ISO 15924 script code, followed by a colon, then a resource
path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic
(script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>
Constructors
| Improve this Doc View SourceICUTokenizerFactory(IDictionary<String, String>)
Creates a new ICUTokenizer
Declaration
public ICUTokenizerFactory(IDictionary<string, string> args)
Parameters
Type | Name | Description |
---|---|---|
IDictionary<System. |
args |
Methods
| Improve this Doc View SourceCreate(AttributeSource.AttributeFactory, TextReader)
Declaration
public override Tokenizer Create(AttributeSource.AttributeFactory factory, TextReader input)
Parameters
Type | Name | Description |
---|---|---|
Attribute |
factory | |
Text |
input |
Returns
Type | Description |
---|---|
Tokenizer |
Inform(IResourceLoader)
Declaration
public virtual void Inform(IResourceLoader loader)
Parameters
Type | Name | Description |
---|---|---|
IResource |
loader |