Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class ICUTokenizerFactory

    Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the ICU4N.Text.BreakIterator and typing provided by the DefaultICUTokenizerConfig.

    Inheritance
    object
    AbstractAnalysisFactory
    TokenizerFactory
    ICUTokenizerFactory
    Implements
    IResourceLoaderAware
    Inherited Members
    TokenizerFactory.ForName(string, IDictionary<string, string>)
    TokenizerFactory.LookupClass(string)
    TokenizerFactory.AvailableTokenizers
    TokenizerFactory.ReloadTokenizers()
    TokenizerFactory.Create(TextReader)
    AbstractAnalysisFactory.LUCENE_MATCH_VERSION_PARAM
    AbstractAnalysisFactory.m_luceneMatchVersion
    AbstractAnalysisFactory.OriginalArgs
    AbstractAnalysisFactory.AssureMatchVersion()
    AbstractAnalysisFactory.LuceneMatchVersion
    AbstractAnalysisFactory.Require(IDictionary<string, string>, string)
    AbstractAnalysisFactory.Require(IDictionary<string, string>, string, ICollection<string>)
    AbstractAnalysisFactory.Require(IDictionary<string, string>, string, ICollection<string>, bool)
    AbstractAnalysisFactory.Get(IDictionary<string, string>, string, string)
    AbstractAnalysisFactory.Get(IDictionary<string, string>, string, ICollection<string>)
    AbstractAnalysisFactory.Get(IDictionary<string, string>, string, ICollection<string>, string)
    AbstractAnalysisFactory.Get(IDictionary<string, string>, string, ICollection<string>, string, bool)
    AbstractAnalysisFactory.RequireInt32(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetInt32(IDictionary<string, string>, string, int)
    AbstractAnalysisFactory.RequireBoolean(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetBoolean(IDictionary<string, string>, string, bool)
    AbstractAnalysisFactory.RequireSingle(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetSingle(IDictionary<string, string>, string, float)
    AbstractAnalysisFactory.RequireChar(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetChar(IDictionary<string, string>, string, char)
    AbstractAnalysisFactory.GetSet(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetPattern(IDictionary<string, string>, string)
    AbstractAnalysisFactory.GetCulture(IDictionary<string, string>, string, CultureInfo)
    AbstractAnalysisFactory.GetWordSet(IResourceLoader, string, bool)
    AbstractAnalysisFactory.GetLines(IResourceLoader, string)
    AbstractAnalysisFactory.GetSnowballWordSet(IResourceLoader, string, bool)
    AbstractAnalysisFactory.SplitFileNames(string)
    AbstractAnalysisFactory.GetClassArg()
    AbstractAnalysisFactory.IsExplicitLuceneMatchVersion
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Analysis.Icu.Segmentation
    Assembly: Lucene.Net.ICU.dll
    Syntax
    public class ICUTokenizerFactory : TokenizerFactory, IResourceLoaderAware
    Remarks

    To use the default set of per-script rules:

    <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
      </analyzer>
    </fieldType>

    You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

    To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
    <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                   rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
      </analyzer>
    </fieldType>

    Constructors

    ICUTokenizerFactory(IDictionary<string, string>)

    Creates a new ICUTokenizerFactory.

    Declaration
    public ICUTokenizerFactory(IDictionary<string, string> args)
    Parameters
    Type Name Description
    IDictionary<string, string> args
    Remarks

    To use the default set of per-script rules:

    <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
      </analyzer>
    </fieldType>

    You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

    To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
    <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                   rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
      </analyzer>
    </fieldType>

    Methods

    Create(AttributeFactory, TextReader)

    Creates a Lucene.Net.Analysis.TokenStream of the specified input using the given Lucene.Net.Util.AttributeSource.AttributeFactory

    Declaration
    public override Tokenizer Create(AttributeSource.AttributeFactory factory, TextReader input)
    Parameters
    Type Name Description
    AttributeSource.AttributeFactory factory
    TextReader input
    Returns
    Type Description
    Tokenizer
    Overrides
    TokenizerFactory.Create(AttributeSource.AttributeFactory, TextReader)
    Remarks

    To use the default set of per-script rules:

    <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
      </analyzer>
    </fieldType>

    You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

    To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
    <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                   rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
      </analyzer>
    </fieldType>

    Inform(IResourceLoader)

    Initializes this component with the provided Lucene.Net.Analysis.Util.IResourceLoader (used for loading types, embedded resources, files, etc).

    Declaration
    public virtual void Inform(IResourceLoader loader)
    Parameters
    Type Name Description
    IResourceLoader loader
    Remarks

    To use the default set of per-script rules:

    <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
      </analyzer>
    </fieldType>

    You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU ICU4N.Text.RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

    To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
    <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                   rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
      </analyzer>
    </fieldType>

    Implements

    Lucene.Net.Analysis.Util.IResourceLoaderAware
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.