Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Analysis.Pattern

    Set of components for pattern-based (regex) analysis.

    Classes

    PatternCaptureGroupFilterFactory

    Factory for PatternCaptureGroupTokenFilter.

    <fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/>
      </analyzer>
    </fieldType>

    PatternCaptureGroupTokenFilter

    CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture group in one or more patterns.

    For example, a pattern like:

    "(https?://([a-zA-Z-_0-9.]+))"

    when matched against the string "http://www.foo.com/index"; would return the tokens "https://www.foo.com" and "www.foo.com".

    If none of the patterns match, or if preserveOriginal is true, the original token will be preserved.

    Each pattern is matched as often as it can be, so the pattern "(...)", when matched against "abcdefghi" would produce ["abc","def","ghi"]

    A camelCaseFilter could be written as:

      "([A-Z]{2,})",                                 
      "(?<![A-Z])([A-Z][a-z]+)",                     
      "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)", 
      "([0-9]+)"

    plus if Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.preserveOriginal is true, it would also return camelCaseFilter

    PatternReplaceCharFilter

    Lucene.Net.Analysis.CharFilter that uses a regular expression for the target of replace string. The pattern match will be done in each "block" in char stream.

    ex1) source="aa bb aa bb", pattern="(aa)\s+(bb)" replacement="$1#$2" output="aa#bb aa#bb"

    NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble.

    ex2) source="aa123bb", pattern="(aa)\d+(bb)" replacement="$1 $2" output="aa bb" and you want to search bb and highlight it, you will get highlight snippet="aa1<em>23bb</em>"

    @since Solr 1.5

    PatternReplaceCharFilterFactory

    Factory for PatternReplaceCharFilter.

    <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <charFilter class="solr.PatternReplaceCharFilterFactory" 
                       pattern="([^a-z])" replacement=""/>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

    @since Solr 3.1

    PatternReplaceFilter

    A TokenFilter which applies a System.Text.RegularExpressions.Regex to each token in the stream, replacing match occurances with the specified replacement string.

    Note: Depending on the input and the pattern used and the input Lucene.Net.Analysis.TokenStream, this Lucene.Net.Analysis.TokenFilter may produce Lucene.Net.Analysis.Tokens whose text is the empty string.

    PatternReplaceFilterFactory

    Factory for PatternReplaceFilter.

    <fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement=""
                replace="all"/>
      </analyzer>
    </fieldType>

    PatternTokenizer

    This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

    • "pattern" is the regular expression.
    • "group" says which group to extract into tokens.

    group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): System.Text.RegularExpressions.Regex.Replace(System.String,System.String)

    Using group >= 0 selects the matching group as the token. For example, if you have:

     pattern = \'([^\']+)\'
     group = 0
     input = aaa 'bbb' 'ccc'

    the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

    NOTE: This Lucene.Net.Analysis.Tokenizer does not output tokens that are of zero length.

    PatternTokenizerFactory

    Factory for PatternTokenizer. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

    • "pattern" is the regular expression.
    • "group" says which group to extract into tokens.

    group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): System.Text.RegularExpressions.Regex.Replace(System.String,System.String)

    Using group >= 0 selects the matching group as the token. For example, if you have:

        pattern = \'([^\']+)\'
        group = 0
        input = aaa 'bbb' 'ccc'

    the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

    NOTE: This Tokenizer does not output tokens that are of zero length.

    <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/>
      </analyzer>
    </fieldType>

    @since solr1.2

    • Improve this Doc
    Back to top Copyright © 2020 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.