Namespace Lucene.Net.Analysis.Pattern

Set of components for pattern-based (regex) analysis.

Classes

PatternCaptureGroupFilterFactory

Factory for PatternCaptureGroupTokenFilter.

<fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/>
  </analyzer>
</fieldType>

PatternCaptureGroupTokenFilter

CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture group in one or more patterns.

For example, a pattern like:

"(https?://([a-zA-Z-_0-9.]+))"

when matched against the string "http://www.foo.com/index"; would return the tokens "https://www.foo.com" and "www.foo.com".

If none of the patterns match, or if preserveOriginal is true, the original token will be preserved.

Each pattern is matched as often as it can be, so the pattern "(...)", when matched against "abcdefghi" would produce ["abc","def","ghi"]

A camelCaseFilter could be written as:

  "([A-Z]{2,})",                                 
  "(?<![A-Z])([A-Z][a-z]+)",                     
  "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)", 
  "([0-9]+)"

plus if Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.preserveOriginal is true, it would also return camelCaseFilter

PatternReplaceCharFilter

Lucene.Net.Analysis.CharFilter that uses a regular expression for the target of replace string. The pattern match will be done in each "block" in char stream.

ex1) source="aa bb aa bb", pattern="(aa)\s+(bb)" replacement="$1#$2" output="aa#bb aa#bb"

NOTE: If you produce a phrase that has different length to source string and the field is used for highlighting for a term of the phrase, you will face a trouble.

ex2) source="aa123bb", pattern="(aa)\d+(bb)" replacement="$1 $2" output="aa bb" and you want to search bb and highlight it, you will get highlight snippet="aa1<em>23bb</em>"

@since Solr 1.5

PatternReplaceCharFilterFactory

Factory for PatternReplaceCharFilter.

<fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory" 
                   pattern="([^a-z])" replacement=""/>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
  </analyzer>
</fieldType>

@since Solr 3.1

PatternReplaceFilter

A TokenFilter which applies a System.Text.RegularExpressions.Regex to each token in the stream, replacing match occurances with the specified replacement string.

Note: Depending on the input and the pattern used and the input Lucene.Net.Analysis.TokenStream, this Lucene.Net.Analysis.TokenFilter may produce Lucene.Net.Analysis.Tokens whose text is the empty string.

PatternReplaceFilterFactory

Factory for PatternReplaceFilter.

<fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement=""
            replace="all"/>
  </analyzer>
</fieldType>

PatternTokenizer

This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

"pattern" is the regular expression.
"group" says which group to extract into tokens.

group=-1 (the default) is equivalent to "split". In this case, the tokens will be equivalent to the output from (without empty tokens): System.Text.RegularExpressions.Regex.Replace(System.String,System.String)

Using group >= 0 selects the matching group as the token. For example, if you have:

 pattern = \'([^\']+)\'
 group = 0
 input = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Lucene.Net.Analysis.Tokenizer does not output tokens that are of zero length.

PatternTokenizerFactory

Factory for PatternTokenizer. This tokenizer uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: "pattern" and "group".

"pattern" is the regular expression.
"group" says which group to extract into tokens.

Using group >= 0 selects the matching group as the token. For example, if you have:

    pattern = \'([^\']+)\'
    group = 0
    input = aaa 'bbb' 'ccc'

the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)

NOTE: This Tokenizer does not output tokens that are of zero length.

<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/>
  </analyzer>
</fieldType>

@since solr1.2