Namespace Lucene.Net.Analysis.OpenNlp

OpenNLP Library Integration

This module exposes functionality from Apache OpenNLP to Apache Lucene.NET. The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.

For an introduction to Lucene's analysis API, see the Lucene.Net.Analysis namespace documentation.

The OpenNLP Tokenizer behavior is similar to the WhitespaceTokenizer but is smart about inter-word punctuation. The term stream looks very much like the way you parse words and punctuation while reading. The major difference between this tokenizer and most other tokenizers shipped with Lucene is that punctuation is tokenized. This is required for the following taggers to operate properly.

The OpenNLP taggers annotate terms using the <xref:Lucene.Net.Analysis.TokenAttributes.ITypeAttribute>.

OpenNLPTokenizer segments text into sentences or words. This Tokenizer uses the OpenNLP Sentence Detector and/or Tokenizer classes. When used together, the Tokenizer receives sentences and can do a better job.
OpenNLPPOSFilter tags words for Part-of-Speech and OpenNLPChunkerFilter tags words for Chunking. These tags are assigned as token types. Note that only one of these operations will tag Since the <xref:Lucene.Net.Analysis.TokenAttributes.ITypeAttribute> is not stored in the index, it is recommended that one of these filters is used following OpenNLPFilter to enable search against the assigned tags:
<xref:Lucene.Net.Analysis.Payloads.TypeAsPayloadTokenFilter> copies the <xref:Lucene.Net.Analysis.TokenAttributes.ITypeAttribute> value to the <xref:Lucene.Net.Analysis.TokenAttributes.IPayloadAttribute>
<xref:Lucene.Net.Analysis.Miscellaneous.TypeAsSynonymFilter> creates a cloned token at the same position as each tagged token, and copies the <xref:Lucene.Net.Analysis.TokenAttributes.ITypeAttribute> value to the <xref:Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute>, optionally with a customized prefix (so that tags effectively occupy a different namespace from token text).

Named Entity Recognition is also supported by OpenNLP, but there is no OpenNLPNERFilter included. For an implementation, see the lucenenet-opennlp-mavenreference-demo.

MavenReference Primer

When a <PackageReference> is included for this NuGet package in your SDK-style MSBuild project, it will automatically include transitive dependencies to opennlp-tools on maven.org. The transitive dependency will automatically include a <MavenReference> in your MSBuild project.

The <MavenReference> item group operates similar to a dependency in Maven. All transitive dependencies are collected and resolved, and then the final output is produced. However, unlike PackageReferences, MavenReferences are collected by the final output project, and reassessed. That is, each dependent Project within your .NET SDK-style solution contributes its MavenReferences to project(s) which include it, and each project makes its own dependency graph. Projects do not contribute their final built assemblies up. They only contribute their dependencies. Allowing each project in a complicated solution to make its own local conflict resolution attempt.

Note

<MavenReference> is only supported on SDK-style MSBuild projects.

MavenReference Example

This means this package can be combined with other related packages on Maven in your project and they can be accessed using the same path as in Java like a namespace in .NET. For example, you can add a <MavenReference> to your project to include a reference to opennlp-uima. The UIMA (Unstructured Information Management Architecture) integration module is designed to work with the Apache UIMA framework. UIMA is a framework for building applications that analyze unstructured information, and it's often used for processing natural language text. The opennlp-uima module allows you to integrate OpenNLP functionality into UIMA pipelines, leveraging the capabilities of both frameworks.

Here's a basic outline of how you might extend an existing Lucene.NET analyzer to incorporate OpenNLP-UIMA annotators:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="Lucene.Net.Analysis.OpenNLP" Version="4.8.0-beta00017" />
  </ItemGroup>

  <ItemGroup>
    <MavenReference Include="org.apache.opennlp:opennlp-uima" Version="1.9.1" />
  </ItemGroup>
</Project>

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Util;
using Lucene.Net.Util;
using org.apache.uima.analysis_engine;
using System.IO;

public class CustomOpenNLPAnalyzer : OpenNLPTokenizerFactory
{
    // ... constructor and other methods ...

    public override Tokenizer Create(AttributeFactory factory, TextReader reader)
    {
        Tokenizer tokenizer = base.Create(factory, reader);

        // Wrap the tokenizer with UIMA annotators
        AnalysisEngineDescription uimaSentenceAnnotator = CreateUIMASentenceAnnotator();
        AnalysisEngineDescription uimaTokenAnnotator = CreateUIMATokenAnnotator();

        // Combine OpenNLP-UIMA annotators with the existing tokenizer
        AnalysisEngine tokenizerAndUIMAAnnotators = CreateAggregate(uimaSentenceAnnotator, uimaTokenAnnotator);

        return new UIMATokenizer(tokenizer, tokenizerAndUIMAAnnotators);
    }

    // ... other methods ...

    private AnalysisEngineDescription CreateUIMASentenceAnnotator() {
        // Create and configure UIMA sentence annotator
        // ...

        return /* UIMA sentence annotator description */;
    }

    private AnalysisEngineDescription CreateUIMATokenAnnotator() {
        // Create and configure UIMA token annotator
        // ...

        return /* UIMA token annotator description */;
    }
}

In the above example, CustomOpenNLPAnalyzer extends OpenNLPTokenizerFactory (assuming that's the analyzer you're using), and it wraps the OpenNLP tokenizer with UIMA annotators. You'll need to replace the placeholder methods (CreateUIMASentenceAnnotator and CreateUIMATokenAnnotator) with the actual code to create and configure your UIMA annotators. Please note that configuring NLP can be complex. See the OpenNLP 1.9.4 Manual and OpenNLP UIMA 1.9.4 API Documention for details.

Note

IKVM (and <MavenReference>) does not support Java SE higher than version 8. So it will not be possible to add a <MavenReference> to OpenNLP 2.x until support is added for it in IKVM.

For a more complete example, see the lucenenet-opennlp-mavenreference-demo.

Classes

OpenNLPChunkerFilter

Run OpenNLP chunker. Prerequisite: the OpenNLPTokenizer and OpenNLPPOSFilter must precede this filter. Tags terms in the Lucene.Net.Analysis.TokenAttributes.ITypeAttribute, replacing the POS tags previously put there by OpenNLPPOSFilter.

The Lucene.Net.Analysis.Payloads.TypeAsPayloadTokenFilter can be used to copy the POS tag values to Lucene.Net.Analysis.TokenAttributes.IPayloadAttribute, which will index the value. Alternatively, the Lucene.Net.Analysis.Miscellaneous.TypeAsSynonymFilter creates a cloned token at the same position as each tagged token, and copies the Lucene.Net.Analysis.TokenAttributes.ITypeAttribute value to the Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute, optionally with a customized prefix (so that tags effectively occupy a different namespace from token text).

OpenNLPChunkerFilterFactory

Factory for OpenNLPChunkerFilter.

<fieldType name="text_opennlp_chunked" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="filename" tokenizerModel="filename"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="filename"/>
    <filter class="solr.OpenNLPChunkerFilterFactory" chunkerModel="filename"/>
  </analyzer>
</fieldType>

OpenNLPLemmatizerFilter

Runs OpenNLP dictionary-based and/or MaxEnt lemmatizers.

Both a dictionary-based lemmatizer and a MaxEnt lemmatizer are supported, via the "dictionary" and "lemmatizerModel" params, respectively. If both are configured, the dictionary-based lemmatizer is tried first, and then the MaxEnt lemmatizer is consulted for out-of-vocabulary tokens.

The dictionary file must be encoded as UTF-8, with one entry per line, in the form word[tab]lemma[tab]part-of-speech

OpenNLPLemmatizerFilterFactory

Factory for OpenNLPLemmatizerFilter.

<fieldType name="text_opennlp_lemma" class="solr.TextField" positionIncrementGap="100"
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory"
               sentenceModel="filename"
               tokenizerModel="filename"/>
    />
    <filter class="solr.OpenNLPLemmatizerFilterFactory"
            dictionary="filename"
            lemmatizerModel="filename"/>
  </analyzer>
</fieldType>

OpenNLPPOSFilter

Run OpenNLP POS tagger. Tags all terms in the Lucene.Net.Analysis.TokenAttributes.ITypeAttribute.

OpenNLPPOSFilterFactory

Factory for OpenNLPPOSFilter.

<fieldType name="text_opennlp_pos" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="filename" tokenizerModel="filename"/>
    <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="filename"/>
  </analyzer>
</fieldType>

OpenNLPSentenceBreakIterator

A ICU4N.Text.BreakIterator that splits sentences using an OpenNLP sentence chunking model.

OpenNLPTokenizer

Run OpenNLP SentenceDetector and Lucene.Net.Analysis.Tokenizer. The last token in each sentence is marked by setting the EOS_FLAG_BIT in the Lucene.Net.Analysis.TokenAttributes.IFlagsAttribute; following filters can use this information to apply operations to tokens one sentence at a time.

OpenNLPTokenizerFactory

Factory for OpenNLPTokenizer.

<fieldType name="text_opennlp" class="solr.TextField" positionIncrementGap="100"
  <analyzer>
    <tokenizer class="solr.OpenNLPTokenizerFactory" sentenceModel="filename" tokenizerModel="filename"/>
  </analyzer>
</fieldType>