Class QueryAutoStopWordAnalyzer
An Lucene.Net.Analysis.Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Query
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public sealed class QueryAutoStopWordAnalyzer : AnalyzerWrapper, IDisposable
Constructors
QueryAutoStopWordAnalyzer(LuceneVersion, Analyzer, IndexReader)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
Declaration
public QueryAutoStopWordAnalyzer(LuceneVersion matchVersion, Analyzer @delegate, IndexReader indexReader)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | Version to be used in StopFilter |
Analyzer | delegate | Lucene.Net.Analysis.Analyzer whose Lucene.Net.Analysis.TokenStream will be filtered |
IndexReader | indexReader | Lucene.Net.Index.IndexReader to identify the stopwords from |
Exceptions
Type | Condition |
---|---|
IOException | Can be thrown while reading from the Lucene.Net.Index.IndexReader |
QueryAutoStopWordAnalyzer(LuceneVersion, Analyzer, IndexReader, ICollection<string>, int)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency greater than
the given maxDocFreq
Declaration
public QueryAutoStopWordAnalyzer(LuceneVersion matchVersion, Analyzer @delegate, IndexReader indexReader, ICollection<string> fields, int maxDocFreq)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | Version to be used in StopFilter |
Analyzer | delegate | Analyzer whose TokenStream will be filtered |
IndexReader | indexReader | Lucene.Net.Index.IndexReader to identify the stopwords from |
ICollection<string> | fields | Selection of fields to calculate stopwords for |
int | maxDocFreq | Document frequency terms should be above in order to be stopwords |
Exceptions
Type | Condition |
---|---|
IOException | Can be thrown while reading from the Lucene.Net.Index.IndexReader |
QueryAutoStopWordAnalyzer(LuceneVersion, Analyzer, IndexReader, ICollection<string>, float)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency percentage
greater than the given maxPercentDocs
Declaration
public QueryAutoStopWordAnalyzer(LuceneVersion matchVersion, Analyzer @delegate, IndexReader indexReader, ICollection<string> fields, float maxPercentDocs)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | Version to be used in StopFilter |
Analyzer | delegate | Lucene.Net.Analysis.Analyzer whose Lucene.Net.Analysis.TokenStream will be filtered |
IndexReader | indexReader | Lucene.Net.Index.IndexReader to identify the stopwords from |
ICollection<string> | fields | Selection of fields to calculate stopwords for |
float | maxPercentDocs | The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word |
Exceptions
Type | Condition |
---|---|
IOException | Can be thrown while reading from the Lucene.Net.Index.IndexReader |
QueryAutoStopWordAnalyzer(LuceneVersion, Analyzer, IndexReader, int)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency greater than the given
maxDocFreq
Declaration
public QueryAutoStopWordAnalyzer(LuceneVersion matchVersion, Analyzer @delegate, IndexReader indexReader, int maxDocFreq)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | Version to be used in StopFilter |
Analyzer | delegate | Lucene.Net.Analysis.Analyzer whose Lucene.Net.Analysis.TokenStream will be filtered |
IndexReader | indexReader | Lucene.Net.Index.IndexReader to identify the stopwords from |
int | maxDocFreq | Document frequency terms should be above in order to be stopwords |
Exceptions
Type | Condition |
---|---|
IOException | Can be thrown while reading from the Lucene.Net.Index.IndexReader |
QueryAutoStopWordAnalyzer(LuceneVersion, Analyzer, IndexReader, float)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
the given maxPercentDocs
Declaration
public QueryAutoStopWordAnalyzer(LuceneVersion matchVersion, Analyzer @delegate, IndexReader indexReader, float maxPercentDocs)
Parameters
Type | Name | Description |
---|---|---|
LuceneVersion | matchVersion | Version to be used in StopFilter |
Analyzer | delegate | Lucene.Net.Analysis.Analyzer whose Lucene.Net.Analysis.TokenStream will be filtered |
IndexReader | indexReader | Lucene.Net.Index.IndexReader to identify the stopwords from |
float | maxPercentDocs | The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word |
Exceptions
Type | Condition |
---|---|
IOException | Can be thrown while reading from the Lucene.Net.Index.IndexReader |
Fields
defaultMaxDocFreqPercent
An Lucene.Net.Analysis.Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Declaration
public const float defaultMaxDocFreqPercent = 0.4
Field Value
Type | Description |
---|---|
float |
Methods
GetStopWords()
Provides information on which stop words have been identified for all fields
Declaration
public Term[] GetStopWords()
Returns
Type | Description |
---|---|
Term[] | the stop words (as terms) |
GetStopWords(string)
Provides information on which stop words have been identified for a field
Declaration
public string[] GetStopWords(string fieldName)
Parameters
Type | Name | Description |
---|---|---|
string | fieldName | The field for which stop words identified in "addStopWords" method calls will be returned |
Returns
Type | Description |
---|---|
string[] | the stop words identified for a field |
GetWrappedAnalyzer(string)
Retrieves the wrapped Lucene.Net.Analysis.Analyzer appropriate for analyzing the field with the given name
Declaration
protected override Analyzer GetWrappedAnalyzer(string fieldName)
Parameters
Type | Name | Description |
---|---|---|
string | fieldName | Name of the field which is to be analyzed |
Returns
Type | Description |
---|---|
Analyzer | Lucene.Net.Analysis.Analyzer for the field with the given name. Assumed to be non-null |
Overrides
WrapComponents(string, TokenStreamComponents)
Wraps / alters the given Lucene.Net.Analysis.TokenStreamComponents, taken from the wrapped Lucene.Net.Analysis.Analyzer, to form new components. It is through this method that new Lucene.Net.Analysis.TokenFilters can be added by Lucene.Net.Analysis.AnalyzerWrappers. By default, the given components are returned.
Declaration
protected override TokenStreamComponents WrapComponents(string fieldName, TokenStreamComponents components)
Parameters
Type | Name | Description |
---|---|---|
string | fieldName | Name of the field which is to be analyzed |
TokenStreamComponents | components | Lucene.Net.Analysis.TokenStreamComponents taken from the wrapped Lucene.Net.Analysis.Analyzer |
Returns
Type | Description |
---|---|
TokenStreamComponents | Wrapped / altered Lucene.Net.Analysis.TokenStreamComponents. |