Class TokenSources
Hides implementation issues associated with obtaining a
Inheritance
Namespace: Lucene.Net.Search.Highlight
Assembly: Lucene.Net.Highlighter.dll
Syntax
public class TokenSources : object
Methods
| Improve this Doc View SourceGetAnyTokenStream(IndexReader, Int32, String, Analyzer)
A convenience method that tries a number of approaches to getting a token stream. The cost of finding there are no termVectors in the index is minimal (1000 invocations still registers 0 ms). So this "lazy" (flexible?) approach to coding is probably acceptable
Declaration
public static TokenStream GetAnyTokenStream(IndexReader reader, int docId, string field, Analyzer analyzer)
Parameters
Type | Name | Description |
---|---|---|
IndexReader | reader | |
System.Int32 | docId | |
System.String | field | |
Analyzer | analyzer |
Returns
Type | Description |
---|---|
TokenStream | null if field not stored correctly |
GetAnyTokenStream(IndexReader, Int32, String, Document, Analyzer)
A convenience method that tries to first get a TermPositionVector for the specified docId, then, falls back to
using the passed in
Declaration
public static TokenStream GetAnyTokenStream(IndexReader reader, int docId, string field, Document doc, Analyzer analyzer)
Parameters
Type | Name | Description |
---|---|---|
IndexReader | reader | The |
System.Int32 | docId | The docId to retrieve. |
System.String | field | The field to retrieve on the document |
Document | doc | The document to fall back on |
Analyzer | analyzer | The analyzer to use for creating the TokenStream if the vector doesn't exist |
Returns
Type | Description |
---|---|
TokenStream | The |
GetTokenStream(Document, String, Analyzer)
Declaration
public static TokenStream GetTokenStream(Document doc, string field, Analyzer analyzer)
Parameters
Type | Name | Description |
---|---|---|
Document | doc | |
System.String | field | |
Analyzer | analyzer |
Returns
Type | Description |
---|---|
TokenStream |
GetTokenStream(IndexReader, Int32, String, Analyzer)
Declaration
public static TokenStream GetTokenStream(IndexReader reader, int docId, string field, Analyzer analyzer)
Parameters
Type | Name | Description |
---|---|---|
IndexReader | reader | |
System.Int32 | docId | |
System.String | field | |
Analyzer | analyzer |
Returns
Type | Description |
---|---|
TokenStream |
GetTokenStream(String, String, Analyzer)
Declaration
public static TokenStream GetTokenStream(string field, string contents, Analyzer analyzer)
Parameters
Type | Name | Description |
---|---|---|
System.String | field | |
System.String | contents | |
Analyzer | analyzer |
Returns
Type | Description |
---|---|
TokenStream |
GetTokenStream(Terms)
Declaration
public static TokenStream GetTokenStream(Terms vector)
Parameters
Type | Name | Description |
---|---|---|
Terms | vector |
Returns
Type | Description |
---|---|
TokenStream |
GetTokenStream(Terms, Boolean)
Low level api. Returns a token stream generated from a
In my tests the speeds to recreate 1000 token streams using this method are:
- with TermVector offset only data stored - 420 milliseconds
- with TermVector offset AND position data stored - 271 milliseconds (nb timings for TermVector with position data are based on a tokenizer with contiguous positions - no overlaps or gaps)
-
The cost of not using TermPositionVector to store
pre-parsed content and using an analyzer to re-parse the original content:
- reanalyzing the original content - 980 milliseconds
The re-analyze timings will typically vary depending on -
- The complexity of the analyzer code (timings above were using a stemmer/lowercaser/stopword combo)
- The number of other fields (Lucene reads ALL fields off the disk when accessing just one document field - can cost dear!)
- Use of compression on field storage - could be faster due to compression (less disk IO) or slower (more CPU burn) depending on the content.
Declaration
public static TokenStream GetTokenStream(Terms tpv, bool tokenPositionsGuaranteedContiguous)
Parameters
Type | Name | Description |
---|---|---|
Terms | tpv | |
System.Boolean | tokenPositionsGuaranteedContiguous | true if the token position numbers have no overlaps or gaps. If looking to eek out the last drops of performance, set to true. If in doubt, set to false. |
Returns
Type | Description |
---|---|
TokenStream |
GetTokenStreamWithOffsets(IndexReader, Int32, String)
Returns a
Declaration
public static TokenStream GetTokenStreamWithOffsets(IndexReader reader, int docId, string field)
Parameters
Type | Name | Description |
---|---|---|
IndexReader | reader | the |
System.Int32 | docId | the document to retrieve term vectors for |
System.String | field | the field to retrieve term vectors for |
Returns
Type | Description |
---|---|
TokenStream | a |