Namespace Lucene.Net.Documents
The logical representation of a Document for indexing and searching.
The document package provides the user level logical representation of content to be indexed and searched. The package also provides utilities for working with Documents and IIndexableFields.
Document and IndexableField
A Document is a collection of IIndexableFields. A IIndexableField is a logical representation of a user's content that needs to be indexed or stored. IIndexableFields have a number of properties that tell Lucene.NET how to treat the content (like indexed, tokenized, stored, etc.) See the Field implementation of IIndexableField for specifics on these properties.
Note: it is common to refer to Documents having Fields, even though technically they have IIndexableFields.
Working with Documents
First and foremost, a Document is something created by the user application. It is your job to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is completely up to you. That being said, there are many tools available in other projects that can make the process of taking a file and converting it into a Lucene Document.
The DateTools is a utility class to make dates and times searchable (remember, Lucene only searches text). Int32Field, Int64Field, SingleField and DoubleField are a special helper class to simplify indexing of numeric values (and also dates) for fast range range queries with NumericRangeQuery (using a special sortable string representation of numeric values).
Classes
BinaryDocValuesField
Field that stores a per-document BytesRef value.
The values are stored directly with no sharing, which is a good fit when the fields don't share (many) values, such as a title field. If values may be shared and sorted it's better to use SortedDocValuesField. Here's an example usage:
document.Add(new BinaryDocValuesField(name, new BytesRef("hello")));
If you also need to store the value, you should add a separate StoredField instance.
ByteDocValuesField
Field that stores a per-document System.Byte value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new ByteDocValuesField(name, (byte) 22));
If you also need to store the value, you should add a separate StoredField instance.
CompressionTools
Simple utility class providing static methods to compress and decompress binary data for stored fields. this class uses the System.IO.Compression.DeflateStream class to compress and decompress.
DateTools
Provides support for converting dates to strings and vice-versa. The strings are structured so that lexicographic sorting orders them by date, which makes them suitable for use as field values and search terms.
This class also helps you to limit the resolution of your dates. Do not save dates with a finer resolution than you really need, as then TermRangeQuery and PrefixQuery will require more memory and become slower.
Another approach is NumericUtils, which provides a sortable binary representation (prefix encoded) of numeric values, which date/time are.
For indexing a System.DateTime, just get the UnixTimeMillisecondsToTicks(Int64) from System.DateTime.Ticks and index this as a numeric value with Int64Field and use NumericRangeQuery<T> to query it.
DerefBytesDocValuesField
Field that stores a per-document BytesRef value. Here's an example usage:
document.Add(new DerefBytesDocValuesField(name, new BytesRef("hello")));
If you also need to store the value, you should add a separate StoredField instance.
Document
Documents are the unit of indexing and search.
A Document is a set of fields. Each field has a name and a textual value. A field may be stored (IsStored) with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.
Note that fields which are not IsStored are not available in documents retrieved from the index, e.g. with Doc or Document(Int32).
DocumentStoredFieldVisitor
A StoredFieldVisitor that creates a Document containing all stored fields, or only specific requested fields provided to DocumentStoredFieldVisitor(ISet<String>).
This is used by Document(Int32) to load a document.
Note
This API is experimental and might change in incompatible ways in the next release.
DoubleDocValuesField
Syntactic sugar for encoding doubles as NumericDocValues via J2N.BitConversion.DoubleToRawInt64Bits(System.Double).
Per-document double values can be retrieved via GetDoubles(AtomicReader, String, Boolean).
NOTE: In most all cases this will be rather inefficient, requiring eight bytes per document. Consider encoding double values yourself with only as much precision as you require.
DoubleField
Field that indexes System.Double values for efficient range filtering and sorting. Here's an example usage:
document.Add(new DoubleField(name, 6.0, Field.Store.NO));
For optimal performance, re-use the DoubleField and
Document instance for more than one document:
DoubleField field = new DoubleField(name, 0.0, Field.Store.NO);
Document document = new Document();
document.Add(field);
for (all documents)
{
...
field.SetDoubleValue(value)
writer.AddDocument(document);
...
}
See also Int32Field, Int64Field,
SingleField.
To perform range querying or filtering against a DoubleField, use NumericRangeQuery or NumericRangeFilter<T>. To sort according to a DoubleField, use the normal numeric sort types, eg DOUBLE. DoubleField values can also be loaded directly from IFieldCache.
You may add the same field name as an DoubleField to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued DoubleField.
A DoubleField will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.
Within Lucene, each numeric value is indexed as a
trie structure, where each term is logically
assigned to larger and larger pre-defined brackets (which
are simply lower-precision representations of the value).
The step size between each successive bracket is called the
precisionStep
, measured in bits. Smaller
precisionStep
values result in larger number
of brackets, which consumes more disk space in the index
but may result in faster range search performance. The
default value, 4, was selected for a reasonable tradeoff
of disk space consumption versus performance. You can
create a custom FieldType and invoke the
NumericPrecisionStep setter if you'd
like to change the value. Note that you must also
specify a congruent value when creating
NumericRangeQuery<T> or NumericRangeFilter<T>.
For low cardinality fields larger precision steps are good.
If the cardinality is < 100, it is fair
to use System.Int32.MaxValue, which produces one
term per value.
For more information on the internals of numeric trie
indexing, including the PrecisionStep (precisionStep
)
configuration, see NumericRangeQuery<T>. The format of
indexed values is described in NumericUtils.
If you only need to sort by numeric value, and never
run range querying/filtering, you can index using a
precisionStep
of System.Int32.MaxValue.
this will minimize disk space consumed.
More advanced users can instead use NumericTokenStream directly, when indexing numbers. This class is a wrapper around this token stream type for easier, more intuitive usage.
@since 2.9
Field
Expert: directly create a field for a document. Most users should use one of the sugar subclasses: Int32Field, Int64Field, SingleField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField.
A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (System.String, System.IO.TextReader or pre-analyzed TokenStream), binary (byte[]), or numeric (System.Int32, System.Int64, System.Single, or System.Double). Fields are optionally stored in the index, so that they may be returned with hits on the document.
NOTE: the field type is an IIndexableFieldType. Making changes to the state of the IIndexableFieldType will impact any Field it is used in. It is strongly recommended that no changes be made after Field instantiation.
FieldExtensions
LUCENENET specific extension methods to add functionality to enumerations that mimic Lucene
FieldType
Describes the properties of a field.
Int16DocValuesField
Field that stores a per-document System.Int16 value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new Int16DocValuesField(name, (short) 22));
If you also need to store the value, you should add a separate StoredField instance.
NOTE: This was ShortDocValuesField in Lucene
Int32DocValuesField
Field that stores a per-document System.Int32 value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new Int32DocValuesField(name, 22));
If you also need to store the value, you should add a separate StoredField instance.
NOTE: This was IntDocValuesField in Lucene
Int32Field
Field that indexes System.Int32 values for efficient range filtering and sorting. Here's an example usage:
document.Add(new Int32Field(name, 6, Field.Store.NO));
For optimal performance, re-use the Int32Field and
Document instance for more than one document:
Int32Field field = new Int32Field(name, 6, Field.Store.NO);
Document document = new Document();
document.Add(field);
for (all documents)
{
...
field.SetInt32Value(value)
writer.AddDocument(document);
...
}
See also Int64Field, SingleField,
DoubleField.
To perform range querying or filtering against a Int32Field, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a Int32Field, use the normal numeric sort types, eg INT32. Int32Field values can also be loaded directly from IFieldCache.
You may add the same field name as an Int32Field to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued Int32Field.
An Int32Field will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.
Within Lucene, each numeric value is indexed as a
trie structure, where each term is logically
assigned to larger and larger pre-defined brackets (which
are simply lower-precision representations of the value).
The step size between each successive bracket is called the
precisionStep
, measured in bits. Smaller
precisionStep
values result in larger number
of brackets, which consumes more disk space in the index
but may result in faster range search performance. The
default value, 4, was selected for a reasonable tradeoff
of disk space consumption versus performance. You can
create a custom FieldType and invoke the
NumericPrecisionStep setter if you'd
like to change the value. Note that you must also
specify a congruent value when creating
NumericRangeQuery<T> or NumericRangeFilter<T>.
For low cardinality fields larger precision steps are good.
If the cardinality is < 100, it is fair
to use System.Int32.MaxValue, which produces one
term per value.
For more information on the internals of numeric trie
indexing, including the PrecisionStep precisionStep
configuration, see NumericRangeQuery<T>. The format of
indexed values is described in NumericUtils.
If you only need to sort by numeric value, and never
run range querying/filtering, you can index using a
precisionStep
of System.Int32.MaxValue.
this will minimize disk space consumed.
More advanced users can instead use NumericTokenStream directly, when indexing numbers. this class is a wrapper around this token stream type for easier, more intuitive usage.
NOTE: This was IntField in Lucene
@since 2.9Int64DocValuesField
Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new Int64DocValuesField(name, 22L));
If you also need to store the value, you should add a separate StoredField instance.
NOTE: This was LongDocValuesField in Lucene
Int64Field
Field that indexes System.Int64 values for efficient range filtering and sorting. Here's an example usage:
document.Add(new Int64Field(name, 6L, Field.Store.NO));
For optimal performance, re-use the Int64Field and
Document instance for more than one document:
Int64Field field = new Int64Field(name, 0L, Field.Store.NO);
Document document = new Document();
document.Add(field);
for (all documents) {
...
field.SetInt64Value(value)
writer.AddDocument(document);
...
}
See also Int32Field, SingleField,
DoubleField.
Any type that can be converted to long can also be indexed. For example, date/time values represented by a System.DateTime can be translated into a long value using the System.DateTime.Ticks property. If you don't need millisecond precision, you can quantize the value, either by dividing the result of System.DateTime.Ticks or using the separate getters (for year, month, etc.) to construct an System.Int32 or System.Int64 value.
To perform range querying or filtering against a Int64Field, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a Int64Field, use the normal numeric sort types, eg INT64. Int64Field values can also be loaded directly from IFieldCache.
You may add the same field name as an Int64Field to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued Int64Field.
An Int64Field will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.
Within Lucene, each numeric value is indexed as a
trie structure, where each term is logically
assigned to larger and larger pre-defined brackets (which
are simply lower-precision representations of the value).
The step size between each successive bracket is called the
precisionStep
, measured in bits. Smaller
precisionStep
values result in larger number
of brackets, which consumes more disk space in the index
but may result in faster range search performance. The
default value, 4, was selected for a reasonable tradeoff
of disk space consumption versus performance. You can
create a custom FieldType and invoke the
NumericPrecisionStep setter if you'd
like to change the value. Note that you must also
specify a congruent value when creating
NumericRangeQuery<T> or NumericRangeFilter<T>.
For low cardinality fields larger precision steps are good.
If the cardinality is < 100, it is fair
to use System.Int32.MaxValue, which produces one
term per value.
For more information on the internals of numeric trie
indexing, including the PrecisionStep precisionStep
configuration, see NumericRangeQuery<T>. The format of
indexed values is described in NumericUtils.
If you only need to sort by numeric value, and never
run range querying/filtering, you can index using a
precisionStep
of System.Int32.MaxValue.
this will minimize disk space consumed.
More advanced users can instead use NumericTokenStream directly, when indexing numbers. this class is a wrapper around this token stream type for easier, more intuitive usage.
NOTE: This was LongField in Lucene
@since 2.9NumericDocValuesField
Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new NumericDocValuesField(name, 22L));
If you also need to store the value, you should add a separate StoredField instance.
PackedInt64DocValuesField
Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:
document.Add(new PackedInt64DocValuesField(name, 22L));
If you also need to store the value, you should add a separate StoredField instance.
NOTE: This was PackedLongDocValuesField in Lucene
SingleDocValuesField
Syntactic sugar for encoding floats as NumericDocValues via J2N.BitConversion.SingleToRawInt32Bits(System.Single).
Per-document floating point values can be retrieved via GetSingles(AtomicReader, String, Boolean).
NOTE: In most all cases this will be rather inefficient, requiring four bytes per document. Consider encoding floating point values yourself with only as much precision as you require.
NOTE: This was FloatDocValuesField in Lucene
SingleField
Field that indexes System.Single values for efficient range filtering and sorting. Here's an example usage:
document.Add(new SingleField(name, 6.0F, Field.Store.NO));
For optimal performance, re-use the SingleField and
Document instance for more than one document:
FloatField field = new SingleField(name, 0.0F, Field.Store.NO);
Document document = new Document();
document.Add(field);
for (all documents)
{
...
field.SetSingleValue(value)
writer.AddDocument(document);
...
}
See also Int32Field, Int64Field,
DoubleField.
To perform range querying or filtering against a SingleField, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a SingleField, use the normal numeric sort types, eg SINGLE. SingleField values can also be loaded directly from IFieldCache.
You may add the same field name as an SingleField to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued SingleField.
A SingleField will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.
Within Lucene, each numeric value is indexed as a
trie structure, where each term is logically
assigned to larger and larger pre-defined brackets (which
are simply lower-precision representations of the value).
The step size between each successive bracket is called the
precisionStep
, measured in bits. Smaller
precisionStep
values result in larger number
of brackets, which consumes more disk space in the index
but may result in faster range search performance. The
default value, 4, was selected for a reasonable tradeoff
of disk space consumption versus performance. You can
create a custom FieldType and invoke the
NumericPrecisionStep setter if you'd
like to change the value. Note that you must also
specify a congruent value when creating
NumericRangeQuery<T>
or NumericRangeFilter<T>.
For low cardinality fields larger precision steps are good.
If the cardinality is < 100, it is fair
to use System.Int32.MaxValue, which produces one
term per value.
For more information on the internals of numeric trie
indexing, including the PrecisionStep precisionStep
configuration, see NumericRangeQuery<T>. The format of
indexed values is described in NumericUtils.
If you only need to sort by numeric value, and never
run range querying/filtering, you can index using a
precisionStep
of System.Int32.MaxValue.
this will minimize disk space consumed.
More advanced users can instead use NumericTokenStream directly, when indexing numbers. This class is a wrapper around this token stream type for easier, more intuitive usage.
NOTE: This was FloatField in Lucene
@since 2.9SortedBytesDocValuesField
Field that stores a per-document BytesRef value, indexed for sorting. Here's an example usage:
document.Add(new SortedBytesDocValuesField(name, new BytesRef("hello")));
If you also need to store the value, you should add a separate StoredField instance.
SortedDocValuesField
Field that stores a per-document BytesRef value, indexed for sorting. Here's an example usage:
document.Add(new SortedDocValuesField(name, new BytesRef("hello")));
If you also need to store the value, you should add a separate StoredField instance.
SortedSetDocValuesField
Field that stores a set of per-document BytesRef values, indexed for faceting,grouping,joining. Here's an example usage:
document.Add(new SortedSetDocValuesField(name, new BytesRef("hello")));
document.Add(new SortedSetDocValuesField(name, new BytesRef("world")));
If you also need to store the value, you should add a separate StoredField instance.
StoredField
A field whose value is stored so that Doc(Int32) and Document(Int32) will return the field and its value.
StraightBytesDocValuesField
Field that stores a per-document BytesRef value. If values may be shared it's better to use SortedDocValuesField. Here's an example usage:
document.Add(new StraightBytesDocValuesField(name, new BytesRef("hello")));
If you also need to store the value, you should add a separate StoredField instance.
StringField
A field that is indexed but not tokenized: the entire System.String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache.
TextField
A field that is indexed and tokenized, without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text.
Enums
DateResolution
Specifies the time granularity.
Field.Index
Specifies whether and how a field should be indexed.
Field.Store
Specifies whether and how a field should be stored.
Field.TermVector
Specifies whether and how a field should have term vectors.
NumericFieldType
Data type of the numeric IIndexableField value
NumericRepresentation
Specifies how a time will be represented as a System.Int64.
NumericType
Data type of the numeric value @since 3.2