Namespace Lucene.Net.Documents

The logical representation of a Document for indexing and searching.

The document package provides the user level logical representation of content to be indexed and searched. The package also provides utilities for working with Documents and IIndexableFields.

Document and IndexableField

A Document is a collection of IIndexableFields. A IIndexableField is a logical representation of a user's content that needs to be indexed or stored. IIndexableFields have a number of properties that tell Lucene.NET how to treat the content (like indexed, tokenized, stored, etc.) See the Field implementation of IIndexableField for specifics on these properties.

Note: it is common to refer to Documents having Fields, even though technically they have IIndexableFields.

Working with Documents

First and foremost, a Document is something created by the user application. It is your job to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is completely up to you. That being said, there are many tools available in other projects that can make the process of taking a file and converting it into a Lucene Document.

The DateTools is a utility class to make dates and times searchable (remember, Lucene only searches text). Int32Field, Int64Field, SingleField and DoubleField are a special helper class to simplify indexing of numeric values (and also dates) for fast range range queries with NumericRangeQuery (using a special sortable string representation of numeric values).

Classes

BinaryDocValuesField

Field that stores a per-document BytesRef value.

The values are stored directly with no sharing, which is a good fit when the fields don't share (many) values, such as a title field. If values may be shared and sorted it's better to use SortedDocValuesField. Here's an example usage:

  document.Add(new BinaryDocValuesField(name, new BytesRef("hello")));

If you also need to store the value, you should add a separate StoredField instance.

ByteDocValuesField

Field that stores a per-document System.Byte value for scoring, sorting or value retrieval. Here's an example usage:

  document.Add(new ByteDocValuesField(name, (byte) 22));

If you also need to store the value, you should add a separate StoredField instance.

CompressionTools

Simple utility class providing static methods to compress and decompress binary data for stored fields. this class uses the System.IO.Compression.DeflateStream class to compress and decompress.

DateTools

Provides support for converting dates to strings and vice-versa. The strings are structured so that lexicographic sorting orders them by date, which makes them suitable for use as field values and search terms.

This class also helps you to limit the resolution of your dates. Do not save dates with a finer resolution than you really need, as then TermRangeQuery and PrefixQuery will require more memory and become slower.

Another approach is NumericUtils, which provides a sortable binary representation (prefix encoded) of numeric values, which date/time are.

For indexing a System.DateTime, just get the UnixTimeMillisecondsToTicks(Int64) from System.DateTime.Ticks and index this as a numeric value with Int64Field and use NumericRangeQuery<T> to query it.

DerefBytesDocValuesField

Field that stores a per-document BytesRef value. Here's an example usage:

  document.Add(new DerefBytesDocValuesField(name, new BytesRef("hello")));

If you also need to store the value, you should add a separate StoredField instance.

Document

Documents are the unit of indexing and search.

A Document is a set of fields. Each field has a name and a textual value. A field may be stored (IsStored) with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields which uniquely identify it.

Note that fields which are not IsStored are not available in documents retrieved from the index, e.g. with Doc or Document(Int32).

DocumentStoredFieldVisitor

A StoredFieldVisitor that creates a Document containing all stored fields, or only specific requested fields provided to DocumentStoredFieldVisitor(ISet<String>).

This is used by Document(Int32) to load a document.

Note

This API is experimental and might change in incompatible ways in the next release.

DoubleDocValuesField

Syntactic sugar for encoding doubles as NumericDocValues via J2N.BitConversion.DoubleToRawInt64Bits(System.Double).

Per-document double values can be retrieved via GetDoubles(AtomicReader, String, Boolean).

NOTE: In most all cases this will be rather inefficient, requiring eight bytes per document. Consider encoding double values yourself with only as much precision as you require.

DoubleField

Field that indexes System.Double values for efficient range filtering and sorting. Here's an example usage:

document.Add(new DoubleField(name, 6.0, Field.Store.NO));

For optimal performance, re-use the DoubleField and Document instance for more than one document:

    DoubleField field = new DoubleField(name, 0.0, Field.Store.NO);
    Document document = new Document();
    document.Add(field);

    for (all documents)
    {
        ...
        field.SetDoubleValue(value)
        writer.AddDocument(document);
        ...
    }

See also Int32Field, Int64Field, SingleField.

To perform range querying or filtering against a DoubleField, use NumericRangeQuery or NumericRangeFilter<T>. To sort according to a DoubleField, use the normal numeric sort types, eg DOUBLE. DoubleField values can also be loaded directly from IFieldCache.

You may add the same field name as an DoubleField to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued DoubleField.

A DoubleField will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.

Within Lucene, each numeric value is indexed as a trie structure, where each term is logically assigned to larger and larger pre-defined brackets (which are simply lower-precision representations of the value). The step size between each successive bracket is called the precisionStep, measured in bits. Smaller precisionStep values result in larger number of brackets, which consumes more disk space in the index but may result in faster range search performance. The default value, 4, was selected for a reasonable tradeoff of disk space consumption versus performance. You can create a custom FieldType and invoke the NumericPrecisionStep setter if you'd like to change the value. Note that you must also specify a congruent value when creating NumericRangeQuery<T> or NumericRangeFilter<T>. For low cardinality fields larger precision steps are good. If the cardinality is < 100, it is fair to use System.Int32.MaxValue, which produces one term per value.

For more information on the internals of numeric trie indexing, including the PrecisionStep (precisionStep) configuration, see NumericRangeQuery<T>. The format of indexed values is described in NumericUtils.

If you only need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep of System.Int32.MaxValue. this will minimize disk space consumed.

More advanced users can instead use NumericTokenStream directly, when indexing numbers. This class is a wrapper around this token stream type for easier, more intuitive usage.

@since 2.9

Field

Expert: directly create a field for a document. Most users should use one of the sugar subclasses: Int32Field, Int64Field, SingleField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField.

A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (System.String, System.IO.TextReader or pre-analyzed TokenStream), binary (byte[]), or numeric (System.Int32, System.Int64, System.Single, or System.Double). Fields are optionally stored in the index, so that they may be returned with hits on the document.

NOTE: the field type is an IIndexableFieldType. Making changes to the state of the IIndexableFieldType will impact any Field it is used in. It is strongly recommended that no changes be made after Field instantiation.

FieldExtensions

LUCENENET specific extension methods to add functionality to enumerations that mimic Lucene

FieldType

Describes the properties of a field.

Int16DocValuesField

Field that stores a per-document System.Int16 value for scoring, sorting or value retrieval. Here's an example usage:

    document.Add(new Int16DocValuesField(name, (short) 22));

If you also need to store the value, you should add a separate StoredField instance.

NOTE: This was ShortDocValuesField in Lucene

Int32DocValuesField

Field that stores a per-document System.Int32 value for scoring, sorting or value retrieval. Here's an example usage:

    document.Add(new Int32DocValuesField(name, 22));

If you also need to store the value, you should add a separate StoredField instance.

NOTE: This was IntDocValuesField in Lucene

Int32Field

Field that indexes System.Int32 values for efficient range filtering and sorting. Here's an example usage:

    document.Add(new Int32Field(name, 6, Field.Store.NO));

For optimal performance, re-use the Int32Field and Document instance for more than one document:

    Int32Field field = new Int32Field(name, 6, Field.Store.NO);
    Document document = new Document();
    document.Add(field);

    for (all documents) 
    {
        ...
        field.SetInt32Value(value)
        writer.AddDocument(document);
        ...
    }

See also Int64Field, SingleField, DoubleField.

To perform range querying or filtering against a Int32Field, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a Int32Field, use the normal numeric sort types, eg INT32. Int32Field values can also be loaded directly from IFieldCache.

You may add the same field name as an Int32Field to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued Int32Field.

An Int32Field will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.

For more information on the internals of numeric trie indexing, including the PrecisionStep precisionStep configuration, see NumericRangeQuery<T>. The format of indexed values is described in NumericUtils.

If you only need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep of System.Int32.MaxValue. this will minimize disk space consumed.

More advanced users can instead use NumericTokenStream directly, when indexing numbers. this class is a wrapper around this token stream type for easier, more intuitive usage.

NOTE: This was IntField in Lucene

@since 2.9

Int64DocValuesField

Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:

    document.Add(new Int64DocValuesField(name, 22L));

If you also need to store the value, you should add a separate StoredField instance.

NOTE: This was LongDocValuesField in Lucene

Int64Field

Field that indexes System.Int64 values for efficient range filtering and sorting. Here's an example usage:

document.Add(new Int64Field(name, 6L, Field.Store.NO));

For optimal performance, re-use the Int64Field and Document instance for more than one document:

    Int64Field field = new Int64Field(name, 0L, Field.Store.NO);
    Document document = new Document();
    document.Add(field);

    for (all documents) {
        ...
        field.SetInt64Value(value)
        writer.AddDocument(document);
        ...
    }

See also Int32Field, SingleField, DoubleField.

Any type that can be converted to long can also be indexed. For example, date/time values represented by a System.DateTime can be translated into a long value using the System.DateTime.Ticks property. If you don't need millisecond precision, you can quantize the value, either by dividing the result of System.DateTime.Ticks or using the separate getters (for year, month, etc.) to construct an System.Int32 or System.Int64 value.

To perform range querying or filtering against a Int64Field, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a Int64Field, use the normal numeric sort types, eg INT64. Int64Field values can also be loaded directly from IFieldCache.

You may add the same field name as an Int64Field to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued Int64Field.

An Int64Field will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.

If you only need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep of System.Int32.MaxValue. this will minimize disk space consumed.

More advanced users can instead use NumericTokenStream directly, when indexing numbers. this class is a wrapper around this token stream type for easier, more intuitive usage.

NOTE: This was LongField in Lucene

@since 2.9

NumericDocValuesField

Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:

    document.Add(new NumericDocValuesField(name, 22L));

If you also need to store the value, you should add a separate StoredField instance.

PackedInt64DocValuesField

Field that stores a per-document System.Int64 value for scoring, sorting or value retrieval. Here's an example usage:

    document.Add(new PackedInt64DocValuesField(name, 22L));

If you also need to store the value, you should add a separate StoredField instance.

NOTE: This was PackedLongDocValuesField in Lucene

SingleDocValuesField

Syntactic sugar for encoding floats as NumericDocValues via J2N.BitConversion.SingleToRawInt32Bits(System.Single).

Per-document floating point values can be retrieved via GetSingles(AtomicReader, String, Boolean).

NOTE: In most all cases this will be rather inefficient, requiring four bytes per document. Consider encoding floating point values yourself with only as much precision as you require.

NOTE: This was FloatDocValuesField in Lucene

SingleField

Field that indexes System.Single values for efficient range filtering and sorting. Here's an example usage:

document.Add(new SingleField(name, 6.0F, Field.Store.NO));

For optimal performance, re-use the SingleField and Document instance for more than one document:

    FloatField field = new SingleField(name, 0.0F, Field.Store.NO);
    Document document = new Document();
    document.Add(field);

    for (all documents) 
    {
        ...
        field.SetSingleValue(value)
        writer.AddDocument(document);
        ...
    }

See also Int32Field, Int64Field, DoubleField.

To perform range querying or filtering against a SingleField, use NumericRangeQuery<T> or NumericRangeFilter<T>. To sort according to a SingleField, use the normal numeric sort types, eg SINGLE. SingleField values can also be loaded directly from IFieldCache.

You may add the same field name as an SingleField to the same document more than once. Range querying and filtering will be the logical OR of all values; so a range query will hit all documents that have at least one value in the range. However sort behavior is not defined. If you need to sort, you should separately index a single-valued SingleField.

A SingleField will consume somewhat more disk space in the index than an ordinary single-valued field. However, for a typical index that includes substantial textual content per document, this increase will likely be in the noise.

If you only need to sort by numeric value, and never run range querying/filtering, you can index using a precisionStep of System.Int32.MaxValue. this will minimize disk space consumed.

More advanced users can instead use NumericTokenStream directly, when indexing numbers. This class is a wrapper around this token stream type for easier, more intuitive usage.

NOTE: This was FloatField in Lucene

@since 2.9

SortedBytesDocValuesField

Field that stores a per-document BytesRef value, indexed for sorting. Here's an example usage:

    document.Add(new SortedBytesDocValuesField(name, new BytesRef("hello")));

If you also need to store the value, you should add a separate StoredField instance.

SortedDocValuesField

Field that stores a per-document BytesRef value, indexed for sorting. Here's an example usage:

    document.Add(new SortedDocValuesField(name, new BytesRef("hello")));

If you also need to store the value, you should add a separate StoredField instance.

SortedSetDocValuesField

Field that stores a set of per-document BytesRef values, indexed for faceting,grouping,joining. Here's an example usage:

    document.Add(new SortedSetDocValuesField(name, new BytesRef("hello")));
    document.Add(new SortedSetDocValuesField(name, new BytesRef("world")));

If you also need to store the value, you should add a separate StoredField instance.

StoredField

A field whose value is stored so that Doc(Int32) and Document(Int32) will return the field and its value.

StraightBytesDocValuesField

Field that stores a per-document BytesRef value. If values may be shared it's better to use SortedDocValuesField. Here's an example usage:

    document.Add(new StraightBytesDocValuesField(name, new BytesRef("hello")));

If you also need to store the value, you should add a separate StoredField instance.

StringField

A field that is indexed but not tokenized: the entire System.String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use for sorting or access through the field cache.

TextField

A field that is indexed and tokenized, without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text.

Namespace Lucene.Net.Documents

Document and IndexableField

Working with Documents

Classes

Note

Enums