Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Codecs

    Codecs API: API for customization of the encoding and structure of the index.

    The Codec API allows you to customize the way the following pieces of index information are stored:

    • Postings lists - see PostingsFormat
    • DocValues - see DocValuesFormat
    • Stored fields - see StoredFieldsFormat
    • Term vectors - see TermVectorsFormat
    • FieldInfos - see FieldInfosFormat
    • SegmentInfo - see SegmentInfoFormat
    • Norms - see NormsFormat
    • Live documents - see LiveDocsFormat

    For some concrete implementations beyond Lucene's official index format, see the Codecs module.

    Codecs are identified by name through the ICodecFactory implementation, which by default is the DefaultCodecFactory. To create your own codec, extend Codec. By default, the name of the class (minus the suffix "Codec") will be used as the codec's name.

    // By default, the name will be "My" because the "Codec" suffix is removed
    public class MyCodec : Codec 
    {
    }
    
    Note

    There is a built-in FilterCodec type that can be used to easily extend an existing codec type.

    To override the default codec name, decorate the custom codec with the CodecNameAttribute.

    The CodecNameAttribute can be used to set the name to that of a built-in codec to override its registration in the DefaultCodecFactory.

    [CodecName("MyCodec")] // Sets the codec name explicitly
    public class MyCodec : Codec
    {
    }
    

    Register the Codec class so Lucene.NET can find it either by providing it to the DefaultCodecFactory at application start up or by using a dependency injection container.

    Using Microsoft.Extensions.DependencyInjection to Register a Custom Codec

    First, create an ICodecFactory implementation to return the type based on a string name. Here is a generic implementation, that can be used with almost any dependency injection container.

    public class NamedCodecFactory : ICodecFactory, IServiceListable
    {
        private readonly IDictionary<string, Codec> codecs;
    
        public NamedCodecFactory(IEnumerable<Codec> codecs)
        {
            this.codecs = codecs.ToDictionary(n => n.Name);
        }
    
        public ICollection<string> AvailableServices => codecs.Keys;
    
        public Codec GetCodec(string name)
        {
            if (codecs.TryGetValue(name, out Codec value))
                return value;
    
            throw new ArgumentException($"The codec {name} is not registered.", nameof(name));
        }
    }
    
    Note

    Implementing IServiceListable is optional. This allows for logging scenarios (such as those built into the test framework) to list the codecs that are registered.

    Next, register all of the codecs that your Lucene.NET implementation will use and the NamedCodecFactory with dependency injection using singleton lifetime.

    IServiceProvider services = new ServiceCollection()
        .AddSingleton<Codec, Lucene.Net.Codecs.Lucene46.Lucene46Codec>()
        .AddSingleton<Codec, MyCodec>()
        .AddSingleton<ICodecFactory, NamedCodecFactory>()
        .BuildServiceProvider();
    

    Finally, set the ICodecFactory implementation Lucene.NET will use with the static Codec.SetCodecFactory(ICodecFactory) method. This must be done one time at application start up.

    Codec.SetCodecFactory(services.GetService<ICodecFactory>());
    

    Using DefaultCodecFactory to Register a Custom Codec

    If your application is not using dependency injection, you can register a custom codec by adding your codec at start up.

    Codec.SetCodecFactory(new DefaultCodecFactory { 
        CustomCodecTypes = new Type[] { typeof(MyCodec) }
    });
    
    Note

    DefaultCodecFactory also registers all built-in codec types automatically.

    Custom Postings Formats

    If you just want to customize the PostingsFormat, or use different postings formats for different fields.

    [PostingsFormatName("MyPostingsFormat")]
    public class MyPostingsFormat : PostingsFormat
    {
        private readonly string field;
    
        public MyPostingsFormat(string field)
        {
            this.field = field ?? throw new ArgumentNullException(nameof(field));
        }
    
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            // Returns fields consumer...
        }
    
        public override FieldsProducer FieldsProducer(SegmentReadState state)
        {
            // Returns fields producer...
        }
    }
    

    Extend the the default Lucene46Codec, and override GetPostingsFormatForField(string) to return your custom postings format.

    [CodecName("MyCodec")]
    public class MyCodec : Lucene46Codec
    {
        public override PostingsFormat GetPostingsFormatForField(string field)
        {
            return new MyPostingsFormat(field);
        }
    }
    

    Registration of a custom postings format is similar to registering custom codecs, implement IPostingsFormatFactory and then call <xref:Lucene.Net.Codecs.PostingsFormat.SetPostingsFormatFactory> at application start up.

    PostingsFormat.SetPostingsFormatFactory(new DefaultPostingsFormatFactory {
        CustomPostingsFormatTypes = new Type[] { typeof(MyPostingsFormat) }
    });
    

    Custom DocValues Formats

    Similarly, if you just want to customize the DocValuesFormat per-field, have a look at GetDocValuesFormatForField(string). Custom implementations can be provided by implementing IDocValuesFormatFactory and registering the factory using <xref:Lucene.Net.Codecs.DocValuesFormat.SetDocValuesFormatFactory>.

    Testing Custom Codecs

    The <xref:Lucene.Net.TestFramework> library contains specialized classes to minimize the amount of code required to thoroughly test extensions to Lucene.NET. Create a new class library project targeting an executable framework your consumers will be using and add the following NuGet package reference. The test framework uses NUnit as the test runner.

    Note

    See Unit testing C# with NUnit and .NET Core for detailed instructions on how to set up a class library to use with NUnit.

    Note

    .NET Standard is not an executable target. Tests will not run unless you target a framework such as net6.0 or net48.

    Here is an example project file for .NET 8 for testing a project named MyCodecs.csproj.

    <Project Sdk="Microsoft.NET.Sdk">
    
        <PropertyGroup>
        <TargetFramework>net8.0</TargetFramework>
        </PropertyGroup>
    
        <ItemGroup>
        <PackageReference Include="nunit" Version="3.13.2" />
        <PackageReference Include="NUnit3TestAdapter" Version="3.17.0" />
        <PackageReference Include="Microsoft.NET.Test.Sdk" Version="16.11.0" />
        <PackageReference Include="Lucene.Net.TestFramework" Version="4.8.0-beta00017" />
        <PackageReference Include="System.Net.Primitives" Version="4.3.0"/>
        </ItemGroup>
    
        <ItemGroup>
        <ProjectReference Include="..\MyCodecs\MyCodecs.csproj" />
        </ItemGroup>
    
    </Project>
    
    Note

    This example outlines testing a custom PostingsFormat, but testing other codec dependencies is a similar procedure.

    To extend an existing codec with a new PostingsFormat, the FilterCodec class can be subclassed and the codec to be extended supplied to the FilterCodec constructor. A PostingsFormat should be supplied to an existing codec to run the tests against it.

    This example is for testing a custom postings format named MyPostingsFormat. Creating a postings format is a bit involved, but an overview of the process is in Building a new Lucene postings format .

    public class MyCodec : FilterCodec
    {
        private readonly PostingsFormat myPostingsFormat;
    
        public MyCodec()
            : base(new Lucene.Net.Codecs.Lucene46.Lucene46Codec())
        {
            myPostingsFormat = new MyPostingsFormat();
        }
    }
    

    Next, add a class to the test project and decorate it with the TestFixtureAttribute from NUnit. To test a postings format, subclass <xref:Lucene.Net.Index.BasePostingsFormatTestCase>, override the GetCodec() method, and return the codec under test. The codec can be cached in a member variable to improve the performance of the tests.

    namespace ExampleLuceneNetTestFramework
    {
        [TestFixture]
        public class TestMyPostingsFormat : BasePostingsFormatTestCase
        {
            private readonly Codec codec = new MyCodec();
    
            protected override Codec GetCodec()
            {
                return codec;
            }
        }
    }
    

    The <xref:Lucene.Net.Index.BasePostingsFormatTestCase> class includes a barrage of 8 tests that can now be run using your favorite test runner, such as Visual Studio Test Explorer.

    • TestDocsAndFreqs
    • TestDocsAndFreqsAndPositions
    • TestDocsAndFreqsAndPositionsAndOffsets
    • TestDocsAndFreqsAndPositionsAndOffsetsAndPayloads
    • TestDocsAndFreqsAndPositionsAndPayloads
    • TestDocsOnly
    • TestMergeStability
    • TestRandom

    The goal of the <xref:Lucene.Net.Index.BasePostingsFormatTestCase> is that if all of these tests pass, then the PostingsFormat will always be compatible with Lucene.NET.

    Registering Codecs with the Test Framework

    Codecs, postings formats and doc values formats can be injected into the test framework to integration test them against other Lucene.NET components. This is an advanced scenario that assumes integration tests for Lucene.NET components exist in your test project.

    In your test project, add a new file to the root of the project named Startup.cs that inherits <xref:Lucene.Net.Util.LuceneTestFrameworkInitializer>. The file may exist in any namespace. Override the Initialize() method to set your custom CodecFactory.

    Note

    There may only be one LuceneTestFrameworkInitializer subclass per assembly.

    public class Startup : LuceneTestFrameworkInitializer
    {
        /// <summary>
        /// Runs before all tests in the current assembly
        /// </summary>
        protected override void Initialize()
        {
            CodecFactory = new TestCodecFactory {
                CustomCodecTypes = new Codec[] { typeof(MyCodec) }
            };
        }
    }
    
    Important

    In Lucene.NET 4.8.0-beta00015 and prior, the CodecFactory should be set in the TestFrameworkSetUp() method, however all later versions must use the Initialize() method to set the factory properties, or an InvalidOperationException will be thrown.

    Setting the Default Codec for use in Tests

    The above block will register a new codec named MyCodec with the test framework. However, the test framework will not select the codec for use in tests on its own. To override the default behavior of selecting a random codec, the configuration parameter tests:codec must be set explicitly.

    Note

    A codec name is derived from either the name of the class (minus the "Codec" suffix) or the <xref:Lucene.Net.Codecs.CodecName.Name> property.

    Setting the Default Codec using an Environment Variable

    Set an environment variable named lucene:tests:codec to the name of the codec.

    $env:lucene:tests:codec = "MyCodec"; # Powershell example
    

    Setting the Default Codec using a Configuration File

    Add a file to the test project (or a parent directory of the test project) named lucene.testsettings.json with a value named tests:codec.

    {
        "tests": {
        "codec": "MyCodec"
        }
    }
    

    Setting the Default Postings Format or Doc Values Format for use in Tests

    Similarly to codecs, the default postings format or doc values format can be set via environment variable or configuration file.

    Environment Variables

    Set environment variables named lucene:tests:postingsformat to the name of the postings format and/or lucene:tests:docvaluesformat to the name of the doc values format.

    $env:lucene:tests:postingsformat = "MyPostingsFormat"; # Powershell example
    $env:lucene:tests:docvaluesformat = "MyDocValuesFormat"; # Powershell example
    

    Configuration File

    Add a file to the test project (or a parent directory of the test project) named lucene.testsettings.json with a value named tests:postingsformat and/or tests:docvaluesformat.

    {
        "tests": {
        "postingsformat": "MyPostingsFormat",
        "docvaluesformat": "MyDocValuesFormat"
        }
    }
    

    Default Codec Configuration

    For reference, the default configuration of codecs, postings formats, and doc values are as follows.

    Codecs

    These are the types registered by the DefaultCodecFactory by default.

    Name Type Assembly
    Lucene46 Lucene46Codec Lucene.Net.dll
    Lucene3x Lucene3xCodec Lucene.Net.dll
    Lucene45 Lucene45Codec Lucene.Net.dll
    Lucene42 Lucene42Codec Lucene.Net.dll
    Lucene41 Lucene41Codec Lucene.Net.dll
    Lucene40 Lucene40Codec Lucene.Net.dll
    Appending AppendingCodec Lucene.Net.Codecs.dll
    SimpleText SimpleTextCodec Lucene.Net.Codecs.dll
    Note

    The codecs in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    Postings Formats

    These are the types registered by the DefaultPostingsFormatFactory by default.

    Name Type Assembly
    Lucene41 Lucene41PostingsFormat Lucene.Net.dll
    Lucene40 Lucene40PostingsFormat Lucene.Net.dll
    SimpleText SimpleTextPostingsFormat Lucene.Net.Codecs.dll
    Pulsing41 Pulsing41PostingsFormat Lucene.Net.Codecs.dll
    Direct DirectPostingsFormat Lucene.Net.Codecs.dll
    FSTOrd41 FSTOrdPostingsFormat Lucene.Net.Codecs.dll
    FSTOrdPulsing41 FSTOrdPulsing41PostingsFormat Lucene.Net.Codecs.dll
    FST41 FSTPostingsFormat Lucene.Net.Codecs.dll
    FSTPulsing41 FSTPulsing41PostingsFormat Lucene.Net.Codecs.dll
    Memory MemoryPostingsFormat Lucene.Net.Codecs.dll
    BloomFilter BloomFilteringPostingsFormat Lucene.Net.Codecs.dll
    Note

    The postings formats in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    Doc Values Formats

    These are the types registered by the DefaultDocValuesFormatFactory by default.

    Name Type Assembly
    Lucene45 Lucene45DocValuesFormat Lucene.Net.dll
    Lucene42 Lucene42DocValuesFormat Lucene.Net.dll
    Lucene40 Lucene40DocValuesFormat Lucene.Net.dll
    SimpleText SimpleTextDocValuesFormat Lucene.Net.Codecs.dll
    Direct DirectDocValuesFormat Lucene.Net.Codecs.dll
    Memory MemoryDocValuesFormat Lucene.Net.Codecs.dll
    Disk DiskDocValuesFormat Lucene.Net.Codecs.dll
    Note

    The doc values formats in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    Classes

    BlockTermState

    Holds all state required for PostingsReaderBase to produce a DocsEnum without re-seeking the terms dict.

    BlockTreeTermsReader<TSubclassState>

    A block-based terms index and dictionary that assigns terms to variable length blocks according to how they share prefixes. The terms index is a prefix trie whose leaves are term blocks. The advantage of this approach is that SeekExact() is often able to determine a term cannot exist without doing any IO, and intersection with Automata is very fast. Note that this terms dictionary has it's own fixed terms index (ie, it does not support a pluggable terms index implementation).

    NOTE: this terms dictionary does not support index divisor when opening an IndexReader. Instead, you can change the min/maxItemsPerBlock during indexing.

    The data structure used by this implementation is very similar to a burst trie (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), but with added logic to break up too-large blocks of all terms sharing a given prefix into smaller ones.

    Use CheckIndex with the -verbose option to see summary statistics on the blocks in the dictionary.

    See BlockTreeTermsWriter<TSubclassState>.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    BlockTreeTermsReader<TSubclassState>.FieldReader

    BlockTree's implementation of GetTerms(string).

    BlockTreeTermsReader<TSubclassState>.Stats

    BlockTree statistics for a single field returned by ComputeStats().

    BlockTreeTermsWriter

    Codecs API: API for customization of the encoding and structure of the index.

    The Codec API allows you to customize the way the following pieces of index information are stored:

    • Postings lists - see PostingsFormat
    • DocValues - see DocValuesFormat
    • Stored fields - see StoredFieldsFormat
    • Term vectors - see TermVectorsFormat
    • FieldInfos - see FieldInfosFormat
    • SegmentInfo - see SegmentInfoFormat
    • Norms - see NormsFormat
    • Live documents - see LiveDocsFormat

    For some concrete implementations beyond Lucene's official index format, see the Codecs module.

    Codecs are identified by name through the ICodecFactory implementation, which by default is the DefaultCodecFactory. To create your own codec, extend Codec. By default, the name of the class (minus the suffix "Codec") will be used as the codec's name.

    // By default, the name will be "My" because the "Codec" suffix is removed
    public class MyCodec : Codec 
    {
    }
    
    Note

    There is a built-in FilterCodec type that can be used to easily extend an existing codec type.

    To override the default codec name, decorate the custom codec with the CodecNameAttribute.

    The CodecNameAttribute can be used to set the name to that of a built-in codec to override its registration in the DefaultCodecFactory.

    [CodecName("MyCodec")] // Sets the codec name explicitly
    public class MyCodec : Codec
    {
    }
    

    Register the Codec class so Lucene.NET can find it either by providing it to the DefaultCodecFactory at application start up or by using a dependency injection container.

    Using Microsoft.Extensions.DependencyInjection to Register a Custom Codec

    First, create an ICodecFactory implementation to return the type based on a string name. Here is a generic implementation, that can be used with almost any dependency injection container.

    public class NamedCodecFactory : ICodecFactory, IServiceListable
    {
        private readonly IDictionary<string, Codec> codecs;
    
        public NamedCodecFactory(IEnumerable<Codec> codecs)
        {
            this.codecs = codecs.ToDictionary(n => n.Name);
        }
    
        public ICollection<string> AvailableServices => codecs.Keys;
    
        public Codec GetCodec(string name)
        {
            if (codecs.TryGetValue(name, out Codec value))
                return value;
    
            throw new ArgumentException($"The codec {name} is not registered.", nameof(name));
        }
    }
    
    Note

    Implementing IServiceListable is optional. This allows for logging scenarios (such as those built into the test framework) to list the codecs that are registered.

    Next, register all of the codecs that your Lucene.NET implementation will use and the NamedCodecFactory with dependency injection using singleton lifetime.

    IServiceProvider services = new ServiceCollection()
        .AddSingleton<Codec, Lucene.Net.Codecs.Lucene46.Lucene46Codec>()
        .AddSingleton<Codec, MyCodec>()
        .AddSingleton<ICodecFactory, NamedCodecFactory>()
        .BuildServiceProvider();
    

    Finally, set the ICodecFactory implementation Lucene.NET will use with the static Codec.SetCodecFactory(ICodecFactory) method. This must be done one time at application start up.

    Codec.SetCodecFactory(services.GetService<ICodecFactory>());
    

    Using DefaultCodecFactory to Register a Custom Codec

    If your application is not using dependency injection, you can register a custom codec by adding your codec at start up.

    Codec.SetCodecFactory(new DefaultCodecFactory { 
        CustomCodecTypes = new Type[] { typeof(MyCodec) }
    });
    
    Note

    DefaultCodecFactory also registers all built-in codec types automatically.

    Custom Postings Formats

    If you just want to customize the PostingsFormat, or use different postings formats for different fields.

    [PostingsFormatName("MyPostingsFormat")]
    public class MyPostingsFormat : PostingsFormat
    {
        private readonly string field;
    
        public MyPostingsFormat(string field)
        {
            this.field = field ?? throw new ArgumentNullException(nameof(field));
        }
    
        public override FieldsConsumer FieldsConsumer(SegmentWriteState state)
        {
            // Returns fields consumer...
        }
    
        public override FieldsProducer FieldsProducer(SegmentReadState state)
        {
            // Returns fields producer...
        }
    }
    

    Extend the the default Lucene46Codec, and override GetPostingsFormatForField(string) to return your custom postings format.

    [CodecName("MyCodec")]
    public class MyCodec : Lucene46Codec
    {
        public override PostingsFormat GetPostingsFormatForField(string field)
        {
            return new MyPostingsFormat(field);
        }
    }
    

    Registration of a custom postings format is similar to registering custom codecs, implement IPostingsFormatFactory and then call <xref:Lucene.Net.Codecs.PostingsFormat.SetPostingsFormatFactory> at application start up.

    PostingsFormat.SetPostingsFormatFactory(new DefaultPostingsFormatFactory {
        CustomPostingsFormatTypes = new Type[] { typeof(MyPostingsFormat) }
    });
    

    Custom DocValues Formats

    Similarly, if you just want to customize the DocValuesFormat per-field, have a look at GetDocValuesFormatForField(string). Custom implementations can be provided by implementing IDocValuesFormatFactory and registering the factory using <xref:Lucene.Net.Codecs.DocValuesFormat.SetDocValuesFormatFactory>.

    Testing Custom Codecs

    The <xref:Lucene.Net.TestFramework> library contains specialized classes to minimize the amount of code required to thoroughly test extensions to Lucene.NET. Create a new class library project targeting an executable framework your consumers will be using and add the following NuGet package reference. The test framework uses NUnit as the test runner.

    Note

    See Unit testing C# with NUnit and .NET Core for detailed instructions on how to set up a class library to use with NUnit.

    Note

    .NET Standard is not an executable target. Tests will not run unless you target a framework such as net6.0 or net48.

    Here is an example project file for .NET 8 for testing a project named MyCodecs.csproj.

    <Project Sdk="Microsoft.NET.Sdk">
    
        <PropertyGroup>
        <TargetFramework>net8.0</TargetFramework>
        </PropertyGroup>
    
        <ItemGroup>
        <PackageReference Include="nunit" Version="3.13.2" />
        <PackageReference Include="NUnit3TestAdapter" Version="3.17.0" />
        <PackageReference Include="Microsoft.NET.Test.Sdk" Version="16.11.0" />
        <PackageReference Include="Lucene.Net.TestFramework" Version="4.8.0-beta00017" />
        <PackageReference Include="System.Net.Primitives" Version="4.3.0"/>
        </ItemGroup>
    
        <ItemGroup>
        <ProjectReference Include="..\MyCodecs\MyCodecs.csproj" />
        </ItemGroup>
    
    </Project>
    
    Note

    This example outlines testing a custom PostingsFormat, but testing other codec dependencies is a similar procedure.

    To extend an existing codec with a new PostingsFormat, the FilterCodec class can be subclassed and the codec to be extended supplied to the FilterCodec constructor. A PostingsFormat should be supplied to an existing codec to run the tests against it.

    This example is for testing a custom postings format named MyPostingsFormat. Creating a postings format is a bit involved, but an overview of the process is in Building a new Lucene postings format .

    public class MyCodec : FilterCodec
    {
        private readonly PostingsFormat myPostingsFormat;
    
        public MyCodec()
            : base(new Lucene.Net.Codecs.Lucene46.Lucene46Codec())
        {
            myPostingsFormat = new MyPostingsFormat();
        }
    }
    

    Next, add a class to the test project and decorate it with the TestFixtureAttribute from NUnit. To test a postings format, subclass <xref:Lucene.Net.Index.BasePostingsFormatTestCase>, override the GetCodec() method, and return the codec under test. The codec can be cached in a member variable to improve the performance of the tests.

    namespace ExampleLuceneNetTestFramework
    {
        [TestFixture]
        public class TestMyPostingsFormat : BasePostingsFormatTestCase
        {
            private readonly Codec codec = new MyCodec();
    
            protected override Codec GetCodec()
            {
                return codec;
            }
        }
    }
    

    The <xref:Lucene.Net.Index.BasePostingsFormatTestCase> class includes a barrage of 8 tests that can now be run using your favorite test runner, such as Visual Studio Test Explorer.

    • TestDocsAndFreqs
    • TestDocsAndFreqsAndPositions
    • TestDocsAndFreqsAndPositionsAndOffsets
    • TestDocsAndFreqsAndPositionsAndOffsetsAndPayloads
    • TestDocsAndFreqsAndPositionsAndPayloads
    • TestDocsOnly
    • TestMergeStability
    • TestRandom

    The goal of the <xref:Lucene.Net.Index.BasePostingsFormatTestCase> is that if all of these tests pass, then the PostingsFormat will always be compatible with Lucene.NET.

    Registering Codecs with the Test Framework

    Codecs, postings formats and doc values formats can be injected into the test framework to integration test them against other Lucene.NET components. This is an advanced scenario that assumes integration tests for Lucene.NET components exist in your test project.

    In your test project, add a new file to the root of the project named Startup.cs that inherits <xref:Lucene.Net.Util.LuceneTestFrameworkInitializer>. The file may exist in any namespace. Override the Initialize() method to set your custom CodecFactory.

    Note

    There may only be one LuceneTestFrameworkInitializer subclass per assembly.

    public class Startup : LuceneTestFrameworkInitializer
    {
        /// <summary>
        /// Runs before all tests in the current assembly
        /// </summary>
        protected override void Initialize()
        {
            CodecFactory = new TestCodecFactory {
                CustomCodecTypes = new Codec[] { typeof(MyCodec) }
            };
        }
    }
    
    Important

    In Lucene.NET 4.8.0-beta00015 and prior, the CodecFactory should be set in the TestFrameworkSetUp() method, however all later versions must use the Initialize() method to set the factory properties, or an InvalidOperationException will be thrown.

    Setting the Default Codec for use in Tests

    The above block will register a new codec named MyCodec with the test framework. However, the test framework will not select the codec for use in tests on its own. To override the default behavior of selecting a random codec, the configuration parameter tests:codec must be set explicitly.

    Note

    A codec name is derived from either the name of the class (minus the "Codec" suffix) or the <xref:Lucene.Net.Codecs.CodecName.Name> property.

    Setting the Default Codec using an Environment Variable

    Set an environment variable named lucene:tests:codec to the name of the codec.

    $env:lucene:tests:codec = "MyCodec"; # Powershell example
    

    Setting the Default Codec using a Configuration File

    Add a file to the test project (or a parent directory of the test project) named lucene.testsettings.json with a value named tests:codec.

    {
        "tests": {
        "codec": "MyCodec"
        }
    }
    

    Setting the Default Postings Format or Doc Values Format for use in Tests

    Similarly to codecs, the default postings format or doc values format can be set via environment variable or configuration file.

    Environment Variables

    Set environment variables named lucene:tests:postingsformat to the name of the postings format and/or lucene:tests:docvaluesformat to the name of the doc values format.

    $env:lucene:tests:postingsformat = "MyPostingsFormat"; # Powershell example
    $env:lucene:tests:docvaluesformat = "MyDocValuesFormat"; # Powershell example
    

    Configuration File

    Add a file to the test project (or a parent directory of the test project) named lucene.testsettings.json with a value named tests:postingsformat and/or tests:docvaluesformat.

    {
        "tests": {
        "postingsformat": "MyPostingsFormat",
        "docvaluesformat": "MyDocValuesFormat"
        }
    }
    

    Default Codec Configuration

    For reference, the default configuration of codecs, postings formats, and doc values are as follows.

    Codecs

    These are the types registered by the DefaultCodecFactory by default.

    Name Type Assembly
    Lucene46 Lucene46Codec Lucene.Net.dll
    Lucene3x Lucene3xCodec Lucene.Net.dll
    Lucene45 Lucene45Codec Lucene.Net.dll
    Lucene42 Lucene42Codec Lucene.Net.dll
    Lucene41 Lucene41Codec Lucene.Net.dll
    Lucene40 Lucene40Codec Lucene.Net.dll
    Appending AppendingCodec Lucene.Net.Codecs.dll
    SimpleText SimpleTextCodec Lucene.Net.Codecs.dll
    Note

    The codecs in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    Postings Formats

    These are the types registered by the DefaultPostingsFormatFactory by default.

    Name Type Assembly
    Lucene41 Lucene41PostingsFormat Lucene.Net.dll
    Lucene40 Lucene40PostingsFormat Lucene.Net.dll
    SimpleText SimpleTextPostingsFormat Lucene.Net.Codecs.dll
    Pulsing41 Pulsing41PostingsFormat Lucene.Net.Codecs.dll
    Direct DirectPostingsFormat Lucene.Net.Codecs.dll
    FSTOrd41 FSTOrdPostingsFormat Lucene.Net.Codecs.dll
    FSTOrdPulsing41 FSTOrdPulsing41PostingsFormat Lucene.Net.Codecs.dll
    FST41 FSTPostingsFormat Lucene.Net.Codecs.dll
    FSTPulsing41 FSTPulsing41PostingsFormat Lucene.Net.Codecs.dll
    Memory MemoryPostingsFormat Lucene.Net.Codecs.dll
    BloomFilter BloomFilteringPostingsFormat Lucene.Net.Codecs.dll
    Note

    The postings formats in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    Doc Values Formats

    These are the types registered by the DefaultDocValuesFormatFactory by default.

    Name Type Assembly
    Lucene45 Lucene45DocValuesFormat Lucene.Net.dll
    Lucene42 Lucene42DocValuesFormat Lucene.Net.dll
    Lucene40 Lucene40DocValuesFormat Lucene.Net.dll
    SimpleText SimpleTextDocValuesFormat Lucene.Net.Codecs.dll
    Direct DirectDocValuesFormat Lucene.Net.Codecs.dll
    Memory MemoryDocValuesFormat Lucene.Net.Codecs.dll
    Disk DiskDocValuesFormat Lucene.Net.Codecs.dll
    Note

    The doc values formats in Lucene.Net.Codecs.dll are only loaded if referenced in the calling project.

    BlockTreeTermsWriter<TSubclassState>

    Block-based terms index and dictionary writer.

    Writes terms dict and index, block-encoding (column stride) each term's metadata for each set of terms between two index terms.

    Files:
    • .tim:Term Dictionary
    • .tip:Term Index

    Term Dictionary

    The .tim file contains the list of terms in each field along with per-term statistics (such as docfreq) and per-term metadata (typically pointers to the postings list for that term in the inverted index).

    The .tim is arranged in blocks: with blocks containing a variable number of entries (by default 25-48), where each entry is either a term or a reference to a sub-block.

    NOTE: The term dictionary can plug into different postings implementations: the postings writer/reader are actually responsible for encoding and decoding the Postings Metadata and Term Metadata sections.

    • TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, FieldSummary, DirOffset, Footer
    • NodeBlock --> (OuterNode | InnerNode)
    • OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
    • InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
    • TermStats --> DocFreq, TotalTermFreq
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, SumTotalTermFreq?, SumDocFreq, DocCount>NumFields
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int)
    • DirOffset --> Uint64 (WriteInt64(long))
    • EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields, FieldNumber,RootCodeLength,DocCount --> VInt (WriteVInt32(int)_
    • TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> VLong (WriteVInt64(long))
    • Footer --> CodecFooter (WriteFooter(IndexOutput))

    Notes:

    • Header is a CodecHeader (WriteHeader(DataOutput, string, int)) storing the version information for the BlockTree implementation.
    • DirOffset is a pointer to the FieldSummary section.
    • DocFreq is the count of documents which contain the term.
    • TotalTermFreq is the total number of occurrences of the term. this is encoded as the difference between the total number of occurrences and the DocFreq.
    • FieldNumber is the fields number from fieldInfos. (.fnm)
    • NumTerms is the number of unique terms for the field.
    • RootCode points to the root block for the field.
    • SumDocFreq is the total number of postings, the number of term-document pairs across the entire field.
    • DocCount is the number of documents that have at least one posting for this field.
    • PostingsHeader and TermMetadata are plugged into by the specific postings implementation: these contain arbitrary per-file data (such as parameters or versioning information) and per-term data (such as pointers to inverted files).
    • For inner nodes of the tree, every entry will steal one bit to mark whether it points to child nodes(sub-block). If so, the corresponding TermStats and TermMetadata are omitted

    Term Index

    The .tip file contains an index into the term dictionary, so that it can be accessed randomly. The index is also used to determine when a given term cannot exist on disk (in the .tim file), saving a disk seek.

    • TermsIndex (.tip) --> Header, FSTIndexNumFields <IndexStartFP>NumFields, DirOffset, Footer
    • Header --> CodecHeader (WriteHeader(DataOutput, string, int))
    • DirOffset --> Uint64 (WriteInt64(long)
    • IndexStartFP --> VLong (WriteVInt64(long)
    • FSTIndex --> FST{byte[]}
    • Footer --> CodecFooter (WriteFooter(IndexOutput)

    Notes:

    • The .tip file contains a separate FST for each field. The FST maps a term prefix to the on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP points to its FST.
    • DirOffset is a pointer to the start of the IndexStartFPs for all fields
    • It's possible that an on-disk block would contain too many terms (more than the allowed maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called "floor blocks"), and then the output in the FST for the block's prefix encodes the leading byte of each sub-block, and its file pointer.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Codec

    Encodes/decodes an inverted index segment.

    Note, when extending this class, the name (Name) is written into the index. In order for the segment to be read, the name must resolve to your implementation via ForName(string). This method uses GetCodec(string) to resolve codec names.

    To implement your own codec:
    1. Subclass this class.
    2. Subclass DefaultCodecFactory, override the Initialize() method, and add the line base.ScanForCodecs(typeof(YourCodec).Assembly). If you have any codec classes in your assembly that are not meant for reading, you can add the ExcludeCodecFromScanAttribute to them so they are ignored by the scan.
    3. set the new ICodecFactory by calling SetCodecFactory(ICodecFactory) at application startup.
    If your codec has dependencies, you may also override GetCodec(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

    Codec Names

    Unlike the Java version, codec names are by default convention-based on the class name. If you name your custom codec class "MyCustomCodec", the codec name will the same name without the "Codec" suffix: "MyCustom".

    You can override this default behavior by using the CodecNameAttribute to name the codec differently than this convention. Codec names must be all ASCII alphanumeric, and less than 128 characters in length.

    CodecNameAttribute

    Represents an attribute that is used to name a Codec, if a name other than the default Codec naming convention is desired.

    CodecUtil

    Utility class for reading and writing versioned headers.

    Writing codec headers is useful to ensure that a file is in the format you think it is.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    DefaultCodecFactory

    Implements the default functionality of ICodecFactory.

    To replace the DefaultCodecFactory instance, call SetCodecFactory(ICodecFactory) at application start up. DefaultCodecFactory can be subclassed or passed additional parameters to register additional codecs, inject dependencies, or change caching behavior, as shown in the following examples. Alternatively, ICodecFactory can be implemented to provide complete control over codec creation and lifetimes.

    Register Additional Codecs

    Additional codecs can be added by initializing the instance of DefaultCodecFactory and passing an array of Codec-derived types.
    // Register the factory at application start up.
    Codec.SetCodecFactory(new DefaultCodecFactory {
        CustomCodecTypes = new Type[] { typeof(MyCodec), typeof(AnotherCodec) }
    });

    Only Use Explicitly Defined Codecs

    PutCodecType(Type) can be used to explicitly add codec types. In this example, the call to base.Initialize() is excluded to skip the built-in codec registration. Since AnotherCodec doesn't have a default constructor, the NewCodec(Type) method is overridden to supply the required parameters.
    public class ExplicitCodecFactory : DefaultCodecFactory
    {
        protected override void Initialize()
        {
            // Load specific codecs in a specific order.
            PutCodecType(typeof(MyCodec));
            PutCodecType(typeof(AnotherCodec));
        }
    
    protected override Codec NewCodec(Type type)
    {
        // Special case: AnotherCodec has a required dependency
        if (typeof(AnotherCodec).Equals(type))
            return new AnotherCodec(new SomeDependency());
    
        return base.NewCodec(type);
    }
    

    }

    // Register the factory at application start up. Codec.SetCodecFactory(new ExplicitCodecFactory());

    See the Lucene.Net.Codecs namespace documentation for more examples of how to inject dependencies into Codec subclasses.

    Use Reflection to Scan an Assembly for Codecs

    ScanForCodecs(Assembly) or ScanForCodecs(IEnumerable<Assembly>) can be used to scan assemblies using .NET Reflection for codec types and add all subclasses that are found automatically. This example calls base.Initialize() to load the default codecs prior to scanning for additional codecs.
    public class ScanningCodecFactory : DefaultCodecFactory
    {
        protected override void Initialize()
        {
            // Load all default codecs
            base.Initialize();
    
        // Load all of the codecs inside of the same assembly that MyCodec is defined in
        ScanForCodecs(typeof(MyCodec).Assembly);
    }
    

    }

    // Register the factory at application start up. Codec.SetCodecFactory(new ScanningCodecFactory());

    Codecs in the target assemblie(s) can be excluded from the scan by decorating them with the ExcludeCodecFromScanAttribute.

    DefaultDocValuesFormatFactory

    Implements the default functionality of IDocValuesFormatFactory.

    To replace the DefaultDocValuesFormatFactory instance, call SetDocValuesFormatFactory(IDocValuesFormatFactory) at application start up. DefaultDocValuesFormatFactory can be subclassed or passed additional parameters to register additional codecs, inject dependencies, or change caching behavior, as shown in the following examples. Alternatively, IDocValuesFormatFactory can be implemented to provide complete control over doc values format creation and lifetimes.

    Register Additional DocValuesFormats

    Additional codecs can be added by initializing the instance of DefaultDocValuesFormatFactory and passing an array of DocValuesFormat-derived types.
    // Register the factory at application start up.
    DocValuesFormat.SetDocValuesFormatFactory(new DefaultDocValuesFormatFactory {
        CustomDocValuesFormatTypes = new Type[] { typeof(MyDocValuesFormat), typeof(AnotherDocValuesFormat) }
    });

    Only Use Explicitly Defined DocValuesFormats

    PutDocValuesFormatType(Type) can be used to explicitly add codec types. In this example, the call to base.Initialize() is excluded to skip the built-in codec registration. Since AnotherDocValuesFormat doesn't have a default constructor, the NewDocValuesFormat(Type) method is overridden to supply the required parameters.
    public class ExplicitDocValuesFormatFactory : DefaultDocValuesFormatFactory
    {
        protected override void Initialize()
        {
            // Load specific codecs in a specific order.
            PutDocValuesFormatType(typeof(MyDocValuesFormat));
            PutDocValuesFormatType(typeof(AnotherDocValuesFormat));
        }
    
    protected override DocValuesFormat NewDocValuesFormat(Type type)
    {
        // Special case: AnotherDocValuesFormat has a required dependency
        if (typeof(AnotherDocValuesFormat).Equals(type))
            return new AnotherDocValuesFormat(new SomeDependency());
    
        return base.NewDocValuesFormat(type);
    }
    

    }

    // Register the factory at application start up. DocValuesFormat.SetDocValuesFormatFactory(new ExplicitDocValuesFormatFactory());

    See the Lucene.Net.Codecs namespace documentation for more examples of how to inject dependencies into DocValuesFormat subclasses.

    Use Reflection to Scan an Assembly for DocValuesFormats

    ScanForDocValuesFormats(Assembly) or ScanForDocValuesFormats(IEnumerable<Assembly>) can be used to scan assemblies using .NET Reflection for codec types and add all subclasses that are found automatically.
    public class ScanningDocValuesFormatFactory : DefaultDocValuesFormatFactory
    {
        protected override void Initialize()
        {
            // Load all default codecs
            base.Initialize();
    
        // Load all of the codecs inside of the same assembly that MyDocValuesFormat is defined in
        ScanForDocValuesFormats(typeof(MyDocValuesFormat).Assembly);
    }
    

    }

    // Register the factory at application start up. DocValuesFormat.SetDocValuesFormatFactory(new ScanningDocValuesFormatFactory());

    Doc values formats in the target assembly can be excluded from the scan by decorating them with the ExcludeDocValuesFormatFromScanAttribute.

    DefaultPostingsFormatFactory

    Implements the default functionality of IPostingsFormatFactory.

    To replace the DefaultPostingsFormatFactory instance, call SetPostingsFormatFactory(IPostingsFormatFactory) at application start up. DefaultPostingsFormatFactory can be subclassed or passed additional parameters to register additional codecs, inject dependencies, or change caching behavior, as shown in the following examples. Alternatively, IPostingsFormatFactory can be implemented to provide complete control over postings format creation and lifetimes.

    Register Additional PostingsFormats

    Additional codecs can be added by initializing the instance of DefaultPostingsFormatFactory and passing an array of PostingsFormat-derived types.
    // Register the factory at application start up.
    PostingsFormat.SetPostingsFormatFactory(new DefaultPostingsFormatFactory {
        CustomPostingsFormatTypes = new Type[] { typeof(MyPostingsFormat), typeof(AnotherPostingsFormat) }
    });

    Only Use Explicitly Defined PostingsFormats

    PutPostingsFormatType(Type) can be used to explicitly add codec types. In this example, the call to base.Initialize() is excluded to skip the built-in codec registration. Since AnotherPostingsFormat doesn't have a default constructor, the NewPostingsFormat(Type) method is overridden to supply the required parameters.
    public class ExplicitPostingsFormatFactory : DefaultPostingsFormatFactory
    {
        protected override void Initialize()
        {
            // Load specific codecs in a specific order.
            PutPostingsFormatType(typeof(MyPostingsFormat));
            PutPostingsFormatType(typeof(AnotherPostingsFormat));
        }
    
    protected override PostingsFormat NewPostingsFormat(Type type)
    {
        // Special case: AnotherPostingsFormat has a required dependency
        if (typeof(AnotherPostingsFormat).Equals(type))
            return new AnotherPostingsFormat(new SomeDependency());
    
        return base.NewPostingsFormat(type);
    }
    

    }

    // Register the factory at application start up. PostingsFormat.SetPostingsFormatFactory(new ExplicitPostingsFormatFactory());

    See the Lucene.Net.Codecs namespace documentation for more examples of how to inject dependencies into PostingsFormat subclasses.

    Use Reflection to Scan an Assembly for PostingsFormats

    ScanForPostingsFormats(Assembly) or ScanForPostingsFormats(IEnumerable<Assembly>) can be used to scan assemblies using .NET Reflection for codec types and add all subclasses that are found automatically.
    public class ScanningPostingsFormatFactory : DefaultPostingsFormatFactory
    {
        protected override void Initialize()
        {
            // Load all default codecs
            base.Initialize();
    
        // Load all of the codecs inside of the same assembly that MyPostingsFormat is defined in
        ScanForPostingsFormats(typeof(MyPostingsFormat).Assembly);
    }
    

    }

    // Register the factory at application start up. PostingsFormat.SetPostingsFormatFactory(new ScanningPostingsFormatFactory());

    Postings formats in the target assembly can be excluded from the scan by decorating them with the ExcludePostingsFormatFromScanAttribute.

    DocValuesConsumer

    Abstract API that consumes numeric, binary and sorted docvalues. Concrete implementations of this actually do "something" with the docvalues (write it into the index in a specific format).

    The lifecycle is:
    1. DocValuesConsumer is created by FieldsConsumer(SegmentWriteState) or NormsConsumer(SegmentWriteState).
    2. AddNumericField(FieldInfo, IEnumerable<long?>), AddBinaryField(FieldInfo, IEnumerable<BytesRef>), or AddSortedField(FieldInfo, IEnumerable<BytesRef>, IEnumerable<long?>) are called for each Numeric, Binary, or Sorted docvalues field. The API is a "pull" rather than "push", and the implementation is free to iterate over the values multiple times (GetEnumerator()).
    3. After all fields are added, the consumer is Dispose()d.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    DocValuesFormat

    Encodes/decodes per-document values.

    Note, when extending this class, the name (Name) may written into the index in certain configurations. In order for the segment to be read, the name must resolve to your implementation via ForName(string). This method uses GetDocValuesFormat(string) to resolve format names.

    To implement your own format:
    1. Subclass this class.
    2. Subclass DefaultDocValuesFormatFactory, override the Initialize() method, and add the line base.ScanForDocValuesFormats(typeof(YourDocValuesFormat).Assembly). If you have any format classes in your assembly that are not meant for reading, you can add the ExcludeDocValuesFormatFromScanAttribute to them so they are ignored by the scan.
    3. Set the new IDocValuesFormatFactory by calling SetDocValuesFormatFactory(IDocValuesFormatFactory) at application startup.
    If your format has dependencies, you may also override GetDocValuesFormat(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

    DocValuesFormat Names

    Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomDocValuesFormat", the format name will the same name without the "DocValuesFormat" suffix: "MyCustom".

    You can override this default behavior by using the DocValuesFormatNameAttribute to name the format differently than this convention. Format names must be all ASCII alphanumeric, and less than 128 characters in length.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    DocValuesFormatNameAttribute

    Represents an attribute that is used to name a DocValuesFormat, if a name other than the default DocValuesFormat naming convention is desired.

    DocValuesProducer

    Abstract API that produces numeric, binary and sorted docvalues.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    ExcludeCodecFromScanAttribute

    When placed on a class that subclasses Codec, adding this attribute will exclude the type from consideration in the ScanForCodecs(Assembly) method.

    However, the Codec type can still be added manually using PutCodecType(Type).

    ExcludeDocValuesFormatFromScanAttribute

    When placed on a class that subclasses DocValuesFormat, adding this attribute will exclude the type from consideration in the ScanForDocValuesFormats(Assembly) method.

    However, the DocValuesFormat type can still be added manually using PutDocValuesFormatType(Type).

    ExcludePostingsFormatFromScanAttribute

    When placed on a class that subclasses PostingsFormat, adding this attribute will exclude the type from consideration in the ScanForPostingsFormats(Assembly) method.

    However, the PostingsFormat type can still be added manually using PutPostingsFormatType(Type).

    FieldInfosFormat

    Encodes/decodes FieldInfos.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FieldInfosReader

    Codec API for reading FieldInfos.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FieldInfosWriter

    Codec API for writing FieldInfos.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FieldsConsumer

    Abstract API that consumes terms, doc, freq, prox, offset and payloads postings. Concrete implementations of this actually do "something" with the postings (write it into the index in a specific format).

    The lifecycle is:
    1. FieldsConsumer is created by FieldsConsumer(SegmentWriteState).
    2. For each field, AddField(FieldInfo) is called, returning a TermsConsumer for the field.
    3. After all fields are added, the consumer is Dispose()d.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FieldsProducer

    Abstract API that produces terms, doc, freq, prox, offset and payloads postings.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FilterCodec

    A codec that forwards all its method calls to another codec.

    Extend this class when you need to reuse the functionality of an existing codec. For example, if you want to build a codec that redefines Lucene46's LiveDocsFormat:
    public sealed class CustomCodec : FilterCodec 
    {
        public CustomCodec()
            : base("CustomCodec", new Lucene46Codec())
        {
        }
    
    public override LiveDocsFormat LiveDocsFormat 
    {
        get { return new CustomLiveDocsFormat(); }
    }
    

    }

    Please note: Don't call ForName(string) from the no-arg constructor of your own codec. When the DefaultCodecFactory loads your own Codec, the DefaultCodecFactory has not yet fully initialized! If you want to extend another Codec, instantiate it directly by calling its constructor.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    LiveDocsFormat

    Format for live/deleted documents.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    MappingMultiDocsAndPositionsEnum

    Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).

    Note

    This API is experimental and might change in incompatible ways in the next release.

    MappingMultiDocsEnum

    Exposes flex API, merged from flex API of sub-segments, remapping docIDs (this is used for segment merging).

    Note

    This API is experimental and might change in incompatible ways in the next release.

    MultiLevelSkipListReader

    This abstract class reads skip lists with multiple levels.

    See MultiLevelSkipListWriter for the information about the encoding of the multi level skip lists.

    Subclasses must implement the abstract method ReadSkipData(int, IndexInput) which defines the actual format of the skip data.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    MultiLevelSkipListWriter

    This abstract class writes skip lists with multiple levels.

    Example for skipInterval = 3:
                                                        c            (skip level 2)
                    c                 c                 c            (skip level 1)
        x     x     x     x     x     x     x     x     x     x      (skip level 0)
    d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
        3     6     9     12    15    18    21    24    27    30     (df)
    
    d - document
    x - skip data
    c - skip data with child pointer
    
    Skip level i contains every skipInterval-th entry from skip level i-1.
    Therefore the number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
    
    Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in list i-1.
    this guarantees a logarithmic amount of skips to find the target document.
    
    While this class takes care of writing the different skip levels,
    subclasses must define the actual format of the skip data.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    NormsFormat

    Encodes/decodes per-document score normalization values.

    PostingsBaseFormat

    Provides a PostingsReaderBase and PostingsWriterBase.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PostingsConsumer

    Abstract API that consumes postings for an individual term.

    The lifecycle is:
    1. PostingsConsumer is returned for each term by StartTerm(BytesRef).
    2. StartDoc(int, int) is called for each document where the term occurs, specifying id and term frequency for that document.
    3. If positions are enabled for the field, then AddPosition(int, BytesRef, int, int) will be called for each occurrence in the document.
    4. FinishDoc() is called when the producer is done adding positions to the document.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PostingsFormat

    Encodes/decodes terms, postings, and proximity data.

    Note, when extending this class, the name (Name) may written into the index in certain configurations. In order for the segment to be read, the name must resolve to your implementation via ForName(string). This method uses GetPostingsFormat(string) to resolve format names.

    If you implement your own format:
    1. Subclass this class.
    2. Subclass DefaultPostingsFormatFactory, override Initialize(), and add the line base.ScanForPostingsFormats(typeof(YourPostingsFormat).Assembly). If you have any format classes in your assembly that are not meant for reading, you can add the ExcludePostingsFormatFromScanAttribute to them so they are ignored by the scan.
    3. Set the new IPostingsFormatFactory by calling SetPostingsFormatFactory(IPostingsFormatFactory) at application startup.
    If your format has dependencies, you may also override GetPostingsFormat(Type) to inject them via pure DI or a DI container. See DI-Friendly Framework to understand the approach used.

    PostingsFormat Names

    Unlike the Java version, format names are by default convention-based on the class name. If you name your custom format class "MyCustomPostingsFormat", the codec name will the same name without the "PostingsFormat" suffix: "MyCustom".

    You can override this default behavior by using the PostingsFormatNameAttribute to name the format differently than this convention. Format names must be all ASCII alphanumeric, and less than 128 characters in length.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PostingsFormatNameAttribute

    Represents an attribute that is used to name a PostingsFormat, if a name other than the default PostingsFormat naming convention is desired.

    PostingsReaderBase

    The core terms dictionaries (BlockTermsReader, BlockTreeTermsReader<TSubclassState>) interact with a single instance of this class to manage creation of DocsEnum and DocsAndPositionsEnum instances. It provides an IndexInput (termsIn) where this class may read any previously stored data that it had written in its corresponding PostingsWriterBase at indexing time.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PostingsWriterBase

    Extension of PostingsConsumer to support pluggable term dictionaries.

    This class contains additional hooks to interact with the provided term dictionaries such as BlockTreeTermsWriter<TSubclassState>. If you want to re-use an existing implementation and are only interested in customizing the format of the postings list, extend this class instead.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SegmentInfoFormat

    Expert: Controls the format of the SegmentInfo (segment metadata file).

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SegmentInfoReader

    Specifies an API for classes that can read SegmentInfo information.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SegmentInfoWriter

    Specifies an API for classes that can write out SegmentInfo data.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    StoredFieldsFormat

    Controls the format of stored fields.

    StoredFieldsReader

    Codec API for reading stored fields.

    You need to implement VisitDocument(int, StoredFieldVisitor) to read the stored fields for a document, implement Clone() (creating clones of any IndexInputs used, etc), and Dispose(bool) to cleanup any allocated resources.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    StoredFieldsWriter

    Codec API for writing stored fields:

    1. For every document, StartDocument(int) is called, informing the Codec how many fields will be written.
    2. WriteField(FieldInfo, IIndexableField) is called for each field in the document.
    3. After all documents have been written, Finish(FieldInfos, int) is called for verification/sanity-checks.
    4. Finally the writer is disposed (Dispose(bool))

    Note

    This API is experimental and might change in incompatible ways in the next release.

    TermStats

    Holder for per-term statistics.

    TermVectorsFormat

    Controls the format of term vectors.

    TermVectorsReader

    Codec API for reading term vectors:

    Note

    This API is experimental and might change in incompatible ways in the next release.

    TermVectorsWriter

    Codec API for writing term vectors:

    1. For every document, StartDocument(int) is called, informing the Codec how many fields will be written.
    2. StartField(FieldInfo, int, bool, bool, bool) is called for each field in the document, informing the codec how many terms will be written for that field, and whether or not positions, offsets, or payloads are enabled.
    3. Within each field, StartTerm(BytesRef, int) is called for each term.
    4. If offsets and/or positions are enabled, then AddPosition(int, int, int, BytesRef) will be called for each term occurrence.
    5. After all documents have been written, Finish(FieldInfos, int) is called for verification/sanity-checks.
    6. Finally the writer is disposed (Dispose(bool))

    Note

    This API is experimental and might change in incompatible ways in the next release.

    TermsConsumer

    Abstract API that consumes terms for an individual field.

    The lifecycle is:
    1. TermsConsumer is returned for each field by AddField(FieldInfo).
    2. TermsConsumer returns a PostingsConsumer for each term in StartTerm(BytesRef).
    3. When the producer (e.g. IndexWriter) is done adding documents for the term, it calls FinishTerm(BytesRef, TermStats), passing in the accumulated term statistics.
    4. Producer calls Finish(long, long, int) with the accumulated collection statistics when it is finished adding terms to the field.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Interfaces

    ICodecFactory

    Contract for extending the functionality of Codec implementations so they can be injected with dependencies.

    To set the ICodecFactory, call SetCodecFactory(ICodecFactory).

    See the Lucene.Net.Codecs namespace documentation for some common usage examples.

    IDocValuesFormatFactory

    Contract for extending the functionality of DocValuesFormat implementations so they can be injected with dependencies.

    To set the IDocValuesFormatFactory, call SetDocValuesFormatFactory(IDocValuesFormatFactory).

    See the Lucene.Net.Codecs namespace documentation for some common usage examples.

    IPostingsFormatFactory

    Contract for extending the functionality of PostingsFormat implementations so they can be injected with dependencies.

    To set the IPostingsFormatFactory, call SetPostingsFormatFactory(IPostingsFormatFactory).

    See the Lucene.Net.Codecs namespace documentation for some common usage examples.
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.