Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Util.Fst

    Finite state transducers

    This package implements Finite State Transducers with the following characteristics:

    • Fast and low memory overhead construction of the minimal FST (but inputs must be provided in sorted order)

    • Low object overhead and quick deserialization (byte[] representation)

    • Optional two-pass compression: FST.Pack()

    • Lookup-by-output when the outputs are in sorted order (e.g., ordinals or file pointers)

    • Pluggable Outputs representation

    • N-shortest-paths search by weight

    • Enumerators (Int32sRef and BytesRef) that behave like SortedDictionary<TKey, TValue> enumerators

    FST Construction example:

    // Input values (keys). These must be provided to Builder in Unicode sorted order!
    string[] inputValues = new string[] { "cat", "dog", "dogs" };
    long[] outputValues = new long[] { 5, 7, 12 };
    
    PositiveInt32Outputs outputs = PositiveInt32Outputs.Singleton;
    Builder<long?> builder = new Builder<long?>(FST.INPUT_TYPE.BYTE1, outputs);
    BytesRef scratchBytes = new BytesRef();
    Int32sRef scratchInts = new Int32sRef();
    for (int i = 0; i < inputValues.Length; i++)
    {
        scratchBytes.CopyChars(inputValues[i]);
        builder.Add(Util.ToInt32sRef(scratchBytes, scratchInts), outputValues[i]);
    }
    FST<long?> fst = builder.Finish();
    

    Retrieval by key:

    long? value = Util.Get(fst, new BytesRef("dog"));
    Console.WriteLine(value); // 7
    

    Retrieval by value:

    // Only works because outputs are also in sorted order
    Int32sRef key = Util.GetByOutput(fst, 12);
    Console.WriteLine(Util.ToBytesRef(key, scratchBytes).Utf8ToString()); // dogs
    

    Iterate over key - value pairs in sorted order:

    // Like TermsEnum, this also supports seeking (advance)
    BytesRefFSTEnum<long?> enumerator = new BytesRefFSTEnum<long?>(fst);
    while (enumerator.MoveNext())
    {
        BytesRefFSTEnum.InputOutput<long?> mapEntry = enumerator.Current;
        Console.WriteLine(mapEntry.Input.Utf8ToString());
        Console.WriteLine(mapEntry.Output);
    }
    

    N-shortest paths by weight:

    var comparer = Comparer<long?>.Create((left, right) =>
    {
        return left.GetValueOrDefault().CompareTo(right);
    });
    FST.Arc<long?> firstArc = fst.GetFirstArc(new FST.Arc<long?>());
    Util.TopResults<long?> paths = Util.ShortestPaths(
        fst, firstArc, startOutput: 0, comparer, topN: 2, allowEmptyString: false);
    
    foreach (Util.Result<long?> path in paths)
    {
        Console.WriteLine(Util.ToBytesRef(path.Input, scratchBytes).Utf8ToString());
        Console.WriteLine(path.Output);
    }
    
    // Results:
    //
    // cat
    // 5
    // dog
    // 7
    

    Classes

    Builder

    LUCENENET specific type used to access nested types of Builder<T> without referring to its generic closing type.

    Builder.Arc<S>

    Expert: holds a pending (seen but not yet serialized) arc.

    Builder.CompiledNode

    Builder.FreezeTail<S>

    Expert: this is invoked by Builder whenever a suffix is serialized.

    Builder.UnCompiledNode<S>

    Expert: holds a pending (seen but not yet serialized) Node.

    Builder<T>

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    ByteSequenceOutputs

    An FST Outputs{BytesRef} implementation where each output is a sequence of bytes.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    BytesRefFSTEnum

    LUCENENET specific. This class is to mimic Java's ability to specify nested classes of Generics without having to specify the generic type (i.e. BytesRefFSTEnum.InputOutput{T} rather than BytesRefFSTEnum{T}.InputOutput{T})

    BytesRefFSTEnum.InputOutput<T>

    Holds a single input (BytesRef) + output pair.

    BytesRefFSTEnum<T>

    Enumerates all input (BytesRef) + output pairs in an FST.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    CharSequenceOutputs

    An FST Outputs<T> implementation where each output is a sequence of characters.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FST

    LUCENENET specific: This new base class is to mimic Java's ability to use nested types without specifying a type parameter. i.e. FST.BytesReader instead of FST<BytesRef>.BytesReader

    FST.Arc<T>

    Represents a single arc.

    FST.BytesReader

    Reads bytes stored in an FST.

    FST<T>

    Represents an finite state machine (FST), using a compact byte[] format.

    The format is similar to what's used by Morfologik (http://sourceforge.net/projects/morfologik).

    See the FST package documentation for some simple examples.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    FSTEnum<T>

    Can Next() and Advance() through the terms in an FST

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Int32SequenceOutputs

    An FST Outputs<T> implementation where each output is a sequence of System.Int32s.

    NOTE: This was IntSequenceOutputs in Lucene

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Int32sRefFSTEnum

    LUCENENET specific. This class is to mimic Java's ability to specify nested classes of Generics without having to specify the generic type (i.e. Int32sRefFSTEnum.InputOutput{T} rather than Int32sRefFSTEnum{T}.InputOutput{T})

    NOTE: This was Int32sRefFSTEnum{T} in Lucene

    Int32sRefFSTEnum.InputOutput<T>

    Holds a single input (Int32sRef) + output pair.

    Int32sRefFSTEnum<T>

    Enumerates all input (Int32sRef) + output pairs in an FST.

    NOTE: This was IntsRefFSTEnum{T} in Lucene

    Note

    This API is experimental and might change in incompatible ways in the next release.

    NoOutputs

    A null FST Outputs<T> implementation; use this if you just want to build an FSA.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Outputs<T>

    Represents the outputs for an FST, providing the basic algebra required for building and traversing the FST.

    Note that any operation that returns NO_OUTPUT must return the same singleton object from NoOutput.

    LUCENENET IMPORTANT: If T is a collection type, it must implement System.Collections.IStructuralEquatable in order to properly compare its nested values.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PairOutputs<A, B>

    An FST Outputs<T> implementation, holding two other outputs.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    PairOutputs<A, B>.Pair

    Holds a single pair of two outputs.

    PositiveInt32Outputs

    An FST Outputs<T> implementation where each output is a non-negative value.

    NOTE: This was PositiveIntOutputs in Lucene

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Util

    Static helper methods.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Util.FSTPath<T>

    Represents a path in TopNSearcher.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Util.Result<T>

    Holds a single input (Int32sRef) + output, returned by ShortestPaths<T>(FST<T>, FST.Arc<T>, T, IComparer<T>, Int32, Boolean).

    Util.TopNSearcher<T>

    Utility class to find top N shortest paths from start point(s).

    Util.TopResults<T>

    Holds the results for a top N search using Util.TopNSearcher<T>

    Interfaces

    Builder.INode

    Enums

    FST.INPUT_TYPE

    Specifies allowed range of each int input label for this FST.

    • Improve this Doc
    Back to top Copyright © 2021 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.