Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class Builder<T>

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Inheritance
    object
    Builder
    Builder<T>
    Inherited Members
    object.Equals(object)
    object.Equals(object, object)
    object.GetHashCode()
    object.GetType()
    object.MemberwiseClone()
    object.ReferenceEquals(object, object)
    object.ToString()
    Namespace: Lucene.Net.Util.Fst
    Assembly: Lucene.Net.dll
    Syntax
    public class Builder<T> : Builder where T : class
    Type Parameters
    Name Description
    T

    Constructors

    Builder(INPUT_TYPE, Outputs<T>)

    Instantiates an FST/FSA builder without any pruning. A shortcut to Builder(INPUT_TYPE, int, int, bool, bool, int, Outputs<T>, FreezeTail<T>, bool, float, bool, int) with pruning options turned off.

    Declaration
    public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
    Parameters
    Type Name Description
    FST.INPUT_TYPE inputType
    Outputs<T> outputs

    Builder(INPUT_TYPE, int, int, bool, bool, int, Outputs<T>, FreezeTail<T>, bool, float, bool, int)

    Instantiates an FST/FSA builder with all the possible tuning and construction tweaks. Read parameter documentation carefully.

    Declaration
    public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, bool doShareSuffix, bool doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, Builder.FreezeTail<T> freezeTail, bool doPackFST, float acceptableOverheadRatio, bool allowArrayArcs, int bytesPageBits)
    Parameters
    Type Name Description
    FST.INPUT_TYPE inputType

    The input type (transition labels). Can be anything from FST.INPUT_TYPE enumeration. Shorter types will consume less memory. Strings (character sequences) are represented as BYTE4 (full unicode codepoints).

    int minSuffixCount1

    If pruning the input graph during construction, this threshold is used for telling if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node is kept.

    int minSuffixCount2

    (Note: only Mike McCandless knows what this one is really doing...)

    bool doShareSuffix

    If true, the shared suffixes will be compacted into unique paths. this requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter to false creates a single suffix path for all input sequences. this will result in a larger FST, but requires substantially less memory and CPU during building.

    bool doShareNonSingletonNodes

    Only used if doShareSuffix is true. Set this to true to ensure FST is fully minimal, at cost of more CPU and more RAM during building.

    int shareMaxTailLength

    Only used if doShareSuffix is true. Set this to MaxValue to ensure FST is fully minimal, at cost of more CPU and more RAM during building.

    Outputs<T> outputs

    The output type for each input sequence. Applies only if building an FST. For FSA, use Singleton and NoOutput as the singleton output object.

    Builder.FreezeTail<T> freezeTail
    bool doPackFST

    Pass true to create a packed FST.

    float acceptableOverheadRatio

    How to trade speed for space when building the FST. this option is only relevant when doPackFST is true. GetMutable(int, int, float)

    bool allowArrayArcs

    Pass false to disable the array arc optimization while building the FST; this will make the resulting FST smaller but slower to traverse.

    int bytesPageBits

    How many bits wide to make each byte[] block in the BytesStore; if you know the FST will be large then make this larger. For example 15 bits = 32768 byte pages.

    Properties

    MappedStateCount

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual long MappedStateCount { get; }
    Property Value
    Type Description
    long

    TermCount

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual long TermCount { get; }
    Property Value
    Type Description
    long

    TotStateCount

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual long TotStateCount { get; }
    Property Value
    Type Description
    long

    Methods

    Add(Int32sRef, T)

    It's OK to add the same input twice in a row with different outputs, as long as outputs impls the merge method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (eg ByteSequenceOutputs or Int32SequenceOutputs) then you cannot reuse across calls.

    Declaration
    public virtual void Add(Int32sRef input, T output)
    Parameters
    Type Name Description
    Int32sRef input
    T output

    Finish()

    Returns final FST. NOTE: this will return null if nothing is accepted by the FST.

    Declaration
    public virtual FST<T> Finish()
    Returns
    Type Description
    FST<T>

    GetFstSizeInBytes()

    Builds a minimal FST (maps an Int32sRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

    NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698

    The parameterized type is the output type. See the subclasses of Outputs<T>.

    FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Declaration
    public virtual long GetFstSizeInBytes()
    Returns
    Type Description
    long
    Back to top Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.