Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Namespace Lucene.Net.Util.Automaton

    Finite-state automaton for regular expressions.

    This package contains a full DFA/NFA implementation with Unicode alphabet and support for all standard (and a number of non-standard) regular expression operations.

    The most commonly used functionality is located in the classes Automaton and RegExp.

    For more information, go to the package home page at http://www.brics.dk/automaton/.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Classes

    Automaton

    Finite-state automaton with regular expression operations.

    Class invariants:

    • An automaton is either represented explicitly (with State and Transition objects) or with a singleton string (see Singleton and ExpandSingleton()) in case the automaton is known to accept exactly one string. (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
    • Automata are always reduced (see Reduce()) and have no transitions to dead states (see RemoveDeadTransitions()).
    • If an automaton is nondeterministic, then IsDeterministic returns false (but the converse is not required).
    • Automata provided as input to operations are generally assumed to be disjoint.

    If the states or transitions are manipulated manually, the RestoreInvariant() method and IsDeterministic setter should be used afterwards to restore representation invariants that are assumed by the built-in automata operations.

    Note: this class has internal mutable state and is not thread safe. It is the caller's responsibility to ensure any necessary synchronization if you wish to use the same Automaton from multiple threads. In general it is instead recommended to use a RunAutomaton for multithreaded matching: it is immutable, thread safe, and much faster.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    BasicAutomata

    Construction of basic automata.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    BasicOperations

    Basic automata operations.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    ByteRunAutomaton

    Automaton representation for matching UTF-8 byte[].

    CharacterRunAutomaton

    Automaton representation for matching char[].

    CompiledAutomaton

    Immutable class holding compiled details for a given Automaton. The Automaton is deterministic, must not have dead states but is not necessarily minimal.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    LevenshteinAutomata

    Class to construct DFAs that match a word within some edit distance.

    Implements the algorithm described in: Schulz and Mihov: Fast String Correction with Levenshtein Automata

    Note

    This API is experimental and might change in incompatible ways in the next release.

    MinimizationOperations

    Operations for minimizing automata.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    RegExp

    Regular Expression extension to Automaton.

    Regular expressions are built from the following abstract syntax:

    regexp::=unionexp
    |
    unionexp::=interexp | unionexp(union)
    |interexp
    interexp::=concatexp & interexp(intersection)[OPTIONAL]
    |concatexp
    concatexp::=repeatexp concatexp(concatenation)
    |repeatexp
    repeatexp::=repeatexp ?(zero or one occurrence)
    |repeatexp *(zero or more occurrences)
    |repeatexp +(one or more occurrences)
    |repeatexp {n}(n occurrences)
    |repeatexp {n,}(n or more occurrences)
    |repeatexp {n,m}(n to m occurrences, including both)
    |complexp
    complexp::=~ complexp(complement)[OPTIONAL]
    |charclassexp
    charclassexp::=[ charclasses ](character class)
    |[^ charclasses ](negated character class)
    |simpleexp
    charclasses::=charclass charclasses
    |charclass
    charclass::=charexp - charexp(character range, including end-points)
    |charexp
    simpleexp::=charexp
    |.(any single character)
    |#(the empty language)[OPTIONAL]
    |@(any string)[OPTIONAL]
    |" <Unicode string without double-quotes> "(a string)
    |( )(the empty string)
    |( unionexp )(precedence override)
    |< <identifier> >(named automaton)[OPTIONAL]
    |<n-m>(numerical interval)[OPTIONAL]
    charexp::=<Unicode character>(a single non-reserved character)
    |</strong> <Unicode character> (a single character)

    The productions marked [OPTIONAL] are only allowed if specified by the syntax flags passed to the RegExp constructor. The reserved characters used in the (enabled) syntax must be escaped with backslash (</code>) or double-quotes ("..."). (In contrast to other regexp syntaxes, this is required also in character classes.) Be aware that dash (-) has a special meaning in charclass expressions. An identifier is a string not containing right angle bracket (>) or dash (-). Numerical intervals are specified by non-negative decimal integers and include both end points, and if n and m have the same number of digits, then the conforming strings must have that length (i.e. prefixed by 0's).

    Note

    This API is experimental and might change in incompatible ways in the next release.

    RunAutomaton

    Finite-state automaton with fast run operation.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    SpecialOperations

    Special automata operations.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    State

    Automaton state.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    StatePair

    Pair of states.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Transition

    Automaton transition.

    A transition, which belongs to a source state, consists of a Unicode codepoint interval and a destination state.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    UTF32ToUTF8

    Converts UTF-32 automata to the equivalent UTF-8 representation.

    Note

    This API is for internal purposes only and might change in incompatible ways in the next release.

    Interfaces

    IAutomatonProvider

    Automaton provider for RegExp.

    Note

    This API is experimental and might change in incompatible ways in the next release.

    Enums

    CompiledAutomaton.AUTOMATON_TYPE

    Automata are compiled into different internal forms for the most efficient execution depending upon the language they accept.

    RegExpSyntax

    • Improve this Doc
    Back to top Copyright © 2021 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.