Namespace Lucene.Net.Util.Automaton
Finite-state automaton for regular expressions.
This package contains a full DFA/NFA implementation with Unicode alphabet and support for all standard (and a number of non-standard) regular expression operations.
The most commonly used functionality is located in the classes Automaton and RegExp.
For more information, go to the package home page at http://www.brics.dk/automaton/.
Classes
Automaton
Finite-state automaton with regular expression operations.
Class invariants:
- An automaton is either represented explicitly (with State and Transition objects) or with a singleton string (see Singleton and ExpandSingleton()) in case the automaton is known to accept exactly one string. (Implicitly, all states and transitions of an automaton are reachable from its initial state.)
- Automata are always reduced (see Reduce()) and have no transitions to dead states (see RemoveDeadTransitions()).
- If an automaton is nondeterministic, then IsDeterministic
returns
false
(but the converse is not required). - Automata provided as input to operations are generally assumed to be disjoint.
If the states or transitions are manipulated manually, the RestoreInvariant() method and IsDeterministic setter should be used afterwards to restore representation invariants that are assumed by the built-in automata operations.
Note: this class has internal mutable state and is not thread safe. It is the caller's responsibility to ensure any necessary synchronization if you wish to use the same Automaton from multiple threads. In general it is instead recommended to use a RunAutomaton for multithreaded matching: it is immutable, thread safe, and much faster.
BasicAutomata
Construction of basic automata.
BasicOperations
Basic automata operations.
ByteRunAutomaton
Automaton representation for matching UTF-8 byte[].
CharacterRunAutomaton
Automaton representation for matching char[].
CompiledAutomaton
Immutable class holding compiled details for a given Automaton. The Automaton is deterministic, must not have dead states but is not necessarily minimal.
LevenshteinAutomata
Class to construct DFAs that match a word within some edit distance.
Implements the algorithm described in: Schulz and Mihov: Fast String Correction with Levenshtein Automata
MinimizationOperations
Operations for minimizing automata.
RegExp
Regular Expression extension to Automaton.
Regular expressions are built from the following abstract syntax:
regexp::=unionexp | |
| | |
unionexp::=interexp | unionexp(union) | |
|interexp | |
interexp::=concatexp & interexp(intersection)[OPTIONAL] | |
|concatexp | |
concatexp::=repeatexp concatexp(concatenation) | |
|repeatexp | |
repeatexp::=repeatexp ?(zero or one occurrence) | |
|repeatexp *(zero or more occurrences) | |
|repeatexp +(one or more occurrences) | |
|repeatexp {n}(n occurrences) | |
|repeatexp {n,}(n or more occurrences) | |
|repeatexp {n,m}(n to m occurrences, including both) | |
|complexp | |
complexp::=~ complexp(complement)[OPTIONAL] | |
|charclassexp | |
charclassexp::=[ charclasses ](character class) | |
|[^ charclasses ](negated character class) | |
|simpleexp | |
charclasses::=charclass charclasses | |
|charclass | |
charclass::=charexp - charexp(character range, including end-points) | |
|charexp | |
simpleexp::=charexp | |
|.(any single character) | |
|#(the empty language)[OPTIONAL] | |
|@(any string)[OPTIONAL] | |
|" <Unicode string without double-quotes> "(a string) | |
|( )(the empty string) | |
|( unionexp )(precedence override) | |
|< <identifier> >(named automaton)[OPTIONAL] | |
|<n-m>(numerical interval)[OPTIONAL] | |
charexp::=<Unicode character>(a single non-reserved character) | |
|</strong> <Unicode character> (a single character) |
The productions marked [OPTIONAL] are only allowed if
specified by the syntax flags passed to the RegExp constructor.
The reserved characters used in the (enabled) syntax must be escaped with
backslash (
</code>) or double-quotes (
"..."
). (In
contrast to other regexp syntaxes, this is required also in character
classes.) Be aware that dash (-
) has a special meaning in
charclass expressions. An identifier is a string not containing right
angle bracket (>
) or dash (-
). Numerical
intervals are specified by non-negative decimal integers and include both end
points, and if n
and m
have the same number
of digits, then the conforming strings must have that length (i.e. prefixed
by 0's).
RunAutomaton
Finite-state automaton with fast run operation.
SpecialOperations
Special automata operations.
State
Automaton state.
StatePair
Pair of states.
Transition
Automaton transition.
A transition, which belongs to a source state, consists of a Unicode codepoint interval and a destination state.
UTF32ToUTF8
Converts UTF-32 automata to the equivalent UTF-8 representation.
Interfaces
IAutomatonProvider
Automaton provider for RegExp.
Enums
CompiledAutomaton.AUTOMATON_TYPE
Automata are compiled into different internal forms for the most efficient execution depending upon the language they accept.