Deterministic acyclic finite state automaton

The strings "tap", "taps", "top", and "tops" stored in a trie (left) and a DAFSA (right), EOW stands for End-of-word.

In computer science, a deterministic acyclic finite state automaton (DAFSA),^[1] also called a directed acyclic word graph (DAWG; though that name also refers to a related data structure that functions as a suffix index^[2]) is a data structure that represents a set of strings, and allows for a query operation that tests whether a given string belongs to the set in time proportional to its length. Algorithms exist to construct and maintain such automata,^[1] while keeping them minimal.

A DAFSA is a special case of a finite state recognizer that takes the form of a directed acyclic graph with a single source vertex (a vertex with no incoming edges), in which each edge of the graph is labeled by a letter or symbol, and in which each vertex has at most one outgoing edge for each possible letter or symbol. The strings represented by the DAFSA are formed by the symbols on paths in the graph from the source vertex to any sink vertex (a vertex with no outgoing edges). In fact, a deterministic finite state automaton is acyclic if and only if it recognizes a finite set of strings.^[1]

YouTube Encyclopedic

1/5
Views:
708 336
43 818
10 266
95 725
838

Transcription

Comparison to tries

By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly related trie data structure. Consider, for example, the four English words "tap", "taps", "top", and "tops". A trie for those four words would have 12 vertices, one for each of the strings formed as a prefix of one of these words, or for one of the words followed by the end-of-string marker. However, a DAFSA can represent these same four words using only six vertices v_i for 0 ≤ i ≤ 5, and the following edges: an edge from v₀ to v₁ labeled "t", two edges from v₁ to v₂ labeled "a" and "o", an edge from v₂ to v₃ labeled "p", an edge v₃ to v₄ labeled "s", and edges from v₃ and v₄ to v₅ labeled with the end-of-string marker. There is a tradeoff between memory and functionality, because a standard DAFSA can tell you if a word exists within it, but it cannot point you to auxiliary information about that word, whereas a trie can.

The primary difference between DAFSA and trie is the elimination of suffix and infix redundancy in storing strings. The trie eliminates prefix redundancy since all common prefixes are shared between strings, such as between doctors and doctorate the doctor prefix is shared. In a DAFSA common suffixes are also shared, for words that have the same set of possible suffixes as each other. For dictionary sets of common English words, this translates into major memory usage reduction.

Because the terminal nodes of a DAFSA can be reached by multiple paths, a DAFSA cannot directly store auxiliary information relating to each path, e.g. a word's frequency in the English language. However, if for each node we store the number of unique paths through that point in the structure, we can use it to retrieve the index of a word, or a word given its index.^[3] The auxiliary information can then be stored in an array.

References

^ ^a ^b ^c Jan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson (2000). Incremental construction of minimal acyclic finite state automata. Computational Linguistics 26(1):3-16.
^ This article incorporates public domain material from Paul E. Black. "directed acyclic word graph". Dictionary of Algorithms and Data Structures. NIST.
^ Kowaltowski, T.; CL Lucchesi (1993). "Applications of finite automata representing large vocabularies". Software-Practice and Experience. 1993: 15–30. CiteSeerX 10.1.1.56.5272.

Blumer, A.; Blumer, J.; Haussler, D.; Ehrenfeucht, A.; Chen, M.T.; Seiferas, J. (1985), "The smallest automaton recognizing the subwords of a text", Theoretical Computer Science, 40: 31–55, doi:10.1016/0304-3975(85)90157-4
Appel, Andrew; Jacobsen, Guy (1988), "The World's Fastest Scrabble Program" (PDF), Communications of the ACM, 31 (5): 572–578, doi:10.1145/42411.42420. One of the early mentions of the data structure.
Jansen, Cees J. A.; Boekee, Dick E. (1990), "On the significance of the directed acyclic word graph in cryptology", Advances in Cryptology – AUSCRYPT '90, Lecture Notes in Computer Science, vol. 453, Springer-Verlag, pp. 318–326, doi:10.1007/BFb0030372, ISBN 3-540-53000-2.
Epifanio, Chiara; Mignosi, Filippo; Shallit, Jeffrey; Venturini, Ilaria (2004), "Sturmian graphs and a conjecture of Moser", in Calude, Cristian S.; Calude, Elena; Dineen, Michael J. (eds.), Developments in language theory. Proceedings, 8th international conference (DLT 2004), Auckland, New Zealand, December 2004, Lecture Notes in Computer Science, vol. 3340, Springer-Verlag, pp. 175–187, ISBN 3-540-24014-4, Zbl 1117.68454
Tresoldi, Tiago (2020), "DAFSA: a Python library for Deterministic Acyclic Finite State Automata", Journal of Open Source Software, 5 (46): 1986, doi:10.21105/joss.01986, hdl:21.11116/0000-0005-AD0D-B An open source Python implementation.

External links

Wikimedia Commons has media related to Deterministic acyclic finite state automaton.

"Directed Acyclic Word Graph or DAWG" – JohnPaul Adamovsky teaches how to construct a DAFSA using an array of integers (Archived 22 July 2022 at the Wayback Machine)
"Caroline Word Graph or CWG" – JohnPaul Adamovsky teaches how to construct a DAFSA hash function using a novel encoding with multiple integer arrays (Archived 27 July 2022 at the Wayback Machine)

Automata theory: formal languages and formal grammars

Chomsky hierarchy	Grammars	Languages	Abstract machines
Type-0 — Type-1 — — — — — Type-2 — — Type-3 — —	Unrestricted (no common name) Context-sensitive Positive range concatenation Indexed — Linear context-free rewriting systems Tree-adjoining Context-free Deterministic context-free Visibly pushdown Regular — Non-recursive	Recursively enumerable Decidable Context-sensitive Positive range concatenation^* Indexed^* — Linear context-free rewriting language Tree-adjoining Context-free Deterministic context-free Visibly pushdown Regular Star-free Finite	Turing machine Decider Linear-bounded PTIME Turing Machine Nested stack Thread automaton restricted Tree stack automaton Embedded pushdown Nondeterministic pushdown Deterministic pushdown Visibly pushdown Finite Counter-free (with aperiodic finite monoid) Acyclic finite

Each category of languages, except those marked by a ^*, is a proper subset of the category directly above it. Any language in each category is generated by a grammar and by an automaton in the category in the same line.

v t e Data structures
Types	Collection Container
Abstract	Associative array Multimap Retrieval Data Structure List Stack Queue Double-ended queue Priority queue Double-ended priority queue Set Multiset Disjoint-set
Arrays	Bit array Circular buffer Dynamic array Hash table Hashed array tree Sparse matrix
Linked	Association list Linked list Skip list Unrolled linked list XOR linked list
Trees	B-tree Binary search tree AA tree AVL tree Red–black tree Self-balancing tree Splay tree Heap Binary heap Binomial heap Fibonacci heap R-tree R* tree R+ tree Hilbert R-tree Trie Hash tree
Graphs	Binary decision diagram Directed acyclic graph Directed acyclic word graph
List of data structures

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Suffix array Suffix automaton Suffix tree Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations

This page was last edited on 3 January 2024, at 01:02

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Comparison to tries

References

External links