E-graph

From testwiki
Jump to navigation Jump to search

Template:Short description

In computer science, an e-graph is a data structure that stores an equivalence relation over terms of some language.

Definition and operations

Let Σ be a set of uninterpreted functions, where Σn is the subset of Σ consisting of functions of arity n. Let π•šπ•• be a countable set of opaque identifiers that may be compared for equality, called e-class IDs. The application of fΣn to e-class IDs i1,i2,,inπ•šπ•• is denoted f(i1,i2,,in) and called an e-node.

The e-graph then represents equivalence classes of e-nodes, using the following data structures:[1]

  • A union-find structure U representing equivalence classes of e-class IDs, with the usual operations find, add and merge. An e-class ID e is canonical if find(U,e)=e; an e-node f(i1,,in) is canonical if each ij is canonical (j in 1,,n).
  • An association of e-class IDs with sets of e-nodes, called e-classes. This consists of
    • a hashcons H (i.e. a mapping) from canonical e-nodes to e-class IDs, and
    • an e-class map M that maps e-class IDs to e-classes, such that M maps equivalent IDs to the same set of e-nodes: i,jπ•šπ••,M[i]=M[j]find(U,i)=find(U,j)

Invariants

In addition to the above structure, a valid e-graph conforms to several data structure invariants.[2] Two e-nodes are equivalent if they are in the same e-class. The congruence invariant states that an e-graph must ensure that equivalence is closed under congruence, where two e-nodes f(i1,,in),f(j1,,jn) are congruent when find(U,ik)=find(U,jk),k{1,,n}. The hashcons invariant states that the hashcons maps canonical e-nodes to their e-class ID.

Operations

Template:Expand section

E-graphs expose wrappers around the add, find, and merge operations from the union-find that preserve the e-graph invariants. The last operation, e-matching, is described below.

Equivalent formulations

An e-graph can also be formulated as a bipartite graph G=(Nid,E) where

  • id is the set of e-class IDs (as above),
  • N is the set of e-nodes, and
  • E(id×N)(N×id) is a set of directed edges.

There is a directed edge from each e-class to each of its members, and from each e-node to each of its children.[3]

E-matching

Let V be a set of variables and let Term(Σ,V) be the smallest set that includes the 0-arity function symbols (also called constants), includes the variables, and is closed under application of the function symbols. In other words, Term(Σ,V) is the smallest set such that VTerm(Σ,V), Σ0Term(Σ,V), and when x1,,xnTerm(Σ,V) and fΣn, then f(x1,,xn)Term(Σ,V). A term containing variables is called a pattern, a term without variables is called ground.

An e-graph E represents a ground term tTerm(Σ,) if one of its e-classes represents t. An e-class C represents t if some e-node f(i1,,in)C does. An e-node f(i1,,in)C represents a term g(j1,,jn) if f=g and each e-class M[ik] represents the term jk (k in 1,,n).

e-matching is an operation that takes a pattern pTerm(Σ,V) and an e-graph E, and yields all pairs (σ,C) where σV×π•šπ•• is a substitution mapping the variables in p to e-class IDs and Cπ•šπ•• is an e-class ID such that the term σ(p) is represented by C. There are several known algorithms for e-matching,[4][5] the relational e-matching algorithm is based on worst-case optimal joins and is worst-case optimal.[6]

Extraction

Given an e-class and a cost function that maps each function symbol in Σ to a natural number, the extraction problem is to find a ground term with minimal total cost that is represented by the given e-class. This problem is NP-hard.[7] There is also no constant-factor approximation algorithm for this problem, which can be shown by reduction from the set cover problem. However, for graphs with bounded treewidth, there is a linear-time, fixed-parameter tractable algorithm.[8]

Complexity

  • An e-graph with n equalities can be constructed in O(n log n) time.[9]

Equality saturation

Template:Expand section

Equality saturation is a technique for building optimizing compilers using e-graphs.[10] It operates by applying a set of rewrites using e-matching until the e-graph is saturated, a timeout is reached, an e-graph size limit is reached, a fixed number of iterations is exceeded, or some other halting condition is reached. After rewriting, an optimal term is extracted from the e-graph according to some cost function, usually related to AST size or performance considerations.

Applications

E-graphs are used in automated theorem proving. They are a crucial part of modern SMT solvers such as Z3[11] and CVC4, where they are used to decide the empty theory by computing the congruence closure of a set of equalities, and e-matching is used to instantiate quantifiers.[12] In DPLL(T)-based solvers that use conflict-driven clause learning (also known as non-chronological backtracking), e-graphs are extended to produce proof certificates.[13] E-graphs are also used in the Simplify theorem prover of ESC/Java.[14]

Equality saturation is used in specialized optimizing compilers,[15] e.g. for deep learning[16] and linear algebra.[17] Equality saturation has also been used for translation validation applied to the LLVM toolchain.[18]

E-graphs have been applied to several problems in program analysis, including fuzzing,[19] abstract interpretation,[20] and library learning.[21]

References

Template:Reflist

Template:Program analysis