Draft:Context-free language reachability

From testwiki
Jump to navigation Jump to search

Template:Short description Template:Draft topics Template:AfC topic Template:AfC submission Template:AfC submission

Context-free language reachability is an algorithmic problem with applications in static program analysis. Given a graph with edge labels from some alphabet and a context-free grammar over that alphabet, the problem is to determine whether there exists a path through the graph such that the concatenation of the labels along the path is a string accepted by the grammar.

Variations

In addition to the decision problem formulation given above, there are several related function problem formulations of CFL-reachability. For brevity, define an L-path to be a path with edge labels in the language of the grammar. Then:[1]

  • The all-pairs variant is to determine all pairs of nodes such that there exists an L-path between them.
  • The single-source variant is to determine all nodes that are reachable from a given source node via an L-path.
  • The single-target variant is to determine all nodes that are the sources of L-paths that end at a given target node.
  • The single-source/single-target variant is to determine whether there is an L-path between two given nodes.

Algorithms

There is a relatively simple dynamic programming algorithm for solving all-pairs CFL-reachability. The algorithm requires a normalized grammar, where each production has at most two symbols (terminals or nonterminals) on the right-hand side. The runtime of this algorithm O(Σ3n3), where Σ is the number of terminals and nonterminals in the normalized grammar (which is linear with respect to the original grammar), and n is the number of nodes in the graph.[2] The algorithm works by repeatedly adding summary edges to the graph: given a production A::=B C, if there exists an edge between some nodes x and y labeled with B and an edge between y and z labeled C, then the algorithm adds a new edge labeled A between x and z. This process is repeated until saturation, i.e., until no more summary edges may be added.

Applications to program analysis

Several problems in program analysis can be formulated as CFL-reachability problems, including:[3]

Alias analysis

Consider an imperative language with pointers, like a simplified C. The program expression graph (PEG) for a program in such a language has a node for each expression in the program, and two kinds of edges:

  • A pointer dereference edge labeled d from each pointer dereference expression *e to the corresponding expression e
  • An assignment edge labeled a from r to l for each assignment l = r

For each d- and a-edge, there are also corresponding edges in the opposite direction, labeled ~d and ~a, respectively.

The CFL-reachability problem over the PEG and the following grammar encodes the may-alias problem:[6]

M ::= ~d V d
V ::= ~F M? F
F ::= (a M?)*
~F ::= (M? ~a)*

The nonterminal M signifies that two memory locations may alias, i.e., they point to the same value. Nonterminal V signifies that two values may alias, i.e., they hold pointers that may alias. F signifies data-flows, which are sequences of assignments interleaved with memory aliases. ~F is the inverse production of F.

The following grammar is equivalent:

M ::= ~d V d
V ::= (M? ~a)* M? (a M?)*

Every CFL-reachability problem can be encoded as a Datalog program.[7]

References

Template:Reflist