Chomsky–Schützenberger enumeration theorem

From testwiki
Jump to navigation Jump to search

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let denote the set of nonnegative integers. A power series over is an infinite series of the form

f=f(x)=k=0akxk=a0+a1x1+a2x2+a3x3+

with coefficients ak in . The multiplication of two formal power series f and g is defined in the expected way as the convolution of the sequences an and bn:

f(x)g(x)=k=0(i=0kaibki)xk.

In particular, we write f2=f(x)f(x), f3=f(x)f(x)f(x), and so on. In analogy to algebraic numbers, a power series f(x) is called algebraic over (x), if there exists a finite set of polynomials p0(x),p1(x),p2(x),,pn(x) each with rational coefficients such that

p0(x)+p1(x)f+p2(x)f2++pn(x)fn=0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If L is a context-free language admitting an unambiguous context-free grammar, and ak:=|L Σk| is the number of words of length k in L, then G(x)=k=0akxk is a power series over that is algebraic over (x).

Proofs of this theorem are given by Template:Harvtxt, and by Template:Harvtxt.

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Template:Harvtxt: the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

SM | U
M → 0M1M | ε
U → 0S | 0M1U.

To obtain an algebraic representation of the power series Template:Tmath associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write Template:Tmath, Template:Tmath, and Template:Tmath. The equation system can be resolved after S, resulting in a single algebraic equation:

Template:Tmath.

This quadratic equation has two solutions for S, one of which is the algebraic power series Template:Tmath. By applying methods from complex analysis to this equation, the number an of words of length n generated by G can be estimated, as n grows large. In this case, one obtains anO(2+ϵ)n but anO(2ϵ)n for each ϵ>0.[1]

The following example is from Template:Harvtxt:{SXYTaT|TbT|YcYYYaY|cY|abTaYYa|XXa|b|c{s(z)=x(z)y(z)t(z)=zt(z)+zt(z)2+zy(z)2y(z)=zy(z)2+zy(z)+z4t(z)y(z)2+x(z)x(z)=3zwhich simplifies tos(z)827(z3z2)s(z)5++59049z10=0

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language LG over the alphabet {a,b} consists of the words an1ban2banpb with p1, ni>0 for i{1,2,,p}, and njj for some j{1,2,,p}.

It is comparably easy to show that the language LG is context-free.Template:Sfnp The harder part is to show that there does not exist an unambiguous grammar that generates LG. This can be proved as follows: If gk denotes the number of words of length k in LG, then for the associated power series holds G(x)=k=0gkxk=1x12x1xk1xk(k+1)/21. Using methods from complex analysis, one can prove that this function is not algebraic over (x). By the Chomsky-Schützenberger theorem, one can conclude that LG does not admit an unambiguous context-free grammar.[2]

Notes

Template:Reflist

References

Template:Refbegin

Template:Refend

Template:Noam Chomsky

  1. See Template:Harvtxt for a detailed exposition.
  2. See Template:Harvtxt for detailed account.