Chomsky–Schützenberger enumeration theorem

In formal language theory, the Chomsky–Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

Statement

In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let $ℕ$ denote the set of nonnegative integers. A power series over $ℕ$ is an infinite series of the form

f = f (x) = \sum_{k = 0}^{\infty} a_{k} x^{k} = a_{0} + a_{1} x^{1} + a_{2} x^{2} + a_{3} x^{3} + \dots

with coefficients $a_{k}$ in $ℕ$ . The multiplication of two formal power series $f$ and $g$ is defined in the expected way as the convolution of the sequences $a_{n}$ and $b_{n}$ :

f (x) \cdot g (x) = \sum_{k = 0}^{\infty} (\sum_{i = 0}^{k} a_{i} b_{k - i}) x^{k} .

In particular, we write $f^{2} = f (x) \cdot f (x)$ , $f^{3} = f (x) \cdot f (x) \cdot f (x)$ , and so on. In analogy to algebraic numbers, a power series $f (x)$ is called algebraic over $ℚ (x)$ , if there exists a finite set of polynomials $p_{0} (x), p_{1} (x), p_{2} (x), \dots, p_{n} (x)$ each with rational coefficients such that

p_{0} (x) + p_{1} (x) \cdot f + p_{2} (x) \cdot f^{2} + \dots + p_{n} (x) \cdot f^{n} = 0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If

L

is a context-free language admitting an unambiguous context-free grammar, and

a_{k} := | L \cap Σ^{k} |

is the number of words of length

k

in

L

, then

G (x) = \sum_{k = 0}^{\infty} a_{k} x^{k}

is a power series over

ℕ

that is algebraic over

ℚ (x)

.

Proofs of this theorem are given by Template:Harvtxt, and by Template:Harvtxt.

Usage

Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Template:Harvtxt: the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U

M → 0M1M | ε

U → 0S | 0M1U.

To obtain an algebraic representation of the power series Template:Tmath associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U

M = M²x² + 1

U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write Template:Tmath, Template:Tmath, and Template:Tmath. The equation system can be resolved after S, resulting in a single algebraic equation:

Template:Tmath.

This quadratic equation has two solutions for S, one of which is the algebraic power series Template:Tmath. By applying methods from complex analysis to this equation, the number $a_{n}$ of words of length n generated by G can be estimated, as n grows large. In this case, one obtains $a_{n} \in O (2 + ϵ)^{n}$ but $a_{n} \notin O (2 - ϵ)^{n}$ for each $ϵ > 0$ .^[1]

The following example is from Template:Harvtxt: ${\begin{matrix} S \to X Y \\ T \to a T | T b T | Y c Y \\ Y \to Y a Y | c Y | a b T a Y Y a | X \\ X \to a | b | c \end{matrix} \Rightarrow {\begin{matrix} s (z) = x (z) y (z) \\ t (z) = z t (z) + z t (z)^{2} + z y (z)^{2} \\ y (z) = z y (z)^{2} + z y (z) + z^{4} t (z) y (z)^{2} + x (z) \\ x (z) = 3 z \end{matrix}$ which simplifies to $s (z)^{8} - 27 (z^{3} - z^{2}) s (z)^{5} + \dots + 59049 z^{10} = 0$

Inherent ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language $L_{G}$ over the alphabet ${a, b}$ consists of the words $a^{n_{1}} b a^{n_{2}} b \dots a^{n_{p}} b$ with $p \geq 1$ , $n_{i} > 0$ for $i \in {1, 2, \dots, p}$ , and $n_{j} \neq j$ for some $j \in {1, 2, \dots, p}$ .

It is comparably easy to show that the language $L_{G}$ is context-free.Template:Sfnp The harder part is to show that there does not exist an unambiguous grammar that generates $L_{G}$ . This can be proved as follows: If $g_{k}$ denotes the number of words of length $k$ in $L_{G}$ , then for the associated power series holds $G (x) = \sum_{k = 0}^{\infty} g_{k} x^{k} = \frac{1 - x}{1 - 2 x} - \frac{1}{x} \sum_{k \geq 1} x^{k (k + 1) / 2 - 1}$ . Using methods from complex analysis, one can prove that this function is not algebraic over $ℚ (x)$ . By the Chomsky-Schützenberger theorem, one can conclude that $L_{G}$ does not admit an unambiguous context-free grammar.^[2]

Notes

Template:Reflist

References

Template:Refbegin

Template:Refend

Template:Noam Chomsky

↑ See Template:Harvtxt for a detailed exposition.
↑ See Template:Harvtxt for detailed account.

[1] See Template:Harvtxt for a detailed exposition.

[2] See Template:Harvtxt for detailed account.

[1]

[2]

Chomsky–Schützenberger enumeration theorem

Contents

Statement

Usage

Asymptotic estimates

Inherent ambiguity

Notes

References

Navigation menu

Chomsky–Schützenberger enumeration theorem

Statement

Usage

Asymptotic estimates

Inherent ambiguity

Notes

References

Navigation menu

Search