Thompson's construction

Template:Short description In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm,^[1] is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA).^[2] This NFA can be used to match strings against the regular expression. This algorithm is credited to Ken Thompson.

Regular expressions and nondeterministic finite automata are two representations of formal languages. For instance, text processing utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it can compile regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, the regular languages.

An NFA can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also be interpreted directly.

To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimal deterministic finite automaton via Thompson's construction, powerset construction, and DFA minimization. If, and only if, the resulting automata agree up to renaming of states, the regular expressions' languages agree.

The algorithm

The algorithm works recursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules.^[3] More precisely, from a regular expression Template:Mvar, the obtained automaton Template:Mvar with the transition function Template:Mvar Template:Clarification needed respects the following properties:

Template:Mvar has exactly one initial state Template:Math, which is not accessible from any other state. That is, for any state Template:Mvar and any letter Template:Mvar, $Δ (q, a)$ does not contain Template:Math.
Template:Mvar has exactly one final state Template:Mvar, which is not co-accessible from any other state. That is, for any letter Template:Mvar, $Δ (q_{f}, a) = \emptyset$ .
Let Template:Mvar be the number of concatenation of the regular expression Template:Mvar and let Template:Mvar be the number of symbols apart from parentheses — that is, Template:Math, Template:Math, Template:Mvar and Template:Mvar. Then, the number of states of Template:Mvar is Template:Math (linear in the size of Template:Mvar).
The number of transitions leaving any state is at most two.
Since an NFA of Template:Mvar states and at most Template:Mvar transitions from each state can match a string of length Template:Mvar in time Template:Math, a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet.^[4]Template:Better source needed

Rules

The following rules are depicted according to Aho et al. (2007),^[1] p. 122. In what follows, N(Template:Color) and N(Template:Color) are the NFA of the subexpressions Template:Color and Template:Color, respectively.

The empty-expression ε is converted to

A symbol a of the input alphabet is converted to

The union expression Template:Color|Template:Color is converted to

State q goes via ε either to the initial state of N(Template:Color) or N(Template:Color). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA.

The concatenation expression st is converted to

The initial state of N(Template:Color) is the initial state of the whole NFA. The final state of N(Template:Color) becomes the initial state of N(Template:Color). The final state of N(Template:Color) is the final state of the whole NFA.

The Kleene star expression Template:Color^* is converted to

An ε-transition connects initial and final state of the NFA with the sub-NFA N(Template:Color) in between. Another ε-transition from the inner final to the inner initial state of N(Template:Color) allows for repetition of expression Template:Color according to the star operator.

The parenthesized expression (Template:Color) is converted to N(Template:Color) itself.

With these rules, using the empty expression and symbol rules as base cases, it is possible to prove with structural induction that any regular expression may be converted into an equivalent NFA.^[1]

Example

Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm.

Small Example

Example of Template:Code using Thompson's construction, step by step

The picture below shows the result of Thompson's construction on (ε|a*b). The purple oval corresponds to a, the teal oval corresponds to a*, the green oval corresponds to b, the orange oval corresponds to a*b, and the blue oval corresponds to ε.

Application of the algorithm

As an example, the picture shows the result of Thompson's construction algorithm on the regular expression Template:Nowrap begin(0|(1(01*(00)*0)*1)*)*Template:Nowrap end that denotes the set of binary numbers that are multiples of 3:

{ ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ... }.

The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are named Template:Color-Template:Color for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with the Template:Color and Template:Color state of each subexpression colored in Template:Color and Template:Color, respectively. An ε as transition label is omitted for clarity — unlabelled transitions are in fact ε transitions. The entry and exit state corresponding to the root expression Template:Color is the start and accept state of the automaton, respectively.

The algorithm's steps are as follows:

q:	start converting Kleene star expression									`(0\|(1(01(00)0)1))*`
	b:	start converting union expression								`0\|(1(01(00)0)1)`
		a:	convert symbol							`0`
		p:	start converting Kleene star expression							`(1(01(00)0)1)`
			d:	start converting concatenation expression						`1(01(00)0)*1`
				c:	convert symbol					`1`
				n:	start converting Kleene star expression					`(01(00)0)*`
					f:	start converting concatenation expression				`01(00)0`
						e:	convert symbol			`0`
						h:	start converting Kleene star expression			`1*`
							g:	convert symbol		`1`
						h:	finished converting Kleene star expression			`1*`
						l:	start converting Kleene star expression			`(00)*`
							j:	start converting concatenation expression		`00`
								i:	convert symbol	`0`
								k:	convert symbol	`0`
							j:	finished converting concatenation expression		`00`
						l:	finished converting Kleene star expression			`(00)*`
						m:	convert symbol			`0`
					f:	finished converting concatenation expression				`01(00)0`
				n:	finished converting Kleene star expression					`(01(00)0)*`
				o:	convert symbol					`1`
			d:	finished converting concatenation expression						`1(01(00)0)*1`
		p:	finished converting Kleene star expression							`(1(01(00)0)1)`
	b:	finished converting union expression								`0\|(1(01(00)0)1)`
q:	finished converting Kleene star expression									`(0\|(1(01(00)0)1))*`

An equivalent minimal deterministic automaton is shown below.

Relation to other algorithms

Thompson's is one of several algorithms for constructing NFAs from regular expressions;^[5] an earlier algorithm was given by McNaughton and Yamada.^[6] Converse to Thompson's construction, Kleene's algorithm transforms a finite automaton into a regular expression.

Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed.

Use in string pattern matching

Regular expressions are often used to specify patterns that software is then asked to match. Generating an NFA by Thompson's construction, and using an appropriate algorithm to simulate it, it is possible to create pattern-matching software with performance that is Template:Tmath, where m is the length of the regular expression and n is the length of the string being matched. This is much better than is achieved by many popular programming-language implementations;^[7] however, it is restricted to purely regular expressions and does not support patterns for non-regular languages like backreferences.

References

Template:Reflist

Template:Strings

[DragonBook-1] 1.0 ^1.1 ^1.2 Template:Cite book

[2] Template:Cite book

[3] Template:Cite journal

[4] Template:Cite web

[5] Template:Cite tech report

[6] Template:Cite journal

[7] Template:Cite web

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Thompson's construction

Contents

The algorithm

Rules

Example

Small Example

Application of the algorithm

Relation to other algorithms

Use in string pattern matching

References

Navigation menu

Thompson's construction

The algorithm

Rules

Example

Small Example

Application of the algorithm

Relation to other algorithms

Use in string pattern matching

References

Navigation menu

Search