Hunt–Szymanski algorithm

From testwiki
Jump to navigation Jump to search

In computer science, the Hunt–Szymanski algorithm,[1][2] also known as Hunt–McIlroy algorithm, is a solution to the longest common subsequence problem. It was one of the first non-heuristic algorithms used in diff which compares a pair of files each represented as a sequence of lines. To this day, variations of this algorithm are found in incremental version control systems, wiki engines, and molecular phylogenetics research software.

The worst-case complexity for this algorithm is Template:Math, but in practice Template:Math is rather expected.[3][4]

History

The algorithm was proposed by Harold S. Stone as a generalization of a special case solved by Thomas G. Szymanski.[5][6][7] James W. Hunt refined the idea, implemented the first version of the candidate-listing algorithm used by diff and embedded it into an older framework of Douglas McIlroy.[5]

The description of the algorithm appeared as a technical report by Hunt and McIlroy in 1976.[5] The following year, a variant of the algorithm was finally published in a joint paper by Hunt and Szymanski.[5][8]

Algorithm

The Hunt–Szymanski algorithm is a modification to a basic solution for the longest common subsequence problem which has complexity Template:Math. The solution is modified so that there are lower time and space requirements for the algorithm when it is working with typical inputs.

Basic longest common subsequence solution

Algorithm

Let Template:Math be the Template:Mathth element of the first sequence.

Let Template:Math be the Template:Mathth element of the second sequence.

Let Template:Math be the length of the longest common subsequence for the first Template:Math elements of Template:Math and the first Template:Math elements Template:Math.

Pij={0 if  i=0 or j=01+Pi1,j1 if Ai=Bjmax(Pi1,j,Pi,j1) if AiBj

Example

A table showing the recursive steps that the basic longest common subsequence algorithm takes

Consider the sequences Template:Math and Template:Math.

Template:Math contains three elements:

A1=aA2=bA3=c

Template:Math contains three elements:

B1=aB2=cB3=b

The steps that the above algorithm would perform to determine the length of the longest common subsequence for both sequences are shown in the diagram. The algorithm correctly reports that the longest common subsequence of the two sequences is two elements long.

Complexity

The above algorithm has worst-case time and space complexities of Template:Math (see big O notation), where Template:Math is the number of elements in sequence Template:Math and Template:Math is the number of elements in sequence Template:Math. The Hunt–Szymanski algorithm modifies this algorithm to have a worst-case time complexity of Template:Math and space complexity of Template:Math, though it regularly beats the worst case with typical inputs.

Essential matches

Template:Math-candidates

The Hunt–Szymanski algorithm only considers what the authors call essential matches, or Template:Math-candidates. Template:Math-candidates are pairs of indices Template:Math such that:

Ai=Bj
Pij>max(Pi1,j,Pi,j1)

The second point implies two properties of Template:Math-candidates:

Connecting Template:Math-candidates

A diagram that shows how using Template:Math-candidates reduces the amount of time and space needed to find the longest common subsequence of two sequences.

To create the longest common subsequence from a collection of Template:Math-candidates, a grid with each sequence's contents on each axis is created. The Template:Math-candidates are marked on the grid. A common subsequence can be created by joining marked coordinates of the grid such that any increase in Template:Math is accompanied by an increase in Template:Math.

This is illustrated in the adjacent diagram.

Black dots represent candidates that would have to be considered by the simple algorithm and the black lines are connections that create common subsequences of length 3.

Red dots represent Template:Math-candidates that are considered by the Hunt–Szymanski algorithm and the red line is the connection that creates a common subsequence of length 3.

See also

References

Template:Reflist

  1. Template:Cite web
  2. Template:Cite journal
  3. Template:Cite journal
  4. See Section 5.6 of Aho, A. V., Hopcroft, J. E., Ullman, J. D., Data Structures and Algorithms. Addison-Wesley, 1983. Template:ISBN
  5. 5.0 5.1 5.2 5.3 Template:Cite journal
  6. Template:Cite web
  7. Szymanski, T. G. (1975) A special case of the maximal common subsequence problem. Technical Report TR-170, Computer Science Lab., Princeton University.
  8. Template:Cite journal