Derivation of the conjugate gradient method

From testwiki
Jump to navigation Jump to search

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

๐‘จ๐’™=๐’ƒ

where ๐‘จ is symmetric positive-definite, without computing ๐‘จโˆ’1 explicitly. The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method[1] for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems.

The intent of this article is to document the important steps in these derivations.

Conjugate direction

The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function

f(๐’™)=๐’™T๐‘จ๐’™โˆ’2๐’ƒT๐’™.

which allows us to apply geometric intuition.

Geometrically, the quadratic function can be equivalently presented by writing down its value at every point in space. The points of equal value make up its contour surfaces, which are concentric ellipsoids with the equation ๐’™T๐‘จ๐’™โˆ’2๐’ƒT๐’™=Cfor varying C. As C decreases, the ellipsoids become smaller and smaller, until at its minimal value, the ellipsoid shrinks to their shared center.

Minimizing the quadratic function is then a problem of moving around the plane, searching for that shared center of all those ellipsoids. The center can be found by computing ๐‘จโˆ’1 explicitly, but this is precisely what we are trying to avoid.

The simplest method is greedy line search, where we start at some point ๐’™0, pick a direction ๐’‘0 somehow, then minimize f(๐’™0+๐’‘0ฮฑ0). This has a simple closed-form solution that does not involve matrix inversion:ฮฑ0=๐’‘0T(๐’ƒโˆ’๐‘จ๐’™0)๐’‘0T๐‘จ๐’‘0Geometrically, we start at some point ๐’™0 on some ellipsoid, then choose a direction and travel along that direction, until we hit the point where the ellipsoid is minimized in that direction. This is not necessarily the minimum, but it is progress towards it. Visually, it is moving along a line, and stopping as soon as we reach a point tangent to the contour ellipsoid.

We can now repeat this procedure, starting at our new point ๐’™1=๐’™0+ฮฑ0๐’‘0, pick a new direction ๐’‘1, compute ฮฑ1, etc.

We can summarize this as the following algorithm:

Start by picking an initial guess ๐’™0, and compute the initial residual ๐’“0=๐’ƒโˆ’๐‘จ๐’™0, then iterate:

ฮฑi=๐’‘iT๐’“i๐’‘iT๐‘จ๐’‘i,๐’™i+1=๐’™i+ฮฑi๐’‘i,๐’“i+1=๐’“iโˆ’ฮฑi๐‘จ๐’‘i

where ๐’‘0,๐’‘1,๐’‘2,โ€ฆ are to be picked. Notice in particular how the residual is calculated iteratively step-by-step, instead of anew every time:๐’“i+1=๐’ƒโˆ’๐‘จ๐’™i+1=๐’ƒโˆ’๐‘จ(๐’™i+ฮฑi๐’‘i)=๐’“iโˆ’ฮฑi๐‘จ๐’‘iIt is possibly true that ฮฑi=0 prematurely, which would bring numerical problems. However, for particular choices of ๐’‘0,๐’‘1,๐’‘2,โ€ฆ, this will not occur before convergence, as we will prove below.

Conjugate directions

If the directions ๐’‘0,๐’‘1,๐’‘2,โ€ฆ are not picked well, then progress will be slow. In particular, the gradient descent method would be slow. This can be seen in the diagram, where the green line is the result of always picking the local gradient direction. It zig-zags towards the minimum, but repeatedly overshoots. In contrast, if we pick the directions to be a set of mutually conjugate directions, then there will be no overshoot, and we would obtain the global minimum after n steps, where n is the number of dimensions.

Two conjugate diameters of an ellipse. Each edge of the bounding parallelogram is parallel to one of the diameters.

The concept of conjugate directions came from classical geometry of ellipse. For an ellipse, two semi-axes center are mutually conjugate with respect to the ellipse iff the lines are parallel to the tangent bounding parallelogram, as pictured. The concept generalizes to n-dimensional ellipsoids, where n semi-axes t0๐’‘0,,tnโˆ’1๐’‘nโˆ’1 are mutually conjugate with respect to the ellipsoid iff each axis is parallel to the tangent bounding parallelepiped. In other words, for any i, the tangent plane to the ellipsoid at ๐’„+ti๐’‘i is a hyperplane spanned by the vectors {๐’‘j:jโ‰ i}, where ๐’„ is the center of the ellipsoid.

Note that we need to scale each directional vector ๐’‘i by a scalar ti, so that ๐’„+ti๐’‘i falls exactly on the ellipsoid.

Given an ellipsoid with equation ๐’™T๐‘จ๐’™โˆ’2๐’ƒT๐’™=Cfor some constant C, we can translate it so that its center is at origin. This changes the equation to ๐’™T๐‘จ๐’™=Cfor some other constant C. The condition of tangency is then:(ti๐’‘i+๐’‘jdtj)T๐‘จ(ti๐’‘i+๐’‘jdtj)=C+O(dtj2),โˆ€iโ‰ jthat is, ๐’‘iT๐‘จ๐’‘j=0 for any iโ‰ j.

The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions ๐’‘0,๐’‘1,๐’‘2,โ€ฆ. Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination.

Gramโ€“Schmidt process

We can tabulate the equations that we need to set to zero:

0 1 2 3 โ‹ฏ
0 ๐’‘0T๐‘จ๐’‘1 ๐’‘0T๐‘จ๐’‘2 ๐’‘0T๐‘จ๐’‘3 โ‹ฏ
1 ๐’‘1T๐‘จ๐’‘2 ๐’‘1T๐‘จ๐’‘3 โ‹ฏ
2 ๐’‘2T๐‘จ๐’‘3 โ‹ฏ
โ‹ฎ โ‹ฑ

This resembles the problem of orthogonalization, which requires ๐’‘iT๐’‘j=0 for any iโ‰ j, and ๐’‘iT๐’‘j=1 for any i=j. Thus the problem of finding conjugate axes is less constrained than the problem of orthogonalization, so the Gramโ€“Schmidt process works, with additional degrees of freedom that we can later use to pick the ones that would simplify the computation:

  • Arbitrarily set ๐’‘0.
  • Arbitrarily set ๐’‘10, then modify it to ๐’‘1=๐’‘10โˆ’๐’‘0T๐‘จ๐’‘10๐’‘0T๐‘จ๐’‘0๐’‘0.
  • Arbitrarily set ๐’‘20, then modify it to ๐’‘2=๐’‘20โˆ’โˆ‘i=01๐’‘iT๐‘จ๐’‘20๐’‘iT๐‘จ๐’‘i๐’‘i.
  • ...
  • Arbitrarily set ๐’‘nโˆ’1,0, then modify it to ๐’‘nโˆ’1=๐’‘nโˆ’1,0โˆ’โˆ‘i=0nโˆ’2๐’‘iT๐‘จ๐’‘nโˆ’1,0๐’‘iT๐‘จ๐’‘i๐’‘i.

The most natural choice of ๐’‘k,0 is the gradient. That is, ๐’‘k,0=โˆ‡f(๐’™k). Since conjugate directions can be scaled by a nonzero value, we scale it by โˆ’1/2 for notational cleanness, obtaining ๐’‘k,0=๐ซk=๐›โˆ’๐€๐ฑkThus, we have ๐’‘k=๐’“kโˆ’โˆ‘i=0kโˆ’1๐’‘iT๐‘จ๐’“k๐’‘iT๐‘จ๐’‘i๐’‘i. Plugging it in, we have the conjugate gradient algorithm:๐ซ0:=๐›โˆ’๐€๐ฑ0๐ฉ0:=๐ซ0k:=0do while k<nฮฑk:=๐ฉk๐–ณ๐ซk๐ฉk๐–ณ๐€๐ฉk๐ฑk+1:=๐ฑk+ฮฑk๐ฉkif |ฮฑk| is sufficiently small, then exit loop๐ซk+1:=๐ซkโˆ’ฮฑk๐€๐ฉk๐ฉk+1:=๐’“k+1โˆ’โˆ‘i=0k๐’‘iT๐‘จ๐’“k+1๐’‘iT๐‘จ๐’‘i๐’‘ik:=k+1return ๐ฑk+1 as the resultProposition. If at some point, ฮฑk=0, then the algorithm has converged, that is, โˆ‡f(xk+1)=0.

Proof. By construction, it would mean that ๐ฑk+1=๐ฑk, that is, taking a conjugate gradient step gets us exactly back to where we were. This is only possible if the local gradient is already zero.

Simplification

This algorithm can be significantly simplified by some lemmas, resulting in the conjugate gradient algorithm.

Lemma 1. ๐ฉiT๐ซj=0,โˆ€i<j and ๐ซiT๐ซj=0,โˆ€i<j.

Proof. By the geometric construction, the tangent plane to the ellipsoid at ๐ฑj contains each of the previous conjugate direction vectors ๐ฉ0,๐ฉ1,,๐ฉjโˆ’1. Further, ๐ซj is perpendicular to the tangent, thus ๐ฉiT๐ซj=0,โˆ€i<j. The second equation is true since by construction, ๐ซ0,๐ซ1,,๐ซjโˆ’1 is a linear transform of ๐ฉ0,๐ฉ1,,๐ฉjโˆ’1.

Lemma 2. ๐ฉkT๐ซk=๐ซkT๐ซk.

Proof. By construction, ๐ฉk:=๐’“kโˆ’โˆ‘i=0kโˆ’1๐’‘iT๐‘จ๐’“kโˆ’1๐’‘iT๐‘จ๐’‘i๐’‘i, now apply lemma 1.

Lemma 3. ๐’‘iT๐‘จ๐’“k+1={0,i<kโˆ’๐’“k+1T๐’“k+1/ฮฑk,i=k.

Proof. By construction, we have ๐ซi+1=๐ซiโˆ’ฮฑi๐€๐ฉi, thus๐’“k+1T๐‘จ๐’‘i=๐’“k+1T๐’“iโˆ’๐’“i+1ฮฑiNow apply lemma 1.


Plugging lemmas 1-3 in, we have ฮฑk=๐ซkโŠค๐ซk๐ฉkโŠค๐€๐ฉk and ๐ฉk+1:=๐’“k+1+๐ซk+1โŠค๐ซk+1๐ซkโŠค๐ซk๐ฉk, which is the proper conjugate gradient algorithm.

Arnoldi/Lanczos iteration

Template:See The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.

The general Arnoldi method

In the Arnoldi iteration, one starts with a vector ๐’“0 and gradually builds an orthonormal basis {๐’—1,๐’—2,๐’—3,โ€ฆ} of the Krylov subspace

๐’ฆ(๐‘จ,๐’“0)=span{๐’“0,๐‘จ๐’“0,๐‘จ2๐’“0,โ€ฆ}

by defining ๐’—i=๐’˜i/โ€–๐’˜iโ€–2 where

๐’—i={๐’“0if i=1,๐‘จ๐’—iโˆ’1โˆ’โˆ‘j=1iโˆ’1(๐’—jT๐‘จ๐’—iโˆ’1)๐’—jif i>1.

In other words, for i>1, ๐’—i is found by Gram-Schmidt orthogonalizing ๐‘จ๐’—iโˆ’1 against {๐’—1,๐’—2,โ€ฆ,๐’—iโˆ’1} followed by normalization.

Put in matrix form, the iteration is captured by the equation

๐‘จ๐‘ฝi=๐‘ฝi+1๐‘ฏ~i

where

๐‘ฝi=[๐’—1๐’—2โ‹ฏ๐’—i],๐‘ฏ~i=[h11h12h13โ‹ฏh1,ih21h22h23โ‹ฏh2,ih32h33โ‹ฏh3,iโ‹ฑโ‹ฑโ‹ฎhi,iโˆ’1hi,ihi+1,i]=[๐‘ฏihi+1,i๐’†iT]

with

hji={๐’—jT๐‘จ๐’—iif jโ‰คi,โ€–๐’˜i+1โ€–2if j=i+1,0if j>i+1.

When applying the Arnoldi iteration to solving linear systems, one starts with ๐’“0=๐’ƒโˆ’๐‘จ๐’™0, the residual corresponding to an initial guess ๐’™0. After each step of iteration, one computes ๐’ši=๐‘ฏiโˆ’1(โ€–๐’“0โ€–2๐’†1) and the new iterate ๐’™i=๐’™0+๐‘ฝi๐’ši.

The direct Lanczos method

For the rest of discussion, we assume that ๐‘จ is symmetric positive-definite. With symmetry of ๐‘จ, the upper Hessenberg matrix ๐‘ฏi=๐‘ฝiT๐‘จ๐‘ฝi becomes symmetric and thus tridiagonal. It then can be more clearly denoted by

๐‘ฏi=[a1b2b2a2b3โ‹ฑโ‹ฑโ‹ฑbiโˆ’1aiโˆ’1bibiai].

This enables a short three-term recurrence for ๐’—i in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.

Since ๐‘จ is symmetric positive-definite, so is ๐‘ฏi. Hence, ๐‘ฏi can be LU factorized without partial pivoting into

๐‘ฏi=๐‘ณi๐‘ผi=[1c21โ‹ฑโ‹ฑciโˆ’11ci1][d1b2d2b3โ‹ฑโ‹ฑdiโˆ’1bidi]

with convenient recurrences for ci and di:

ci=bi/diโˆ’1,di={a1if i=1,aiโˆ’cibiif i>1.

Rewrite ๐’™i=๐’™0+๐‘ฝi๐’ši as

๐’™i=๐’™0+๐‘ฝi๐‘ฏiโˆ’1(โ€–๐’“0โ€–2๐’†1)=๐’™0+๐‘ฝi๐‘ผiโˆ’1๐‘ณiโˆ’1(โ€–๐’“0โ€–2๐’†1)=๐’™0+๐‘ทi๐’›i

with

๐‘ทi=๐‘ฝi๐‘ผiโˆ’1,๐’›i=๐‘ณiโˆ’1(โ€–๐’“0โ€–2๐’†1).

It is now important to observe that

๐‘ทi=[๐‘ทiโˆ’1๐’‘i],๐’›i=[๐’›iโˆ’1ฮถi].

In fact, there are short recurrences for ๐’‘i and ฮถi as well:

๐’‘i=1di(๐’—iโˆ’bi๐’‘iโˆ’1),ฮถi=โˆ’ciฮถiโˆ’1.

With this formulation, we arrive at a simple recurrence for ๐’™i:

๐’™i=๐’™0+๐‘ทi๐’›i=๐’™0+๐‘ทiโˆ’1๐’›iโˆ’1+ฮถi๐’‘i=๐’™iโˆ’1+ฮถi๐’‘i.

The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.

The conjugate gradient method from imposing orthogonality and conjugacy

If we allow ๐’‘i to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:

๐’™i=๐’™iโˆ’1+ฮฑiโˆ’1๐’‘iโˆ’1,๐’“i=๐’“iโˆ’1โˆ’ฮฑiโˆ’1๐‘จ๐’‘iโˆ’1,๐’‘i=๐’“i+ฮฒiโˆ’1๐’‘iโˆ’1.

As premises for the simplification, we now derive the orthogonality of ๐’“i and conjugacy of ๐’‘i, i.e., for iโ‰ j,

๐’“iT๐’“j=0,๐’‘iT๐‘จ๐’‘j=0.

The residuals are mutually orthogonal because ๐’“i is essentially a multiple of ๐’—i+1 since for i=0, ๐’“0=โ€–๐’“0โ€–2๐’—1, for i>0,

๐’“i=๐’ƒโˆ’๐‘จ๐’™i=๐’ƒโˆ’๐‘จ(๐’™0+๐‘ฝi๐’ši)=๐’“0โˆ’๐‘จ๐‘ฝi๐’ši=๐’“0โˆ’๐‘ฝi+1๐‘ฏ~i๐’ši=๐’“0โˆ’๐‘ฝi๐‘ฏi๐’šiโˆ’hi+1,i(๐’†iT๐’ši)๐’—i+1=โ€–๐’“0โ€–2๐’—1โˆ’๐‘ฝi(โ€–๐’“0โ€–2๐’†1)โˆ’hi+1,i(๐’†iT๐’ši)๐’—i+1=โˆ’hi+1,i(๐’†iT๐’ši)๐’—i+1.

To see the conjugacy of ๐’‘i, it suffices to show that ๐‘ทiT๐‘จ๐‘ทi is diagonal:

๐‘ทiT๐‘จ๐‘ทi=๐‘ผiโˆ’T๐‘ฝiT๐‘จ๐‘ฝi๐‘ผiโˆ’1=๐‘ผiโˆ’T๐‘ฏi๐‘ผiโˆ’1=๐‘ผiโˆ’T๐‘ณi๐‘ผi๐‘ผiโˆ’1=๐‘ผiโˆ’T๐‘ณi

is symmetric and lower triangular simultaneously and thus must be diagonal.

Now we can derive the constant factors ฮฑi and ฮฒi with respect to the scaled ๐’‘i by solely imposing the orthogonality of ๐’“i and conjugacy of ๐’‘i.

Due to the orthogonality of ๐’“i, it is necessary that ๐’“i+1T๐’“i=(๐’“iโˆ’ฮฑi๐‘จ๐’‘i)T๐’“i=0. As a result,

ฮฑi=๐’“iT๐’“i๐’“iT๐‘จ๐’‘i=๐’“iT๐’“i(๐’‘iโˆ’ฮฒiโˆ’1๐’‘iโˆ’1)T๐‘จ๐’‘i=๐’“iT๐’“i๐’‘iT๐‘จ๐’‘i.

Similarly, due to the conjugacy of ๐’‘i, it is necessary that ๐’‘i+1T๐‘จ๐’‘i=(๐’“i+1+ฮฒi๐’‘i)T๐‘จ๐’‘i=0. As a result,

ฮฒi=โˆ’๐’“i+1T๐‘จ๐’‘i๐’‘iT๐‘จ๐’‘i=โˆ’๐’“i+1T(๐’“iโˆ’๐’“i+1)ฮฑi๐’‘iT๐‘จ๐’‘i=๐’“i+1T๐’“i+1๐’“iT๐’“i.

This completes the derivation.

References

Template:Reflist

  1. Template:Cite journal
  2. Shewchuk, Jonathan Richard. "An introduction to the conjugate gradient method without the agonizing pain." (1994)
  3. Template:Cite book

Template:Numerical linear algebra