Derivation of the conjugate gradient method

From testwiki
Jump to navigation Jump to search

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

𝑨𝒙=𝒃

where 𝑨 is symmetric positive-definite, without computing 𝑨1 explicitly. The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method[1] for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems.

The intent of this article is to document the important steps in these derivations.

Conjugate direction

The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function

f(𝒙)=𝒙T𝑨𝒙2𝒃T𝒙.

which allows us to apply geometric intuition.

Geometrically, the quadratic function can be equivalently presented by writing down its value at every point in space. The points of equal value make up its contour surfaces, which are concentric ellipsoids with the equation 𝒙T𝑨𝒙2𝒃T𝒙=Cfor varying C. As C decreases, the ellipsoids become smaller and smaller, until at its minimal value, the ellipsoid shrinks to their shared center.

Minimizing the quadratic function is then a problem of moving around the plane, searching for that shared center of all those ellipsoids. The center can be found by computing 𝑨1 explicitly, but this is precisely what we are trying to avoid.

The simplest method is greedy line search, where we start at some point 𝒙0, pick a direction 𝒑0 somehow, then minimize f(𝒙0+𝒑0α0). This has a simple closed-form solution that does not involve matrix inversion:α0=𝒑0T(𝒃𝑨𝒙0)𝒑0T𝑨𝒑0Geometrically, we start at some point 𝒙0 on some ellipsoid, then choose a direction and travel along that direction, until we hit the point where the ellipsoid is minimized in that direction. This is not necessarily the minimum, but it is progress towards it. Visually, it is moving along a line, and stopping as soon as we reach a point tangent to the contour ellipsoid.

We can now repeat this procedure, starting at our new point 𝒙1=𝒙0+α0𝒑0, pick a new direction 𝒑1, compute α1, etc.

We can summarize this as the following algorithm:

Start by picking an initial guess 𝒙0, and compute the initial residual 𝒓0=𝒃𝑨𝒙0, then iterate:

αi=𝒑iT𝒓i𝒑iT𝑨𝒑i,𝒙i+1=𝒙i+αi𝒑i,𝒓i+1=𝒓iαi𝑨𝒑i

where 𝒑0,𝒑1,𝒑2, are to be picked. Notice in particular how the residual is calculated iteratively step-by-step, instead of anew every time:𝒓i+1=𝒃𝑨𝒙i+1=𝒃𝑨(𝒙i+αi𝒑i)=𝒓iαi𝑨𝒑iIt is possibly true that αi=0 prematurely, which would bring numerical problems. However, for particular choices of 𝒑0,𝒑1,𝒑2,, this will not occur before convergence, as we will prove below.

Conjugate directions

If the directions 𝒑0,𝒑1,𝒑2, are not picked well, then progress will be slow. In particular, the gradient descent method would be slow. This can be seen in the diagram, where the green line is the result of always picking the local gradient direction. It zig-zags towards the minimum, but repeatedly overshoots. In contrast, if we pick the directions to be a set of mutually conjugate directions, then there will be no overshoot, and we would obtain the global minimum after n steps, where n is the number of dimensions.

Two conjugate diameters of an ellipse. Each edge of the bounding parallelogram is parallel to one of the diameters.

The concept of conjugate directions came from classical geometry of ellipse. For an ellipse, two semi-axes center are mutually conjugate with respect to the ellipse iff the lines are parallel to the tangent bounding parallelogram, as pictured. The concept generalizes to n-dimensional ellipsoids, where n semi-axes t0𝒑0,,tn1𝒑n1 are mutually conjugate with respect to the ellipsoid iff each axis is parallel to the tangent bounding parallelepiped. In other words, for any i, the tangent plane to the ellipsoid at 𝒄+ti𝒑i is a hyperplane spanned by the vectors {𝒑j:ji}, where 𝒄 is the center of the ellipsoid.

Note that we need to scale each directional vector 𝒑i by a scalar ti, so that 𝒄+ti𝒑i falls exactly on the ellipsoid.

Given an ellipsoid with equation 𝒙T𝑨𝒙2𝒃T𝒙=Cfor some constant C, we can translate it so that its center is at origin. This changes the equation to 𝒙T𝑨𝒙=Cfor some other constant C. The condition of tangency is then:(ti𝒑i+𝒑jdtj)T𝑨(ti𝒑i+𝒑jdtj)=C+O(dtj2),ijthat is, 𝒑iT𝑨𝒑j=0 for any ij.

The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions 𝒑0,𝒑1,𝒑2,. Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination.

Gram–Schmidt process

We can tabulate the equations that we need to set to zero:

0 1 2 3
0 𝒑0T𝑨𝒑1 𝒑0T𝑨𝒑2 𝒑0T𝑨𝒑3
1 𝒑1T𝑨𝒑2 𝒑1T𝑨𝒑3
2 𝒑2T𝑨𝒑3

This resembles the problem of orthogonalization, which requires 𝒑iT𝒑j=0 for any ij, and 𝒑iT𝒑j=1 for any i=j. Thus the problem of finding conjugate axes is less constrained than the problem of orthogonalization, so the Gram–Schmidt process works, with additional degrees of freedom that we can later use to pick the ones that would simplify the computation:

  • Arbitrarily set 𝒑0.
  • Arbitrarily set 𝒑10, then modify it to 𝒑1=𝒑10𝒑0T𝑨𝒑10𝒑0T𝑨𝒑0𝒑0.
  • Arbitrarily set 𝒑20, then modify it to 𝒑2=𝒑20i=01𝒑iT𝑨𝒑20𝒑iT𝑨𝒑i𝒑i.
  • ...
  • Arbitrarily set 𝒑n1,0, then modify it to 𝒑n1=𝒑n1,0i=0n2𝒑iT𝑨𝒑n1,0𝒑iT𝑨𝒑i𝒑i.

The most natural choice of 𝒑k,0 is the gradient. That is, 𝒑k,0=f(𝒙k). Since conjugate directions can be scaled by a nonzero value, we scale it by 1/2 for notational cleanness, obtaining 𝒑k,0=𝐫k=𝐛𝐀𝐱kThus, we have 𝒑k=𝒓ki=0k1𝒑iT𝑨𝒓k𝒑iT𝑨𝒑i𝒑i. Plugging it in, we have the conjugate gradient algorithm:𝐫0:=𝐛𝐀𝐱0𝐩0:=𝐫0k:=0do while k<nαk:=𝐩k𝖳𝐫k𝐩k𝖳𝐀𝐩k𝐱k+1:=𝐱k+αk𝐩kif |αk| is sufficiently small, then exit loop𝐫k+1:=𝐫kαk𝐀𝐩k𝐩k+1:=𝒓k+1i=0k𝒑iT𝑨𝒓k+1𝒑iT𝑨𝒑i𝒑ik:=k+1return π±k+1 as the resultProposition. If at some point, αk=0, then the algorithm has converged, that is, f(xk+1)=0.

Proof. By construction, it would mean that 𝐱k+1=𝐱k, that is, taking a conjugate gradient step gets us exactly back to where we were. This is only possible if the local gradient is already zero.

Simplification

This algorithm can be significantly simplified by some lemmas, resulting in the conjugate gradient algorithm.

Lemma 1. 𝐩iT𝐫j=0,i<j and 𝐫iT𝐫j=0,i<j.

Proof. By the geometric construction, the tangent plane to the ellipsoid at 𝐱j contains each of the previous conjugate direction vectors 𝐩0,𝐩1,,𝐩j1. Further, 𝐫j is perpendicular to the tangent, thus 𝐩iT𝐫j=0,i<j. The second equation is true since by construction, 𝐫0,𝐫1,,𝐫j1 is a linear transform of 𝐩0,𝐩1,,𝐩j1.

Lemma 2. 𝐩kT𝐫k=𝐫kT𝐫k.

Proof. By construction, 𝐩k:=𝒓ki=0k1𝒑iT𝑨𝒓k1𝒑iT𝑨𝒑i𝒑i, now apply lemma 1.

Lemma 3. 𝒑iT𝑨𝒓k+1={0,i<k𝒓k+1T𝒓k+1/αk,i=k.

Proof. By construction, we have 𝐫i+1=𝐫iαi𝐀𝐩i, thus𝒓k+1T𝑨𝒑i=𝒓k+1T𝒓i𝒓i+1αiNow apply lemma 1.


Plugging lemmas 1-3 in, we have αk=𝐫k𝐫k𝐩k𝐀𝐩k and 𝐩k+1:=𝒓k+1+𝐫k+1𝐫k+1𝐫k𝐫k𝐩k, which is the proper conjugate gradient algorithm.

Arnoldi/Lanczos iteration

Template:See The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.

The general Arnoldi method

In the Arnoldi iteration, one starts with a vector 𝒓0 and gradually builds an orthonormal basis {𝒗1,𝒗2,𝒗3,} of the Krylov subspace

𝒦(𝑨,𝒓0)=span{𝒓0,𝑨𝒓0,𝑨2𝒓0,}

by defining 𝒗i=π’˜i/π’˜i2 where

𝒗i={𝒓0if i=1,𝑨𝒗i1j=1i1(𝒗jT𝑨𝒗i1)𝒗jif i>1.

In other words, for i>1, 𝒗i is found by Gram-Schmidt orthogonalizing 𝑨𝒗i1 against {𝒗1,𝒗2,,𝒗i1} followed by normalization.

Put in matrix form, the iteration is captured by the equation

𝑨𝑽i=𝑽i+1𝑯~i

where

𝑽i=[𝒗1𝒗2𝒗i],𝑯~i=[h11h12h13h1,ih21h22h23h2,ih32h33h3,ihi,i1hi,ihi+1,i]=[𝑯ihi+1,i𝒆iT]

with

hji={𝒗jT𝑨𝒗iif ji,π’˜i+12if j=i+1,0if j>i+1.

When applying the Arnoldi iteration to solving linear systems, one starts with 𝒓0=𝒃𝑨𝒙0, the residual corresponding to an initial guess 𝒙0. After each step of iteration, one computes π’ši=𝑯i1(𝒓02𝒆1) and the new iterate 𝒙i=𝒙0+𝑽iπ’ši.

The direct Lanczos method

For the rest of discussion, we assume that 𝑨 is symmetric positive-definite. With symmetry of 𝑨, the upper Hessenberg matrix 𝑯i=𝑽iT𝑨𝑽i becomes symmetric and thus tridiagonal. It then can be more clearly denoted by

𝑯i=[a1b2b2a2b3bi1ai1bibiai].

This enables a short three-term recurrence for 𝒗i in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.

Since 𝑨 is symmetric positive-definite, so is 𝑯i. Hence, 𝑯i can be LU factorized without partial pivoting into

𝑯i=𝑳i𝑼i=[1c21ci11ci1][d1b2d2b3di1bidi]

with convenient recurrences for ci and di:

ci=bi/di1,di={a1if i=1,aicibiif i>1.

Rewrite 𝒙i=𝒙0+𝑽iπ’ši as

𝒙i=𝒙0+𝑽i𝑯i1(𝒓02𝒆1)=𝒙0+𝑽i𝑼i1𝑳i1(𝒓02𝒆1)=𝒙0+𝑷i𝒛i

with

𝑷i=𝑽i𝑼i1,𝒛i=𝑳i1(𝒓02𝒆1).

It is now important to observe that

𝑷i=[𝑷i1𝒑i],𝒛i=[𝒛i1ζi].

In fact, there are short recurrences for 𝒑i and ζi as well:

𝒑i=1di(𝒗ibi𝒑i1),ζi=ciζi1.

With this formulation, we arrive at a simple recurrence for 𝒙i:

𝒙i=𝒙0+𝑷i𝒛i=𝒙0+𝑷i1𝒛i1+ζi𝒑i=𝒙i1+ζi𝒑i.

The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.

The conjugate gradient method from imposing orthogonality and conjugacy

If we allow 𝒑i to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:

𝒙i=𝒙i1+αi1𝒑i1,𝒓i=𝒓i1αi1𝑨𝒑i1,𝒑i=𝒓i+βi1𝒑i1.

As premises for the simplification, we now derive the orthogonality of 𝒓i and conjugacy of 𝒑i, i.e., for ij,

𝒓iT𝒓j=0,𝒑iT𝑨𝒑j=0.

The residuals are mutually orthogonal because 𝒓i is essentially a multiple of 𝒗i+1 since for i=0, 𝒓0=𝒓02𝒗1, for i>0,

𝒓i=𝒃𝑨𝒙i=𝒃𝑨(𝒙0+𝑽iπ’ši)=𝒓0𝑨𝑽iπ’ši=𝒓0𝑽i+1𝑯~iπ’ši=𝒓0𝑽i𝑯iπ’šihi+1,i(𝒆iTπ’ši)𝒗i+1=𝒓02𝒗1𝑽i(𝒓02𝒆1)hi+1,i(𝒆iTπ’ši)𝒗i+1=hi+1,i(𝒆iTπ’ši)𝒗i+1.

To see the conjugacy of 𝒑i, it suffices to show that 𝑷iT𝑨𝑷i is diagonal:

𝑷iT𝑨𝑷i=𝑼iT𝑽iT𝑨𝑽i𝑼i1=𝑼iT𝑯i𝑼i1=𝑼iT𝑳i𝑼i𝑼i1=𝑼iT𝑳i

is symmetric and lower triangular simultaneously and thus must be diagonal.

Now we can derive the constant factors αi and βi with respect to the scaled 𝒑i by solely imposing the orthogonality of 𝒓i and conjugacy of 𝒑i.

Due to the orthogonality of 𝒓i, it is necessary that 𝒓i+1T𝒓i=(𝒓iαi𝑨𝒑i)T𝒓i=0. As a result,

αi=𝒓iT𝒓i𝒓iT𝑨𝒑i=𝒓iT𝒓i(𝒑iβi1𝒑i1)T𝑨𝒑i=𝒓iT𝒓i𝒑iT𝑨𝒑i.

Similarly, due to the conjugacy of 𝒑i, it is necessary that 𝒑i+1T𝑨𝒑i=(𝒓i+1+βi𝒑i)T𝑨𝒑i=0. As a result,

βi=𝒓i+1T𝑨𝒑i𝒑iT𝑨𝒑i=𝒓i+1T(𝒓i𝒓i+1)αi𝒑iT𝑨𝒑i=𝒓i+1T𝒓i+1𝒓iT𝒓i.

This completes the derivation.

References

Template:Reflist

  1. Template:Cite journal
  2. Shewchuk, Jonathan Richard. "An introduction to the conjugate gradient method without the agonizing pain." (1994)
  3. Template:Cite book

Template:Numerical linear algebra