SPIKE algorithm

From testwiki
Jump to navigation Jump to search

The SPIKE algorithm is a hybrid parallel solver for banded linear systems developed by Eric Polizzi and Ahmed SamehTemplate:RefTemplate:NoteTemplate:Ref

Overview

The SPIKE algorithm deals with a linear system Template:Math, where Template:Math is a banded n×n matrix of bandwidth much less than n, and Template:Math is an n×s matrix containing s right-hand sides. It is divided into a preprocessing stage and a postprocessing stage.

Preprocessing stage

In the preprocessing stage, the linear system Template:Math is partitioned into a block tridiagonal form

[๐‘จ1๐‘ฉ1๐‘ช2๐‘จ2๐‘ฉ2๐‘ชp1๐‘จp1๐‘ฉp1๐‘ชp๐‘จp][๐‘ฟ1๐‘ฟ2๐‘ฟp1๐‘ฟp]=[๐‘ญ1๐‘ญ2๐‘ญp1๐‘ญp].

Assume, for the time being, that the diagonal blocks Template:Math (Template:Math with Template:Math) are nonsingular. Define a block diagonal matrix

Template:Math,

then Template:Math is also nonsingular. Left-multiplying Template:Math to both sides of the system gives

[๐‘ฐ๐‘ฝ1๐‘พ2๐‘ฐ๐‘ฝ2๐‘พp1๐‘ฐ๐‘ฝp1๐‘พp๐‘ฐ][๐‘ฟ1๐‘ฟ2๐‘ฟp1๐‘ฟp]=[๐‘ฎ1๐‘ฎ2๐‘ฎp1๐‘ฎp],

which is to be solved in the postprocessing stage. Left-multiplication by Template:Math is equivalent to solving p systems of the form

Template:Math

(omitting Template:Math and Template:Math for j=1, and Template:Math and Template:Math for j=p), which can be carried out in parallel.

Due to the banded nature of Template:Math, only a few leftmost columns of each Template:Math and a few rightmost columns of each Template:Math can be nonzero. These columns are called the spikes.

Postprocessing stage

Without loss of generality, assume that each spike contains exactly m columns (m is much less than n) (pad the spike with columns of zeroes if necessary). Partition the spikes in all Template:Math and Template:Math into

[๐‘ฝj(t)๐‘ฝj๐‘ฝj(b)] and [๐‘พj(t)๐‘พj๐‘พj(b)]

where Template:Math, Template:Math, Template:Math and Template:Math are of dimensions m×m. Partition similarly all Template:Math and Template:Math into

[๐‘ฟj(t)๐‘ฟj๐‘ฟj(b)] and [๐‘ฎj(t)๐‘ฎj๐‘ฎj(b)].

Notice that the system produced by the preprocessing stage can be reduced to a block pentadiagonal system of much smaller size (recall that m is much less than n)

[๐‘ฐm0๐‘ฝ1(t)0๐‘ฐm๐‘ฝ1(b)00๐‘พ2(t)๐‘ฐm0๐‘ฝ2(t)๐‘พ2(b)0๐‘ฐm๐‘ฝ2(b)00๐‘พp1(t)๐‘ฐm0๐‘ฝp1(t)๐‘พp1(b)0๐‘ฐm๐‘ฝp1(b)00๐‘พp(t)๐‘ฐm0๐‘พp(b)0๐‘ฐm][๐‘ฟ1(t)๐‘ฟ1(b)๐‘ฟ2(t)๐‘ฟ2(b)๐‘ฟp1(t)๐‘ฟp1(b)๐‘ฟp(t)๐‘ฟp(b)]=[๐‘ฎ1(t)๐‘ฎ1(b)๐‘ฎ2(t)๐‘ฎ2(b)๐‘ฎp1(t)๐‘ฎp1(b)๐‘ฎp(t)๐‘ฎp(b)],

which we call the reduced system and denote by Template:Math.

Once all Template:Math and Template:Math are found, all Template:Math can be recovered with perfect parallelism via

{๐‘ฟ1=๐‘ฎ1๐‘ฝ1๐‘ฟ2(t),๐‘ฟj=๐‘ฎj๐‘ฝj๐‘ฟj+1(t)๐‘พj๐‘ฟj1(b),j=2,,p1,๐‘ฟp=๐‘ฎp๐‘พp๐‘ฟp1(b).

SPIKE as a polyalgorithmic banded linear system solver

Despite being logically divided into two stages, computationally, the SPIKE algorithm comprises three stages:

  1. factorizing the diagonal blocks,
  2. computing the spikes,
  3. solving the reduced system.

Each of these stages can be accomplished in several ways, allowing a multitude of variants. Two notable variants are the recursive SPIKE algorithm for non-diagonally-dominant cases and the truncated SPIKE algorithm for diagonally-dominant cases. Depending on the variant, a system can be solved either exactly or approximately. In the latter case, SPIKE is used as a preconditioner for iterative schemes like Krylov subspace methods and iterative refinement.

Recursive SPIKE

Preprocessing stage

The first step of the preprocessing stage is to factorize the diagonal blocks Template:Math. For numerical stability, one can use LAPACK's XGBTRF routines to LU factorize them with partial pivoting. Alternatively, one can also factorize them without partial pivoting but with a "diagonal boosting" strategy. The latter method tackles the issue of singular diagonal blocks.

In concrete terms, the diagonal boosting strategy is as follows. Let Template:Math denote a configurable "machine zero". In each step of LU factorization, we require that the pivot satisfy the condition

Template:Math.

If the pivot does not satisfy the condition, it is then boosted by

pivot={pivot+ϵ๐‘จj1if pivot0,pivotϵ๐‘จj1if pivot<0

where Template:Math is a positive parameter depending on the machine's unit roundoff, and the factorization continues with the boosted pivot. This can be achieved by modified versions of ScaLAPACK's XDBTRF routines. After the diagonal blocks are factorized, the spikes are computed and passed on to the postprocessing stage.

Postprocessing stage

The two-partition case

In the two-partition case, i.e., when Template:Math, the reduced system Template:Math has the form

[๐‘ฐm0๐‘ฝ1(t)0๐‘ฐm๐‘ฝ1(b)00๐‘พ2(t)๐‘ฐm0๐‘พ2(b)0๐‘ฐm][๐‘ฟ1(t)๐‘ฟ1(b)๐‘ฟ2(t)๐‘ฟ2(b)]=[๐‘ฎ1(t)๐‘ฎ1(b)๐‘ฎ2(t)๐‘ฎ2(b)].

An even smaller system can be extracted from the center:

[๐‘ฐm๐‘ฝ1(b)๐‘พ2(t)๐‘ฐm][๐‘ฟ1(b)๐‘ฟ2(t)]=[๐‘ฎ1(b)๐‘ฎ2(t)],

which can be solved using the block LU factorization

[๐‘ฐm๐‘ฝ1(b)๐‘พ2(t)๐‘ฐm]=[๐‘ฐm๐‘พ2(t)๐‘ฐm][๐‘ฐm๐‘ฝ1(b)๐‘ฐm๐‘พ2(t)๐‘ฝ1(b)].

Once Template:Math and Template:Math are found, Template:Math and Template:Math can be computed via

Template:Math,
Template:Math.
The multiple-partition case

Assume that Template:Math is a power of two, i.e., Template:Math. Consider a block diagonal matrix

Template:Math

where

๐‘ซ~k[1]=[๐‘ฐm0๐‘ฝ2k1(t)0๐‘ฐm๐‘ฝ2k1(b)00๐‘พ2k(t)๐‘ฐm0๐‘พ2k(b)0๐‘ฐm]

for Template:Math. Notice that Template:Math essentially consists of diagonal blocks of order Template:Math extracted from Template:Math. Now we factorize Template:Math as

Template:Math.

The new matrix Template:Math has the form

[๐‘ฐ3m0๐‘ฝ1[2](t)0๐‘ฐm๐‘ฝ1[2](b)00๐‘พ2[2](t)๐‘ฐm0๐‘ฝ2[2](t)๐‘พ2[2](b)0๐‘ฐ3m๐‘ฝ2[2](b)00๐‘พp/21[2](t)๐‘ฐ3m0๐‘ฝp/21[2](t)๐‘พp/21[2](b)0๐‘ฐm๐‘ฝp/21[2](b)00๐‘พp/2[2](t)๐‘ฐm0๐‘พp/2[2](b)0๐‘ฐ3m].

Its structure is very similar to that of Template:Math, only differing in the number of spikes and their height (their width stays the same at Template:Math). Thus, a similar factorization step can be performed on Template:Math to produce

Template:Math

and

Template:Math.

Such factorization steps can be performed recursively. After Template:Math steps, we obtain the factorization

Template:Math,

where Template:Math has only two spikes. The reduced system will then be solved via

Template:Math.

The block LU factorization technique in the two-partition case can be used to handle the solving steps involving Template:Math, ..., Template:Math and Template:Math for they essentially solve multiple independent systems of generalized two-partition forms.

Generalization to cases where Template:Math is not a power of two is almost trivial.

Truncated SPIKE

When Template:Math is diagonally-dominant, in the reduced system

[๐‘ฐm0๐‘ฝ1(t)0๐‘ฐm๐‘ฝ1(b)00๐‘พ2(t)๐‘ฐm0๐‘ฝ2(t)๐‘พ2(b)0๐‘ฐm๐‘ฝ2(b)00๐‘พp1(t)๐‘ฐm0๐‘ฝp1(t)๐‘พp1(b)0๐‘ฐm๐‘ฝp1(b)00๐‘พp(t)๐‘ฐm0๐‘พp(b)0๐‘ฐm][๐‘ฟ1(t)๐‘ฟ1(b)๐‘ฟ2(t)๐‘ฟ2(b)๐‘ฟp1(t)๐‘ฟp1(b)๐‘ฟp(t)๐‘ฟp(b)]=[๐‘ฎ1(t)๐‘ฎ1(b)๐‘ฎ2(t)๐‘ฎ2(b)๐‘ฎp1(t)๐‘ฎp1(b)๐‘ฎp(t)๐‘ฎp(b)],

the blocks Template:Math and Template:Math are often negligible. With them omitted, the reduced system becomes block diagonal

[๐‘ฐm๐‘ฐm๐‘ฝ1(b)๐‘พ2(t)๐‘ฐm๐‘ฐm๐‘ฝ2(b)๐‘พp1(t)๐‘ฐm๐‘ฐm๐‘ฝp1(b)๐‘พp(t)๐‘ฐm๐‘ฐm][๐‘ฟ1(t)๐‘ฟ1(b)๐‘ฟ2(t)๐‘ฟ2(b)๐‘ฟp1(t)๐‘ฟp1(b)๐‘ฟp(t)๐‘ฟp(b)]=[๐‘ฎ1(t)๐‘ฎ1(b)๐‘ฎ2(t)๐‘ฎ2(b)๐‘ฎp1(t)๐‘ฎp1(b)๐‘ฎp(t)๐‘ฎp(b)]

and can be easily solved in parallel Template:Ref.

The truncated SPIKE algorithm can be wrapped inside some outer iterative scheme (e.g., BiCGSTAB or iterative refinement) to improve the accuracy of the solution.

SPIKE for tridiagonal systems

The first SPIKE partitioning and algorithm was presented in Template:Ref and was designed as the means to improve the stability properties of a parallel Givens rotations-based solver for tridiagonal systems. A version of the algorithm, termed g-Spike, that is based on serial Givens rotations applied independently on each block was designed for the NVIDIA GPU Template:Ref. A SPIKE-based algorithm for the GPU that is based on a special block diagonal pivoting strategy is described in Template:Ref.

SPIKE as a preconditioner

The SPIKE algorithm can also function as a preconditioner for iterative methods for solving linear systems. To solve a linear system Template:Math using a SPIKE-preconditioned iterative solver, one extracts center bands from Template:Math to form a banded preconditioner Template:Math and solves linear systems involving Template:Math in each iteration with the SPIKE algorithm.

In order for the preconditioner to be effective, row and/or column permutation is usually necessary to move "heavy" elements of Template:Math close to the diagonal so that they are covered by the preconditioner. This can be accomplished by computing the weighted spectral reordering of Template:Math.

The SPIKE algorithm can be generalized by not restricting the preconditioner to be strictly banded. In particular, the diagonal block in each partition can be a general matrix and thus handled by a direct general linear system solver rather than a banded solver. This enhances the preconditioner, and hence allows better chance of convergence and reduces the number of iterations.

Implementations

Intel offers an implementation of the SPIKE algorithm under the name Intel Adaptive Spike-Based Solver Template:Ref. Tridiagonal solvers have also been developed for the NVIDIA GPU Template:RefTemplate:Ref and the Xeon Phi co-processors. The method in Template:Ref is the basis for a tridiagonal solver in the cuSPARSE library.[1] The Givens rotations based solver was also implemented for the GPU and the Intel Xeon Phi.[2]

References

Template:Reflist

  1. Template:NoteTemplate:Cite journal
  2. Template:NoteTemplate:Cite journal
  3. Template:NoteTemplate:Cite journal
  4. Template:NoteTemplate:Cite book
  5. Template:NoteTemplate:Cite web
  6. Template:NoteTemplate:Cite journal
  7. Template:NoteTemplate:Cite journal
  8. Template:NoteTemplate:Cite journal

Further reading

Template:Numerical linear algebra

  1. โ†‘ NVIDIA, Accessed October 28, 2014. CUDA Toolkit Documentation v. 6.5: cuSPARSE, http://docs.nvidia.com/cuda/cusparse.
  2. โ†‘ Template:Cite web