Matrix Chernoff bound

From testwiki
Jump to navigation Jump to search

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose {๐—k} is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}

The following theorems answer this general question under various assumptions; these assumptions are named below by analogy to their classical, scalar counterparts. All of these theorems can be found in Template:Harv, as the specific application of a general result which is derived below. A summary of related works is given.

Matrix Gaussian and Rademacher series

Self-adjoint matrices case

Consider a finite sequence {๐€k} of fixed, self-adjoint matrices with dimension d, and let {ฮพk} be a finite sequence of independent standard normal or independent Rademacher random variables.

Then, for all tโ‰ฅ0,

Pr{ฮปmax(โˆ‘kฮพk๐€k)โ‰ฅt}โ‰คdโ‹…eโˆ’t2/2ฯƒ2

where

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

Rectangular case

Consider a finite sequence {๐k} of fixed matrices with dimension d1ร—d2, and let {ฮพk} be a finite sequence of independent standard normal or independent Rademacher random variables. Define the variance parameter

ฯƒ2=max{โ€–โˆ‘k๐k๐kโˆ—โ€–,โ€–โˆ‘k๐kโˆ—๐kโ€–}.

Then, for all tโ‰ฅ0,

Pr{โ€–โˆ‘kฮพk๐kโ€–โ‰ฅt}โ‰ค(d1+d2)โ‹…eโˆ’t2/2ฯƒ2.

Matrix Chernoff inequalities

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly bounded random variables. In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random matrices subjected to a uniform eigenvalue bound.

Matrix Chernoff I

Consider a finite sequence {๐—k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies

๐—kโชฐ๐ŸŽandฮปmax(๐—k)โ‰คR

almost surely.

Define

ฮผmin=ฮปmin(โˆ‘k๐”ผ๐—k)andฮผmax=ฮปmax(โˆ‘k๐”ผ๐—k).

Then

Pr{ฮปmin(โˆ‘k๐—k)โ‰ค(1โˆ’ฮด)ฮผmin}โ‰คdโ‹…[eโˆ’ฮด(1โˆ’ฮด)1โˆ’ฮด]ฮผmin/Rfor ฮดโˆˆ[0,1), and
Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅ(1+ฮด)ฮผmax}โ‰คdโ‹…[eฮด(1+ฮด)1+ฮด]ฮผmax/Rfor ฮดโ‰ฅ0.

Matrix Chernoff II

Consider a sequence {๐—k:k=1,2,โ€ฆ,n} of independent, random, self-adjoint matrices that satisfy

๐—kโชฐ๐ŸŽandฮปmax(๐—k)โ‰ค1

almost surely.

Compute the minimum and maximum eigenvalues of the average expectation,

ฮผยฏmin=ฮปmin(1nโˆ‘k=1n๐”ผ๐—k)andฮผยฏmax=ฮปmax(1nโˆ‘k=1n๐”ผ๐—k).

Then

Pr{ฮปmin(1nโˆ‘k=1n๐—k)โ‰คฮฑ}โ‰คdโ‹…eโˆ’nD(ฮฑโ€–ฮผยฏmin)for 0โ‰คฮฑโ‰คฮผยฏmin, and
Pr{ฮปmax(1nโˆ‘k=1n๐—k)โ‰ฅฮฑ}โ‰คdโ‹…eโˆ’nD(ฮฑโ€–ฮผยฏmax)for ฮผยฏmaxโ‰คฮฑโ‰ค1.

The binary information divergence is defined as

D(aโ€–u)=a(logaโˆ’logu)+(1โˆ’a)(log(1โˆ’a)โˆ’log(1โˆ’u))

for a,uโˆˆ[0,1].

Matrix Bennett and Bernstein inequalities

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of independent, zero-mean random variables that are either bounded or subexponential. In the matrix case, the analogous results concern a sum of zero-mean random matrices.

Bounded case

Consider a finite sequence {๐—k} of independent, random, self-adjoint matrices with dimension d. Assume that each random matrix satisfies

๐”ผ๐—k=๐ŸŽandฮปmax(๐—k)โ‰คR

almost surely.

Compute the norm of the total variance,

ฯƒ2=โ€–โˆ‘k๐”ผ(๐—k2)โ€–.

Then, the following chain of inequalities holds for all tโ‰ฅ0:

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คdโ‹…exp(โˆ’ฯƒ2R2โ‹…h(Rtฯƒ2))โ‰คdโ‹…exp(โˆ’t2ฯƒ2+Rt/3)โ‰ค{dโ‹…exp(โˆ’3t2/8ฯƒ2)for tโ‰คฯƒ2/R;dโ‹…exp(โˆ’3t/8R)for tโ‰ฅฯƒ2/R.

The function h(u) is defined as h(u)=(1+u)log(1+u)โˆ’u for uโ‰ฅ0.

Consider a sequence {๐—k}k=1n of independent and identically distributed random column vectors in โ„d. Assume that each random vector satisfies โ€–๐—kโ€–2โ‰คM almost surely, and โ€–๐”ผ[๐—k๐—kT]โ€–2โ‰ค1. Then, for all tโ‰ฅ0,[1]

Pr{โ€–1nโˆ‘k=1n๐—k๐—kTโˆ’๐”ผ[๐—1๐—1T]โ€–2โ‰ฅt}โ‰ค(2min(d,n))2โ‹…exp(โˆ’n(tโˆ’1)4M2)

Subexponential case

Consider a finite sequence {๐—k} of independent, random, self-adjoint matrices with dimension d. Assume that

๐”ผ๐—k=๐ŸŽand๐”ผ(๐—kp)โชฏp!2โ‹…Rpโˆ’2๐€k2

for p=2,3,4,โ€ฆ.

Compute the variance parameter,

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

Then, the following chain of inequalities holds for all tโ‰ฅ0:

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คdโ‹…exp(โˆ’t2/2ฯƒ2+Rt)โ‰ค{dโ‹…exp(โˆ’t2/4ฯƒ2)for tโ‰คฯƒ2/R;dโ‹…exp(โˆ’t/4R)for tโ‰ฅฯƒ2/R.

Rectangular case

Consider a finite sequence {๐™k} of independent, random, matrices with dimension d1ร—d2. Assume that each random matrix satisfies

๐”ผ๐™k=๐ŸŽandโ€–๐™kโ€–โ‰คR

almost surely. Define the variance parameter

ฯƒ2=max{โ€–โˆ‘k๐”ผ(๐™k๐™kโˆ—)โ€–,โ€–โˆ‘k๐”ผ(๐™kโˆ—๐™k)โ€–}.

Then, for all tโ‰ฅ0

Pr{โ€–โˆ‘k๐™kโ€–โ‰ฅt}โ‰ค(d1+d2)โ‹…exp(โˆ’t2/2ฯƒ2+Rt/3)

holds.

Matrix Azuma, Hoeffding, and McDiarmid inequalities

Matrix Azuma

The scalar version of Azuma's inequality states that a scalar martingale exhibits normal concentration about its mean value, and the scale for deviations is controlled by the total maximum squared range of the difference sequence. The following is the extension in matrix setting.

Consider a finite adapted sequence {๐—k} of self-adjoint matrices with dimension d, and a fixed sequence {๐€k} of self-adjoint matrices that satisfy

๐”ผkโˆ’1๐—k=๐ŸŽand๐—k2โชฏ๐€k2

almost surely.

Compute the variance parameter

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

Then, for all tโ‰ฅ0

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คdโ‹…eโˆ’t2/8ฯƒ2

The constant 1/8 can be improved to 1/2 when there is additional information available. One case occurs when each summand ๐—k is conditionally symmetric. Another example requires the assumption that ๐—k commutes almost surely with ๐€k.

Matrix Hoeffding

Placing addition assumption that the summands in Matrix Azuma are independent gives a matrix extension of Hoeffding's inequalities.

Consider a finite sequence {๐—k} of independent, random, self-adjoint matrices with dimension d, and let {๐€k} be a sequence of fixed self-adjoint matrices. Assume that each random matrix satisfies

๐”ผ๐—k=๐ŸŽand๐—k2โชฏ๐€k2

almost surely.

Then, for all tโ‰ฅ0

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คdโ‹…eโˆ’t2/8ฯƒ2

where

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

An improvement of this result was established in Template:Harv: for all tโ‰ฅ0

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คdโ‹…eโˆ’t2/2ฯƒ2

where

ฯƒ2=12โ€–โˆ‘k๐€k2+๐”ผ๐—k2โ€–โ‰คโ€–โˆ‘k๐€k2โ€–.

Matrix bounded difference (McDiarmid)

In scalar setting, McDiarmid's inequality provides one common way of bounding the differences by applying Azuma's inequality to a Doob martingale. A version of the bounded differences inequality holds in the matrix setting.

Let {Zk:k=1,2,โ€ฆ,n} be an independent, family of random variables, and let ๐‡ be a function that maps n variables to a self-adjoint matrix of dimension d. Consider a sequence {๐€k} of fixed self-adjoint matrices that satisfy

(๐‡(z1,โ€ฆ,zk,โ€ฆ,zn)โˆ’๐‡(z1,โ€ฆ,z'k,โ€ฆ,zn))2โชฏ๐€k2,

where zi and z'i range over all possible values of Zi for each index i. Compute the variance parameter

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

Then, for all tโ‰ฅ0

Pr{ฮปmax(๐‡(๐ณ)โˆ’๐”ผ๐‡(๐ณ))โ‰ฅt}โ‰คdโ‹…eโˆ’t2/8ฯƒ2,

where ๐ณ=(Z1,โ€ฆ,Zn).

An improvement of this result was established in Template:Harv (see also Template:Harv): for all tโ‰ฅ0

Pr{ฮปmax(๐‡(๐ณ)โˆ’๐”ผ๐‡(๐ณ))โ‰ฅt}โ‰คdโ‹…eโˆ’t2/ฯƒ2,

where ๐ณ=(Z1,โ€ฆ,Zn) and ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

The first bounds of this type were derived by Template:Harv. Recall the theorem above for self-adjoint matrix Gaussian and Rademacher bounds: For a finite sequence {๐€k} of fixed, self-adjoint matrices with dimension d and for {ฮพk} a finite sequence of independent standard normal or independent Rademacher random variables, then

Pr{ฮปmax(โˆ‘kฮพk๐€k)โ‰ฅt}โ‰คdโ‹…eโˆ’t2/2ฯƒ2

where

ฯƒ2=โ€–โˆ‘k๐€k2โ€–.

Ahlswede and Winter would give the same result, except with

ฯƒAW2=โˆ‘kฮปmax(๐€k2).

By comparison, the ฯƒ2 in the theorem above commutes ฮฃ and ฮปmax; that is, it is the largest eigenvalue of the sum rather than the sum of the largest eigenvalues. It is never larger than the Ahlswedeโ€“Winter value (by the norm triangle inequality), but can be much smaller. Therefore, the theorem above gives a tighter bound than the Ahlswedeโ€“Winter result.

The chief contribution of Template:Harv was the extension of the Laplace-transform method used to prove the scalar Chernoff bound (see Chernoff bound#Additive form (absolute error)) to the case of self-adjoint matrices. The procedure given in the derivation below. All of the recent works on this topic follow this same procedure, and the chief differences follow from subsequent steps. Ahlswede & Winter use the Goldenโ€“Thompson inequality to proceed, whereas Tropp Template:Harv uses Lieb's Theorem.

Suppose one wished to vary the length of the series (n) and the dimensions of the matrices (d) while keeping the right-hand side approximately constant. Then n must vary approximately as the log of d. Several papers have attempted to establish a bound without a dependence on dimensions. Rudelson and Vershynin Template:Harv give a result for matrices which are the outer product of two vectors. Template:Harv provide a result without the dimensional dependence for low rank matrices. The original result was derived independently from the Ahlswedeโ€“Winter approach, but Template:Harv proves a similar result using the Ahlswedeโ€“Winter approach.

Finally, Oliveira Template:Harv proves a result for matrix martingales independently from the Ahlswedeโ€“Winter framework. Tropp Template:Harv slightly improves on the result using the Ahlswedeโ€“Winter framework. Neither result is presented in this article.

Derivation and proof

Ahlswede and Winter

The Laplace transform argument found in Template:Harv is a significant result in its own right: Let ๐˜ be a random self-adjoint matrix. Then

Pr{ฮปmax(Y)โ‰ฅt}โ‰คinfฮธ>0{eโˆ’ฮธtโ‹…E[treฮธ๐˜]}.

To prove this, fix ฮธ>0. Then

Pr{ฮปmax(๐˜)โ‰ฅt}=Pr{ฮปmax(๐œฝ๐˜)โ‰ฅฮธt}=Pr{eฮปmax(ฮธ๐˜)โ‰ฅeฮธt}โ‰คeโˆ’ฮธtEeฮปmax(ฮธ๐˜)โ‰คeโˆ’ฮธtEtre(ฮธ๐˜)

The second-to-last inequality is Markov's inequality. The last inequality holds since eฮปmax(ฮธ๐˜)=ฮปmax(eฮธ๐˜)โ‰คtr(eฮธ๐˜). Since the left-most quantity is independent of ฮธ, the infimum over ฮธ>0 remains an upper bound for it.

Thus, our task is to understand E[tr(eฮธ๐˜)] Nevertheless, since trace and expectation are both linear, we can commute them, so it is sufficient to consider Eeฮธ๐˜:=๐Œ๐˜(ฮธ), which we call the matrix generating function. This is where the methods of Template:Harv and Template:Harv diverge. The immediately following presentation follows Template:Harv.

The Goldenโ€“Thompson inequality implies that

tr๐Œ๐—1+๐—2(ฮธ)โ‰คtr[(Eeฮธ๐—1)(Eeฮธ๐—2)]=tr๐Œ๐—1(ฮธ)๐Œ๐—2(ฮธ), where we used the linearity of expectation several times.

Suppose ๐˜=โˆ‘k๐—k. We can find an upper bound for tr๐Œ๐˜(ฮธ) by iterating this result. Noting that tr(๐€๐)โ‰คtr(๐€)ฮปmax(๐), then

tr๐Œ๐˜(ฮธ)โ‰คtr[(Eeโˆ‘k=1nโˆ’1ฮธ๐—k)(Eeฮธ๐—n)]โ‰คtr(Eeโˆ‘k=1nโˆ’1ฮธ๐—k)ฮปmax(Eeฮธ๐—n).

Iterating this, we get

tr๐Œ๐˜(ฮธ)โ‰ค(tr๐ˆ)[ฮ kฮปmax(Eeฮธ๐—k)]=deโˆ‘kฮปmax(logEeฮธ๐—k)

So far we have found a bound with an infimum over ฮธ. In turn, this can be bounded. At any rate, one can see how the Ahlswedeโ€“Winter bound arises as the sum of largest eigenvalues.

Tropp

The major contribution of Template:Harv is the application of Lieb's theorem where Template:Harv had applied the Goldenโ€“Thompson inequality. Tropp's corollary is the following: If H is a fixed self-adjoint matrix and X is a random self-adjoint matrix, then

Etre๐‡+๐—โ‰คtre๐‡+log(Ee๐—)

Proof: Let ๐˜=e๐—. Then Lieb's theorem tells us that

f(๐˜)=tre๐‡+log(๐˜)

is concave. The final step is to use Jensen's inequality to move the expectation inside the function:

Etre๐‡+log(๐˜)โ‰คtre๐‡+log(E๐˜).

This gives us the major result of the paper: the subadditivity of the log of the matrix generating function.

Subadditivity of log mgf

Let ๐—k be a finite sequence of independent, random self-adjoint matrices. Then for all ฮธโˆˆโ„,

tr๐Œโˆ‘k๐—k(ฮธ)โ‰คtreโˆ‘klog๐Œ๐—k(ฮธ)

Proof: It is sufficient to let ฮธ=1. Expanding the definitions, we need to show that

Etreโˆ‘kฮธ๐—kโ‰คtreโˆ‘klogEeฮธ๐—k.

To complete the proof, we use the law of total expectation. Let Ek be the expectation conditioned on ๐—1,โ€ฆ,๐—k. Since we assume all the ๐—i are independent,

Ekโˆ’1e๐—k=Ee๐—k.

Define ๐œฉk=logEkโˆ’1e๐—k=log๐Œ๐—k(ฮธ).

Finally, we have

Etreโˆ‘k=1n๐—k=E0โ‹ฏEnโˆ’1treโˆ‘k=1nโˆ’1๐—k+๐—nโ‰คE0โ‹ฏEnโˆ’2treโˆ‘k=1nโˆ’1๐—k+log(Enโˆ’1e๐—n)=E0โ‹ฏEnโˆ’2treโˆ‘k=1nโˆ’2๐—k+๐—nโˆ’1+๐œฉnโ‹ฎ=treโˆ‘k=1n๐œฉk

where at every step m we use Tropp's corollary with

๐‡m=โˆ‘k=1mโˆ’1๐—k+โˆ‘k=m+1n๐œฉk

Master tail bound

The following is immediate from the previous result:

Pr{ฮปmax(โˆ‘k๐—k)โ‰ฅt}โ‰คinfฮธ>0{eโˆ’ฮธttreโˆ‘klog๐Œ๐—k(ฮธ)}

All of the theorems given above are derived from this bound; the theorems consist in various ways to bound the infimum. These steps are significantly simpler than the proofs given.

References

Template:Reflist