Stein's lemma

From testwiki
Jump to navigation Jump to search

Template:Short description Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory.[1] The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.

Note that the name "Stein's lemma" is also commonly used[2] to refer to a different result in the area of statistical hypothesis testing, which connects the error exponents in hypothesis testing with the Kullback–Leibler divergence. This result is also known as the Chernoff–Stein lemma[3] and is not related to the lemma discussed in this article.

Statement

Suppose X is a normally distributed random variable with expectation μ and variance σ2. Further suppose g is a differentiable function for which the two expectations E(g(X) (X − μ)) and E(g ′(X)) both exist. (The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value.) Then

E(g(X)(Xμ))=σ2E(g(X)).

Multidimensional

In general, suppose X and Y are jointly normally distributed. Then

Cov(g(X),Y)=Cov(X,Y)E(g(X)).

For a general multivariate Gaussian random vector (X1,...,Xn)N(μ,Σ) it follows that

E(g(X)(Xμ))=ΣE(g(X)).

Similarly, when μ=0, E[ig(X)]=E[g(X)(Σ1X)i],E[ijg(X)]=E[g(X)((Σ1X)i(Σ1X)jΣij1)]

Gradient descent

Stein's lemma can be used to stochastically estimate gradient:Eϵ𝒩(0,I)(g(x+Σ1/2ϵ))=Σ1/2Eϵ𝒩(0,I)(g(x+Σ1/2ϵ)ϵ)Σ1/21Ni=1Ng(x+Σ1/2ϵi)ϵiwhere ϵ1,,ϵN are IID samples from the standard normal distribution 𝒩(0,I). This form has applications in Stein variational gradient descent[4] and Stein variational policy gradient.[5]

Proof

The univariate probability density function for the univariate normal distribution with expectation 0 and variance 1 is

φ(x)=12πex2/2

Since xexp(x2/2)dx=exp(x2/2) we get from integration by parts:

E[g(X)X]=12πg(x)xexp(x2/2)dx=12πg(x)exp(x2/2)dx=E[g(X)].

The case of general variance σ2 follows by substitution.

Generalizations

Isserlis' theorem is equivalently stated asE(X1f(X1,,Xn))=i=1nCov(X1,Xi)E(Xif(X1,,Xn)).where (X1,Xn) is a zero-mean multivariate normal random vector.

Suppose X is in an exponential family, that is, X has the density

fη(x)=exp(ηT(x)Ψ(η))h(x).

Suppose this density has support (a,b) where a,b could be , and as xa or b, exp(ηT(x))h(x)g(x)0 where g is any differentiable function such that E|g(X)|< or exp(ηT(x))h(x)0 if a,b finite. Then

E[(h(X)h(X)+ηiTi(X))g(X)]=E[g(X)].

The derivation is same as the special case, namely, integration by parts.

If we only know X has support , then it could be the case that E|g(X)|< and E|g(X)|< but limxfη(x)g(x)=0. To see this, simply put g(x)=1 and fη(x) with infinitely spikes towards infinity but still integrable. One such example could be adapted from f(x)={1x[n,n+2n)0otherwise so that f is smooth.

Extensions to elliptically-contoured distributions also exist.[6][7][8]

See also

References