Proofs involving ordinary least squares

From testwiki
Jump to navigation Jump to search

Template:Multiple issues The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Derivation of the normal equations

Define the ith residual to be

ri=yij=1nXijβj.

Then the objective S can be rewritten

S=i=1mri2.

Given that S is convex, it is minimized when its gradient vector is zero (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further – see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

Sβj=2i=1mririβj(j=1,2,,n).

The derivatives are

riβj=Xij.

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

Sβj=2i=1m(yik=1nXikβk)(Xij)(j=1,2,,n).

Thus if β^ minimizes S, we have

2i=1m(yik=1nXikβ^k)(Xij)=0(j=1,2,,n).

Upon rearrangement, we obtain the normal equations:

i=1mk=1nXijXikβ^k=i=1mXijyi(j=1,2,,n).

The normal equations are written in matrix notation as

(𝐗T𝐗)β^=𝐗T𝐲 (where XT is the matrix transpose of X).

The solution of the normal equations yields the vector β^ of the optimal parameter values.

Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

S(β)=𝐲𝐗β2=(𝐲𝐗β)T(𝐲𝐗β)=𝐲T𝐲βT𝐗T𝐲𝐲T𝐗β+βT𝐗T𝐗β.

Here (βT𝐗T𝐲)T=𝐲T𝐗β has the dimension 1x1 (the number of columns of 𝐲), so it is a scalar and equal to its own transpose, hence βT𝐗T𝐲=𝐲T𝐗β and the quantity to minimize becomes

S(β)=𝐲T𝐲2βT𝐗T𝐲+βT𝐗T𝐗β.

Differentiating this with respect to β and equating to zero to satisfy the first-order conditions gives

𝐗T𝐲+(𝐗T𝐗)β=0,

which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that 𝐗 have full column rank, in which case 𝐗T𝐗 is positive definite.

Derivation without calculus

When 𝐗T𝐗 is positive definite, the formula for the minimizing value of β can be derived without the use of derivatives. The quantity

S(β)=𝐲T𝐲2βT𝐗T𝐲+βT𝐗T𝐗β

can be written as

β,β2β,(𝐗T𝐗)1𝐗T𝐲+(𝐗T𝐗)1𝐗T𝐲,(𝐗T𝐗)1𝐗T𝐲+C,

where C depends only on 𝐲 and 𝐗, and , is the inner product defined by

x,y=xT(𝐗T𝐗)y.

It follows that S(β) is equal to

β(𝐗T𝐗)1𝐗T𝐲,β(𝐗T𝐗)1𝐗T𝐲+C

and therefore minimized exactly when

β(𝐗T𝐗)1𝐗T𝐲=0.

Generalization for complex equations

In general, the coefficients of the matrices 𝐗,β and 𝐲 can be complex. By using a Hermitian transpose instead of a simple transpose, it is possible to find a vector β^ which minimizes S(β), just as for the real matrix case. In order to get the normal equations we follow a similar path as in previous derivations:

S(β)=𝐲𝐗β,𝐲𝐗β=𝐲,𝐲𝐗β,𝐲𝐲,𝐗β+𝐗β,𝐗β=𝐲T𝐲β𝐗𝐲𝐲𝐗β+βT𝐗T𝐗β,

where stands for Hermitian transpose.

We should now take derivatives of S(β) with respect to each of the coefficients βj, but first we separate real and imaginary parts to deal with the conjugate factors in above expression. For the βj we have

βj=βjR+iβjI

and the derivatives change into

Sβj=SβjRβjRβj+SβjIβjIβj=SβjRiSβjI(j=1,2,3,,n).

After rewriting S(β) in the summation form and writing βj explicitly, we can calculate both partial derivatives with result:

SβjR=i=1m(Xijyi+yiXij)+2i=1mXijXijβjR+i=1mkjn(XijXikβk+βkXikXij),iSβjI=i=1m(XijyiyiXij)2ii=1mXijXijβjI+i=1mkjn(XijXikβkβkXikXij),

which, after adding it together and comparing to zero (minimization condition for β^) yields

i=1mXijyi=i=1mk=1nXijXikβ^k(j=1,2,3,,n).

In matrix form:

XTy=XT(Xβ^) or (XX)β^=Xy.

Least squares estimator for β

Using matrix notation, the sum of squared residuals is given by

S(β)=(yXβ)T(yXβ).

Since this is a quadratic expression, the vector which gives the global minimum may be found via matrix calculus by differentiating with respect to the vector β (using denominator layout) and setting equal to zero:

0=dSdβ(β^)=ddβ(yTyβTXTyyTXβ+βTXTXβ)|β=β^=2XTy+2XTXβ^

By assumption matrix X has full column rank, and therefore XTX is invertible and the least squares estimator for β is given by

β^=(XTX)1XTy

Unbiasedness and variance of β^

Plug y =  + ε into the formula for β^ and then use the law of total expectation:

E[β^]=E[(XTX)1XT(Xβ+ε)]=β+E[(XTX)1XTε]=β+E[E[(XTX)1XTεX]]=β+E[(XTX)1XTE[εX]]=β,

where E[ε|X] = 0 by assumptions of the model. Since the expected value of β^ equals the parameter it estimates, β, it is an unbiased estimator of β.

For the variance, let the covariance matrix of ε be E[εεT]=σ2I (where I is the identity m×m matrix), and let X be a known constant. Then,

E[(β^β)(β^β)T]=E[((XTX)1XTε)((XTX)1XTε)T]=E[(XTX)1XTεεTX(XTX)1]=(XTX)1XTE[εεT]X(XTX)1=(XTX)1XTσ2X(XTX)1=σ2(XTX)1XTX(XTX)1=σ2(XTX)1,

where we used the fact that β^β is just an affine transformation of ε by the matrix (XTX)1XT.

For a simple linear regression model, where β=[β0,β1]T (β0 is the y-intercept and β1 is the slope), one obtains

σ2(XTX)1=σ2((11x1x2)(1x11x2))1=σ2(i=1m(1xixixi2))1=σ2(mxixixi2)1=σ21mxi2(xi)2(xi2xixim)=σ21m(xix¯)2(xi2xixim)Var(β^1)=σ2i=1m(xix¯)2.

Expected value and biasedness of σ^2

First we will plug in the expression for y into the estimator, and use the fact that X'M = MX = 0 (matrix M projects onto the space orthogonal to X):

σ^2=1nyMy=1n(Xβ+ε)M(Xβ+ε)=1nεMε

Now we can recognize ε as a 1×1 matrix, such matrix is equal to its own trace. This is useful because by properties of trace operator, tr(AB) = tr(BA), and we can use this to separate disturbance ε from matrix M which is a function of regressors X:

Eσ^2=1nE[tr(εMε)]=1ntr(E[Mεε])

Using the Law of iterated expectation this can be written as

Eσ^2=1ntr(E[ME[εε|X]])=1ntr(E[σ2MI])=1nσ2E[trM]

Recall that M = I − P where P is the projection onto linear space spanned by columns of matrix X. By properties of a projection matrix, it has p = rank(X) eigenvalues equal to 1, and all other eigenvalues are equal to 0. Trace of a matrix is equal to the sum of its characteristic values, thus tr(P) = p, and tr(M) = n − p. Therefore,

Eσ^2=npnσ2

Since the expected value of σ^2 does not equal the parameter it estimates, σ2, it is a biased estimator of σ2. Note in the later section “Maximum likelihood” we show that under the additional assumption that errors are distributed normally, the estimator σ^2 is proportional to a chi-squared distribution with n – p degrees of freedom, from which the formula for expected value would immediately follow. However the result we have shown in this section is valid regardless of the distribution of the errors, and thus has importance on its own.

Consistency and asymptotic normality of β^

Estimator β^ can be written as

β^=(1nXX)11nXy=β+(1nXX)11nXε=β+(1ni=1nxix'i)1(1ni=1nxiεi)

We can use the law of large numbers to establish that

1ni=1nxix'i p E[xixi]=Qxxn,1ni=1nxiεi p E[xiεi]=0

By Slutsky's theorem and continuous mapping theorem these results can be combined to establish consistency of estimator β^:

β^ p β+nQxx10=β

The central limit theorem tells us that

1ni=1nxiεi d 𝒩(0,V), where V=Var[xiεi]=E[εi2xix'i]=E[E[εi2xi]xix'i]=σ2Qxxn

Applying Slutsky's theorem again we'll have

n(β^β)=(1ni=1nxix'i)1(1ni=1nxiεi) d Qxx1n𝒩(0,σ2Qxxn)=𝒩(0,σ2Qxx1n)

Maximum likelihood approach

Maximum likelihood estimation is a generic technique for estimating the unknown parameters in a statistical model by constructing a log-likelihood function corresponding to the joint distribution of the data, then maximizing this function over all possible parameter values. In order to apply this method, we have to make an assumption about the distribution of y given X so that the log-likelihood function can be constructed. The connection of maximum likelihood estimation to OLS arises when this distribution is modeled as a multivariate normal.

Specifically, assume that the errors ε have multivariate normal distribution with mean 0 and variance matrix σ2I. Then the distribution of y conditionally on X is

yX  𝒩(Xβ,σ2I)

and the log-likelihood function of the data will be

(β,σ2X)=ln(1(2π)n/2(σ2)n/2e12(yXβ)(σ2I)1(yXβ))=n2ln2πn2lnσ212σ2(yXβ)(yXβ)

Differentiating this expression with respect to β and σ2 we'll find the ML estimates of these parameters:

β=12σ2(2Xy+2XXβ)=0β^=(XX)1Xyσ2=n21σ2+12σ4(yXβ)(yXβ)=0σ^2=1n(yXβ^)(yXβ^)=1nS(β^)

We can check that this is indeed a maximum by looking at the Hessian matrix of the log-likelihood function.

Finite-sample distribution

Since we have assumed in this section that the distribution of error terms is known to be normal, it becomes possible to derive the explicit expressions for the distributions of estimators β^ and σ^2:

β^=(XX)1Xy=(XX)1X(Xβ+ε)=β+(XX)1X𝒩(0,σ2I)

so that by the affine transformation properties of multivariate normal distribution

β^X  𝒩(β,σ2(XX)1).

Similarly the distribution of σ^2 follows from

σ^2=1n(yX(XX)1Xy)(yX(XX)1Xy)=1n(My)My=1n(Xβ+ε)M(Xβ+ε)=1nεMε,

where M=IX(XX)1X is the symmetric projection matrix onto subspace orthogonal to X, and thus MX = XM = 0. We have argued before that this matrix rank n – p, and thus by properties of chi-squared distribution,

nσ2σ^2X=(ε/σ)M(ε/σ)  χnp2

Moreover, the estimators β^ and σ^2 turn out to be independent (conditional on X), a fact which is fundamental for construction of the classical t- and F-tests. The independence can be easily seen from following: the estimator β^ represents coefficients of vector decomposition of y^=Xβ^=Py=Xβ+Pε by the basis of columns of X, as such β^ is a function of . At the same time, the estimator σ^2 is a norm of vector divided by n, and thus this estimator is a function of . Now, random variables (, ) are jointly normal as a linear transformation of ε, and they are also uncorrelated because PM = 0. By properties of multivariate normal distribution, this means that and are independent, and therefore estimators β^ and σ^2 will be independent as well.

Derivation of simple linear regression estimators

Template:Further

We look for α^ and β^ that minimize the sum of squared errors (SSE):

minα^,β^SSE(α^,β^)minα^,β^i=1n(yiα^β^xi)2

To find a minimum take partial derivatives with respect to α^ and β^

α^(SSE(α^,β^))=2i=1n(yiα^β^xi)=0i=1n(yiα^β^xi)=0i=1nyi=i=1nα^+β^i=1nxii=1nyi=nα^+β^i=1nxi1ni=1nyi=α^+1nβ^i=1nxiy¯=α^+β^x¯

Before taking partial derivative with respect to β^, substitute the previous result for α^.

minα^,β^i=1n[yi(y¯β^x¯)β^xi]2=minα^,β^i=1n[(yiy¯)β^(xix¯)]2

Now, take the derivative with respect to β^:

β^(SSE(α^,β^))=2i=1n[(yiy¯)β^(xix¯)](xix¯)=0i=1n(yiy¯)(xix¯)β^i=1n(xix¯)2=0β^=i=1n(yiy¯)(xix¯)i=1n(xix¯)2=Cov(x,y)Var(x)

And finally substitute β^ to determine α^

α^=y¯β^x¯

References

Template:Reflist