Sliced inverse regression: Difference between revisions

From testwiki
Jump to navigation Jump to search
imported>Davide King
m ce
 
(No difference)

Latest revision as of 18:11, 25 March 2024

Template:Short description Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics.[1]

In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x_, which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing.

As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E(x_|y), to perform a weighted principal component analysis.

Model

Given a response variable Y and a (random) vector Xp of explanatory variables, SIR is based on the model

Y=f(β1X,,βkX,ε)(1)

where β1,,βk are unknown projection vectors, k is an unknown number smaller than p, f is an unknown function on k+1as it only depends onk arguments, and ε is a random variable representing error with E[ε|X]=0 and a finite variance of σ2. The model describes an ideal solution, where Y depends on Xp only through ak dimensional subspace; i.e., one can reduce the dimension of the explanatory variables fromp to a smaller numberk without losing any information.

An equivalent version of (1) is: the conditional distribution of Y given X depends on X only through the k dimensional random vector (β1X,,βkX). It is assumed that this reduced vector is as informative as the original X in explaining Y.

The unknown βis are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space).

Relevant linear algebra background

Given a_1,,a_rn, then V:=L(a_1,,a_r), the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a_1,,a_r span V, but the vectors that span space V are not unique.

The dimension of V(n) is equal to the maximum number of linearly independent vectors in V. A set of n linear independent vectors of n makes up a basis of n. The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line.

Inverse regression

Computing the inverse regression curve (IR) means instead of looking for

  • E[Y|X=x], which is a curve in p

it is actually

  • E[X|Y=y], which is also a curve in p, but consisting of p one-dimensional regressions.

The center of the inverse regression curve is located at E[E[X|Y]]=E[X]. Therefore, the centered inverse regression curve is

  • E[X|Y=y]E[X]

which is a p dimensional curve in p.

Inverse regression versus dimension reduction

The centered inverse regression curve lies on a k-dimensional subspace spanned by Σxxβis. This is a connection between the model and inverse regression.

Given this condition and (1), the centered inverse regression curve E[X|Y=y]E[X] is contained in the linear subspace spanned by Σxxβk(k=1,,K), where Σxx=Cov(X).

Estimation of the EDR-directions

After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m^hs, X would have been standardized to Z=Σxx1/2{XE(X)}. Corresponding to the theorem above, the IR-curve m1(y)=E[Z|Y=y] lies in the space spanned by (η1,,ηk), where ηi=Σxx1/2βi. As a consequence, the covariance matrix cov[E[Z|Y]] is degenerate in any direction orthogonal to the ηis. Therefore, the eigenvectors ηk(k=1,,K) associated with the largestK eigenvalues are the standardized EDR-directions.

Algorithm

The algorithm to estimate the EDR-directions via SIR is as follows.

1. Let Σxx be the covariance matrix of X. Standardize X to

Z=Σxx1/2{XE(X)}

((1) can also be rewritten as

Y=f(η1Z,,ηkZ,ε)

where ηk=βkΣxx1/2k.)

2. Divide the range of yi into S non-overlapping slices Hs(s=1,,S).ns is the number of observations within each slice and IHs is the indicator function for the slice:

ns=i=1nIHs(yi)

3. Compute the mean of zi over all slices, which is a crude estimate m^1 of the inverse regression curve m1:

z¯s=ns1i=1nziIHs(yi)

4. Calculate the estimate for Cov{m1(y)}:

V^=n1i=1Snsz¯sz¯s

5. Identify the eigenvalues λ^i and the eigenvectors η^i of V^, which are the standardized EDR-directions.

6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by:

β^i=Σ^xx1/2η^i

(which are not necessarily orthogonal)

References

Template:Reflist