Kernel Fisher discriminant analysis

From testwiki
Jump to navigation Jump to search

In statistics, kernel Fisher discriminant analysis (KFD),[1] also known as generalized discriminant analysis[2] and kernel discriminant analysis,[3] is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.

Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data, 𝐂1 and 𝐂2, we can calculate the mean value of each class, 𝐦1 and 𝐦2, as

𝐦i=1lin=1li𝐱ni,

where li is the number of examples of class 𝐂i. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small.[4] This is formulated as maximizing, with respect to 𝐰, the following ratio:

J(𝐰)=𝐰T𝐒B𝐰𝐰T𝐒W𝐰,

where 𝐒B is the between-class covariance matrix and 𝐒W is the total within-class covariance matrix:

𝐒B=(𝐦2𝐦1)(𝐦2𝐦1)T𝐒W=i=1,2n=1li(𝐱ni𝐦i)(𝐱ni𝐦i)T.

The maximum of the above ratio is attained at

𝐰𝐒W1(𝐦2𝐦1).

as can be shown by the Lagrange multiplier method (sketch of proof):

Maximizing J(𝐰)=𝐰T𝐒B𝐰𝐰T𝐒W𝐰 is equivalent to maximizing

𝐰T𝐒B𝐰

subject to

𝐰T𝐒W𝐰=1.

This, in turn, is equivalent to maximizing I(𝐰,λ)=𝐰T𝐒B𝐰λ(𝐰T𝐒W𝐰1), where λ is the Lagrange multiplier.

At the maximum, the derivatives of I(𝐰,λ) with respect to 𝐰 and λ must be zero. Taking dId𝐰=𝟎 yields

𝐒B𝐰λ𝐒W𝐰=𝟎,

which is trivially satisfied by 𝐰=c𝐒W1(𝐦2𝐦1) and λ=(𝐦2𝐦1)T𝐒W1(𝐦2𝐦1).

Extending LDA

To extend LDA to non-linear mappings, the data, given as the points 𝐱i, can be mapped to a new feature space, F, via some function ϕ. In this new feature space, the function that needs to be maximized is[1]

J(𝐰)=𝐰T𝐒Bϕ𝐰𝐰T𝐒Wϕ𝐰,

where

𝐒Bϕ=(𝐦2ϕ𝐦1ϕ)(𝐦2ϕ𝐦1ϕ)T𝐒Wϕ=i=1,2n=1li(ϕ(𝐱ni)𝐦iϕ)(ϕ(𝐱ni)𝐦iϕ)T,

and

𝐦iϕ=1lij=1liϕ(𝐱ji).

Further, note that 𝐰F. Explicitly computing the mappings ϕ(𝐱i) and then performing LDA can be computationally expensive, and in many cases intractable. For example, F may be infinite dimensional. Thus, rather than explicitly mapping the data to F, the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,k(𝐱,𝐲)=ϕ(𝐱)ϕ(𝐲).

LDA can be reformulated in terms of dot products by first noting that 𝐰 will have an expansion of the form[5]

𝐰=i=1lαiϕ(𝐱i).

Then note that

𝐰T𝐦iϕ=1lij=1lk=1liαjk(𝐱j,𝐱ki)=αT𝐌i,

where

(𝐌i)j=1lik=1lik(𝐱j,𝐱ki).

The numerator of J(𝐰) can then be written as:

𝐰T𝐒Bϕ𝐰=𝐰T(𝐦2ϕ𝐦1ϕ)(𝐦2ϕ𝐦1ϕ)T𝐰=αT𝐌α,where𝐌=(𝐌2𝐌1)(𝐌2𝐌1)T.

Similarly, the denominator can be written as

𝐰T𝐒Wϕ𝐰=αT𝐍α,where𝐍=j=1,2𝐊j(𝐈𝟏lj)𝐊jT,

with the nth,mth component of 𝐊j defined as k(𝐱n,𝐱mj),𝐈 is the identity matrix, and 𝟏lj the matrix with all entries equal to 1/lj. This identity can be derived by starting out with the expression for 𝐰T𝐒Wϕ𝐰 and using the expansion of 𝐰 and the definitions of 𝐒Wϕ and 𝐦iϕ

𝐰T𝐒Wϕ𝐰=(i=1lαiϕT(𝐱i))(j=1,2n=1lj(ϕ(𝐱nj)𝐦jϕ)(ϕ(𝐱nj)𝐦jϕ)T)(k=1lαkϕ(𝐱k))=j=1,2i=1ln=1ljk=1l(αiϕT(𝐱i)(ϕ(𝐱nj)𝐦jϕ)(ϕ(𝐱nj)𝐦jϕ)Tαkϕ(𝐱k))=j=1,2i=1ln=1ljk=1l(αik(𝐱i,𝐱nj)1ljp=1ljαik(𝐱i,𝐱pj))(αkk(𝐱k,𝐱nj)1ljq=1ljαkk(𝐱k,𝐱qj))=j=1,2(i=1ln=1ljk=1l(αiαkk(𝐱i,𝐱nj)k(𝐱k,𝐱nj)2αiαkljp=1ljk(𝐱i,𝐱nj)k(𝐱k,𝐱pj)+αiαklj2p=1ljq=1ljk(𝐱i,𝐱pj)k(𝐱k,𝐱qj)))=j=1,2(i=1ln=1ljk=1l(αiαkk(𝐱i,𝐱nj)k(𝐱k,𝐱nj)αiαkljp=1ljk(𝐱i,𝐱nj)k(𝐱k,𝐱pj)))=j=1,2αT𝐊j𝐊jTααT𝐊j𝟏lj𝐊jTα=αT𝐍α.

With these equations for the numerator and denominator of J(𝐰), the equation for J can be rewritten as

J(α)=αT𝐌ααT𝐍α.

Then, differentiating and setting equal to zero gives

(αT𝐌α)𝐍α=(αT𝐍α)𝐌α.

Since only the direction of 𝐰, and hence the direction of α, matters, the above can be solved for α as

α=𝐍1(𝐌2𝐌1).

Note that in practice, 𝐍 is usually singular and so a multiple of the identity is added to it [1]

𝐍ϵ=𝐍+ϵ𝐈.

Given the solution for α, the projection of a new data point is given by[1]

y(𝐱)=(𝐰ϕ(𝐱))=i=1lαik(𝐱i,𝐱).

Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward.[2][6][7] Let c be the number of classes. Then multi-class KFD involves projecting the data into a (c1)-dimensional space using (c1) discriminant functions

yi=𝐰iTϕ(𝐱)i=1,,c1.

This can be written in matrix notation

𝐲=𝐖Tϕ(𝐱),

where the 𝐰i are the columns of 𝐖.[6] Further, the between-class covariance matrix is now

𝐒Bϕ=i=1cli(𝐦iϕ𝐦ϕ)(𝐦iϕ𝐦ϕ)T,

where 𝐦ϕ is the mean of all the data in the new feature space. The within-class covariance matrix is

𝐒Wϕ=i=1cn=1li(ϕ(𝐱ni)𝐦iϕ)(ϕ(𝐱ni)𝐦iϕ)T,

The solution is now obtained by maximizing

J(𝐖)=|𝐖T𝐒Bϕ𝐖||𝐖T𝐒Wϕ𝐖|.

The kernel trick can again be used and the goal of multi-class KFD becomes[7]

𝐀*=argmax𝐀=|𝐀TπŒπ€||𝐀T𝐍𝐀|,

where A=[α1,,αc1] and

M=j=1clj(𝐌j𝐌*)(𝐌j𝐌*)TN=j=1c𝐊j(𝐈𝟏lj)𝐊jT.

The 𝐌i are defined as in the above section and 𝐌* is defined as

(𝐌*)j=1lk=1lk(𝐱j,𝐱k).

𝐀* can then be computed by finding the (c1) leading eigenvectors of 𝐍1𝐌.[7] Furthermore, the projection of a new input, 𝐱t, is given by[7]

𝐲(𝐱t)=(𝐀*)T𝐊t,

where the ith component of 𝐊t is given by k(𝐱i,𝐱t).

Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as[7]

f(𝐱)=argminjD(𝐲(𝐱),𝐲¯j),

where 𝐲¯j is the projected mean for class j and D(,) is a distance function.

Applications

Kernel discriminant analysis has been used in a variety of applications. These include:

See also

References

Template:Reflist