Kernel Fisher discriminant analysis

From testwiki
Jump to navigation Jump to search

In statistics, kernel Fisher discriminant analysis (KFD),[1] also known as generalized discriminant analysis[2] and kernel discriminant analysis,[3] is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.

Linear discriminant analysis

Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data, 𝐂1 and 𝐂2, we can calculate the mean value of each class, 𝐦1 and 𝐦2, as

𝐦i=1liβˆ‘n=1li𝐱ni,

where li is the number of examples of class 𝐂i. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small.[4] This is formulated as maximizing, with respect to 𝐰, the following ratio:

J(𝐰)=𝐰T𝐒B𝐰𝐰T𝐒W𝐰,

where 𝐒B is the between-class covariance matrix and 𝐒W is the total within-class covariance matrix:

𝐒B=(𝐦2βˆ’π¦1)(𝐦2βˆ’π¦1)T𝐒W=βˆ‘i=1,2βˆ‘n=1li(𝐱niβˆ’π¦i)(𝐱niβˆ’π¦i)T.

The maximum of the above ratio is attained at

π°βˆπ’Wβˆ’1(𝐦2βˆ’π¦1).

as can be shown by the Lagrange multiplier method (sketch of proof):

Maximizing J(𝐰)=𝐰T𝐒B𝐰𝐰T𝐒W𝐰 is equivalent to maximizing

𝐰T𝐒B𝐰

subject to

𝐰T𝐒W𝐰=1.

This, in turn, is equivalent to maximizing I(𝐰,Ξ»)=𝐰T𝐒Bπ°βˆ’Ξ»(𝐰T𝐒Wπ°βˆ’1), where Ξ» is the Lagrange multiplier.

At the maximum, the derivatives of I(𝐰,λ) with respect to 𝐰 and λ must be zero. Taking dId𝐰=𝟎 yields

𝐒Bπ°βˆ’Ξ»π’W𝐰=𝟎,

which is trivially satisfied by 𝐰=c𝐒Wβˆ’1(𝐦2βˆ’π¦1) and Ξ»=(𝐦2βˆ’π¦1)T𝐒Wβˆ’1(𝐦2βˆ’π¦1).

Extending LDA

To extend LDA to non-linear mappings, the data, given as the β„“ points 𝐱i, can be mapped to a new feature space, F, via some function Ο•. In this new feature space, the function that needs to be maximized is[1]

J(𝐰)=𝐰T𝐒Bϕ𝐰𝐰T𝐒Wϕ𝐰,

where

𝐒BΟ•=(𝐦2Ο•βˆ’π¦1Ο•)(𝐦2Ο•βˆ’π¦1Ο•)T𝐒WΟ•=βˆ‘i=1,2βˆ‘n=1li(Ο•(𝐱ni)βˆ’π¦iΟ•)(Ο•(𝐱ni)βˆ’π¦iΟ•)T,

and

𝐦iΟ•=1liβˆ‘j=1liΟ•(𝐱ji).

Further, note that 𝐰∈F. Explicitly computing the mappings Ο•(𝐱i) and then performing LDA can be computationally expensive, and in many cases intractable. For example, F may be infinite dimensional. Thus, rather than explicitly mapping the data to F, the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,k(𝐱,𝐲)=Ο•(𝐱)β‹…Ο•(𝐲).

LDA can be reformulated in terms of dot products by first noting that 𝐰 will have an expansion of the form[5]

𝐰=βˆ‘i=1lΞ±iΟ•(𝐱i).

Then note that

𝐰T𝐦iΟ•=1liβˆ‘j=1lβˆ‘k=1liΞ±jk(𝐱j,𝐱ki)=𝜢T𝐌i,

where

(𝐌i)j=1liβˆ‘k=1lik(𝐱j,𝐱ki).

The numerator of J(𝐰) can then be written as:

𝐰T𝐒Bϕ𝐰=𝐰T(𝐦2Ο•βˆ’π¦1Ο•)(𝐦2Ο•βˆ’π¦1Ο•)T𝐰=𝜢T𝐌𝜢,where𝐌=(𝐌2βˆ’πŒ1)(𝐌2βˆ’πŒ1)T.

Similarly, the denominator can be written as

𝐰T𝐒Wϕ𝐰=𝜢T𝐍𝜢,where𝐍=βˆ‘j=1,2𝐊j(πˆβˆ’πŸlj)𝐊jT,

with the nth,mth component of 𝐊j defined as k(𝐱n,𝐱mj),𝐈 is the identity matrix, and 𝟏lj the matrix with all entries equal to 1/lj. This identity can be derived by starting out with the expression for 𝐰T𝐒Wϕ𝐰 and using the expansion of 𝐰 and the definitions of 𝐒WΟ• and 𝐦iΟ•

𝐰T𝐒Wϕ𝐰=(βˆ‘i=1lΞ±iΟ•T(𝐱i))(βˆ‘j=1,2βˆ‘n=1lj(Ο•(𝐱nj)βˆ’π¦jΟ•)(Ο•(𝐱nj)βˆ’π¦jΟ•)T)(βˆ‘k=1lΞ±kΟ•(𝐱k))=βˆ‘j=1,2βˆ‘i=1lβˆ‘n=1ljβˆ‘k=1l(Ξ±iΟ•T(𝐱i)(Ο•(𝐱nj)βˆ’π¦jΟ•)(Ο•(𝐱nj)βˆ’π¦jΟ•)TΞ±kΟ•(𝐱k))=βˆ‘j=1,2βˆ‘i=1lβˆ‘n=1ljβˆ‘k=1l(Ξ±ik(𝐱i,𝐱nj)βˆ’1ljβˆ‘p=1ljΞ±ik(𝐱i,𝐱pj))(Ξ±kk(𝐱k,𝐱nj)βˆ’1ljβˆ‘q=1ljΞ±kk(𝐱k,𝐱qj))=βˆ‘j=1,2(βˆ‘i=1lβˆ‘n=1ljβˆ‘k=1l(Ξ±iΞ±kk(𝐱i,𝐱nj)k(𝐱k,𝐱nj)βˆ’2Ξ±iΞ±kljβˆ‘p=1ljk(𝐱i,𝐱nj)k(𝐱k,𝐱pj)+Ξ±iΞ±klj2βˆ‘p=1ljβˆ‘q=1ljk(𝐱i,𝐱pj)k(𝐱k,𝐱qj)))=βˆ‘j=1,2(βˆ‘i=1lβˆ‘n=1ljβˆ‘k=1l(Ξ±iΞ±kk(𝐱i,𝐱nj)k(𝐱k,𝐱nj)βˆ’Ξ±iΞ±kljβˆ‘p=1ljk(𝐱i,𝐱nj)k(𝐱k,𝐱pj)))=βˆ‘j=1,2𝜢T𝐊j𝐊jTπœΆβˆ’πœΆT𝐊j𝟏lj𝐊jT𝜢=𝜢T𝐍𝜢.

With these equations for the numerator and denominator of J(𝐰), the equation for J can be rewritten as

J(𝜢)=𝜢T𝐌𝜢𝜢T𝐍𝜢.

Then, differentiating and setting equal to zero gives

(𝜢T𝐌𝜢)𝐍𝜢=(𝜢T𝐍𝜢)𝐌𝜢.

Since only the direction of 𝐰, and hence the direction of 𝜢, matters, the above can be solved for 𝜢 as

𝜢=πβˆ’1(𝐌2βˆ’πŒ1).

Note that in practice, 𝐍 is usually singular and so a multiple of the identity is added to it [1]

𝐍ϡ=𝐍+ϡ𝐈.

Given the solution for 𝜢, the projection of a new data point is given by[1]

y(𝐱)=(𝐰⋅ϕ(𝐱))=βˆ‘i=1lΞ±ik(𝐱i,𝐱).

Multi-class KFD

The extension to cases where there are more than two classes is relatively straightforward.[2][6][7] Let c be the number of classes. Then multi-class KFD involves projecting the data into a (cβˆ’1)-dimensional space using (cβˆ’1) discriminant functions

yi=𝐰iTΟ•(𝐱)i=1,…,cβˆ’1.

This can be written in matrix notation

𝐲=𝐖TΟ•(𝐱),

where the 𝐰i are the columns of 𝐖.[6] Further, the between-class covariance matrix is now

𝐒BΟ•=βˆ‘i=1cli(𝐦iΟ•βˆ’π¦Ο•)(𝐦iΟ•βˆ’π¦Ο•)T,

where 𝐦ϕ is the mean of all the data in the new feature space. The within-class covariance matrix is

𝐒WΟ•=βˆ‘i=1cβˆ‘n=1li(Ο•(𝐱ni)βˆ’π¦iΟ•)(Ο•(𝐱ni)βˆ’π¦iΟ•)T,

The solution is now obtained by maximizing

J(𝐖)=|𝐖T𝐒Bϕ𝐖||𝐖T𝐒Wϕ𝐖|.

The kernel trick can again be used and the goal of multi-class KFD becomes[7]

π€βˆ—=argmax𝐀=|𝐀TπŒπ€||𝐀T𝐍𝐀|,

where A=[𝜢1,…,𝜢cβˆ’1] and

M=βˆ‘j=1clj(𝐌jβˆ’πŒβˆ—)(𝐌jβˆ’πŒβˆ—)TN=βˆ‘j=1c𝐊j(πˆβˆ’πŸlj)𝐊jT.

The 𝐌i are defined as in the above section and πŒβˆ— is defined as

(πŒβˆ—)j=1lβˆ‘k=1lk(𝐱j,𝐱k).

π€βˆ— can then be computed by finding the (cβˆ’1) leading eigenvectors of πβˆ’1𝐌.[7] Furthermore, the projection of a new input, 𝐱t, is given by[7]

𝐲(𝐱t)=(π€βˆ—)T𝐊t,

where the ith component of 𝐊t is given by k(𝐱i,𝐱t).

Classification using KFD

In both two-class and multi-class KFD, the class label of a new input can be assigned as[7]

f(𝐱)=argminjD(𝐲(𝐱),𝐲¯j),

where 𝐲¯j is the projected mean for class j and D(β‹…,β‹…) is a distance function.

Applications

Kernel discriminant analysis has been used in a variety of applications. These include:

See also

References

Template:Reflist