Kernel perceptron

Template:Machine learning bar In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The algorithm was invented in 1964,^[1] making it the first kernel classification learner.^[2]

Preliminaries

The perceptron algorithm

Template:Main The perceptron algorithm is an online learning algorithm that operates by a principle called "error-driven learning". It iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification with respect to a supervised signal. The model learned by the standard perceptron algorithm is a linear binary classifier: a vector of weights Template:Math (and optionally an intercept term Template:Math, omitted here for simplicity) that is used to classify a sample vector Template:Math as class "one" or class "minus one" according to

\hat{y} = sgn (𝐰^{⊤} 𝐱)

where a zero is arbitrarily mapped to one or minus one. (The "hat" on Template:Mvar denotes an estimated value.)

In pseudocode, the perceptron algorithm is given by:

Initialize Template:Math to an all-zero vector of length Template:Mvar, the number of predictors (features).

For some fixed number of iterations, or until some stopping criterion is met:

For each training example Template:Math with ground truth label Template:Math}:

Let Template:Math.

If Template:Math, update Template:Math.

Kernel Methods

Template:Main By contrast with the linear models learned by the perceptron, a kernel method^[3] is a classifier that stores a subset of its training examples Template:Math, associates with each a weight Template:Mvar, and makes decisions for new samples Template:Math by evaluating

sgn \sum_{i} α_{i} y_{i} K (𝐱_{i}, 𝐱^{'})

.

Here, Template:Mvar is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition), representing an inner product between samples in a high-dimensional space, as if the samples had been expanded to include additional features by a function Template:Math: Template:Math. Intuitively, it can be thought of as a similarity function between samples, so the kernel machine establishes the class of a new sample by weighted comparison to the training set. Each function Template:Math serves as a basis function in the classification.

Algorithm

To derive a kernelized version of the perceptron algorithm, we must first formulate it in dual form, starting from the observation that the weight vector Template:Math can be expressed as a linear combination of the Template:Mvar training samples. The equation for the weight vector is

𝐰 = \sum_{i}^{n} α_{i} y_{i} 𝐱_{i}

where Template:Math is the number of times Template:Math was misclassified, forcing an update Template:Math. Using this result, we can formulate the dual perceptron algorithm, which loops through the samples as before, making predictions, but instead of storing and updating a weight vector Template:Math, it updates a "mistake counter" vector Template:Math. We must also rewrite the prediction formula to get rid of Template:Math:

\begin{matrix} \hat{y} & = sgn (𝐰^{𝖳} 𝐱) \\ = sgn {(\sum_{i}^{n} α_{i} y_{i} 𝐱_{i})}^{𝖳} 𝐱 \\ = sgn \sum_{i}^{n} α_{i} y_{i} (𝐱_{i} \cdot 𝐱) \end{matrix}

Plugging these two equations into the training loop turn it into the dual perceptron algorithm.

Finally, we can replace the dot product in the dual perceptron by an arbitrary kernel function, to get the effect of a feature map Template:Math without computing Template:Math explicitly for any samples. Doing this yields the kernel perceptron algorithm:^[4]

Initialize Template:Math to an all-zeros vector of length Template:Mvar, the number of training samples.

For some fixed number of iterations, or until some stopping criterion is met:

For each training example Template:Math:

Let

\hat{y} = sgn \sum_{i}^{n} α_{i} y_{i} K (𝐱_{i}, 𝐱_{j})

If Template:Math, perform an update by incrementing the mistake counter:

Template:Math

Variants and extensions

One problem with the kernel perceptron, as presented above, is that it does not learn sparse kernel machines. Initially, all the Template:Mvar are zero so that evaluating the decision function to get Template:Mvar requires no kernel evaluations at all, but each update increments a single Template:Mvar, making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an online setting, the number of non-zero Template:Mvar and thus the evaluation cost grow linearly in the number of examples presented to the algorithm.

The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an active set of examples with non-zero Template:Mvar, removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero Template:Mvar.^[5]

Another problem with the kernel perceptron is that it does not regularize, making it vulnerable to overfitting. The NORMA online kernel learning algorithm can be regarded as a generalization of the kernel perceptron algorithm with regularization.^[6] The sequential minimal optimization (SMO) algorithm used to learn support vector machines can also be regarded as a generalization of the kernel perceptron.^[6]

The voted perceptron algorithm of Freund and Schapire also extends to the kernelized case,^[7] giving generalization bounds comparable to the kernel SVM.^[2]

References

Template:Reflist

↑ Template:Cite journal Cited in Template:Cite conference
↑ ^2.0 ^2.1 Template:Cite journal
↑ Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. Template:ISBN
↑ Template:Cite book
↑ Template:Cite journal
↑ ^6.0 ^6.1 Template:Cite journal
↑ Template:Cite journal

[1] Template:Cite journal Cited in Template:Cite conference

[lasvm-2] 2.0 ^2.1 Template:Cite journal

[3] Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. Template:ISBN

[shawe-4] Template:Cite book

[5] Template:Cite journal

[norma-6] 6.0 ^6.1 Template:Cite journal

[7] Template:Cite journal

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Kernel perceptron

Contents

Preliminaries

The perceptron algorithm

Kernel Methods

Algorithm

Variants and extensions

References

Navigation menu

Kernel perceptron

Preliminaries

The perceptron algorithm

Kernel Methods

Algorithm

Variants and extensions

References

Navigation menu

Search