Polynomial kernel

Template:Short description Template:About

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features.^[1]

Definition

For degree-Template:Mvar polynomials, the polynomial kernel is defined as^[2]

K (𝐱, 𝐲) = (𝐱^{𝖳} 𝐲 + c)^{d}

where Template:Mvar and Template:Mvar are vectors of size Template:Mvar in the input space, i.e. vectors of features computed from training or test samples and Template:Math is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When Template:Math, the kernel is called homogeneous.^[3] (A further generalized polykernel divides Template:Math by a user-specified scalar parameter Template:Mvar.Template:R)

As a kernel, Template:Mvar corresponds to an inner product in a feature space based on some mapping Template:Mvar:

K (𝐱, 𝐲) = ⟨ φ (𝐱), φ (𝐲) ⟩

The nature of Template:Mvar can be seen from an example. Let Template:Math, so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping,

K (𝐱, 𝐲) = {(\sum_{i = 1}^{n} x_{i} y_{i} + c)}^{2} = \sum_{i = 1}^{n} (x_{i}^{2}) (y_{i}^{2}) + \sum_{i = 2}^{n} \sum_{j = 1}^{i - 1} (\sqrt{2} x_{i} x_{j}) (\sqrt{2} y_{i} y_{j}) + \sum_{i = 1}^{n} (\sqrt{2 c} x_{i}) (\sqrt{2 c} y_{i}) + c^{2}

From this it follows that the feature map is given by:

φ (x) = (x_{n}^{2}, \dots, x_{1}^{2}, \sqrt{2} x_{n} x_{n - 1}, \dots, \sqrt{2} x_{n} x_{1}, \sqrt{2} x_{n - 1} x_{n - 2}, \dots, \sqrt{2} x_{n - 1} x_{1}, \dots, \sqrt{2} x_{2} x_{1}, \sqrt{2 c} x_{n}, \dots, \sqrt{2 c} x_{1}, c)

generalizing for ${(𝐱^{T} 𝐲 + c)}^{d}$ , where $𝐱 \in ℝ^{n}$ , $𝐲 \in ℝ^{n}$ and applying the multinomial theorem:

$\begin{matrix} {(𝐱^{T} 𝐲 + c)}^{d} & = \sum_{j_{1} + j_{2} + \dots + j_{n + 1} = d} \frac{\sqrt{d!}}{\sqrt{j_{1}! \dots j_{n}! j_{n + 1}!}} x_{1}^{j_{1}} \dots x_{n}^{j_{n}} {\sqrt{c}}^{j_{n + 1}} \frac{\sqrt{d!}}{\sqrt{j_{1}! \dots j_{n}! j_{n + 1}!}} y_{1}^{j_{1}} \dots y_{n}^{j_{n}} {\sqrt{c}}^{j_{n + 1}} \\ = φ (𝐱)^{T} φ (𝐲) \end{matrix}$

The last summation has $l_{d} = (\binom{n + d}{d})$ elements, so that:

φ (𝐱) = (a_{1}, \dots, a_{l}, \dots, a_{l_{d}})

where $l = (j_{1}, j_{2}, ..., j_{n}, j_{n + 1})$ and

a_{l} = \frac{\sqrt{d!}}{\sqrt{j_{1}! \dots j_{n}! j_{n + 1}!}} x_{1}^{j_{1}} \dots x_{n}^{j_{n}} {\sqrt{c}}^{j_{n + 1}} | j_{1} + j_{2} + \dots + j_{n} + j_{n + 1} = d

Practical use

Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP).Template:R^[4] The most common degree is Template:Math (quadratic), since larger degrees tend to overfit on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

full expansion of the kernel prior to training/testing with a linear SVM,Template:R i.e. full computation of the mapping Template:Mvar as in polynomial regression;
basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion;^[5]
inverted indexing of support vectors.Template:R Template:R

One problem with the polynomial kernel is that it may suffer from numerical instability: when Template:Math, Template:Math tends to zero with increasing Template:Mvar, whereas when Template:Math, Template:Math tends to infinity.^[6]

References

↑ Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.
↑ Template:Cite web
↑ Template:Cite arXiv
↑ Template:Cite journal
↑ Template:Cite conference
↑ Template:Cite conference

[Goldberg2008-1] Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

[2] Template:Cite web

[3] Template:Cite arXiv

[Chang2010-4] Template:Cite journal

[Kudo2003-5] Template:Cite conference

[lin2012-6] Template:Cite conference

[1]

[2]

[3]

[4]

[5]

[6]

Polynomial kernel

Definition

Practical use

References

Navigation menu

Search