Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).^[1]

For an intended output Template:Math and a classifier score Template:Mvar, the hinge loss of the prediction Template:Mvar is defined as

ℓ (y) = \max (0, 1 - t \cdot y)

Note that $y$ should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, $y = 𝐰 \cdot 𝐱 + b$ , where $(𝐰, b)$ are the parameters of the hyperplane and $𝐱$ is the input variable(s).

When Template:Mvar and Template:Mvar have the same sign (meaning Template:Mvar predicts the right class) and $| y | \geq 1$ , the hinge loss $ℓ (y) = 0$ . When they have opposite signs, $ℓ (y)$ increases linearly with Template:Mvar, and similarly if $| y | < 1$ , even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,^[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.^[3] For example, Crammer and Singer^[4] defined it for a linear classifier as^[5]

ℓ (y) = \max (0, 1 + \max_{y \neq t} 𝐰_{y} 𝐱 - 𝐰_{t} 𝐱)

,

where $t$ is the target label, $𝐰_{t}$ and $𝐰_{y}$ are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:^[6]^[3]

ℓ (y) = \sum_{y \neq t} \max (0, 1 + 𝐰_{y} 𝐱 - 𝐰_{t} 𝐱)

.

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where Template:Math denotes the SVM's parameters, Template:Math the SVM's predictions, Template:Mvar the joint feature function, and Template:Math the Hamming loss:

\begin{matrix} ℓ (𝐲) & = \max (0, Δ (𝐲, 𝐭) + ⟨ 𝐰, ϕ (𝐱, 𝐲) ⟩ - ⟨ 𝐰, ϕ (𝐱, 𝐭) ⟩) \\ = \max (0, \max_{y \in 𝒴} (Δ (𝐲, 𝐭) + ⟨ 𝐰, ϕ (𝐱, 𝐲) ⟩) - ⟨ 𝐰, ϕ (𝐱, 𝐭) ⟩) \end{matrix}

.

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters Template:Math of a linear SVM with score function $y = 𝐰 \cdot 𝐱$ that is given by

\frac{\partial ℓ}{\partial w_{i}} = {\begin{matrix} - t \cdot x_{i} & if t \cdot y < 1, \\ 0 & otherwise . \end{matrix}

Plot of three variants of the hinge loss as a function of Template:Math: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the Template:Math hinge loss, and the x-axis is the parameter Template:Mvar

However, since the derivative of the hinge loss at $t y = 1$ is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's^[7]

ℓ (y) = {\begin{matrix} \frac{1}{2} - t y & if t y \leq 0, \\ \frac{1}{2} (1 - t y)^{2} & if 0 < t y < 1, \\ 0 & if 1 \leq t y \end{matrix}

or the quadratically smoothed

ℓ_{γ} (y) = {\begin{matrix} \frac{1}{2 γ} \max (0, 1 - t y)^{2} & if t y \geq 1 - γ, \\ 1 - \frac{γ}{2} - t y & otherwise \end{matrix}

suggested by Zhang.^[8] The modified Huber loss $L$ is a special case of this loss function with $γ = 2$ , specifically $L (t, y) = 4 ℓ_{2} (y)$ .

References

Template:Reflist

[1] Template:Cite journal

[duan2005-2] Template:Cite book

[unifiedview-3] 3.0 ^3.1 Template:Cite journal

[4] Template:Cite journal

[5] Template:Cite conference

[6] Template:Cite conference

[7] Template:Cite conference

[zhang-8] Template:Cite conference

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Hinge loss

Contents

Extensions

Optimization

See also

References

Navigation menu

Hinge loss

Extensions

Optimization

See also

References

Navigation menu

Search