Hinge loss

From testwiki
Jump to navigation Jump to search

Template:Short description

The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed Template:Math, while the horizontal axis represents the value of the prediction Template:Mvar. The plot shows that the Hinge loss penalizes predictions Template:Math, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

For an intended output Template:Math and a classifier score Template:Mvar, the hinge loss of the prediction Template:Mvar is defined as

(y)=max(0,1ty)

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y=𝐰𝐱+b, where (𝐰,b) are the parameters of the hyperplane and 𝐱 is the input variable(s).

When Template:Mvar and Template:Mvar have the same sign (meaning Template:Mvar predicts the right class) and |y|1, the hinge loss (y)=0. When they have opposite signs, (y) increases linearly with Template:Mvar, and similarly if |y|<1, even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]

(y)=max(0,1+maxyt𝐰y𝐱𝐰t𝐱),

where t is the target label, 𝐰t and 𝐰y are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]

(y)=ytmax(0,1+𝐰y𝐱𝐰t𝐱).

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where Template:Math denotes the SVM's parameters, Template:Math the SVM's predictions, Template:Mvar the joint feature function, and Template:Math the Hamming loss:

(𝐲)=max(0,Δ(𝐲,𝐭)+𝐰,ϕ(𝐱,𝐲)𝐰,ϕ(𝐱,𝐭))=max(0,maxy𝒴(Δ(𝐲,𝐭)+𝐰,ϕ(𝐱,𝐲))𝐰,ϕ(𝐱,𝐭)).

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters Template:Math of a linear SVM with score function y=𝐰𝐱 that is given by

wi={txiif ty<1,0otherwise.
Plot of three variants of the hinge loss as a function of Template:Math: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the Template:Math hinge loss, and the x-axis is the parameter Template:Mvar

However, since the derivative of the hinge loss at ty=1 is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]

(y)={12tyifty0,12(1ty)2if0<ty<1,0if1ty

or the quadratically smoothed

γ(y)={12γmax(0,1ty)2ifty1γ,1γ2tyotherwise

suggested by Zhang.[8] The modified Huber loss L is a special case of this loss function with γ=2, specifically L(t,y)=42(y).

See also

References

Template:Reflist