Hinge loss

From testwiki
Jump to navigation Jump to search

Template:Short description

The vertical axis represents the value of the Hinge loss (in blue) and zero-one loss (in green) for fixed Template:Math, while the horizontal axis represents the value of the prediction Template:Mvar. The plot shows that the Hinge loss penalizes predictions Template:Math, corresponding to the notion of a margin in a support vector machine.

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]

For an intended output Template:Math and a classifier score Template:Mvar, the hinge loss of the prediction Template:Mvar is defined as

β„“(y)=max(0,1βˆ’tβ‹…y)

Note that y should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y=𝐰⋅𝐱+b, where (𝐰,b) are the parameters of the hyperplane and 𝐱 is the input variable(s).

When Template:Mvar and Template:Mvar have the same sign (meaning Template:Mvar predicts the right class) and |y|β‰₯1, the hinge loss β„“(y)=0. When they have opposite signs, β„“(y) increases linearly with Template:Mvar, and similarly if |y|<1, even if it has the same sign (correct prediction, but not by enough margin).

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]

β„“(y)=max(0,1+maxyβ‰ t𝐰yπ±βˆ’π°t𝐱),

where t is the target label, 𝐰t and 𝐰y are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3]

β„“(y)=βˆ‘yβ‰ tmax(0,1+𝐰yπ±βˆ’π°t𝐱).

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where Template:Math denotes the SVM's parameters, Template:Math the SVM's predictions, Template:Mvar the joint feature function, and Template:Math the Hamming loss:

β„“(𝐲)=max(0,Ξ”(𝐲,𝐭)+⟨𝐰,Ο•(𝐱,𝐲)βŸ©βˆ’βŸ¨π°,Ο•(𝐱,𝐭)⟩)=max(0,maxyβˆˆπ’΄(Ξ”(𝐲,𝐭)+⟨𝐰,Ο•(𝐱,𝐲)⟩)βˆ’βŸ¨π°,Ο•(𝐱,𝐭)⟩).

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters Template:Math of a linear SVM with score function y=𝐰⋅𝐱 that is given by

βˆ‚β„“βˆ‚wi={βˆ’tβ‹…xiif tβ‹…y<1,0otherwise.
Plot of three variants of the hinge loss as a function of Template:Math: the "ordinary" variant (blue), its square (green), and the piece-wise smooth version by Rennie and Srebro (red). The y-axis is the Template:Math hinge loss, and the x-axis is the parameter Template:Mvar

However, since the derivative of the hinge loss at ty=1 is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7]

β„“(y)={12βˆ’tyifty≀0,12(1βˆ’ty)2if0<ty<1,0if1≀ty

or the quadratically smoothed

β„“Ξ³(y)={12Ξ³max(0,1βˆ’ty)2iftyβ‰₯1βˆ’Ξ³,1βˆ’Ξ³2βˆ’tyotherwise

suggested by Zhang.[8] The modified Huber loss L is a special case of this loss function with Ξ³=2, specifically L(t,y)=4β„“2(y).

See also

References

Template:Reflist