Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).^[1]

For an intended output $t = \pm1$ and a classifier score $y$ , the hinge loss of the prediction $y$ is defined as

\ell (y)=\max(0,1-t\cdot y)

Note that $y$ should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, $y=\mathbf {w} \cdot \mathbf {x} +b$ , where $(\mathbf {w} ,b)$ are the parameters of the hyperplane and $\mathbf {x}$ is the input variable(s).

When $t$ and $y$ have the same sign (meaning $y$ predicts the right class) and $|y|\geq 1$ , the hinge loss $\ell (y)=0$ . When they have opposite signs, $\ell (y)$ increases linearly with $y$ , and similarly if $|y|<1$ , even if it has the same sign (correct prediction, but not by enough margin).

YouTube Encyclopedic

1/3
Views:
9 654
18 989
1 400

Transcription

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,^[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.^[3] For example, Crammer and Singer^[4] defined it for a linear classifier as^[5]

\ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )

where $t$ is the target label, $\mathbf {w} _{t}$ and $\mathbf {w} _{y}$ are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:^[6]^[3]

\ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where $w$ denotes the SVM's parameters, $y$ the SVM's predictions, $φ$ the joint feature function, and $Δ$ the Hamming loss:

{\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters $w$ of a linear SVM with score function $y=\mathbf {w} \cdot \mathbf {x}$ that is given by

{\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1,\\0&{\text{otherwise}}.\end{cases}}

However, since the derivative of the hinge loss at $ty=1$ is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's^[7]