Back to notes Concepts January 4, 2026 244 words

Loss

Mean Squared Error (MSE) / Least Squares

Earliest version of a loss function, attributed to Gauss and Legendre. L=1n(yy^)2L = \frac{1}{n}\sum(y - \hat{y})^2 Best for regression-like problems.

Cross-Entropy (Log Loss)

Instead of having a numeric value of the loss, used when we want how likely the result was. This is necessary as MSE doesn't work that well with classification problems.

Categorical Cross-Entropy

Used for multiple classes L=c=1Myo,clog(po,c)L = -\sum_{c=1}^{M} y_{o,\,c} \log(p_{o,\,c}) where:

  • MM is number of classes

Binary Cross-Entropy

Used for yes/no questions L=[ylog(p)+(1y)log(1p)]L = -[y \cdot \log(p) + (1-y) \cdot \log(1-p)]

Hinge Loss

Like cross entropy but doesn't measure distance as long as prediction is true. This comes from Support Vector Machines. L=max(0,1yy^)L = max(0, 1 - y * \hat{y})

Focal Loss

Cross-entropy is overwhelmed when data is imbalanced and there's a majority class. Focal Loss adds a reshaping factor so that down-weight easy (more occurring) examples and focus on hard (less occurring) examples. L=(1pt)γlog(pt)L = -(1 - p_t)^\gamma \cdot\log(p_t) where:

  • γ\gamma is the reshaping factor.

Direct Preference Optimization (DPO) Loss

In scenarios where users can prefer some of the outputs and reject some of them, there needs to be a mechanism that increases the likelihood of winning outputs and decrease the likelihood of losing outputs.

DPO is designed to solve this issue

LDPO(πθ,πref)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)Minimize preferredoutputβlogπθ(ylx)πref(ylx)Minimize disprefferedoutput)]L_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)\sim D}\left[ \log \sigma \left( \underbrace{ \beta\log\frac{ \pi_\theta(y_w|x)} { \pi_{\text{ref}}(y_w|x) } }_{\substack{\text{Minimize preferred} \\ \text{output}}} - \underbrace{ \beta\log\frac{ \pi_\theta(y_l|x)} { \pi_{\text{ref}}(y_l|x) } }_{\substack{\text{Minimize dispreffered} \\ \text{output}}} \right) \right]