Back to notes Concepts January 4, 2026 244 words

Loss

Mean Squared Error (MSE) / Least Squares

Earliest version of a loss function, attributed to Gauss and Legendre. $L = \frac{1}{n}\sum(y - \hat{y})^2$ Best for regression-like problems.

Cross-Entropy (Log Loss)

Instead of having a numeric value of the loss, used when we want how likely the result was. This is necessary as MSE doesn't work that well with classification problems.

Categorical Cross-Entropy

Used for multiple classes $L = -\sum_{c=1}^{M} y_{o,\,c} \log(p_{o,\,c})$ where:

$M$ is number of classes

Binary Cross-Entropy

Used for yes/no questions $L = -[y \cdot \log(p) + (1-y) \cdot \log(1-p)]$

Hinge Loss

Like cross entropy but doesn't measure distance as long as prediction is true. This comes from Support Vector Machines. $L = max(0, 1 - y * \hat{y})$

Focal Loss

Cross-entropy is overwhelmed when data is imbalanced and there's a majority class. Focal Loss adds a reshaping factor so that down-weight easy (more occurring) examples and focus on hard (less occurring) examples. $L = -(1 - p_t)^\gamma \cdot\log(p_t)$ where:

$\gamma$ is the reshaping factor.

Direct Preference Optimization (DPO) Loss

In scenarios where users can prefer some of the outputs and reject some of them, there needs to be a mechanism that increases the likelihood of winning outputs and decrease the likelihood of losing outputs.

DPO is designed to solve this issue

L_{\text{DPO}}(\pi_\theta, \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)\sim D}\left[ \log \sigma \left( \underbrace{ \beta\log\frac{ \pi_\theta(y_w|x)} { \pi_{\text{ref}}(y_w|x) } }_{\substack{\text{Minimize preferred} \\ \text{output}}} - \underbrace{ \beta\log\frac{ \pi_\theta(y_l|x)} { \pi_{\text{ref}}(y_l|x) } }_{\substack{\text{Minimize dispreffered} \\ \text{output}}} \right) \right]