Back to notes Concepts January 3, 2026 454 words

Optimization

SGD

Stochastic gradient descent. Updates weights by subtracting the gradient scaled by a learning rate. $w_t = w_{t-1} - \eta * g_t$

$w$ : Weight of the network
$g$ : Gradient of loss function
$\eta$ : Learning rate

SGD with Momentum

Keeps a memory of the previous gradients. Still SOTA for the tasks where data is scarce and generalization is critical.

v_t &= \rho * v_{t-1} + g_t \\ w_t &= w_{t-1} - \eta * v_t \end{aligned}$$ - $v$: Velocity, starts at 0 - $\rho$: Momentum factor \+ Finds better global minima than adaptive models \- Sensitive to learning rate, requires warmup/decay schedules to work well ## Adam Adaptive moment estimation. Computes an individual learning rate for all parameters. Keeps track of: - First moment, $m$, moving average of gradients - Second moment, $v$, moving average of squared gradients, acts like variance Then it divides the update by $\sqrt{v}$ , so that updates size is inversely proportioned to gradient size. $$\begin{aligned} g_t &= g_t + \lambda\theta_{t-1} \\ m_t &= \beta_1 * m_{t-1} + (1-\beta_1) * g_t \\ v_t &= \beta_2 * v_{t-1} + (1-\beta_2) * g_t^2 \\ \theta_t &= \theta_{t-1} - \eta * m_t / (\sqrt{v_t} + \epsilon) \end{aligned}$$ - $\beta_1$: Momentum coefficient, controls "short-term" memory (update direction) - $\beta_2$: Scaling coefficient, controls "long-term" memory (regularization) - $\epsilon$: Makes sure denominator never hits zero - $\lambda$: Weight decay - $\theta$: Trainable parameters ### Coupling Problem of Adam The algorithm first applies weight decay to gradient and then uses this gradient to calculate moment and to update parameters. Because the gradient is scaled by the second moment, this means weight decay gets scaled so that parameters with larger gradient history gets scaled less while parameters with smaller gradient history gets scaled more. This is not the behavior we want. We want weight decay to uniformly apply to all parameters. TODO: Bias correction in Adam ## AdamW Fixes Adam's weight decay error by applying weight decay at the final step separately. Weight decay step at the start is removed and last step is updated to:

\theta_t = \theta_{t-1} - \eta * m_t / (\sqrt{v_t} + \epsilon) - \underline{\eta * \lambda * \theta_{t-1}}

### Why does AdamW sometimes generalize worse than SGD? Because Adam family tends to find *sharp minima* while SGD finds *flat minima*. Usually a flat minimum is considered a good generalization as it'd represent the data better. Technically, when the gradient of the batch turns out large, Adam normalizes it and starts to take smaller steps for the parameters that cause that. In contrast, SGD doesn't have any mechanism for that and it can "jump back up" for the next batch. Noisiness of the SGD causes it to settle on a minimum that is a combination of gradients. Sometimes this is expressed as SGD being in "Rich Regime" and Adam being in "Lazy Regime". TODO: Lazy/Rich regimes, Grokking TODO: Schedule-Free Optimizers TODO: Muon