Back to notes Concepts January 3, 2026 332 words

Regularization

Set of techniques that apply a penalty to the model in order to control model complexity and improve generalization.

Gradient Clipping

Used to deal with exploding gradients problem. During backpropagation clips the gradients.

It's either by value (clip when gradient exceeds min/max threshold) or by norm (clip when L2 norm of the gradient tensor exceeds threshold). Usually clipping by norms works better.

It can be viewed as an emergency break

TODO: Adaptive Gradient Clipping

Dropout

Prevents overfitting by randomly zeroing out neuron activations during training.

+ Prevents overfitting when dataset is small - Doesn't work well with transformers with giant datasets, where model sees each token once or twice and usually underfits.

Stochastic Depth

Also called DropPath. Disable a sample of layers for each training step so that features of the input can reach further layers. This way task of training a large network becomes training a bunch of smaller networks instead.

This results in shorter training times, easier convergence and better test results.

Weight Decay

Usually part of the optimizer. Prevents weights to get large quickly.

Adds a term to the weight update that pulls weights closer to zero, in proportion to their magnitude, larger weights are pulled back more strongly.

TODO: This is equivalent of a L2 regularization term

Early Stopping

Halting training when validation loss begins to rise. Deprecated with the discovery of double descent.

Double Descent

While training models with too many parameters, validations loss decreases for a while, then starts to increase, and the counter-intuitively starts to decrease again as model size or training time increases.

Chinchilla Optimization

Chinchilla is a 70B language model released by DeepMind. In its training, they set the standard by using a fixed compute budget instead of having a stop condition.

They also find the correlation such that, if you double your compute, you have to double your model size and training tokens as well.

This replaced early stopping in favor of checkpoint selection where best performing checkpoint is chosen retrospectively.