Regularization
Set of techniques that apply a penalty to the model in order to control model complexity and improve generalization.
Gradient Clipping
Used to deal with exploding gradients problem. During backpropagation clips the gradients.
It's either by value (clip when gradient exceeds min/max threshold) or by norm (clip when L2 norm of the gradient tensor exceeds threshold). Usually clipping by norms works better.
It can be viewed as an emergency break
TODO: Adaptive Gradient Clipping
Dropout
Prevents overfitting by randomly zeroing out neuron activations during training.
+ Prevents overfitting when dataset is small - Doesn't work well with transformers with giant datasets, where model sees each token once or twice and usually underfits.
Stochastic Depth
Also called DropPath. Disable a sample of layers for each training step so that features of the input can reach further layers. This way task of training a large network becomes training a bunch of smaller networks instead.
This results in shorter training times, easier convergence and better test results.
Weight Decay
Usually part of the optimizer. Prevents weights to get large quickly.
Adds a term to the weight update that pulls weights closer to zero, in proportion to their magnitude, larger weights are pulled back more strongly.
TODO: This is equivalent of a L2 regularization term
Early Stopping
Halting training when validation loss begins to rise. Deprecated with the discovery of double descent.
Double Descent
While training models with too many parameters, validations loss decreases for a while, then starts to increase, and the counter-intuitively starts to decrease again as model size or training time increases.
Chinchilla Optimization
Chinchilla is a 70B language model released by DeepMind. In its training, they set the standard by using a fixed compute budget instead of having a stop condition.
They also find the correlation such that, if you double your compute, you have to double your model size and training tokens as well.
This replaced early stopping in favor of checkpoint selection where best performing checkpoint is chosen retrospectively.