Back to notes Modules February 9, 2026 64 words

GPT

Why does GPT use LayerNorm before activation?

Original transformer paper's forward pass is

y = LayerNorm(x + SubLayer(x))

Here, LayerNorm's gradient involves mean and variance of the data and modifies the residual gradient path. So the gradient of x is scaled and centered for no reason.

GPT uses Pre-LN:

y = x + LayerNorm(SubLayer(x))

This results for a cleaner gradient for the residual connection. And empirically, this makes training deep models more stable.