GPT
Why does GPT use LayerNorm before activation?
Original transformer paper's forward pass is
y = LayerNorm(x + SubLayer(x))
Here, LayerNorm's gradient involves mean and variance of the data and modifies the residual gradient path. So the gradient of x is scaled and centered for no reason.
GPT uses Pre-LN:
y = x + LayerNorm(SubLayer(x))
This results for a cleaner gradient for the residual connection. And empirically, this makes training deep models more stable.