Activation
It refers to how strongly a neuron fires in response to its inputs.
Input is called logit or pre-activation, output is called post-activation.
In models, it can refer to:
- Output after activation function
- Output of any layer in the network in general
Sigmoid
Inspired by neurons which are either on or off. - Suffers from |vanishing gradients problem. - Is positive valued, which makes gradient descent inefficient.
Tanh
+ Is zero centered - Still suffers from vanishing gradients problem.
ReLU
+ Fixes vanishing gradients problem + Because it returns true zeros, negative pre-activation paths are efficient to calculate, resulting with sparse networks - Suffers from dying ReLU problem. In larger networks we want at least some signal for all inputs.
Swish / SiLU
Proposed by Google, found in neural architecture search.
y = x * \frac{1}{1 + e^{-x}}
GeLU
Gaussian Error Linear Unit. Used in BERT and GPT
where is CDF of Gaussian distribution.
y = 0.5 * x * (1 + \tanh(\sqrt{ 2 / \pi } * (x + 0.044715 * x^3)))