Back to notes Concepts January 2, 2026 174 words

Activation

It refers to how strongly a neuron fires in response to its inputs.

Input is called logit or pre-activation, output is called post-activation.

In models, it can refer to:

Output after activation function
Output of any layer in the network in general

Sigmoid

Inspired by neurons which are either on or off. - Suffers from |vanishing gradients problem. - Is positive valued, which makes gradient descent inefficient.

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Tanh

+ Is zero centered - Still suffers from vanishing gradients problem.

f(x) = \tanh(x)

ReLU

+ Fixes vanishing gradients problem + Because it returns true zeros, negative pre-activation paths are efficient to calculate, resulting with sparse networks - Suffers from dying ReLU problem. In larger networks we want at least some signal for all inputs.

Swish / SiLU

Proposed by Google, found in neural architecture search.

silu(x) = x * \sigma(x)

y = x * \frac{1}{1 + e^{-x}}

GeLU

Gaussian Error Linear Unit. Used in BERT and GPT

gelu(x) = x * \Phi(x)

where $\Phi$ is CDF of Gaussian distribution.

y = 0.5 * x * (1 + \tanh(\sqrt{ 2 / \pi } * (x + 0.044715 * x^3)))