Back to notes Modules January 4, 2026 131 words

Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

Softmax

Calculates a probability distribution that adds up to $1$ and maximum value has the biggest share.

\text{softmax}(x) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}

$QK^T$ and Low-Rank Bottleneck

Instead of training a single matrix $n \times n$ for sequence relevances, attention uses two matrices $W_Q$ and $W_K$ that scale down to $d_q = d_k$ . This is called low-rank bottleneck and allows each attention head to specialize on a certain relationship instead of trying to learn every single one at once.

Using multiple attention heads allows us to learn many relationships of sequences afterwards.

Mathematical reason behind this is that, when we multiply two matrices their rank will be minimum of two, and therefore rank of the attention score matrix $QK^T$ will be at most $d_k$ .

Attention

Softmax

QKTQK^TQKT and Low-Rank Bottleneck

$QK^T$ and Low-Rank Bottleneck