Back to notes Modules January 4, 2026 131 words

Attention

Attention(Q,K,V)=softmax(QKTd)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

Softmax

Calculates a probability distribution that adds up to 11 and maximum value has the biggest share.

softmax(x)=exp(xi)jexp(xj)\text{softmax}(x) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}

QKTQK^T and Low-Rank Bottleneck

Instead of training a single matrix n×nn \times n for sequence relevances, attention uses two matrices WQW_Q and WKW_K that scale down to dq=dkd_q = d_k. This is called low-rank bottleneck and allows each attention head to specialize on a certain relationship instead of trying to learn every single one at once.

Using multiple attention heads allows us to learn many relationships of sequences afterwards.

Mathematical reason behind this is that, when we multiply two matrices their rank will be minimum of two, and therefore rank of the attention score matrix QKTQK^T will be at most dkd_k.