Attention
Softmax
Calculates a probability distribution that adds up to and maximum value has the biggest share.
and Low-Rank Bottleneck
Instead of training a single matrix for sequence relevances, attention uses two matrices and that scale down to . This is called low-rank bottleneck and allows each attention head to specialize on a certain relationship instead of trying to learn every single one at once.
Using multiple attention heads allows us to learn many relationships of sequences afterwards.
Mathematical reason behind this is that, when we multiply two matrices their rank will be minimum of two, and therefore rank of the attention score matrix will be at most .