What is self-attention mathematically and how is it computed?

Updated May 17, 2026

Short answer

Self-attention computes weighted relationships between tokens using query, key, and value projections.

Deep explanation

Self-attention transforms input embeddings into queries (Q), keys (K), and values (V). Attention scores are computed as softmax(QK^T / sqrt(d_k)), producing weights that are applied to V. This allows each token to dynamically attend to all others, capturing global dependencies in a sequence.

Unlock with a Pro subscription to view this section.

View pricing