seniorNeural Networks
What is self-attention mathematically and how is it computed?
Updated May 17, 2026
Short answer
Self-attention computes weighted relationships between tokens using query, key, and value projections.
Deep explanation
Self-attention transforms input embeddings into queries (Q), keys (K), and values (V). Attention scores are computed as softmax(QK^T / sqrt(d_k)), producing weights that are applied to V. This allows each token to dynamically attend to all others, capturing global dependencies in a sequence.
Unlock with a Pro subscription to view this section.
View pricingReal-world example
No real-world example available yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProCommon mistakes
No common mistakes listed yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProFollow-up questions
No follow-up questions available yet.
Unlock with a Pro subscription to view this section.
Upgrade to Pro