What is self-attention mathematically and how is it computed?

Updated May 17, 2026

Short answer

Self-attention computes weighted relationships between tokens using query, key, and value projections.

Deep explanation

Self-attention transforms input embeddings into queries (Q), keys (K), and values (V). Attention scores are computed as softmax(QK^T / sqrt(d_k)), producing weights that are applied to V. This allows each token to dynamically attend to all others, capturing global dependencies in a sequence.

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

More Neural Networks interview questions

View all →