What is Self-Attention in Transformers and how does it compute contextual representations?

Updated May 16, 2026

Short answer

Self-attention is a mechanism that allows each token in a sequence to dynamically attend to other tokens to build context-aware representations.

Deep explanation

Self-attention is the core innovation behind Transformer architectures and modern LLMs.

Unlike RNNs, which process sequences sequentially, self-attention allows parallel processing while capturing global dependencies.

Core idea: Each token is transformed into three vectors:

  • Query (Q)
  • Key (K)
  • Value (V)

Attention computation: Attention(Q, K, V) = softmax(QK^T / √d) V

Step-by-step process:

  1. Input embeddings are linearly projected into Q, K, V.
  2. Similarity scores are computed between queries and keys.
  3. Scores are scaled and normalized using softmax.

4.…

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

More Deep Learning interview questions

View all →