Self-Attention Mechanism

Query, Key, Value, and Scaled Dot-Product Attention - Step by Step

Download PNG Save to Pinterest

Related Tools

Related Labs

Related Worksheets

Related Cheat Sheets

Self-attention is a core idea behind modern language models because it lets each word in a sequence look at other words and decide which ones matter most. Instead of processing words only in order, the model builds relationships across the whole sentence at once. This helps it capture context, resolve ambiguity, and represent meaning more effectively. Self-attention is one reason transformer models work so well in translation, chatbots, summarization, and code generation.

The mechanism works by turning each input token into three vectors called query, key, and value. A token compares its query to the keys of all tokens to produce attention scores, then those scores are normalized with softmax into weights that sum to 1. The final output for that token is a weighted sum of the value vectors, so important tokens contribute more strongly. Repeating this process across all tokens allows the model to build context-aware representations in parallel.

Key Facts

Each token is projected into query, key, and value vectors: Q = XW_Q, K = XW_K, V = XW_V
Raw attention scores are computed with a dot product: score(i,j) = q_i · k_j
Scaled dot-product attention uses: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V
The softmax step converts scores into weights that sum to 1 across compared tokens
A token's output is a weighted sum of value vectors: output_i = sum_j a_ij v_j
Multi-head attention runs several attention operations in parallel, then combines them: MultiHead = Concat(head_1,...,head_h)W_O

Vocabulary

Token: A token is a basic unit of input, such as a word, subword, or symbol, that the model processes.
Query vector: A query vector represents what information a token is looking for from other tokens.
Key vector: A key vector represents what kind of information a token offers to other tokens.
Value vector: A value vector contains the information that gets combined into the final attention output.
Softmax: Softmax is a function that turns a list of scores into positive weights that add up to 1.

Common Mistakes to Avoid

Assuming attention scores are the final output, which is wrong because the scores must first be normalized into weights and then applied to the value vectors.
Forgetting the scaling factor 1/sqrt(d_k), which is wrong because large dot products can make softmax too sharp and hurt training stability.
Thinking self-attention only compares neighboring words, which is wrong because each token can attend to every token in the sequence unless masking limits it.
Mixing up keys and values, which is wrong because keys are used for matching with queries while values are the vectors actually combined to form the output.

Practice Questions

1 A token has query q = [1, 2]. Two tokens have keys k1 = [1, 0] and k2 = [0, 2]. Compute the raw attention scores q · k1 and q · k2.
2 Suppose a token has attention weights [0.25, 0.75] over two value vectors v1 = 4 and v2 = 10. Compute the weighted output.
3 Explain why self-attention can represent the meaning of the word bank differently in the sentences I sat by the bank and I went to the bank to deposit money.

Sign in to save