Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Transformer architecture is the foundation of many modern language, vision, and multimodal models. This cheat sheet covers the core computations inside an encoder or decoder block, including attention, projections, normalization, and feedforward layers. College students need it to connect implementation details with the mathematical structure of the model. It is especially useful when reading papers, debugging neural network code, or comparing model variants. The most important idea is that attention computes a weighted mixture of value vectors using similarities between query and key vectors. Multi-head attention repeats this process in parallel so the model can learn different relation patterns at once. Positional information is added because attention alone does not know token order. Residual connections, layer normalization, and feedforward networks make deep transformer stacks trainable and expressive.

Key Facts

  • Given input X, the projected matrices are Q = XW_Q, K = XW_K, and V = XW_V.
  • Scaled dot-product attention is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V.
  • The factor sqrt(d_k) reduces overly large dot products so the softmax does not become too sharp early in training.
  • Multi-head attention is MultiHead(X) = concat(head_1, ..., head_h)W_O, where head_i = Attention(Q_i, K_i, V_i).
  • Sinusoidal positional encoding often uses PE(pos, 2i) = sin(pos / 10000^(2i / d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model)).
  • A transformer feedforward block is usually FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 or uses GELU instead of ReLU.
  • A common pre-norm transformer block uses x = x + Attention(LayerNorm(x)) followed by x = x + FFN(LayerNorm(x)).
  • Full self-attention has time and memory complexity O(n^2) with respect to sequence length n because it compares every token with every other token.

Vocabulary

Token embedding
A learned vector representation of a token that serves as the numerical input to the transformer.
Query
A vector used to search for relevant information by comparing it with key vectors.
Key
A vector that represents what information a token can be matched against during attention.
Value
A vector containing the information that is mixed and passed forward after attention weights are computed.
Attention mask
A matrix that blocks attention to certain positions, such as future tokens in causal language modeling.
Layer normalization
A normalization operation that rescales features within each token representation to stabilize training.

Common Mistakes to Avoid

  • Confusing keys and values is wrong because keys determine attention weights, while values provide the information that gets averaged.
  • Forgetting the division by sqrt(d_k) is wrong because large dot products can saturate the softmax and make gradients less useful.
  • Applying a causal mask after softmax is wrong because masked positions must be assigned very negative scores before softmax so their probabilities become zero.
  • Assuming self-attention automatically knows word order is wrong because attention is permutation-invariant unless positional information is added.
  • Mixing up sequence length and embedding dimension is wrong because attention scores have shape n by n, while token representations have width d_model.

Practice Questions

  1. 1 If a sequence has n = 128 tokens and full self-attention is used, how many pairwise attention scores are computed for one head?
  2. 2 For d_k = 64, what scaling factor appears in Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V?
  3. 3 A transformer layer has h = 8 attention heads and each head has dimension 64. What is the concatenated multi-head attention dimension before the output projection?
  4. 4 Why does a transformer need positional encoding or another position mechanism if self-attention compares every token with every other token?