Transformer Architecture Reference Cheat Sheet

Transformer architecture is the foundation of many modern language, vision, and multimodal models. This cheat sheet covers the core computations inside an encoder or decoder block, including attention, projections, normalization, and feedforward layers. College students need it to connect implementation details with the mathematical structure of the model.

It is especially useful when reading papers, debugging neural network code, or comparing model variants.

The most important idea is that attention computes a weighted mixture of value vectors using similarities between query and key vectors. Multi-head attention repeats this process in parallel so the model can learn different relation patterns at once. Positional information is added because attention alone does not know token order.

Residual connections, layer normalization, and feedforward networks make deep transformer stacks trainable and expressive.

Key Facts

Given input X, the projected matrices are Q = XW_Q, K = XW_K, and V = XW_V.
Scaled dot-product attention is Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V.
The factor sqrt(d_k) reduces overly large dot products so the softmax does not become too sharp early in training.
Multi-head attention is MultiHead(X) = concat(head_1, ..., head_h)W_O, where head_i = Attention(Q_i, K_i, V_i).
Sinusoidal positional encoding often uses PE(pos, 2i) = sin(pos / 10000^(2i / d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model)).
A transformer feedforward block is usually FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 or uses GELU instead of ReLU.
A common pre-norm transformer block uses x = x + Attention(LayerNorm(x)) followed by x = x + FFN(LayerNorm(x)).
Full self-attention has time and memory complexity O(n^2) with respect to sequence length n because it compares every token with every other token.

Vocabulary

Token embedding: A learned vector representation of a token that serves as the numerical input to the transformer.
Query: A vector used to search for relevant information by comparing it with key vectors.
Key: A vector that represents what information a token can be matched against during attention.
Value: A vector containing the information that is mixed and passed forward after attention weights are computed.
Attention mask: A matrix that blocks attention to certain positions, such as future tokens in causal language modeling.
Layer normalization: A normalization operation that rescales features within each token representation to stabilize training.

Common Mistakes to Avoid

Confusing keys and values is wrong because keys determine attention weights, while values provide the information that gets averaged.
Forgetting the division by sqrt(d_k) is wrong because large dot products can saturate the softmax and make gradients less useful.
Applying a causal mask after softmax is wrong because masked positions must be assigned very negative scores before softmax so their probabilities become zero.
Assuming self-attention automatically knows word order is wrong because attention is permutation-invariant unless positional information is added.
Mixing up sequence length and embedding dimension is wrong because attention scores have shape n by n, while token representations have width d_model.

Practice Questions

1 If a sequence has n = 128 tokens and full self-attention is used, how many pairwise attention scores are computed for one head?
2 For d_k = 64, what scaling factor appears in Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V?
3 A transformer layer has h = 8 attention heads and each head has dimension 64. What is the concatenated multi-head attention dimension before the output projection?
4 Why does a transformer need positional encoding or another position mechanism if self-attention compares every token with every other token?

Understanding Transformer Architecture Reference

A transformer begins with a vector for each token. A token may be a word piece, a character, an image patch, or another small unit of data. Its vector has many features, but those features do not have fixed human-readable meanings.

During training, the model adjusts them so useful patterns become easier to detect. The three attention projections give each token different jobs. One representation expresses what the token is looking for.

Another expresses what it offers as a possible match. The third carries information that can be passed onward. This separation matters because the best features for matching are not always the best features to copy into a new representation.

Attention weights form a distribution over permitted positions. Softmax makes the weights positive and makes their total equal one. A token can therefore collect a small amount of information from many positions or focus strongly on one position.

In a language encoder, every non-padding token can usually attend in both directions. In a text generator, a causal mask blocks future tokens. Without that mask, a model could see the answer token while learning to predict it.

Padding masks are important too. Batches often contain sequences of different lengths, with empty filler positions added to make arrays the same size.

The model must ignore those fillers. Implementations usually apply masks before softmax by giving forbidden scores a very large negative value.

Each attention head has a smaller internal feature size than the full model vector. Heads can specialize, though this is not guaranteed. One head may often follow nearby words, while another may connect a pronoun to an earlier noun or compare matching brackets in code.

The output projection then mixes information from all heads. After attention, the feedforward network processes each position separately. It does not communicate between tokens, but it transforms the features gathered by attention.

Its hidden layer is commonly much wider than the model vector, which gives the block room to build more complex feature combinations. Residual paths preserve the earlier representation and give gradients a direct route through many layers. Layer normalization keeps feature scales more stable, which reduces training problems caused by values growing or shrinking across depth.

The main practical limit is sequence length. If a sequence has twice as many tokens, full attention creates about four times as many pairwise scores. This affects memory first, especially during training when intermediate values are stored for gradient calculation.

Long documents, high-resolution images split into patches, and long program files can therefore become expensive quickly. Decoder models have an extra inference issue. When producing one token at a time, recomputing attention information for all earlier tokens would waste work.

A key value cache stores the earlier projected keys and values so each new token only adds one more entry. Students reading code should track tensor shapes, mask direction, and where normalization occurs. Small errors in any of these details can produce a model that runs without crashing but learns the wrong behavior.

Sign in to save

Sign in to save

Transformer Architecture Reference Cheat Sheet

Related Tools

Related Labs

Related Worksheets

Related Infographics

Study as Flashcards

Key Facts

Vocabulary

Common Mistakes to Avoid

Practice Questions

Understanding Transformer Architecture Reference