Computer Science
How ChatGPT Picks the Next Word
Tokens, transformers, attention, and sampling
Related Tools
Related Labs
Related Worksheets
Related Cheat Sheets
ChatGPT writes by predicting one token at a time, where a token is a small piece of text such as a word, part of a word, or punctuation mark. It does not look up a fixed answer in a database, and it does not truly know what will come next. Instead, it uses patterns learned from huge amounts of text to estimate which tokens are most likely to follow the prompt. This matters because the same next-token process can produce essays, code, explanations, jokes, and mistakes.
Key Facts
- Text is split into tokens before the model processes it, so one word may become one token or several tokens.
- Attention lets each token use information from other tokens in the prompt to build context.
- The model outputs a score called a logit for every token in its vocabulary.
- Probabilities are computed from logits using softmax: P(token i) = e^(z_i) / sum(e^(z_j)).
- Temperature changes randomness: higher temperature makes low-probability tokens more likely, while lower temperature makes the top token dominate.
- Training adjusts weights to reduce prediction error, often using cross-entropy loss: L = -log(P(correct token)).
Vocabulary
- Token
- A token is a piece of text, such as a word, word fragment, number, or punctuation mark, that the model processes as one unit.
- Transformer
- A transformer is a neural network design that uses attention to process relationships among tokens in a sequence.
- Attention
- Attention is a method that lets the model decide which earlier tokens are most useful for understanding the current context.
- Probability distribution
- A probability distribution assigns a probability to each possible next token, with all probabilities adding up to 1.
- Temperature
- Temperature is a setting that controls how random or predictable the model's token choices are during generation.
Common Mistakes to Avoid
- Thinking ChatGPT chooses whole sentences at once. It usually generates text step by step by choosing one token, then using the updated text to choose the next token.
- Assuming the highest-probability token is always selected. Sampling settings can allow lower-probability tokens to be chosen, which can make answers more creative but less predictable.
- Believing attention means the model understands like a human. Attention measures useful relationships between tokens, but it is still a mathematical pattern-matching process.
- Ignoring tokenization when counting words. A single word can be split into multiple tokens, so token limits and model costs do not always match word counts.
Practice Questions
- 1 A model assigns next-token probabilities of 0.50 for cat, 0.30 for dog, 0.15 for bird, and 0.05 for fish. What is the probability that it does not choose cat?
- 2 A sentence has 18 words, and each word averages 1.4 tokens. About how many tokens will the model process for that sentence?
- 3 A model gives a fluent but false answer about a science fact. Explain how next-token prediction and training on text patterns can lead to this kind of mistake.