How ChatGPT Picks the Next Word Infographic

ChatGPT writes by predicting one token at a time, where a token is a small piece of text such as a word, part of a word, or punctuation mark. It does not look up a fixed answer in a database, and it does not truly know what will come next. Instead, it uses patterns learned from huge amounts of text to estimate which tokens are most likely to follow the prompt.

This matters because the same next-token process can produce essays, code, explanations, jokes, and mistakes.

Understanding How ChatGPT Picks the Next Word

Inside a transformer, tokens are first turned into lists of numbers called embeddings. These numbers place related ideas near one another in a large mathematical space. The model adds position information too.

Without position, the words "dog bites man" would look much like "man bites dog." Each layer then builds richer representations. Early layers may notice spelling, grammar, or nearby phrases.

Later layers can connect a pronoun with a person named earlier, follow a topic across several sentences, or recognize the structure of a programming language. The final representation contains the model's best numerical summary of the text so far.

Attention is the part that decides which earlier pieces of text deserve focus at a particular moment. For every position, the model creates internal signals that act roughly like a search request, a label for available information, and the information itself. It compares the request with the labels from earlier positions.

Strong matches receive more weight. The weighted information is combined and passed onward. Many attention heads run in parallel.

One head might track sentence structure, while another notices repeated names or matching brackets in code. Attention does not guarantee understanding. It is a calculation that can find useful relationships when training has made those relationships valuable.

During training, the model sees an enormous number of text sequences with part of each sequence hidden from it. It makes a prediction, checks the actual next piece, then changes billions of adjustable values by a tiny amount. A wrong confident prediction causes a larger correction than a wrong uncertain prediction.

Repeating this process teaches patterns such as word order, common facts, styles of explanation, and links between ideas. It does not store every training sentence in a neat library. Some details are learned strongly, some weakly, and some are missing or outdated.

This is one reason a fluent answer can still contain invented facts. The model is optimized to continue text plausibly, not to independently verify every claim.

The final choice is controlled by more than temperature. A system may reject extremely unlikely options, consider only a limited group of likely options, or choose the single highest ranked option. These settings affect whether repeated runs give nearly identical wording.

Low randomness is useful for structured tasks such as formatting data or producing consistent code, though it can repeat dull phrasing. More randomness can help with brainstorming, but it raises the chance of drifting away from the prompt. Students should pay attention to the prompt itself.

Clear instructions, relevant details, examples of the desired format, and limits on assumptions give attention better material to use. For schoolwork, treat generated text as a draft. Check quotations, dates, calculations, sources, and claims that sound unusually certain.

Key Facts

Text is split into tokens before the model processes it, so one word may become one token or several tokens.
Attention lets each token use information from other tokens in the prompt to build context.
The model outputs a score called a logit for every token in its vocabulary.
Probabilities are computed from logits using softmax: P(token i) = e^(z_i) / sum(e^(z_j)).
Temperature changes randomness: higher temperature makes low-probability tokens more likely, while lower temperature makes the top token dominate.
Training adjusts weights to reduce prediction error, often using cross-entropy loss: L = -log(P(correct token)).

Vocabulary

Token: A token is a piece of text, such as a word, word fragment, number, or punctuation mark, that the model processes as one unit.
Transformer: A transformer is a neural network design that uses attention to process relationships among tokens in a sequence.
Attention: Attention is a method that lets the model decide which earlier tokens are most useful for understanding the current context.
Probability distribution: A probability distribution assigns a probability to each possible next token, with all probabilities adding up to 1.
Temperature: Temperature is a setting that controls how random or predictable the model's token choices are during generation.

Common Mistakes to Avoid

Thinking ChatGPT chooses whole sentences at once. It usually generates text step by step by choosing one token, then using the updated text to choose the next token.
Assuming the highest-probability token is always selected. Sampling settings can allow lower-probability tokens to be chosen, which can make answers more creative but less predictable.
Believing attention means the model understands like a human. Attention measures useful relationships between tokens, but it is still a mathematical pattern-matching process.
Ignoring tokenization when counting words. A single word can be split into multiple tokens, so token limits and model costs do not always match word counts.

Practice Questions

1 A model assigns next-token probabilities of 0.50 for cat, 0.30 for dog, 0.15 for bird, and 0.05 for fish. What is the probability that it does not choose cat?
2 A sentence has 18 words, and each word averages 1.4 tokens. About how many tokens will the model process for that sentence?
3 A model gives a fluent but false answer about a science fact. Explain how next-token prediction and training on text patterns can lead to this kind of mistake.

Sign in to save

Sign in to save

How ChatGPT Picks the Next Word

Related Tools

Related Labs

Related Worksheets

Related Cheat Sheets

Study as Flashcards

Understanding How ChatGPT Picks the Next Word

Key Facts

Vocabulary

Common Mistakes to Avoid

Practice Questions