Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

Computer Science high-school May 24, 2026

How Does ChatGPT Pick Its Next Word?

A probability machine for language

A computer science diagram showing text entering an AI model, being split into tokens, and producing a ranked list of possible next tokens

ChatGPT turns your message into small text pieces, then uses patterns learned from many examples to guess what piece should come next. It gives many possible next pieces different chances, then picks one and repeats the process. It does not understand like a person, but it can produce useful text by following learned language patterns.

Big Idea. Common Core HSS.MD.B.5 connects this topic to using probability to make decisions from uncertain outcomes.

ChatGPT writes one small step at a time. It does not pull a finished paragraph from a file. It reads the text so far, turns it into small pieces called tokens, and calculates which token is likely to come next. Then it adds one token and does the same calculation again. This loop can make a sentence, a poem, or code. The key idea is probability. The model gives each possible next token a score, then turns those scores into chances. A word that fits the context gets a higher chance. A word that does not fit gets a lower chance. Settings can make the choice more steady or more varied. That is why the same prompt can lead to different answers. This article connects to the AI & ML Basics cheat sheet and high school probability standards.

Text becomes tokens

A sentence being divided into smaller token blocks, with each block connected to a numeric ID
A model reads text as tokens, not as full thoughts
ChatGPT does not usually choose a whole word at once. First, it splits text into tokens. A token can be a whole word, part of a word, a number, a space, or a punctuation mark. The sentence "The robot smiles" might become pieces like "The", " robot", and " smiles". Longer or unusual words may be split into smaller parts. This helps the model handle words it has not seen as complete words before. It also gives the model a consistent way to turn text into numbers. Computers work with numbers, so each token is matched to an ID. The model does its math on those IDs and on number patterns linked to them. When it answers, it predicts token IDs. Those IDs are then turned back into readable text.

Tokens are the small pieces the model predicts.

Context changes the guess

Tokens in a prompt connected by attention lines, with thicker lines showing stronger influence on the next token
Attention lets earlier tokens influence the next prediction
The next token depends on the tokens before it. If the prompt says "peanut butter and", the token " jelly" is more likely than " volcano". If the prompt says "lava and", the ranking changes. A transformer model handles this by comparing tokens with other tokens in the prompt. This process is called attention. Attention helps the model decide which earlier pieces matter most for the next prediction. In a math problem, a number near the start may matter later. In a story, a character name may matter across many sentences. The model does not store a human memory of the story. It uses number patterns from the visible context. Each layer of the transformer updates those patterns. By the end, the model has a context shaped representation that helps rank possible next tokens.

Attention is a way to weigh which earlier tokens matter.

Scores become chances

A bar chart showing possible next tokens with different probabilities after a prompt
The model ranks many possible next tokens
After reading the context, the model gives many possible next tokens a score. These raw scores are not yet probabilities. A step called softmax turns them into a probability distribution. That means every possible next token gets a chance, and all the chances add up to 1. If the context is "The sky is", tokens like " blue" and " clear" may get large chances. Tokens like " sandwich" may still get a tiny chance. The model then chooses from this distribution. It can choose the highest chance token every time, or it can sample from the set of likely tokens. Sampling means a less likely token can appear sometimes. This is one reason AI text can be fluent but not perfectly predictable.

The next token is picked from a distribution of chances.

Temperature changes variety

Two probability charts comparing low temperature and high temperature sampling for next token choices
Lower temperature is more steady, higher temperature is more varied
Temperature is a setting that changes how spread out the probabilities are before the model samples. A low temperature makes the highest chance tokens dominate. This often gives steadier, more predictable answers. A high temperature spreads the chances out more. This makes unusual tokens more likely to be chosen. The result can feel more creative, but it can also be less reliable. Temperature does not give the model new knowledge. It changes how boldly the model samples from what it already ranked. Some systems also use other sampling settings, such as limiting choices to the most likely tokens. These controls are like changing the spinner used to pick from the probability distribution. They affect wording, examples, and order, even when the prompt stays the same.

Temperature changes variety, not understanding.

One token at a time

A loop diagram showing a prompt, a probability distribution, a chosen token, and updated text feeding back into the model
Generated text is built by repeating the next token step
Once a token is chosen, it becomes part of the context. The model then predicts the next token using the updated text. This repeats until the answer reaches a stopping point. The stopping point might be a special end token, a length limit, or a rule set by the system. This step by step process is why early choices can shape the rest of an answer. If the model picks a different first sentence, later sentences may follow that path. It also explains why the same prompt can produce different wording. Each sampled token nudges the context for the next sample. The process can produce useful explanations, but it is not the same as checking truth. A language model predicts text patterns. People and software still need to check facts, calculations, and sources.

Every chosen token changes the context for the next choice.

Vocabulary

Token
A small piece of text, such as a word, part of a word, space, number, or punctuation mark.
Transformer
A type of machine learning model that uses attention to process relationships among tokens in context.
Attention
A method that lets a model give different weights to earlier tokens when predicting a later token.
Probability distribution
A list of possible outcomes with chances assigned to them, where the chances add up to 1.
Temperature
A sampling setting that changes how predictable or varied the next token choice can be.
Sampling
Choosing one outcome from a probability distribution instead of always taking the top option.

In the Classroom

Build a next word spinner

20 minutes | Grades 9-12

Students write a short prompt and list five possible next words with assigned probabilities that add to 1. They use a spinner or random number table to sample the next word several times and compare the outputs.

Compare low and high temperature writing

25 minutes | Grades 9-12

Give students the same sentence starter and two different sampling rules. One rule strongly favors the top word, while the other spreads chances more evenly. Students discuss how the outputs change and what is gained or lost.

Attention map sketch

15 minutes | Grades 9-12

Students mark which earlier words matter most for predicting a blank in a sentence. They draw thicker lines for stronger influence and explain their choices using context clues.

Key Takeaways

  • ChatGPT generates text by predicting one token at a time.
  • Tokens are small text pieces that can be words, word parts, spaces, or punctuation.
  • Transformer attention helps the model use context when ranking next token choices.
  • Temperature and sampling settings can make answers more steady or more varied.
  • A language model predicts text patterns and does not understand the way a person does.