Contents
- Introduction — Words Are Just Numbers
- Tokenization — What LLMs Actually See
- Word Embeddings — Words in Space
- From Words to Sequences — Why Context Matters
- Attention — The Breakthrough Idea
- The Transformer Block — Putting It Together
- Pretraining — The Massive Multi-Task Learner
- RLHF — Teaching the Model to Be Helpful
- Modern RL — DPO, GRPO, and Beyond
- The Big Picture — Putting It All Together
Words Are Just Numbers
Here's a question that sounds simple but turns out to be surprisingly deep: how would you teach a computer to understand language?
Computers are, at their core, calculators. They're brilliant at arithmetic, at comparing numbers, at following rules. But language? Language is messy, ambiguous, beautiful, and weird. "I saw her duck" could mean you witnessed her pet waterfowl, or you watched her dodge something.
So how do you bridge this gap? You have to turn words into numbers. The question is: which numbers?
The simplest idea: just assign each word an ID. Cat = 1, dog = 2, fish = 3, quantum = 4, love = 5... You can imagine a giant dictionary with 50,000 entries.
But here's the problem. To the computer, the "distance" between cat (1) and dog (2) is the same as between cat (1) and quantum (4). There's no meaning in these numbers. Cat and dog are related! Cat and quantum are... not.
What if we could assign numbers so that similar words get similar numbers?
That's the key insight. And it turns out, there's an elegantly simple way to do it: look at which words appear near each other in real text. Words that show up in similar contexts probably mean similar things.
"You shall know a word by the company it keeps." — J.R. Firth, 1957
"Dog" and "cat" both appear near words like "pet", "fur", "feed", and "vet". So they should get similar numbers. "Dog" and "quantum" almost never share context. So they should get very different numbers.
This is the foundation of everything that follows. It's called the distributional hypothesis, and it's the idea behind word embeddings.
But first, let's make this concrete. Here's a toy version of the idea. Try dragging the words around on the number line below — notice how some groupings feel "right" and others feel absurd:
See how "king" and "queen" want to be close? And how "banana" feels out of place next to "emperor"? That's exactly the intuition behind word embeddings.
But we're getting ahead of ourselves. Before we talk about embeddings, there's something even more fundamental we need to cover: LLMs don't actually see words.
What LLMs Actually See
Here's a surprise that trips up a lot of people: LLMs don't process words. They process tokens.
What's a token? It's a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a single character. The word "unbelievable" might get split into three tokens: ["un", "believ", "able"].
Why? Because it's a much smarter strategy than trying to have a separate entry for every possible word. Think about it: "run", "running", "runner", "runs" — these all share the root "run". If the model learns what "run" means as a token, it can reuse that knowledge across all these forms.
The most common approach is called Byte Pair Encoding (BPE). The idea is beautifully simple:
- Start with individual characters as your tokens
- Find the pair of tokens that appears most frequently next to each other
- Merge that pair into a new token
- Repeat thousands of times
Common words like "the" become single tokens. Rare words like "defenestration" get split into pieces. This gives you the best of both worlds: efficiency for common words and flexibility for rare ones.
Try it yourself! Type something below and see how it gets tokenized:
Most modern LLMs use vocabularies of 30,000–100,000 tokens. GPT-4 uses about 100,000. Each token gets assigned a unique ID number, and those are the numbers that actually get fed into the model.
So the pipeline so far is: text → tokens → token IDs. But a token ID is just an arbitrary number (like our cat=1, dog=2 problem from before). We need something richer. We need embeddings.
Words in Space
Here's where things get genuinely beautiful.
Instead of representing a word with a single number, what if we used a whole list of numbers? Not just one dimension, but many. In practice, modern embeddings use 768, 1024, or even 4096 dimensions.
But let's start with something we can visualize: two dimensions. Each word gets an x-coordinate and a y-coordinate, placing it as a point in 2D space.
The magic is in how these coordinates are learned. The most famous approach, word2vec (2013), works like this:
- Take a huge pile of text (Wikipedia, books, the web...)
- For each word, look at the words that surround it (its "context")
- Train a simple neural network to predict context from a word (Skip-gram) or a word from its context (CBOW)
- The "side effect" of this training: the network learns number-representations for each word where similar words end up nearby
The result? Words that mean similar things cluster together. Animals clump together, colors clump together, countries clump together. Explore the space below:
See how the clusters form? Animals hang out with animals, countries with countries. This isn't programmed — it emerges from the training process. The model discovers these relationships by itself, just from reading lots of text.
And here's the part that really blew people's minds when word2vec was published. You can do arithmetic with these word vectors:
king − man + woman ≈ queen
Take the vector for "king", subtract "man", add "woman" — and the result lands closest to "queen"! The model has somehow learned that "king" is to "man" as "queen" is to "woman".
Try some word arithmetic yourself:
This was the state of the art circa 2013–2017. Word embeddings like word2vec and GloVe were a huge breakthrough. But they had a fatal flaw...
Why Context Matters
Here's the problem: a word like "bank" has one embedding. But "bank" means completely different things in these two sentences:
"I sat by the river bank" and "I went to the bank for money." The word "bank" gets the same vector in both cases. The model can't tell them apart!
This is the fundamental limitation of static word embeddings. We need representations that change depending on context.
Before the transformer revolution, people tried recurrent neural networks (RNNs) and LSTMs. The idea: process words one at a time, left to right, maintaining a "memory" of what came before.
And there was another problem: RNNs process words sequentially. Word 1, then word 2, then word 3... This made them painfully slow to train, because you couldn't parallelize the computation.
What we really wanted was a way for every word to look at every other word simultaneously. To understand each word in the full context of the entire sentence, all at once.
In 2017, a team at Google published a paper with an unforgettable title: "Attention Is All You Need." And the world changed.
The Breakthrough Idea
Attention is the single most important concept in modern AI. Here's the core idea:
Every word gets to look at every other word and decide how much to "pay attention" to it.
Think about reading the sentence "The cat sat on the mat because it was tired." When you read "it", your brain instantly connects it to "cat" — not "mat", not "the". You attend to the right words automatically.
The attention mechanism does the same thing, but mathematically. Let's break it down.
Each word creates three vectors from its embedding:
- Query (Q) — "What am I looking for?" (like a search query)
- Key (K) — "What do I contain?" (like a book title)
- Value (V) — "What information do I carry?" (like the book's content)
The analogy is a library search: your Query is matched against every word's Key to get a relevance score. Then you use those scores to grab a weighted mix of every word's Value.
Mathematically:
Attention(Q, K, V) = softmax(Q · KT / √dk) · V
Let's see it in action. Hover over words in the sentence below to see what each word "attends" to:
Notice something fascinating: different attention heads learn different things! One head might specialize in resolving pronouns ("it" → "cat"), another in tracking syntactic structure (verbs attending to subjects), and another in simply paying attention to nearby words.
In practice, models use 12, 32, or even 128 attention heads running in parallel. Each learns its own pattern, and their outputs are combined. This is called multi-head attention.
Now we have all the ingredients to understand the full transformer architecture.
Putting It Together
A transformer model is built from transformer blocks, stacked on top of each other like LEGO bricks. Each block takes in a sequence of token embeddings and outputs an improved sequence of embeddings — same shape, but each token now has a richer understanding of the whole sentence.
Each transformer block has four key components:
Let's unpack each piece with an analogy. Imagine a team meeting:
- Self-Attention is the part where everyone in the room shares their notes with everyone else. Each person (token) gets to hear what everyone else knows. After this step, your understanding of each word is enriched by the context of all other words.
- Feed-Forward Network is the part where each person goes back to their desk and thinks privately about what they just heard. It's a small neural network applied to each token independently, doing deeper processing.
- Residual Connections (the "skip" arrows) ensure you don't forget what you knew before the meeting. The original input is added back to the output. This is crucial — it means the model can learn to make small refinements rather than needing to reconstruct everything from scratch.
- Layer Normalization keeps everyone "at a similar volume level." Without it, the numbers can grow out of control as they pass through many layers.
GPT-3 stacks 96 of these blocks. GPT-4 likely has even more. Each layer refines the representations further — early layers handle grammar and syntax, middle layers capture semantics and facts, and deep layers do complex reasoning.
But a transformer is just an architecture — a blueprint. The magic happens in how it's trained.
The Massive Multi-Task Learner
Here's the punchline of the whole story. The training objective for GPT and its descendants is embarrassingly simple:
Predict the next token.
That's it. Given a sequence of tokens, predict what comes next. "The capital of France is ___" → "Paris". "def fibonacci(n): ___" → "if". "She felt sad because ___" → "her".
This sounds almost too simple. But here's the key insight that makes it work:
To predict the next word well across all possible texts, you implicitly need to learn an enormous range of skills.
To predict "Paris" after "The capital of France is", you need geography. To predict "her" after "She felt sad because", you need emotional reasoning. To predict the next line of code, you need to understand programming.
One objective, thousands of implicit tasks. This is why next-token prediction is sometimes called a "massive multi-task objective in disguise."
The training data is enormous: GPT-3 was trained on roughly 300 billion tokens (about 500GB of text). More recent models use trillions. This includes books, Wikipedia, web pages, code repositories, and much more.
Now, here's the process. Type a prompt below and watch the model predict tokens one at a time:
After pretraining, you have a model that can complete any text. Give it a news article, it'll write more news. Give it poetry, it'll write more poetry. Give it code, it'll write more code.
But there's a problem: it'll also happily complete toxic text with more toxic text. It has no judgment about what's helpful or appropriate. It's like a brilliant student who has read everything but has no values.
Enter: alignment.
Teaching the Model to Be Helpful
A pretrained language model is like a wildly talented but completely unfiltered person. Ask it a question, and it might give you a helpful answer — or a toxic rant, or a hallucinated lie, or a wall of incoherent text. It was just trained to predict what comes next.
RLHF — Reinforcement Learning from Human Feedback — is how we transform this raw text predictor into the helpful, harmless assistants we actually want. It's a three-step process:
Let's walk through each step:
Step 1: Supervised Fine-Tuning (SFT). Take the pretrained model and show it thousands of examples of good assistant behavior. "Here's a question, here's how a good assistant would answer." This is like apprenticeship — the model learns to mimic the format and tone of helpful responses.
Step 2: Reward Model. Have the SFT model generate multiple responses to the same prompt. Then have humans rank them: "This response is better than that one." Use thousands of these comparisons to train a separate reward model — a neural network that scores how "good" a response is.
Step 3: Reinforcement Learning (PPO). Use the reward model as the training signal. The LLM generates a response, the reward model scores it, and the LLM is updated to produce higher-scoring responses. This is the same kind of RL used to train AlphaGo.
Try being a human rater yourself!
The result of RLHF is dramatic. The same model that would previously complete any text — helpful or harmful — now actively tries to be useful. This is the difference between GPT-3 and ChatGPT.
But RLHF has limitations. The reward model can be gamed ("reward hacking"). PPO is finicky and expensive. Researchers have been looking for better approaches...
DPO, GRPO, and Beyond
RLHF was a breakthrough, but it's complex. You need a separate reward model, deal with PPO's instability, and collect mountains of human labels. Researchers asked: can we simplify?
DPO (Direct Preference Optimization) came first. The insight: you don't need a separate reward model. Train directly on preference pairs — "Response A is better than Response B" — and mathematically bake the reward model's job into the loss function itself. Simpler, more stable, and often works just as well.
But then came an even more interesting idea: what if the model could generate its own training signal?
GRPO (Group Relative Policy Optimization) takes this further:
- Given a prompt, generate multiple responses ("rollouts")
- Score each response (using a reward model, verifier, or correctness check)
- Use the best responses as positive examples and worst as negative
- Update the model to be more like its own best work
The analogy: instead of hiring a critic, you try many approaches yourself and learn from comparing your own best vs worst work. Like a writer who generates four drafts and improves by studying what made the best one good.
This is especially powerful for reasoning tasks. If the model can verify whether an answer is correct (math, code that passes tests), it can generate its own training signal without any human labeling!
Modern LLMs can now improve themselves through practice — especially on problems with verifiable answers. The model generates, evaluates, and refines — getting better through its own experience.
Putting It All Together
Let's step back and see the full picture:
Each step builds on the last:
- Text is broken into tokens
- Tokens become embeddings — rich numerical representations
- Embeddings pass through attention layers that capture context
- These layers are stacked into transformer blocks
- The transformer is pretrained on next-token prediction over trillions of tokens
- The pretrained model is aligned via RLHF, DPO, or GRPO to be helpful
And that's how you get from raw text on the internet to the AI assistant you're chatting with right now.
The next time someone says "it's just predicting the next word," you'll know there's a lot more to it. Next-word prediction is the seed, but the tree that grows from it — through clever architecture, massive scale, and careful alignment — is genuinely extraordinary.
You don't truly understand something until you can predict it. And now, you can predict quite a lot about how LLMs work.
✽
Further Resources
- Attention Is All You Need (Vaswani et al., 2017) — the paper that started it all.
- Language Models are Unsupervised Multitask Learners (GPT-2 paper) — scaling up next-token prediction.
- Training language models to follow instructions with human feedback (InstructGPT) — the RLHF paper.
- Direct Preference Optimization (Rafailov et al., 2023) — simplifying alignment.
- DeepSeek-R1 — GRPO for reasoning models at scale.
- 3Blue1Brown: Neural Networks — excellent visual explanations of the math.
- The Illustrated Transformer by Jay Alammar — the gold standard for transformer visualizations.
Thank you for reading. This explainer was built as a self-contained HTML page with all visualizations in pure SVG and JavaScript.