How Do LLMs Actually Work?

Introduction — Words Are Just Numbers
Tokenization — What LLMs Actually See
Word Embeddings — Words in Space
From Words to Sequences — Why Context Matters
Attention — The Breakthrough Idea
The Transformer Block — Putting It Together
Pretraining — The Massive Multi-Task Learner
RLHF — Teaching the Model to Be Helpful
Modern RL — DPO, GRPO, and Beyond
The Big Picture — Putting It All Together

Words Are Just Numbers

Here's a question that sounds simple but turns out to be surprisingly deep: how would you teach a computer to understand language?

Computers are, at their core, calculators. They're brilliant at arithmetic, at comparing numbers, at following rules. But language? Language is messy, ambiguous, beautiful, and weird. "I saw her duck" could mean you witnessed her pet waterfowl, or you watched her dodge something.

So how do you bridge this gap? You have to turn words into numbers. The question is: which numbers?

The simplest idea: just assign each word an ID. Cat = 1, dog = 2, fish = 3, quantum = 4, love = 5... You can imagine a giant dictionary with 50,000 entries.

But here's the problem. To the computer, the "distance" between cat (1) and dog (2) is the same as between cat (1) and quantum (4). There's no meaning in these numbers. Cat and dog are related! Cat and quantum are... not.

What if we could assign numbers so that similar words get similar numbers?

That's the key insight. And it turns out, there's an elegantly simple way to do it: look at which words appear near each other in real text. Words that show up in similar contexts probably mean similar things.

"You shall know a word by the company it keeps." — J.R. Firth, 1957

"Dog" and "cat" both appear near words like "pet", "fur", "feed", and "vet". So they should get similar numbers. "Dog" and "quantum" almost never share context. So they should get very different numbers.

This is the foundation of everything that follows. It's called the distributional hypothesis, and it's the idea behind word embeddings.

But first, let's make this concrete. Here's a toy version of the idea. Try dragging the words around on the number line below — notice how some groupings feel "right" and others feel absurd:

Words that mean similar things should be close together. Try it!

See how "king" and "queen" want to be close? And how "banana" feels out of place next to "emperor"? That's exactly the intuition behind word embeddings.

But we're getting ahead of ourselves. Before we talk about embeddings, there's something even more fundamental we need to cover: LLMs don't actually see words.

II.

What LLMs Actually See

Here's a surprise that trips up a lot of people: LLMs don't process words. They process tokens.

What's a token? It's a chunk of text — sometimes a whole word, sometimes part of a word, sometimes just a single character. The word "unbelievable" might get split into three tokens: ["un", "believ", "able"].

Why? Because it's a much smarter strategy than trying to have a separate entry for every possible word. Think about it: "run", "running", "runner", "runs" — these all share the root "run". If the model learns what "run" means as a token, it can reuse that knowledge across all these forms.

The most common approach is called Byte Pair Encoding (BPE). The idea is beautifully simple:

Start with individual characters as your tokens
Find the pair of tokens that appears most frequently next to each other
Merge that pair into a new token
Repeat thousands of times

Common words like "the" become single tokens. Rare words like "defenestration" get split into pieces. This gives you the best of both worlds: efficiency for common words and flexibility for rare ones.

Try it yourself! Type something below and see how it gets tokenized:

A simplified BPE tokenizer. Colors show different tokens. Numbers below are token IDs.

Fun fact: This is why LLMs are famously bad at counting letters in words! When you ask "How many r's in strawberry?", the model sees tokens like ["str", "aw", "berry"] — the individual letters are hidden inside the tokens. The model never "sees" the separate letters.

Most modern LLMs use vocabularies of 30,000–100,000 tokens. GPT-4 uses about 100,000. Each token gets assigned a unique ID number, and those are the numbers that actually get fed into the model.

So the pipeline so far is: text → tokens → token IDs. But a token ID is just an arbitrary number (like our cat=1, dog=2 problem from before). We need something richer. We need embeddings.

III.

Words in Space

Here's where things get genuinely beautiful.

Instead of representing a word with a single number, what if we used a whole list of numbers? Not just one dimension, but many. In practice, modern embeddings use 768, 1024, or even 4096 dimensions.

But let's start with something we can visualize: two dimensions. Each word gets an x-coordinate and a y-coordinate, placing it as a point in 2D space.

The magic is in how these coordinates are learned. The most famous approach, word2vec (2013), works like this:

Take a huge pile of text (Wikipedia, books, the web...)
For each word, look at the words that surround it (its "context")
Train a simple neural network to predict context from a word (Skip-gram) or a word from its context (CBOW)
The "side effect" of this training: the network learns number-representations for each word where similar words end up nearby

Technical note: Word2vec uses a trick called negative sampling to make training efficient. Instead of computing probabilities over the entire 50,000-word vocabulary, it only compares each correct context word against a handful of random "negative" words. This makes training feasible on ordinary hardware.

The result? Words that mean similar things cluster together. Animals clump together, colors clump together, countries clump together. Explore the space below:

Each dot is a word. Similar words cluster together. Click two words to see their similarity score.

See how the clusters form? Animals hang out with animals, countries with countries. This isn't programmed — it emerges from the training process. The model discovers these relationships by itself, just from reading lots of text.

And here's the part that really blew people's minds when word2vec was published. You can do arithmetic with these word vectors:

king − man + woman ≈ queen

Take the vector for "king", subtract "man", add "woman" — and the result lands closest to "queen"! The model has somehow learned that "king" is to "man" as "queen" is to "woman".

A word of caution: The king/queen example is the most famous, but it's a bit cherry-picked. Word arithmetic works well for some relationships (gender, country-capital, verb tense) but not all. It's an emergent property, not a guarantee. Still, the fact that it works at all is remarkable.

Try some word arithmetic yourself:

− + ≈

Vector arithmetic reveals semantic relationships hidden in the embeddings.

This was the state of the art circa 2013–2017. Word embeddings like word2vec and GloVe were a huge breakthrough. But they had a fatal flaw...

IV.

Why Context Matters

Here's the problem: a word like "bank" has one embedding. But "bank" means completely different things in these two sentences:

Same word, completely different meanings — but with static embeddings, the vector is identical!

"I sat by the river bank" and "I went to the bank for money." The word "bank" gets the same vector in both cases. The model can't tell them apart!

This is the fundamental limitation of static word embeddings. We need representations that change depending on context.

Before the transformer revolution, people tried recurrent neural networks (RNNs) and LSTMs. The idea: process words one at a time, left to right, maintaining a "memory" of what came before.

The RNN problem: Imagine reading a long novel one word at a time, and trying to remember everything. By page 400, you've mostly forgotten what happened on page 1. RNNs have the same issue — it's called the vanishing gradient problem. Information from early words "fades" as the sequence gets longer. LSTMs partially fixed this, but they were still fundamentally limited.

And there was another problem: RNNs process words sequentially. Word 1, then word 2, then word 3... This made them painfully slow to train, because you couldn't parallelize the computation.

What we really wanted was a way for every word to look at every other word simultaneously. To understand each word in the full context of the entire sentence, all at once.

In 2017, a team at Google published a paper with an unforgettable title: "Attention Is All You Need." And the world changed.

The Breakthrough Idea

Attention is the single most important concept in modern AI. Here's the core idea:

Every word gets to look at every other word and decide how much to "pay attention" to it.

Think about reading the sentence "The cat sat on the mat because it was tired." When you read "it", your brain instantly connects it to "cat" — not "mat", not "the". You attend to the right words automatically.

The attention mechanism does the same thing, but mathematically. Let's break it down.

Each word creates three vectors from its embedding:

Query (Q) — "What am I looking for?" (like a search query)
Key (K) — "What do I contain?" (like a book title)
Value (V) — "What information do I carry?" (like the book's content)

The analogy is a library search: your Query is matched against every word's Key to get a relevance score. Then you use those scores to grab a weighted mix of every word's Value.

Mathematically:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

Why divide by √d_k? Without this scaling, dot products can get very large when the vectors are high-dimensional, which makes the softmax function produce extreme values (almost all 0s and one 1). Dividing by the square root of the dimension keeps the scores in a well-behaved range. It's a small detail, but without it, training breaks.

Let's see it in action. Hover over words in the sentence below to see what each word "attends" to:

Each row shows what that word attends to. Darker = more attention. Different "heads" learn different patterns.

Notice something fascinating: different attention heads learn different things! One head might specialize in resolving pronouns ("it" → "cat"), another in tracking syntactic structure (verbs attending to subjects), and another in simply paying attention to nearby words.

In practice, models use 12, 32, or even 128 attention heads running in parallel. Each learns its own pattern, and their outputs are combined. This is called multi-head attention.

Important detail: Attention is completely order-agnostic — it treats the input as a set, not a sequence! The words "cat sat mat" would get the same attention scores regardless of order. To fix this, we add positional embeddings: extra numbers added to each word's embedding that encode its position in the sequence. Without them, the model wouldn't know whether "the cat ate the fish" or "the fish ate the cat"!

Now we have all the ingredients to understand the full transformer architecture.

VI.

Putting It Together

A transformer model is built from transformer blocks, stacked on top of each other like LEGO bricks. Each block takes in a sequence of token embeddings and outputs an improved sequence of embeddings — same shape, but each token now has a richer understanding of the whole sentence.

Each transformer block has four key components:

Click any component above to learn what it does!

The complete transformer block. Data flows bottom-to-top. GPT-3 stacks 96 of these!

Let's unpack each piece with an analogy. Imagine a team meeting:

Self-Attention is the part where everyone in the room shares their notes with everyone else. Each person (token) gets to hear what everyone else knows. After this step, your understanding of each word is enriched by the context of all other words.
Feed-Forward Network is the part where each person goes back to their desk and thinks privately about what they just heard. It's a small neural network applied to each token independently, doing deeper processing.
Residual Connections (the "skip" arrows) ensure you don't forget what you knew before the meeting. The original input is added back to the output. This is crucial — it means the model can learn to make small refinements rather than needing to reconstruct everything from scratch.
Layer Normalization keeps everyone "at a similar volume level." Without it, the numbers can grow out of control as they pass through many layers.

GPT-3 stacks 96 of these blocks. GPT-4 likely has even more. Each layer refines the representations further — early layers handle grammar and syntax, middle layers capture semantics and facts, and deep layers do complex reasoning.

But a transformer is just an architecture — a blueprint. The magic happens in how it's trained.

VII.

The Massive Multi-Task Learner

Here's the punchline of the whole story. The training objective for GPT and its descendants is embarrassingly simple:

Predict the next token.

That's it. Given a sequence of tokens, predict what comes next. "The capital of France is ___" → "Paris". "def fibonacci(n): ___" → "if". "She felt sad because ___" → "her".

This sounds almost too simple. But here's the key insight that makes it work:

To predict the next word well across all possible texts, you implicitly need to learn an enormous range of skills.

One simple objective — predict the next token — implicitly requires learning all of these skills.

To predict "Paris" after "The capital of France is", you need geography. To predict "her" after "She felt sad because", you need emotional reasoning. To predict the next line of code, you need to understand programming.

One objective, thousands of implicit tasks. This is why next-token prediction is sometimes called a "massive multi-task objective in disguise."

The training data is enormous: GPT-3 was trained on roughly 300 billion tokens (about 500GB of text). More recent models use trillions. This includes books, Wikipedia, web pages, code repositories, and much more.

Now, here's the process. Type a prompt below and watch the model predict tokens one at a time:

This simulates autoregressive generation: predict one token, add it, repeat. Real LLMs do this thousands of times.

Training vs. Inference: During training, the model processes millions of text samples and adjusts its billions of parameters (weights) to get better at next-token prediction. During inference — when you chat with ChatGPT — the weights are frozen. No learning happens. The model just runs forward passes to generate tokens one at a time. It's like the difference between studying for an exam (training) and taking it (inference).

After pretraining, you have a model that can complete any text. Give it a news article, it'll write more news. Give it poetry, it'll write more poetry. Give it code, it'll write more code.

But there's a problem: it'll also happily complete toxic text with more toxic text. It has no judgment about what's helpful or appropriate. It's like a brilliant student who has read everything but has no values.

Enter: alignment.

VIII.

Teaching the Model to Be Helpful

A pretrained language model is like a wildly talented but completely unfiltered person. Ask it a question, and it might give you a helpful answer — or a toxic rant, or a hallucinated lie, or a wall of incoherent text. It was just trained to predict what comes next.

RLHF — Reinforcement Learning from Human Feedback — is how we transform this raw text predictor into the helpful, harmless assistants we actually want. It's a three-step process:

The three stages of RLHF: teach by example, learn preferences, optimize with RL.

Let's walk through each step:

Step 1: Supervised Fine-Tuning (SFT). Take the pretrained model and show it thousands of examples of good assistant behavior. "Here's a question, here's how a good assistant would answer." This is like apprenticeship — the model learns to mimic the format and tone of helpful responses.

Step 2: Reward Model. Have the SFT model generate multiple responses to the same prompt. Then have humans rank them: "This response is better than that one." Use thousands of these comparisons to train a separate reward model — a neural network that scores how "good" a response is.

Step 3: Reinforcement Learning (PPO). Use the reward model as the training signal. The LLM generates a response, the reward model scores it, and the LLM is updated to produce higher-scoring responses. This is the same kind of RL used to train AlphaGo.

Try being a human rater yourself!

Your preferences are exactly the kind of data used to train reward models.

The result of RLHF is dramatic. The same model that would previously complete any text — helpful or harmful — now actively tries to be useful. This is the difference between GPT-3 and ChatGPT.

But RLHF has limitations. The reward model can be gamed ("reward hacking"). PPO is finicky and expensive. Researchers have been looking for better approaches...

IX.

DPO, GRPO, and Beyond

RLHF was a breakthrough, but it's complex. You need a separate reward model, deal with PPO's instability, and collect mountains of human labels. Researchers asked: can we simplify?

DPO (Direct Preference Optimization) came first. The insight: you don't need a separate reward model. Train directly on preference pairs — "Response A is better than Response B" — and mathematically bake the reward model's job into the loss function itself. Simpler, more stable, and often works just as well.

But then came an even more interesting idea: what if the model could generate its own training signal?

GRPO (Group Relative Policy Optimization) takes this further:

Given a prompt, generate multiple responses ("rollouts")
Score each response (using a reward model, verifier, or correctness check)
Use the best responses as positive examples and worst as negative
Update the model to be more like its own best work

Each rollout is a different response. The model learns from comparing its own best and worst attempts.

The analogy: instead of hiring a critic, you try many approaches yourself and learn from comparing your own best vs worst work. Like a writer who generates four drafts and improves by studying what made the best one good.

This is especially powerful for reasoning tasks. If the model can verify whether an answer is correct (math, code that passes tests), it can generate its own training signal without any human labeling!

The broader landscape: DPO and GRPO aren't the only approaches. RLAIF uses AI feedback instead of human feedback. Constitutional AI has the model critique itself against principles. Process reward models score each step of reasoning, not just the final answer. This is one of the most active areas of AI research.

Modern LLMs can now improve themselves through practice — especially on problems with verifiable answers. The model generates, evaluates, and refines — getting better through its own experience.

Putting It All Together

Let's step back and see the full picture:

Every modern LLM follows this pipeline. Each step builds on the previous one.

Each step builds on the last:

Text is broken into tokens
Tokens become embeddings — rich numerical representations
Embeddings pass through attention layers that capture context
These layers are stacked into transformer blocks
The transformer is pretrained on next-token prediction over trillions of tokens
The pretrained model is aligned via RLHF, DPO, or GRPO to be helpful

And that's how you get from raw text on the internet to the AI assistant you're chatting with right now.

The next time someone says "it's just predicting the next word," you'll know there's a lot more to it. Next-word prediction is the seed, but the tree that grows from it — through clever architecture, massive scale, and careful alignment — is genuinely extraordinary.

You don't truly understand something until you can predict it. And now, you can predict quite a lot about how LLMs work.

✽

Further Resources

Attention Is All You Need (Vaswani et al., 2017) — the paper that started it all.
Language Models are Unsupervised Multitask Learners (GPT-2 paper) — scaling up next-token prediction.
Training language models to follow instructions with human feedback (InstructGPT) — the RLHF paper.
Direct Preference Optimization (Rafailov et al., 2023) — simplifying alignment.
DeepSeek-R1 — GRPO for reasoning models at scale.
3Blue1Brown: Neural Networks — excellent visual explanations of the math.
The Illustrated Transformer by Jay Alammar — the gold standard for transformer visualizations.

Thank you for reading. This explainer was built as a self-contained HTML page with all visualizations in pure SVG and JavaScript.