A Visual Journey

How LLMs Actually Work

From Vectors to Intelligence

Lossfunk

Paras Chopra

Founder & Researcher, Lossfunk

The Mystery

These models can write poetry, debug code, explain quantum physics, and pass the bar exam.

But what is actually happening inside?

"How is it that these models are doing these things that we don't know how to do? Imagine if some alien organism landed on Earth."

Chris Olah, Anthropic

Step 1

Everything is Numbers

A vector is a list of numbers, a point in space.

Each number describes one feature.

Example: Describe a fruit
[sweetness: 8, sourness: 2]

Feature 1 Feature 2 2 4 6 2 4 [3, 4] [6, 2]

More Dimensions = Richer Descriptions

2 features: sweetness, sourness

3 features: + size

300 features: every shade of meaning

We can't visualize 300 dimensions, but the math works the same way.

Sweetness Size Sourness Apple Lemon Watermelon Grape Real embeddings: 300 - 4,096 dims

2013: A Breakthrough

Words as Vectors

Tomas Mikolov (Google) showed that words could be represented as vectors, and that these vectors captured meaning.

Similar words cluster together in vector space.

Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," 2013

ANIMALS cat dog lion tiger COUNTRIES France Japan India Brazil EMOTIONS happy joyful sad angry No one told the model what these words mean. It figured it out from context

How Word2Vec Learns

Words that appear together should have similar vectors.

Words that don't co-occur should be pushed apart.

After billions of examples, the space self-organizes. Meaning emerges.

TRAINING SENTENCE The cat sat on the mat . context window cat sat PULL TOGETHER cat stocks PUSH APART

Vector Arithmetic

king − man + woman ≈ queen

Relationships become geometric operations in vector space.

Paris − France + Japan ≈ Tokyo

"There seems to be a constant male-female difference vector."

Chris Olah
king man gender woman queen! The "royalty" direction is preserved The "gender" direction changes

Cosine Similarity

Measure meaning by the angle between vectors.

Small angle → Similar
"happy" & "joyful"

90° → Unrelated
"happy" & "table"

Opposite → Antonyms
"happy" & "miserable"

happy joyful 15° cos = 0.97 table 90° → cos = 0 miserable cos = −0.8

Each Dimension Captures a Shade of Meaning

"scientist" ← education ← profession ← knowledge ← ... 300 dimensions total "artist" ← education ← creativity ← expression same space, different pattern "We now think of internal representation as great big vectors." Geoffrey Hinton, Nobel Prize 2024

From Words to Tokens

LLMs don't process words. They process tokens (word pieces).

INPUT TEXT Understanding language models TOKENIZED Under standing language models EACH TOKEN → EMBEDDING VECTOR [0.2, -0.1, ...] [0.8, 0.3, ...] [-0.1, 0.7, ...] [0.5, -0.4, ...]

The Embedding Space

Every token maps to a point in a vast high-dimensional space.

CODE EMOTIONS SCIENCE LEGAL MATH

"I like to think of the model as a one terabyte zip file. It's full of compressed knowledge from the internet."

Andrej Karpathy

Act II

The Machinery

We've seen how meaning becomes geometry.

Now let's open the black box.

2017: The Architecture

The Transformer

Every modern LLM uses the same pattern:

Attention → MLP → Attention → MLP → ...

repeated 32 - 120+ times

Vaswani et al., "Attention Is All You Need," 2017

Token Embeddings Attention MLP Transformer Block 1 Attention MLP Transformer Block 2 ×N layers Next Token Probabilities

The Residual Stream

Vectors flow through the model like a river.

Each layer reads from the stream, processes, and adds back.

x = x + Attention(x)
x = x + MLP(x)

Nothing is replaced entirely. Information accumulates.

token Attention + MLP + enriched vector ← Residual Stream →

Attention: "Which words matter for understanding me?"

The cat sat on the mat because it was tired strong attention The model LEARNED that "it" refers to "cat". Nobody programmed this. "Self-attention allows the model to associate 'it' with 'animal'" - Jay Alammar

Attention as Soft Matching

In a database, you search for an exact match. Attention does something more powerful: soft matching.

Each token creates a Query ("what am I looking for?") and a Key ("what do I contain?").

The Query for "Paris" doesn't just match "Paris". It softly matches "Capital of France", "The city with Eiffel Tower", or even "Beautiful European city".

Soft matching lets the model find relevant context even when the words are completely different.

DATABASE: EXACT MATCH Query: "Paris" Match: "Paris" only ATTENTION: SOFT MATCH Query: "Paris" "Capital of France" (0.95) "City with Eiffel Tower" (0.82) "Beautiful European city" (0.61) "Recipe for bread" (0.03) The Value (V) vector carries the actual information. Output = weighted blend of all matched Values

Attention as a Heatmap

Each cell shows how much one token attends to another.

The cat sat on it The cat sat on it "it" attends strongly to "cat" Attention(Q,K,V) = softmax(QK^T / √d) · V

Multi-Head Attention

Multiple attention heads run in parallel, each looking for different relationships.

  • Head 1: syntax (subject → verb)
  • Head 2: coreference (pronoun → noun)
  • Head 3: proximity (nearby words)

GPT-4 has 128 attention heads per layer, across 120+ layers.

Input Token Head 1 Q₁ K₁ V₁ syntax Head 2 Q₂ K₂ V₂ coreference Head 3 Q₃ K₃ V₃ proximity ... Concatenate W_o Projection

The MLP: The Model's Memory

Attention gathers context from other tokens. But where is the actual knowledge stored?

The MLP layer holds the model's learned facts and patterns. It reads the enriched representation and shifts it toward better predictions.

Attention decides what to look at. MLP decides what to do with what it sees.

Enriched by Attention MLP Layer Stored knowledge & learned patterns "The Eiffel Tower is in..." Shifted toward "Paris" Attention routes information between tokens. MLP applies stored knowledge to shift the prediction.

How Representations Transform Layer by Layer

Layer 1 Layer 8 Layer 24 Layer 48 Final Token identity spelling, position Syntax grammar, structure Meaning concepts, relationships Reasoning intent, logic Next token prediction prep Early layers = eyes/ears (sensory) · Middle layers = brain (abstract thought) · Final layers = mouth (output)

Next Token Prediction

The model converts its final vector into probabilities over the entire vocabulary.

Final Vector Vocabulary Matrix → 50,000 scores Softmax → probabilities Top predictions for "The cat sat on the ___" mat 42% floor 24% table 12% couch 7% ...49,996 more tokens Sample "mat" → feed back as input → repeat This autoregressive loop generates text one token at a time

Act III

Growing Intelligence

The architecture exists. But how do random weights become intelligent?

"Grown, Not Engineered"

"Because we did not build the thing, what we build is a process which builds the thing."

Ilya Sutskever

"You can think of training a neural network as a process of maybe alchemy or transmutation, or maybe like refining the crude material, which is the data."

Ilya Sutskever

Nobody programs an LLM. We set up conditions for intelligence to emerge from data + optimization.

Cross-Entropy Loss

The model learns by predicting the next word and being scored on how surprised it is.

Predicted "mat" at 90% → Loss = 0.10 (low surprise)

Predicted "mat" at 1% → Loss = 4.6 (extreme surprise!)

The logarithm makes the penalty extreme for confident wrong answers.

LOSS = −log(predicted probability) Predicted probability of correct token → Loss → p=0.01 → Loss=4.6 "I was so wrong!" p=0.90 → Loss=0.10 "I knew it!"

The Loss Landscape

Training = finding the lowest valley in a vast mountain range.

SGD (Stochastic Gradient Descent) rolls a ball downhill with some randomness.

With billions of parameters, the landscape has an unfathomable number of dimensions. The optimizer navigates it using gradients.

Loss Parameter space → start goal: find the valley

Stage 1 of 3

Pre-Training: Massive Multi-Task Learning

Training Data Wikipedia Books, Code Web pages ~44 TB of text trillions of tokens Task: Predict next token for every position Base Model learns: Grammar · Facts · Reasoning patterns World knowledge · Code · Math Multiple languages · Common sense all from predicting the next word

"Predicting the next token well means that you understand the underlying reality that led to the creation of that token."

Ilya Sutskever

"An Expensive Autocomplete"

The base model has knowledge but doesn't know how to be helpful.

Input

What is the capital of France?

Base model output

What is the capital of Germany? What is the capital of Spain? ...

It just continues the pattern. It doesn't answer.

"At its core, a base model is just an expensive autocomplete."

Andrej Karpathy

Stage 2 of 3

Instruction Tuning (SFT)

Train on curated instruction-response pairs (~100K examples).

Training Example

User: What is photosynthesis?

Assistant: Photosynthesis is the process by which plants convert sunlight into energy...

SFT teaches format and personality: the chat template. Much smaller data, but high quality.

"It's the human touch in post-training that gives it a soul."

Andrej Karpathy

Stage 3 of 3

RLHF: Learning Preferences

1. Generate Model produces 2+ responses to same prompt 2. Human Ranks "Response A is better than B" 3. Reward Model Learn to predict human preferences 4. RL (PPO) Optimize model to maximize reward This is where the model learns nuance: helpful but not harmful · confident but not overconfident · concise but thorough

The Full Training Pipeline

StageDataGoalResult
Pre-training Trillions of tokens Predict next word Knowledge (raw)
SFT ~100K curated pairs Follow instructions Conversational
RLHF Human rankings Align with values Helpful & honest

2024-2025

The Reasoning Revolution

New paradigm: train models to think, not just respond.

GRPO (DeepSeek): eliminate the critic. Generate multiple answers, compare within the group.

For verifiable domains (math, code), no human judges needed. Just check: is the answer correct?

"This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning."

DeepSeek team, on R1 spontaneously learning to reason
GRPO: Group Comparison Prompt Ans 1 ✗ r = -1 Ans 2 ✓ r = +1 Ans 3 ✓ r = +1 Ans 4 ✗ r = -1 Mean reward = 0 → normalize within group Reinforce correct answers AIME math: 15.6% → 77.9%

Act IV

Looking Inside

What does the model actually learn? Can we peek inside?

May 2024: A Window Into the Mind

Golden Gate Claude

Anthropic found specific "features" inside Claude: directions in activation space that correspond to real concepts.

They found the Golden Gate Bridge feature cluster and amplified it to 10×.

Prompt

What do you look like?

Golden Gate Claude

"I am the Golden Gate Bridge... my physical form is the iconic bridge itself"

First ever detailed look inside a production-grade LLM.

Anthropic, "Scaling Monosemanticity," May 2024

Models Plan Ahead

LLMs don't just predict one word at a time. They plan ahead.

Anthropic found that when Claude writes a rhyming couplet, it decides the rhyming word early on, then builds the rest of the line to reach it.

The model doesn't just react word by word. It builds a plan across multiple tokens before committing to output.

Anthropic, "On the Biology of a Large Language Model," 2025

WHAT WE ASSUMED The → cat → sat → on → the → ... (one at a time) WHAT ACTUALLY HAPPENS moon decided early Beneath the silver light of line constructed to reach the planned word Like a poet who picks the rhyme first, then writes the line to reach it

The Language of Thought

When Claude thinks, it thinks in concepts, not words.

EN: "opposite of small" FR: "contraire de petit" ZH: "小的反义词" SAME internal features: "smallness" + "opposites" Language-specific features only activate at output

The model has a universal "language of thought" that transcends any specific language.

Act V

Limitations & The Future

These models are powerful, but imperfect in surprising ways.

Jagged Intelligence

"The strange, unintuitive fact that state of the art LLMs can both perform extremely impressive tasks while simultaneously struggle with some very dumb problems."

Andrej Karpathy
Tasks → Capability Legal briefs Complex math Code generation Translation Count r's in "strawberry" 9.11 vs 9.9 Tic-tac-toe Spatial reasoning

LLMs take shortcuts instead of truly generalizing. When the shortcut works → brilliant. When it doesn't → foolish.

Open Problems the Community Is Tackling

Continual Learning

Current LLMs are frozen after training. How do we let them keep learning without forgetting? Catastrophic forgetting remains unsolved.

World Models

Do LLMs build internal models of reality, or just learn surface statistics? Can we train models that truly understand causality and physics?

Energy & Sample Efficiency

GPT-4 training cost ~$100M+ in compute. Human brains run on 20 watts. Can we close this gap? Data efficiency is equally critical: children learn language from far less data.

These aren't just engineering problems. They're fundamental questions about the nature of learning and intelligence.

A Timeline of Discovery

2013 Word2Vec: words become vectors (Mikolov, Google)
2017 "Attention Is All You Need": the Transformer (Vaswani, Google)
2018–20 GPT-1, 2, 3: scaling reveals emergent abilities
2022 ChatGPT: AI goes mainstream, RLHF revolution
2024 Golden Gate Claude: first look inside a production LLM
2024 Geoffrey Hinton wins Nobel Prize in Physics
2025 DeepSeek-R1: pure RL creates reasoning, Biology of LLMs

The Future: An Age of Wonder

"The 2010s were the age of scaling, now we're back in the age of wonder and discovery once again."

Ilya Sutskever

Old paradigm:
Bigger model + More data → Better

New paradigm:
Same model + More thinking + Better RL → Better reasoning

"If the benefits of the increased productivity can be shared equally, it will be a wonderful advance for all of humanity."

Geoffrey Hinton, Nobel Prize Speech, 2024

Key Takeaways

  • Meaning is geometry: words, concepts, knowledge are vectors in high-dimensional space
  • Attention is the key innovation: it lets words "look at" each other to understand context
  • LLMs are grown, not engineered: intelligence emerges from optimization on data
  • We can now look inside: interpretability reveals how models think and plan
  • Intelligence is jagged: powerful but imperfect, brilliant in some ways and brittle in others

References & Further Reading

Papers

Mikolov et al., "Word2Vec" (2013)

Vaswani et al., "Attention Is All You Need" (2017)

Brown et al., "GPT-3" (2020)

Ouyang et al., "InstructGPT" (2022)

Elhage et al., "Toy Models of Superposition" (2022)

Templeton et al., "Scaling Monosemanticity" (2024)

DeepSeek, "DeepSeek-R1" (2025)

Anthropic, "Biology of an LLM" (2025)

Anthropic, "Circuit Tracing" (2025)

Visual Guides & Blogs

Jay Alammar, "The Illustrated Transformer"

3Blue1Brown, "But What Is a GPT?"

Chris Olah, colah.github.io

Andrej Karpathy, "Software 2.0"

Chip Huyen, "RLHF Explained"

Ethan Mollick, "Jagged Frontier"

People Quoted

Ilya Sutskever · Andrej Karpathy · Chris Olah

Geoffrey Hinton · Jay Alammar · DeepSeek team

1 / 42