The internet's default answer to "how does Instagram pick my Reels?" is "the algorithm." And that's not wrong, but it's not very useful. In this article, we'll peel back the layers of the recommendation system — from the two-tower retrieval model to real-time personalization — until you have a working mental model of how billions of candidate videos get whittled down to the handful you see in under 100 milliseconds. By the end, you'll understand enough to predict how your feed would change if you suddenly started watching entirely different content.
Let's start by asking ourselves: what should your Instagram Reels feed look like?
Think about it. At any given moment, there are hundreds of millions of Reels on Instagram. Some are 15-second dance clips. Some are 90-second cooking tutorials. Some are a golden retriever doing something inexplicably hilarious. And you — you're one person with one phone and maybe 20 minutes to kill on the train.
So how does Instagram decide which handful of Reels to show you, out of that ocean of content? And why does it feel so eerily accurate sometimes?
The answer involves some genuinely clever engineering. At the heart of it is a recommendation system — a machine learning pipeline that attempts to predict, in real time, which videos you'll find most engaging. It does this using a multi-stage architecture that trades off speed and precision at each step, narrowing billions of candidates to the few dozen you'll actually see.
In this article, we'll go deep enough that, by the end, you'll have a working mental model of the whole system. You'll understand why your feed shifts when your behavior changes, how the system can respond to a new interest in minutes, and what tradeoffs engineers make to serve recommendations in under 100 milliseconds.
A note on specifics: Meta hasn't published the complete internal architecture of their Reels recommendation system. What we'll discuss here is grounded in Meta's public engineering blog posts (particularly about Instagram Explore), published research papers, and industry-standard patterns. Where we're simplifying or illustrating, we'll say so. The mental model you'll build is accurate at the conceptual level — think of it as a very good map, not a satellite photo.
I.
What even IS a recommendation algorithm?
Before we dive into the specifics, let's build some intuition for the core problem.
A recommendation algorithm is, at its simplest, a function that takes two inputs — a user and a set of items — and returns a ranked list of items that user is most likely to enjoy. That's it. Everything else is engineering to make this function fast, accurate, and scalable.
But "simple" is doing a lot of heavy lifting there. Consider the scale:
Instagram has over 2 billion monthly active users. New Reels are uploaded constantly — millions per day. If you wanted to score every possible reel for every possible user, you'd need to evaluate something on the order of 2 billion × hundreds of millions of scores. That's... not going to happen in real time. Not even close.
So the first big insight in recommendation systems is: you can't score everything. You need to be clever about which items you even consider.
This is why modern recommendation systems split the problem into stages. Think of it like hiring at a large company: you don't give every applicant on Earth a 5-hour interview. First, you filter by basic criteria (can they code?). Then a phone screen. Then an onsite. Each stage is more expensive but more precise, and operates on a smaller pool of candidates.
Recommendation systems work the same way. The rough stages look like this:
The magic is in how each stage works. The retrieval stage needs to be blazing fast — it scans millions of items in milliseconds. The ranking stage can afford to be slower and more thoughtful, because it's only looking at ~1,000 candidates.
But before we dive into those stages, we need to understand the foundation that makes all of this possible: the two-tower model.
II.
The Two-Tower Model: A Tale of Two Embeddings
Here's the central idea, and it's genuinely elegant.
Instead of trying to score every user-item pair directly (which is impossibly expensive), we do something clever: we learn to represent users and items as vectors — lists of numbers — in the same mathematical space. Then, finding a good recommendation becomes as simple as finding which item vectors are closest to the user vector.
This is the two-tower model, and it's the workhorse of modern retrieval systems. Meta has publicly described using it for Instagram Explore, and variations of it are standard across the industry.[1]
It works like this. There are literally two "towers" — two separate neural networks:
The user tower takes in everything the system knows about you — your recent watch history, your likes, your demographics, the time of day, what device you're using — and compresses it all into a single vector. Think of it as a GPS coordinate for your current interests, but in a 128-dimensional space instead of two dimensions.
The item tower does the same thing for each Reel — it takes in the video's features (topic, creator, audio, engagement stats, age) and compresses them into a vector in the same space.
The score for a user-item pair is simply the dot product of their vectors. Items whose vectors point in a similar direction to your user vector get high scores. Items pointing elsewhere get low scores.
Now here's why this is so clever: because the towers are independent, you can pre-compute all the item embeddings ahead of time. When a user opens Instagram, you only need to run the user tower once (cheap!), and then find the pre-computed item vectors closest to it.
"Finding the closest vectors" is a well-studied problem called approximate nearest neighbor (ANN) search. There are specialized data structures (like HNSW graphs or inverted file indices) that can search through billions of vectors in single-digit milliseconds. This is what makes retrieval so fast.
But what does it mean for a user vector and an item vector to be "close"? Let's build some intuition with an interactive demo.
The beauty of embeddings is that they capture nuance. A cooking reel by a comedian might live somewhere between the "cooking" cluster and the "comedy" cluster. A basketball highlight set to trending music might bridge "sports" and "music." The embedding space lets the model capture these overlaps organically.
And because both users and items live in the same space, a user who watches lots of cooking comedy will have a vector that's naturally close to those hybrid videos. No one had to manually tag "cooking comedy" as a category — the model learned it from behavior.
III.
How does it learn what you like?
We've been talking about "learning" what you like, but how does the model actually know? You never filled out a survey saying "I'm 70% into cooking and 30% into sports." The system has to infer your preferences from your behavior.
These behavioral signals are called implicit feedback, and they're the lifeblood of the recommendation system. Every time you interact with a Reel — or don't interact — you're sending a signal:
The key insight is that watch time is the most informative signal. A "like" takes a deliberate tap, but watch time captures involuntary interest. If you watched a 30-second reel all the way through and then replayed it, the system learns far more than from a quick like on a video you scrolled past in 2 seconds.
Other signals include:
• Shares — sharing a reel to a friend or your story is a very strong positive signal
• Saves — bookmarking for later suggests lasting value, not just momentary interest
• Comments — especially positive or substantive ones
• Skips — scrolling past quickly is a negative signal
• "Not interested" — the strongest explicit negative signal
These signals aren't just used for the currently playing reel. They're aggregated into your user profile over time, creating a rich picture of your preferences. Your recent signals are weighted more heavily than old ones — if you suddenly start watching a lot of travel content after booking a vacation, the system picks up on that shift quickly.
The model is trained on billions of these interactions. During training, it learns to position user and item embeddings so that the dot product between a user and items they did engage with is high, and the dot product with items they didn't engage with is low. That's the whole training objective, and it's what makes the embedding space meaningful.
Negative sampling: You might wonder — how do you train on "items the user didn't engage with"? After all, the user didn't see most Reels. The standard approach is to sample random items as "negatives." If you watched cooking reels all day, a random sports reel is treated as a negative example. Over billions of examples, this works remarkably well. It's simple, but it's one of those elegant tricks that makes the whole thing tractable.
IV.
The Retrieval-Ranking Pipeline: Speed vs. Quality
Now we get to the part that makes systems engineers break out in a cold sweat: latency.
When you open Instagram and swipe to the Reels tab, you expect content to appear instantly. If it takes even two seconds, it feels broken. So the entire recommendation pipeline — from "this user just opened the app" to "here are your top 10 Reels" — needs to complete in under 100 milliseconds. For billions of users. Concurrently.
This is why the system uses a multi-stage pipeline. Each stage applies more sophisticated (and expensive) logic to a progressively smaller set of candidates. Let's walk through each stage.
Stage 1: Candidate Generation (Retrieval)
The first stage's job is simple: from the hundreds of millions of eligible Reels, find roughly 1,000 that are plausibly interesting to this user. It doesn't need to be perfect — it just needs to not miss the really good ones.
This is where the two-tower model and ANN search come in. The system computes your user embedding, then performs an approximate nearest neighbor search against a pre-built index of all item embeddings. In practice, multiple retrieval sources run in parallel — one might focus on "similar to what you've watched recently," another on "popular among users like you," another on "trending right now."
The whole thing takes roughly 10 milliseconds.
Stage 2: Pre-ranking (Lightweight Scoring)
The ~1,000 candidates now get a quick score from a lighter model. This model is more sophisticated than the dot product, but still fast. It might be a simple neural network that takes in both user and item features jointly. Its job is to cut the candidates from ~1,000 to ~150.
Think of it as a "first read" — catching the obviously bad candidates that slipped through retrieval.
Stage 3: Full Ranking
This is the expensive one. A deep neural network scores each of the ~150 remaining candidates on multiple objectives simultaneously: predicted watch time, probability of like, probability of share, probability of "not interested," and more. Each candidate gets multiple scores that are combined into a final ranking.
This model is too expensive to run on millions of items — but for 150 candidates, it's totally feasible. It takes roughly 50 milliseconds.
Stage 4: Blending & Policy Filters
The final stage applies business logic. It enforces diversity (don't show 10 cooking reels in a row), freshness (mix in some new content), safety filters (block policy-violating content), and creator fairness (don't over-concentrate exposure). The result: your final feed of ~10-25 Reels, ready to serve.
Abstraction vs. reality: In practice, Meta runs hundreds of models across Instagram's various surfaces. The "ranking model" isn't one model — it's a cascade of models making different predictions. Meta's engineering blog describes deploying over 1,000 models for Instagram recommendations. We're simplifying into four stages, but the real system has more fine-grained steps and many more parallel paths.
Here's something that might surprise you: the retrieval stage is often the most important to get right. If a great reel never makes it past retrieval, the ranking model never even sees it. You can have the world's best ranking model, but it can only rank what's in front of it. This is why teams invest enormous effort in recall — the percentage of truly relevant items that make it through the retrieval stage.
V.
Real-Time Personalization: Your Feed Updates While You Scroll
Here's where things get really interesting. The system doesn't just compute your preferences once and call it a day. It adapts to what you're doing right now, in near real-time.
Imagine you open Instagram at 8am on a Saturday. You're browsing lazily, watching some cooking content. The system serves you more cooking, some travel, some comedy — a nice mix based on your historical preferences. Standard stuff.
But then at 8:15, you watch three basketball highlights in a row. Something shifted. Maybe the NBA playoffs just started. The system notices this pattern immediately. Your user features — specifically your recent interaction sequence — have changed. The very next batch of recommendations will start blending in more sports content.
How does this work technically? The key is that user features are split into two categories:
• Stable features: Your age, location, long-term interest profile. These update slowly (daily or weekly).
• Real-time features: Your last 50 interactions, current session context, time since last visit. These update on every scroll.
The ranking model takes both as input. So even if your long-term profile says "cooking lover," your real-time features screaming "basketball basketball basketball" will shift the output meaningfully.
This real-time adaptation is one of the reasons the system feels "creepy" sometimes. You watch one video about a topic, and suddenly it's everywhere. But there's no dark magic — the system is simply very responsive to recent signals, especially when they represent a strong deviation from your baseline.
There's a subtlety here worth noting: the system doesn't literally retrain the model every time you scroll. The model weights are fixed (they're updated through offline training, typically daily or weekly). What changes in real-time are the input features. Your "recent interactions" feature is a rolling window that shifts with every action. The model was already trained to interpret these features, so feeding in fresh data naturally produces fresh outputs.
Feature stores and streaming: To make real-time features possible, companies like Meta use sophisticated infrastructure — feature stores that can serve fresh user features with single-digit millisecond latency, event streaming pipelines (think Apache Kafka) that propagate your actions across the system in near real-time, and online feature computation that aggregates your recent behavior on the fly. The ML model itself might be the easy part — the infrastructure to feed it fresh data at scale is where the real engineering challenge lies.
VI.
The Balancing Act: Beyond Pure Engagement
If you optimized purely for engagement — predicted watch time, likes, shares — you'd get a pretty dysfunctional feed. The algorithm would learn that rage-bait and clickbait get clicks, that extreme content keeps people watching, and that showing the same narrow content over and over maximizes short-term engagement.
This is the filter bubble problem, and modern recommendation systems actively work to counteract it.
In practice, the final ranking isn't just "which reel will this user engage with most?" It's a weighted combination of multiple objectives:
• Relevance: Will the user enjoy this content? (The core objective)
• Diversity: Is the feed varied enough? (Don't show 10 cooking reels in a row)
• Freshness: Is there enough new/recent content? (Don't just replay proven hits)
• Exploration: Should we show something new this user hasn't seen before? (How the system discovers your new interests)
• Safety: Does this content comply with community guidelines?
• Creator fairness: Are we giving diverse creators a fair shot at exposure?
The art of recommendation engineering is tuning these objectives against each other. More diversity means less relevance per-item. More exploration means some "misses." More safety filtering means potentially removing edgy-but-legal content.
This multi-objective approach is also why two people with similar watch histories can see quite different feeds. The system is constantly A/B testing different weight configurations, and different users might be in different experimental cohorts.
The exploration vs. exploitation tradeoff is particularly fascinating. "Exploitation" means showing content the system is confident you'll like (based on past behavior). "Exploration" means showing content the system is uncertain about — probing whether you might enjoy something new. Without exploration, the system can never discover that you secretly love pottery videos if you've never watched one.
In practice, a small fraction of your feed (maybe 5-10%) is deliberately exploratory. These "exploration slots" are how the algorithm learns and how your feed evolves over time. If you engage with an exploratory reel, the system has learned something new about you and can exploit that knowledge in future sessions.
VII.
Putting It All Together
Let's trace the full journey of a single recommendation request, from the moment you open the Reels tab to the moment the first video starts playing.
T+0ms: You open Reels. The Instagram app sends a request to the recommendation backend. Included: your user ID, device info, current time, and a hash of your recent on-device interactions.
T+2ms: Feature assembly. The system fetches your user features from the feature store. Your stable profile (demographics, long-term interests) and real-time features (last 50 interactions, current session) are assembled into a feature vector.
T+5ms: User embedding. The user tower of the two-tower model runs on your features, producing a 128-dimensional user embedding vector.
T+8ms: Candidate retrieval. Multiple retrieval sources fire in parallel. ANN search finds ~500 candidates from the embedding index. Collaborative filtering adds ~200 "users like you also watched" candidates. A popularity source adds ~100 trending candidates. A social source adds ~200 from people you follow. Total: ~1,000 unique candidates.
T+15ms: Pre-ranking. A lightweight model scores the 1,000 candidates and keeps the top 150.
T+50ms: Full ranking. The heavy ranking model scores 150 candidates on multiple objectives. Each gets a predicted watch time, P(like), P(share), P(skip), P(not interested), and more. A value model combines these into a single score.
T+70ms: Blending & policies. Diversity injection, freshness requirements, safety filters, and creator fairness rules are applied. The final ~25 reels are ordered.
T+80ms: Response sent. The ordered list of Reel IDs (plus pre-fetching hints for video CDN) is sent back to the app.
T+100ms: First frame. The app starts playing the first Reel. You start watching. And the system starts collecting signals for the next batch.
That's the whole cycle — and it repeats every time you've scrolled through most of the current batch and need more content. The system is constantly running, constantly adapting, constantly learning.
The scale is staggering: Meta serves recommendations to over 2 billion users across multiple surfaces (Feed, Reels, Explore, Stories). At peak load, this means millions of recommendation requests per second, each touching hundreds of models. The infrastructure required to do this reliably, at low latency, while being fault-tolerant, is arguably the hardest part of the whole system. The ML is elegant; the systems engineering is heroic.
VIII.
The Bigger Picture: What's Next?
The two-tower + retrieval-ranking pipeline we've described is the industry workhorse. But the field is evolving rapidly. Here's what's on the frontier.
Transformers are coming for recommendations. The same architecture that powers ChatGPT is being adapted for recommendation. Instead of treating recommendation as "find similar items," transformer-based models treat it as a sequence prediction problem: given a user's history of interactions (watch A, like B, skip C, share D...), predict the next item they'll engage with. This captures richer temporal patterns than the two-tower approach, though it's more expensive at inference time.
Multi-modal understanding. Early recommendation systems treated videos as bags of metadata: "cooking," "creator_xyz," "30 seconds." But modern systems increasingly understand the actual content. Vision models analyze what's happening in the video. Audio models understand the soundtrack. NLP models read the captions and comments. This richer understanding lets the system make connections it couldn't before — like recommending a silent pottery video to someone who watches cooking, because both involve "satisfying, hands-on craftsmanship."
Generative retrieval. Rather than searching through a pre-built index, some new approaches let a neural network directly generate item IDs for retrieval. Think of it as the model hallucinating "the ID of the perfect video for this user" rather than searching for it. It's early, but promising for capturing complex, non-obvious matches.
On-device inference. Running parts of the model on the phone itself (rather than in the cloud) can improve latency and privacy. Apple has pushed this with its "on-device ML" initiative, and Meta has explored it for lightweight ranking. The challenge is keeping a model on a phone that's both small enough to run fast and good enough to be useful.
Causal reasoning and long-term value. Current systems optimize for short-term engagement: will the user watch this reel right now? But the most valuable recommendations might be ones that keep users coming back tomorrow, next week, next month. Modeling this requires causal reasoning (if I show X, will the user return more often?) rather than just correlation (users who watched X also came back). This is an active research frontier.
The recommendation system that powers your Reels feed is, in many ways, one of the most sophisticated deployed ML systems in the world. It combines deep learning, information retrieval, distributed systems, and real-time data processing, all operating at a scale that would have seemed like science fiction twenty years ago. And yet, from your perspective, it just... works. You open the app, and interesting videos appear.
Hopefully, the next time that happens, you'll have a better mental model of the extraordinary engineering behind that seamless experience.