The curse of dimensionality
In three dimensions, your brain is a geometry genius. You can catch a ball, parallel park a car, and instinctively understand that a room has corners, edges, and a center. But the moment we step into higher dimensions, this intuition doesn't just get worse — it becomes actively misleading.
Consider a simple question: how many data points do you need to "fill" a space? In 1D, ten evenly-spaced points give you reasonable coverage of the unit interval. In 2D, you'd need 10 × 10 = 100 points to cover a grid at the same resolution. In 3D, that's 1,000 points. See the pattern?
In d dimensions, you need 10d points. For d = 100 — a modest number in machine learning — that's 10100 points. That's more than the number of atoms in the observable universe. Your data will never, ever fill a high-dimensional space. It will always be an infinitesimally sparse dust.
This is what Richard Bellman called the curse of dimensionality in 1961, and it has consequences that feel genuinely bizarre:
- Almost all the volume is in the corners. A high-dimensional unit cube has 2d corners. As d grows, the fraction of the cube's volume that's "near the center" vanishes. In 100D, over 99.99% of the cube's volume is in the outer shell, far from the center.
- Distance contrast vanishes. For random points, the ratio of the farthest distance to the nearest distance converges to 1 as d → ∞. The nearest neighbor is barely closer than the farthest point. Your nearest-neighbor algorithms are in trouble.
- The unit ball shrinks to nothing. The volume of a unit ball in d dimensions goes to zero as d increases. The ball occupies a vanishing fraction of the cube that contains it. The geometry you learned in school is lying to you.
This sounds devastating. If high-dimensional spaces are so hostile, how can machine learning work at all? How can we do anything useful with data in thousands of dimensions?
The answer is one of the most beautiful surprises in mathematics: the same weirdness that creates the curse also creates a hidden blessing. The very concentration phenomena that make high-dimensional spaces seem pathological are exactly what makes random projections work. To see why, we need to confront just how strange these spaces really are.
High-dimensional spaces are weird
Let's start with something concrete. What's the volume of a unit ball (a sphere of radius 1) in d dimensions?
In 1D it's just the interval [−1, 1], so the "volume" (length) is 2. In 2D it's a circle with area π ≈ 3.14. In 3D it's a sphere with volume 4π/3 ≈ 4.19. So far, the volume is increasing. Intuition says: more dimensions, more room, more volume. Right?
Wrong.
The volume peaks around d = 5 and then plummets. By d = 20, it's about 0.026. By d = 100, it's so small that floating-point arithmetic can't even represent it — it rounds to zero. The unit ball in high dimensions is, for all practical purposes, empty.
See it for yourself:
But here's where it gets really wild. It's not just that the ball is small — it's that almost all of what remains is concentrated in a thin shell near the surface.
The thin shell phenomenon
Think about peeling an orange. In 3D, the skin is a small fraction of the total volume — most of the orange is "inside." But in high dimensions, this relationship inverts dramatically. Almost all the volume of a high-dimensional ball is packed into a microscopically thin layer near the boundary.
Specifically, the fraction of volume between radius (1 − ε) and radius 1 is:
For d = 100 and ε = 0.05 (a shell that's just 5% of the radius thick), this gives 1 − 0.95100 ≈ 0.994. That means 99.4% of the volume is in the outermost 5% of the radius. The "interior" of the ball is essentially a vacuum.
Play with the numbers yourself:
This is the concentration of measure phenomenon, and it's not just a curiosity — it's the foundational reason why random projections work. When you randomly project a high-dimensional vector, its length doesn't change much, because the vector's "mass" was already concentrated in a thin shell of predictable radius. For the uniform ball, the shell capturing most of the volume has thickness ~O(1/d) relative to the radius. For Gaussian vectors (which are used in JL proofs), the relative thickness is ~O(1/√d). Either way: the higher the dimension, the tighter the concentration.
The uniform ball gives us geometric intuition, but Gaussians are even more convenient analytically — and they're the distribution actually used in JL proofs. The same shell concentration shows up there too:
Here's a precise statement: if you sample a random vector x from a d-dimensional standard Gaussian, its length concentrates tightly around √d. For absolute deviation t: P(| ||x|| − √d | > t) ≤ 2·exp(−c·t²). Equivalently, in relative terms: P(| ||x||/√d − 1 | > δ) ≤ 2·exp(−c·δ²·d). The vector lives on a thin shell of radius ≈ √d with relative thickness ≈ 1/√d. This is the Gaussian version of shell concentration, and it's the engine behind the JL lemma.
So in high dimensions, the unit ball is essentially a thin, hollow shell. But this weirdness goes even deeper. Let's look at what happens to distances.
Everyone is equally far apart
Here's a thought experiment. Take two random vectors in ℝd, where each coordinate is drawn independently from a standard Gaussian (mean 0, variance 1). What's the distance between them?
The difference vector also has Gaussian coordinates — each coordinate of (x − y) is N(0, 2), since it's the difference of two independent N(0,1) values. The squared distance ||x − y||² is therefore a scaled chi-squared: 2·χ²d, with mean 2d and variance 8d. The expected distance is approximately √(2d).
That's not surprising. What is surprising is what happens to the variability of this distance as d grows. The relative standard deviation of the distance scales as:
In d = 10,000, the relative variation is less than 1%. All pairwise distances are essentially the same. Every point is approximately the same distance from every other point. It's as if all your data points are sitting on the vertices of a regular simplex.
Watch this concentration happen in real time:
At first glance, this seems catastrophic. If everything is the same distance from everything else, how can nearest-neighbor search work? How can you cluster data? How can you tell anything apart?
But look more carefully. The distances are concentrating, yes — but they're not becoming identical. The small remaining variations are exactly the meaningful structure in your data. And here's the critical insight: these small relative differences are precisely what random projections preserve.
The JL lemma doesn't promise to preserve absolute distances. It preserves ratios of distances, up to a factor of (1 ± ε). And because distances in high dimensions are already concentrated, those ratios are the only thing that matters.
Distance concentration is often presented as a problem (the "curse"), but it's actually the mechanism that makes dimensionality reduction possible. If distances in high-d were wildly spread out, a random projection would inevitably distort some of them badly. But because they're all similar in magnitude, a random projection can preserve their relative structure with just a few dimensions. The concentration is not the disease — it's the cure.
But concentration of distances isn't the only remarkable property of high-dimensional spaces. There's something even more striking about directions.
Near-orthogonality: the blessing
Quick quiz: in 3D, what's the maximum number of mutually orthogonal vectors? Three — the x, y, and z axes. In d dimensions, it's exactly d. No more, no less. This feels like an iron constraint: the number of "independent directions" is capped by the number of dimensions.
But here's where high dimensions perform a magic trick. What if we relax "orthogonal" to "almost orthogonal"? What if we only require that the cosine of the angle between any two vectors is at most ε — say, at most 0.1?
Then the number of near-orthogonal vectors explodes exponentially.
In d dimensions, there exist sets of exp(Θ(d·ε²)) unit vectors that are pairwise ε-orthogonal — meaning |cos θ| ≤ ε for every pair. A typical bound gives roughly exp(½·d·ε²). For d = 100 and ε = 0.3, this is about e4.5 ≈ 90 vectors — already close to the 100 exactly orthogonal directions. For d = 1000 with the same ε, it's e45 ≈ 3.5 × 1019 — a number so large it makes the number of grains of sand on Earth look like a rounding error.
Think about what this means. In 1000 dimensions, you can find astronomically many directions that are all "almost perpendicular" to each other. High-dimensional spaces are not cramped — they're unfathomably spacious. There's room for exponentially many nearly independent directions, even though only d of them are perfectly independent.
Why does this work? It's intimately connected to the concentration of inner products. If you pick two random unit vectors in d dimensions, their inner product (cosine of the angle between them) is approximately Gaussian with mean 0 and standard deviation 1/√d. So for large d, two random vectors are automatically nearly orthogonal with high probability.
For two random unit vectors u, v in ℝd, the inner product ⟨u,v⟩ has mean 0 and standard deviation ≈ 1/√d. This means P(|⟨u,v⟩| > ε) ≤ 2·exp(−d·ε²/2) for small ε. In d = 1000, the probability that two random unit vectors have |cos θ| > 0.1 is less than e−5 ≈ 0.007. Random directions in high dimensions are almost always nearly perpendicular.
This is the hidden superpower of high-dimensional spaces. Yes, all distances are roughly equal (the curse). But the space has so many nearly-independent directions that you can embed exponentially many distinct structures in it (the blessing). And this is exactly why a random projection down to a much lower-dimensional space can preserve the essential structure of your data.
Now we're ready for the main event: the Johnson–Lindenstrauss lemma itself.
Random projections preserve structure
In 1984, William Johnson and Joram Lindenstrauss proved something remarkable. Here's the theorem, stated informally:
(1 − ε) · ||u − v||² ≤ ||f(u) − f(v)||² ≤ (1 + ε) · ||u − v||²
Read that again. Let it sink in. Here's what it says:
- You can project n points from any number of dimensions down to just k = O(log(n) / ε²) dimensions.
- All pairwise distances are preserved within a factor of (1 ± ε).
- k doesn't depend on d at all. Whether your data lives in 100 dimensions or 100 million dimensions, the target dimension is the same — it depends only on how many points you have and how much distortion you can tolerate.
Let's put numbers to it. Suppose you have n = 1,000,000 points and you want ε = 0.1 (10% distortion). Then:
A million points — regardless of their original dimension — can be faithfully represented in about 11,000 dimensions. If you're willing to accept ε = 0.3 (30% distortion), that drops to about 1,228 dimensions.
And here's the kicker: a random Gaussian matrix works. Generate a k × d matrix where each entry is drawn independently from N(0, 1/k), multiply your data by it, and with high probability (typically at least 1 − 1/n), all pairwise distances are preserved within (1 ± ε). No optimization. No training. No clever construction. Just randomness.
See it in action:
Notice something wonderful: as you increase k (the target dimension), the dots cluster more tightly around the diagonal — the projection becomes more faithful. But even at surprisingly low k, the worst-case distortion is modest. A random 10-dimensional projection of 50-dimensional data typically distorts distances by less than 50%. Push k to 40 or 50, and the distortion drops below 15%.
The original JL lemma is existential — it says a good projection exists. But the probabilistic proof actually shows that a random Gaussian projection works with probability at least 1 − δ, for k = O(log(n/δ) / ε²). The constant depends on the failure probability δ — a common practical bound is k ≥ 8·ln(n)/ε², though the exact constant varies by proof technique.
This result is astounding because the projection matrix knows nothing about your data. It's the same random matrix whether you're projecting images, text embeddings, genome sequences, or stock prices. The geometry of high-dimensional space — the very concentration phenomena we explored in earlier sections — guarantees that it works. (An important caveat: JL guarantees preservation for a fixed finite set of n points, not uniformly for all of ℝd. The projection is chosen obliviously, but the dimension bound depends on how many points you need to preserve.)
But why does it work? Let's dig into the proof intuition.
Why does JL work?
The proof of the JL lemma has two key ingredients, and both connect directly to the intuitions we've built.
Ingredient 1: Projection length concentrates
Fix a single vector v in ℝd. Now project it using a random k × d matrix A (with entries drawn from N(0, 1/k)). The projected vector is Av, and its squared length is:
Each term (row_i · v) is the dot product of v with a random Gaussian vector. Since v is fixed and the row is random, this dot product is a Gaussian with mean 0 and variance ||v||²/k. So each term (row_i · v)² has expected value ||v||²/k.
Summing k such terms, the expected squared length is:
The expected projected length equals the original length! And here comes the concentration: since ||Av||² is a sum of k independent sub-exponential random variables, it concentrates tightly around its mean. Specifically:
This is shell concentration in disguise. The projected length lives on a thin shell around its expected value, and the shell gets thinner as k increases. This is exactly what we saw with the Gaussian Annulus Theorem — the length of a sum of independent random things concentrates around its expected value.
Ingredient 2: Union bound over all pairs
Ingredient 1 shows that a single vector's length is preserved with high probability. But we need all pairwise distances preserved simultaneously. For n points, there are n(n−1)/2 ≈ n² pairs.
Here's the trick: distance preservation for a pair (u, v) is the same as length preservation for the difference vector (u − v). So we need:
for all n²/2 pairs simultaneously. By the union bound:
For this to be small, we need n² · exp(−c·k·ε²) < 1, which gives:
That's it. That's where the log(n) comes from — it's just the union bound over n² pairs. And that's why k doesn't depend on d — the concentration bound for a single projection depends only on k and ε, not on the original dimension.
The JL lemma rests on two pillars: (1) concentration of measure — a single random projection preserves one vector's length with high probability, and (2) union bound — we can simultaneously preserve all O(n²) pairwise distances by choosing k large enough. The log(n) in the target dimension is the "price" of this simultaneous guarantee.
This proof structure reveals something deep: the JL lemma is fundamentally a statement about concentration of measure, dressed up as a dimensionality reduction result. Every piece of high-dimensional intuition we built — shell concentration, distance concentration, near-orthogonality — converges here.
Now that we understand why it works, let's look at what it's been used for.
Applications that changed computing
The JL lemma isn't just a pretty theorem — it's a workhorse. It's been quietly powering some of the most important algorithms in modern computing. Here are the greatest hits.
Nearest-neighbor search and locality-sensitive hashing
Finding the nearest neighbor to a query point in a large dataset is a fundamental problem. In high dimensions, exact algorithms are essentially no better than brute force — this is the curse of dimensionality striking again. But if you project your data down using a random projection, you can build efficient index structures in the reduced space.
Locality-Sensitive Hashing (LSH) takes this further: it uses random projections to build hash functions where similar points are likely to hash to the same bucket. This enables approximate nearest-neighbor search in sub-linear time — a technique used by Spotify for song recommendations, Google for image search, and countless other applications.
Streaming algorithms and sketching
When data arrives in a continuous stream — network traffic, sensor readings, financial transactions — you can't store it all. Random projections let you sketch the stream: maintain a compact summary that preserves key properties.
The Count-Min Sketch and AMS Sketch use random projections to estimate frequencies, inner products, and norms of streaming data using memory that's logarithmic in the data size. This is JL in disguise: you're projecting a high-dimensional frequency vector into a much smaller space while preserving its approximate structure.
Compressed sensing
In 2006, Emmanuel Candès, Justin Romberg, Terence Tao, and David Donoho proved something revolutionary: if a signal is sparse (most of its coefficients are zero), you can recover it exactly from far fewer measurements than the signal's dimension. The measurement matrix? Random projections.
This builds on the same concentration phenomena, but with stronger structure. Random matrices satisfy the Restricted Isometry Property (RIP) with high probability — a condition stronger than JL that enables exact reconstruction via sparse recovery algorithms like L1 minimization. This has transformed MRI imaging (faster scans), radar (fewer sensors), and astronomical imaging.
Speeding up machine learning
Random projections have been used to speed up virtually every area of machine learning:
- Kernel approximation: Rahimi and Recht's "Random Kitchen Sinks" (2007) use random features — a cousin of JL projections — to approximate kernel methods in linear time. This won a Test of Time award at NeurIPS.
- Dimensionality reduction as preprocessing: Before running expensive algorithms (clustering, classification, regression), project your data down with JL. The results are almost the same, but the runtime drops dramatically.
- Random feature maps in neural networks: The hidden layers of a randomly-initialized neural network act much like a JL projection. This partially explains why neural networks can be effective even before training — a connection to the reservoir computing idea.
Privacy-preserving computation
Random projections can help with privacy by obfuscating individual data points while preserving aggregate properties (like distances and inner products). This isn't differential privacy by itself — additional noise calibration is needed for formal guarantees — but random projections provide a useful starting point for privacy-preserving data analysis.
| Application | What JL preserves | Speedup |
|---|---|---|
| Nearest neighbors (LSH) | Approximate distances | Sublinear query time |
| Streaming sketches | Norms, frequencies | O(log n) memory |
| Compressed sensing | Sparse signal structure | Fewer measurements |
| Kernel approximation | Kernel evaluations | Linear in n (vs n²) |
| ML preprocessing | Pairwise distances | Proportional to d reduction |
The common thread: whenever you have high-dimensional data and need to process it efficiently, a random projection is often the first tool to reach for. It's fast, simple, data-oblivious, and backed by rock-solid theory.
So how do you actually build one?
Building your own random projection
If you want to use the JL lemma in practice, you need to choose a projection matrix. The theorem guarantees that a Gaussian random matrix works, but there are faster alternatives that work just as well.
Option 1: Gaussian projection
Each entry of the k × d matrix A is drawn from N(0, 1/k). This is the "textbook" construction — the one used in the proof. It's optimal in terms of distortion but has two drawbacks: generating k·d random Gaussian values is expensive, and the matrix-vector multiplication takes O(k·d) time.
Option 2: Sparse random projection (Achlioptas, 2003)
Dimitris Achlioptas showed that you can use a much sparser matrix. Each entry is independently set to +√(3/k) with probability 1/6, 0 with probability 2/3, and −√(3/k) with probability 1/6. Two-thirds of the entries are zero, making the projection about 3× faster. The √3 factor ensures the variance matches the Gaussian case (variance 1/k per entry).
Option 3: The ±1 projection (Rademacher)
Even simpler: each entry is +1/√k or −1/√k with equal probability. No zeros, no complicated distributions — just random sign flips. This is the fastest to generate and nearly as good as Gaussian in practice.
Compare them yourself:
Choosing the target dimension
The practical formula is:
where C is a constant (typically between 4 and 12, depending on the desired failure probability and proof technique). In practice, C = 8 is a safe starting point. For n = 10,000 and ε = 0.1:
That said, the theoretical bound is often conservative. In practice, you can frequently get away with smaller k. A good strategy is to start with the formula, then experiment with smaller values and check the empirical distortion.
- Use ±1 (Rademacher) projections unless you need theoretical guarantees — they're fastest and work great.
- Always scale correctly: multiply by 1/√k so that lengths are preserved in expectation.
- For very large d, consider structured random projections (like the Fast JL Transform using FFT) which reduce the O(kd) multiplication to O(d log d).
- Test empirically: compute a few hundred pairwise distances before and after projection and check the distortion.
The bigger picture
Let's zoom out. What have we really learned?
We started with the curse of dimensionality — the idea that high-dimensional spaces are inherently hostile. And they are, in some ways: data is sparse, distances concentrate, and your low-dimensional intuitions fail catastrophically.
But then we discovered the blessing. The same concentration phenomena that create the curse also create remarkable structure:
- Shell concentration means that random vectors have predictable lengths, so projections preserve them.
- Distance concentration means that relative differences are what matter, and there aren't too many of those to preserve.
- Near-orthogonality means there's exponentially more "room" than the dimension count suggests, so random subspaces capture most of the structure.
The JL lemma ties these together into a single, clean guarantee: random projections preserve pairwise distances, and you only need O(log n) dimensions to do it.
This has profound implications beyond the specific applications we discussed:
Why do neural networks work? A neural network maps inputs through layers of random-ish weights and nonlinearities into high-dimensional hidden representations, then applies a linear readout. This is eerily similar to a reservoir computer or a JL projection — the random intermediate representation spreads data into a high-dimensional space where linear separation becomes easy. The JL lemma suggests that random high-dimensional representations are generically useful, which partially explains why neural networks are so robust to architectural choices.
Why does feature hashing work? The "hashing trick" (used in Vowpal Wabbit and scikit-learn) maps features to a fixed-size vector using a hash function. This is a deterministic random projection, and JL theory explains why it works: the hash function is sufficiently "random-looking" to preserve inner products.
Is there a fundamental limit? Yes — and it's tight. Noga Alon proved that the O(log n / ε²) target dimension in the JL lemma is optimal. You cannot do better in general. Any linear map that preserves all pairwise distances within (1 ± ε) must have at least Ω(log n / ε²) dimensions. Randomness isn't just convenient — it's essentially achieving the theoretical minimum.
Perhaps the deepest lesson is philosophical. We tend to think of randomness as the enemy of structure — noise to be eliminated, chaos to be tamed. But the JL lemma reveals the opposite: randomness is a source of structure. A random projection isn't a lazy approximation; it's a mathematically optimal way to compress high-dimensional data. The geometry of high-dimensional spaces is so regular, so predictable in its concentration, that a random map is almost as good as the best possible map.
Next time someone tells you that high dimensions are a curse, remind them: it's also where the blessings hide.
Further reading
- Sanjoy Dasgupta and Anupam Gupta, "An elementary proof of a theorem of Johnson and Lindenstrauss" (2003) — the clearest proof of the JL lemma.
- Roman Vershynin, High-Dimensional Probability (2018) — the definitive textbook on concentration of measure and its applications.
- Dimitris Achlioptas, "Database-friendly random projections" (2003) — the paper introducing sparse random projections.
- Nir Ailon and Bernard Chazelle, "The Fast Johnson–Lindenstrauss Transform" (2009) — how to speed up JL projections using FFT.
- Rahimi and Recht, "Random Features for Large-Scale Kernel Machines" (2007) — the Random Kitchen Sinks paper, connecting JL to kernel methods.
This explainer was designed to build intuition about high-dimensional geometry and the JL lemma. For rigorous proofs, consult Vershynin's textbook or the Dasgupta-Gupta paper. Some interactive demos use simplified models for pedagogical clarity.