← All Explainers

Why Your Brain Is Shaped Like That

The 20-watt computer in your skull runs on a handful of design principles. They explain everything from why neurons whisper instead of shouting, to why you can't just make a bigger brain.

You have about 86 billion neurons. Your brain is 2% of your body mass but burns about 20% of your calories. A single spike, counting all of its downstream synaptic effects, costs several billion ATP molecules. Axons — the brain's wires — fill roughly half of cortical volume. And yet the whole thing fits in a box the size of a grapefruit, boots up every morning from nothing, and somehow writes Hamlet.

Not by magic. By ruthless engineering. Every gram of grey matter is under the same constraints a chip designer faces: energy, space, noise, latency. Evolution has been iterating on this design for 600 million years, and the solutions it landed on aren't arbitrary — they are nearly forced by physics. This is an explainer about those solutions.

I.

Your brain runs on a light bulb

Let's start with the headline number: a human brain runs on about 20 watts. That is less than a MacBook charger. It is less than a ceiling fan. It is a little more than a USB power bank trickle-charging a phone.

For that 20 watts you get: 86 billion neurons, roughly a quadrillion synapses, most of the heavy lifting of vision, hearing, language, motor control, memory, and whatever-it-is that happens when you daydream. The thing boots up every morning with no recalibration, runs for decades with no spare parts, and is small enough to carry around on top of your shoulders.

Contrast that with the computers we build. A single NVIDIA H100 GPU — the kind of chip used to train large language models — draws about 700 watts. Training one frontier model consumes gigawatt-hours. Even a mid-range laptop CPU burns twice the brain's budget just browsing the web.

The comparison isn't perfectly fair. Brains and GPUs solve different problems (one controls a body in real time; the other predicts tokens from a prompt). But the power gap is real, and it is enormous. The brain is doing something the GPU is not: computing under a brutal energy budget.

So how does this work? How does 20 watts of wet meat outcompete 700 watts of silicon? That's the mystery this explainer is about. And the short answer is: the brain is a machine shaped, by natural selection, to stay within a strict budget. Every gram of grey matter is under the same constraints a chip designer faces — energy, space, noise, latency — and evolution has been iterating on this design for about 600 million years.

The principles we will uncover aren't a list of cute tricks. They're the near-inevitable consequences of trying to compute anything useful on a budget this small. Once you see them, the brain's anatomy stops looking like a pile of pink oatmeal and starts looking like a contract drawn up with physics.

Figure 1 · Power budget, on a log scale
1W 10W 100W 1kW 10kW 100kW 1MW 10MW 100MW Human brain — 20 W 86 billion neurons, runs for ~80 years on autopilot Laptop CPU — 45 W ≈2.3× the brain, just browsing the web One NVIDIA H100 — 700 W 35× the brain, one GPU inference card One LLM training cluster — ~10 MW 500,000× the brain, for a few weeks of training Continuous power draw (log scale)

The brain is the cheapest thing on the chart. By orders of magnitude. The rest of this explainer is about how.

II.

The currency of thought

Before we talk about design principles, we need to understand the budget. Every decision the brain makes — how many neurons to grow, how thick to make an axon, when to fire, whom to connect to — is a trade against one resource: ATP, the universal molecular battery.

ATP powers pretty much everything in a cell. In neurons, its single biggest customer is a protein called the sodium-potassium pump. Every time a neuron spikes, sodium rushes in and potassium rushes out. To fire again, the neuron has to pump those ions back across the membrane, against their concentration gradients, and that costs ATP — lots of it.

Here's the ugly truth, in round numbers (Attwell & Laughlin 2001; Harris & Attwell 2012):

And here is the hard ceiling: if the average cortical firing rate crept up to even 10 Hz, the brain would demand more power than cerebral blood flow can supply. So it can't. The average firing rate has to stay well under 1 Hz, across the entire cortical population (Lennie 2003).

This is the constraint. Everything downstream — sparse firing, short wires, analog retinas, adaptive receptors, dendritic computation — is a strategy for squeezing more bits per ATP.

To make it concrete, try the calculator. You set the average firing rate and what fraction of the 86 billion cortical neurons are actively participating. The widget computes the brain's total power draw and tells you whether your design is biologically plausible, or whether you've just cooked your brain.

Interactive 1 · Metabolic Budget Calculator
0 20 40 60 80 100 120 watts blood-flow limit (~25 W) Housekeeping 10 W Signaling 10 W Total 20 W
0.30 Hz
100%
✓ Plausible design — blood flow can supply this much power.

Move the sliders. At the default (0.3 Hz average rate across all neurons) you land near the real 20 W. Push the rate up and the power bar turns red — at some point, no blood supply could feed the brain, and the design fails.

Well ackshually… The exact split between "signaling" and "housekeeping" varies across studies and species, and not all authors agree on where to draw the line. The qualitative fact — that signaling cost scales with spike rate fast enough to force sparse firing — is robust across every budget that has been published since Attwell & Laughlin 2001.
III.

Send only what surprises

Given the budget, the first principle writes itself: don't send information the receiver already has.

This is the insight Horace Barlow turned into a research program in 1961. His claim was radical for its time: the goal of early sensory processing isn't to represent the world faithfully. It's to re-encode the world so the message is as short as possible. A neuron that transmits redundant information is burning ATP for no new bits.

And natural signals are wildly redundant. Neighboring pixels in a photograph are almost identical. A blue sky sends the same message a million times over. Textures repeat. Movies change slowly from frame to frame. Sending the raw data would be like mailing someone the entire Wikipedia dump every time they ask for the weather.

So the retina doesn't. It subtracts the predictable part.

The retina as a compression engine

The retina has roughly 100 million photoreceptors but only about 1 million optic-nerve fibers leaving the eye for the brain. That's a 100× compression ratio at the very first processing stage. Somehow the retina takes all those pixel-level measurements and squeezes them into a much smaller stream of signals without losing much that matters.

The trick — discovered in cat retina by Stephen Kuffler in 1953 — is the center-surround receptive field. A retinal ganglion cell looks at a small patch of photoreceptors in its "center" and compares their average brightness to an annulus of photoreceptors around it (the "surround"). It fires when the center is brighter than the surround (or the reverse, depending on the cell type). Effectively, each ganglion cell is computing:

output = (center brightness) − (local average brightness)

What gets through is the difference between a pixel and its neighborhood — in other words, the part the neighborhood didn't predict. On a uniform surface the output is zero: no spikes needed, no ATP spent. On an edge, where the center and surround disagree sharply, the output is large: the spike is earning its keep.

This is the same move JPEG uses (local averages + difference coding). The same move video codecs use (send only what changed from the last frame). The same move modern LLMs use (predict the next token and keep only the residual surprise). The retina figured it out first, by about half a billion years.

Interactive 2 · The Retina's Eye
Raw input
After center-surround
6 px
Active pixels
Bits saved vs raw

Uniform regions vanish into grey — no signal, no spikes. Edges light up. The retina sends roughly the complement of what you'd expect: not the image, but the places the image is about to surprise you.

Aside The precise framing of early sensory processing is debated. "Redundancy reduction" à la Barlow is one reading; "predictive coding" (send only the prediction error) is a closely related one; and "sparse coding" over a learned dictionary is yet another. They agree more than they disagree: the retina, cochlea, and early cortex all seem to push the signal toward a representation where each active unit is rare, informative, and cheap. This explainer uses the term efficient coding to cover all three.
IV.

Whisper instead of shout

If each spike costs several billion ATP — counting the downstream synaptic work — then every spike you don't send is money in the bank. So the next principle follows immediately:

Send information at the lowest spike rate that gets the job done.

There are two ways to pull this off, and the brain uses both:

  1. Lower the rate directly. Transmit at a few Hertz, not hundreds. Let most time slots be silent.
  2. Sparsify the population. Instead of 1,000 neurons each firing at 10 Hz, have 10 neurons fire at 100 Hz and the other 990 stay quiet. The average firing rate is the same, but each spike is now a rarer event, and by Shannon's logic a rarer event carries more bits. You've made each spike earn its keep.

Real cortex uses both, aggressively. Recordings from awake monkey V1 during natural viewing (Vinje & Gallant 2000) and from rodent auditory cortex (Hromádka et al. 2008) find that only a small minority of neurons respond above baseline at any given moment. Across cortex as a whole, average firing rates well under 1 Hz are routine (Lennie 2003, Shoham et al. 2006). Most neurons, most of the time, are silent.

A worked example: the fly's H1 neuron

A beautiful case study comes from an insect. The H1 neuron sits in the fly's visual lobe and reports horizontal motion — the kind of signal a fly needs to stabilize flight. H1 has been measured to carry about 1 bit of information per spike under natural conditions (de Ruyter van Steveninck & Bialek). That's astonishingly efficient; many engineered channels do worse.

How does H1 pull that off? Part of the answer is spike timing. If you rate-code (average spike count over a fixed window), you throw away the temporal pattern. H1 doesn't. Its downstream readout cares about precisely when each spike arrives. A small number of well-timed spikes beats a large number of sloppily-placed ones.

Noise is the hidden reason this matters

Every spike train is jittery. Voltage-gated channels are stochastic, synapses release vesicles probabilistically, membranes have thermal noise. If you want a reliable readout from a rate code, you have to average over many spikes to beat the noise down — and averaging costs spikes. A sparse, temporally precise code hands the downstream decoder the most useful bits in the fewest events. The noise argument and the energy argument point the same way.

Interactive 3 · H1 on a Rainy Day
True motion signal (velocity) ① Dense rate code — always firing, rate tracks velocity ② Sparse rate code — fire only when velocity is high ③ Temporal code — one well-timed spike per event
① Dense
— spikes
— bits/spike
② Sparse
— spikes
— bits/spike
③ Temporal
— spikes
— bits/spike

All three codes report the same motion signal. The dense rate code burns around 80 spikes per second window; the sparse code fires in the tens; the temporal code makes do with around a dozen — each one carrying more information than a whole barrage of dense-code spikes. On a strict ATP budget, only ③ is sustainable.

Aside This is why grandmother cells aren't quite the myth they're sometimes said to be. Extremely sparse codes ("one neuron fires for Jennifer Aniston") are metabolically cheap. The real brain probably sits somewhere between fully distributed and fully localist — sparse but not singular.
V.

Analog where you can, spikes only when you must

We've been talking about spikes as if the brain's whole vocabulary were ones and zeros. It isn't. Spikes are only the part of the story that we, as recording neuroscientists, find easiest to measure. Inside a single neuron, signals are analog — continuous, graded voltages on the membrane, continuous currents through synapses, continuous concentrations of calcium in a dendrite. These analog operations are where most of the actual computation happens, and they are astonishingly cheap.

An analog integration — summing inputs on a dendrite, for instance — costs roughly whatever it costs to let ions flow passively through a patch of membrane. There is no discrete event to pay for, no pump needing to run at full tilt. As long as the signal stays small and local, analog is close to free.

The catch is that analog signals degrade with distance. A membrane is both resistive and capacitive: it has the electrical shape of a leaky cable. A subthreshold voltage launched at one end of a long dendrite doesn't arrive at the other end unchanged — it smears out, attenuates, and drifts into the noise floor. Pass a ruler: analog is beautiful over a few hundred micrometers. Over millimeters, it's a mess. Over centimeters or meters, it's impossible.

This is exactly the problem a digital spike solves. An action potential is self-restoring: as long as the membrane is excitable, each patch regenerates the full voltage swing as the wave passes through. A spike that leaves your lumbar spine can arrive at your big toe with the same amplitude a meter later. Digital is robust because each repeater cleans up the signal on the way.

So the design rule is:

Analog where you can, spikes only when you must.

The retina gives us a beautiful demonstration. Its first several layers — photoreceptors, bipolar cells, horizontal cells, and most amacrine cells — run entirely on graded voltages. No action potentials. Massive parallel computation (contrast enhancement, gain control, direction-selectivity, motion detection) happens in pure analog, inside a structure only a fraction of a millimeter thick. It's only when the retinal ganglion cells have to send their output a long way — down the optic nerve, to the thalamus — that spikes finally show up. The retina defers digitization until the last possible moment.

The same pattern shows up elsewhere. Cochlear hair cells communicate with their auditory-nerve partners via graded voltage, though the nerve itself spikes. Many invertebrate neurons get by with no spikes at all. Spikes are expensive and you only spend on them when geometry forces your hand.

Interactive 4 · The Analog-Digital Tradeoff
Analog channel (graded voltage) intact Digital channel (spikes) intact
1 mm
Analog energy cost
~1× baseline
Digital energy cost
~1000× baseline

Move the slider. At short distances (inside a dendrite, up to a few hundred micrometers), analog wins on every axis. Crank the distance up and analog falls apart — the signal dies out before it arrives. Spikes cost vastly more to generate but stay legible over any distance.

VI.

Every wire is a tax

Your cortex is mostly wire. Axons and dendrites together occupy roughly half of cortical volume (Braitenberg & Schüz). Cell bodies, synapses, blood vessels, and glia share the remaining half. You are, anatomically speaking, walking around with a couple of kilograms of neural cabling sloshing inside a braincase.

Every wire costs:

The conclusion is inescapable:

Minimize wire.

The brain uses three big tricks to obey this rule:

Trick 1: Topographic maps

Adjacent points on the retina project to adjacent neurons in V1. Adjacent fingers project to adjacent columns in somatosensory cortex. Adjacent frequencies project to adjacent regions in auditory cortex. These topographic maps aren't there because the brain likes neatness — they're there because related neurons need to talk to each other, and putting them next to each other minimizes wire.

Trick 2: Modularity

The brain is carved into dozens of functionally distinct areas — V1, V2, V4, MT, IT, M1, S1, and so on. Why? Because most communication within any one function is local. Keeping all the parts of "visual processing" in the occipital lobe means most of visual cortex's axons stay within a few millimeters of each other. Only a handful need to be long-range.

Trick 3: Small-world connectivity

Most connections in cortex are local. A small number are long-range shortcuts between distant areas. This is the small-world pattern that Chklovskii and colleagues (2002) showed is close to what you would get if you solved the connectivity layout as a formal wire-minimization problem. In well-mapped nervous systems — C. elegans, primate visual cortex — the real layout sits remarkably close to the wiring-optimal one.

Below, try your hand at the problem. You have eight brain "modules" connected by a fixed pattern of edges. Drag them around. See if you can beat the random layout — and then try to beat the layout a computer finds by gradient-descent on wire length.

Interactive 5 · Minimize the Wire
Your total wire
Random layout
Optimum (by computer)

Drag the modules. The highlighted edges tell you which pairs are connected, and the total wire length updates live. Notice how clusters of densely-connected modules want to be near each other — exactly the layout principle that organizes real cortex.

Well ackshually… The cortex is folded — gyri and sulci — and one reason often cited is that folding reduces white-matter volume by keeping connected areas close. This is probably partly right, but also partly mechanical: the cortical sheet grows faster than the skull, and has to buckle. As with most things in the brain, multiple pressures matter at once.
VII.

The axon triangle: speed, size, energy

Zoom in on a single wire — one axon. The designer (evolution) wants three things from it:

You can't have all three. The physics sets up a triangle where improving any two corners makes the third worse. This is one of the most elegant design tradeoffs in all of biology.

Unmyelinated axons

In an unmyelinated axon — the kind you find in invertebrates and in many thin vertebrate fibers — the conduction velocity scales as the square root of the diameter:

velocity ∝ √(diameter)

Want to double the speed? You need to quadruple the diameter. The membrane area per unit length goes up 4×, and the volume of the axon (proportional to diameter squared) goes up a whopping 16×. Unmyelinated speed is ruinously expensive to buy.

This is why the giant squid axon is giant. It's about a millimeter thick, so it can conduct at ~25 m/s — fast enough to trigger an escape jet when a predator looms. The squid can afford the volume only because it's a huge animal and the axon is a single wire to its mantle. A human trying to do the same thing would need an optic nerve the thickness of a telephone pole.

Myelin: the hack that changed everything

Vertebrates invented a fix. Wrap the axon in an insulating sheath — myelin — leaving small gaps (nodes of Ranvier) where the spike is regenerated. The signal effectively jumps between nodes, a process called saltatory conduction. The velocity now scales linearly with diameter:

velocity (myelinated) ∝ diameter

This is huge. At a given diameter, a myelinated axon is roughly 5–10× faster than an unmyelinated one. Or equivalently, for a given speed, you can use a much thinner (and cheaper) wire.

But myelin isn't free. The oligodendrocytes that make it cost metabolic energy to grow and maintain. The sheath itself takes up space. And there's a minimum diameter below which it doesn't pay off: below about 0.2 µm (Waxman & Bennett 1972), the myelin sheath would be thicker than the speed it buys you. Real thin axons, like many cortical local-circuit fibers, don't myelinate. Real thick axons, like motor neurons projecting to muscles, always do.

The brain has both, each axon in the cheapest configuration for its job. Try to build one.

Interactive 6 · Design Your Own Axon
Your axon, side view Velocity vs diameter · biological axons are the gray dots 0.1µm 1µm 10µm 20µm 0 40 80 120 m/s — — — unmyelinated (√d) ——— myelinated (d)
1.0 µm
Conduction speed
Cross-section area
Verdict

The pink dashed curve is the unmyelinated option: slow to gain speed, thick to be fast. The teal curve is the myelinated option: faster for any given diameter, but only worth it above about 0.2 µm. The grey dots are real biological axons — from cortical thin fibers to motor neurons. They cluster tightly along one curve or the other. Evolution is not making arbitrary choices.

VIII.

Compute in the dendrites

Here's a generalization of the previous section: if wire is expensive, pack more computation into each neuron, because local computation is cheaper than extra neurons plus the axons to connect them. A neuron that can solve a hard sub-problem inside its own dendrites is saving you a whole circuit.

For a long time, textbooks treated the neuron as a point — a little weighted-sum-plus-threshold unit, the classic McCulloch-Pitts model, which is also what artificial neural networks imitate. Dendrites, in this view, were passive antennae: they just collected inputs and funneled them to the soma to be summed.

That picture is wrong, and has been for decades.

Real cortical dendrites are studded with voltage-gated ion channels (NMDA, Ca²⁺, Na⁺). They generate their own local regenerative events called dendritic spikes. Different branches act like semi-independent subcomputers, each one applying a nonlinearity to its own set of inputs before passing the result along. Poirazi & Mel (2003) showed that a single cortical pyramidal cell can, in principle, implement the computation of a small two-layer artificial neural network. One neuron; two layers.

A concrete case: direction selectivity in the retina

The retina has cells called starburst amacrine cells that detect the direction of motion. A starburst has a radial dendritic tree, like a starfish, with inputs coming in along the whole spread. As light moves outward along one of the dendrites, it excites the more distal parts first, and the excitation propagates inward toward the soma. If the motion is in the "preferred" direction, the excitation builds up over time and produces a strong output. If the motion is in the opposite ("null") direction, inhibitory inputs along the way cancel the excitation and the output stays silent.

The punchline: the starburst cell computes direction selectivity inside its own dendritic tree, before a single spike leaves the cell. An equivalent network built from simple point-neurons would need around a dozen cells, several layers of synapses, and a cobweb of axons running between them. The dendrite-based version, in contrast, is one cell. Massively cheaper.

Figure 2 · One fancy neuron vs ten plain ones
① Starburst amacrine cell motion 1 cell · ~24 inputs · 8 direction-selective outputs ② Equivalent point-neuron network inputs delay + gate direction cells ~14+ neurons · ~40+ synapses · extra axons

Both circuits compute the same function: given a moving stimulus, report its direction. On the left, one clever cell does it in analog, inside its own dendrites, for roughly the energy cost of one cell. On the right, a point-neuron network achieves the same computation using more neurons, more synapses, more axons — and more spikes. Local compute wins.

The same story plays out in the hippocampus, where pyramidal-cell dendrites combine grid-cell inputs with sensory cues to produce place-specific firing; in the cerebellum, where Purkinje cells integrate hundreds of thousands of inputs with branch-specific gain control; and in the cortex, where layer-5 pyramidal neurons use NMDA spikes and dendritic calcium plateaus to implement nonlinear combinations of feedback and feedforward signals. The common theme: whenever you can trade a little biophysical cleverness for a lot of extra neurons and axons, take the trade.

IX.

Adapt, match, track reality

Efficient coding (§III) isn't just about space. It's also about time. The world at noon is not the world at midnight; a code tuned for daylight is wasted on a moonless forest. So the "match your code to the statistics of the input" principle has to be reapplied continuously. That's adaptation.

Every neuron has a limited dynamic range — perhaps five or six bits of effective information per spike train, in a good case. The world, on the other hand, has a dynamic range that's embarrassing. Starlight to noon sunshine is a factor of 10⁹. Sound intensities span twelve orders of magnitude from quietest whisper to painful loudness. A single-scale neuron would saturate half the time and be lost in the dark the other half.

The way out is to rescale the encoding continuously. This is why:

Laughlin's fly, and the information-theoretic optimum

In 1981 Simon Laughlin (yes, same Laughlin) measured the input-output curve of the blowfly's large monopolar cells — the first interneurons in the fly visual system — and compared it to the distribution of contrasts the fly actually encounters in a natural scene. The comparison was beautiful:

The neuron's response curve was almost exactly the cumulative distribution function of natural contrast.

Why is that the right answer? Because if you match the response curve to the CDF of the input, each output level is used with equal probability — which, by a classic result in information theory, is the encoding that maximizes transmitted bits per spike for a bounded output. The fly has, in effect, tuned itself to the world it lives in. A fly that evolved in a different visual environment would have a slightly different curve.

This is also why perceptual laws like Weber–Fechner exist: perceived intensity often goes as the log of physical intensity. Not a mystery — close to the optimal compressive coding when inputs span many orders of magnitude with a roughly log-distributed probability.

Your turn to match a histogram.

Interactive 7 · Match the Histogram
Input histogram (natural scene) Your neuron's response curve (drag the dots) 0 50% max 0 mid input max
Information transmitted
Match to optimal CDF

Drag the violet dots to shape your neuron's response curve. The widget computes how much information the curve transmits about the input. The maximum — revealed by the "optimum" button — is always the cumulative distribution of the input. A neuron that knows the world's statistics carries the most bits. A neuron with a fixed generic curve (say, a straight line) does not.

X.

The principles compose

Take a breath. Here is the list, in seven words each:

These are not seven independent tips. They all fall out of the same recurring pressure: intelligence has to run on a metabolically limited, physically embedded substrate. Evolution has stress-tested an enormous number of nervous systems — from jellyfish nerve nets to human cortex — and the ones that survived tend to respect these constraints in overlapping ways. The principles are, more or less, the shape of the survivors.

Why you can't just make a bigger brain

An obvious question: if a larger brain is smarter, why aren't we all dolphins? Suzana Herculano-Houzel and colleagues answered this using a technique called isotropic fractionation — literally dissolving brains and counting the nuclei in the soup. The results were startling:

The moral: the brain's constraints are not arbitrary. They're the reason bigger isn't automatically better, and they're part of why there is no obvious path to a brain ten times the size of a human's that would still fit in a skull with enough blood supply to run.

What these principles help explain

Not fully, but substantially:

A harder comparison: brains vs LLMs

You can't talk about neural design in the 2020s without the obvious question: if evolution built a 20-watt thinking machine, why does it take 700 watts to run GPT-4 inference, and gigawatt-hours to train it?

The honest answer isn't that LLMs are "doing it wrong." It's more nuanced:

Here's a scorecard, with the caveat that the comparison is imperfect:

Principle Brain Transformer on GPU Neuromorphic target
Sparse activity ✓ <1 Hz avg ✘ dense ~ partial (MoE)
Analog local compute ✓ graded voltages ✘ all digital ✓ in-memory compute
Co-located memory + compute ✓ same cell ✘ HBM round-trip ✓ the whole point
Event-driven operation ✓ spikes ✘ tick-synchronous ✓ spike-based
Wire-aware layout ✓ small-world cortex ~ chiplet placement ~ mesh interconnects
Online adaptation ✓ continuous ✘ frozen weights ~ work in progress
Low precision / noise tolerance ✓ noise-native ~ bfloat16, quant ✓ noise-native

The scorecard shows the efficiency gap isn't mysterious — it's a design gap, and it's one that hardware people are actively trying to close. The principles aren't just a story about biology. They are, increasingly, a roadmap for silicon.

Closing thought

The brain is not a general-purpose computer that happens to run on biology. It is a machine whose structure has been heavily sculpted by a few hard physical limits — energy, space, latency, noise. Evolution didn't "choose" these principles so much as stumble onto designs that don't go extinct. The principles are the shape of the survivors.

So the next time someone asks why a brain can do so much with so little — tell them it's because it had to. A more generous budget would have produced a different, lazier machine. The 20 watts is why the design is interesting at all.

The universe's stingiest engineer is the one you should learn from. Thermodynamics is a teacher who never grades on a curve.

Further reading

Built with a lot of help from evolution.