A Theory of Deep Learning

Idk to me this is just redescribing what deep neural networks do without actually explaining why anything happens. I guess it "unifies" things but I am kinda over most unifying theories. Everything is Bayesian, everything is a graph or a group or some other fancy geometric structure, everything is a category. Ultimately the best framework is whatever is useful enough to explain what's happening in such a way that a practitioner can manipulate the model towards a desired outcome. In other words, where is the knob? The tool they share may be interesting and I hope to play with it to see what happens at different levels of noise applied to the labels.

> That is, if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it. This is a one-line change to Adam that accelerates grokking by 5x, suppresses memorization in PINNs, and improves DPO fine-tuning, eliminating the need for validation sets entirely.

Does anyone understand the formula they expressed above this sentence? is this just the classic "skip updating parameters with high gradient/loss variance in multiple batches/samples" ?

This is a fascinating mathematical framework, but the post title might be a bit of an overreach. I often wonder if "a theory of deep learning" could exist that could be stated succinctly and that could predict (1) scaling laws and (2) the surprising reliability of gradient descent.

Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

This essay seems to be related to the paper "There Will Be a Scientific Theory of Deep Learning" [1] which was discussed here recently [2].

[1] https://arxiv.org/pdf/2604.21691

[2] https://news.ycombinator.com/item?id=47893779

The relevant paper: "A Theory of Generalization in Deep Learning". https://arxiv.org/abs/2605.01172

Interesting read. I remember the grokking paper when it came out but I don't think I've ever seen that classic grokking loss curve in my own hands on real data. Curious if others have seen it more often in practice

What a beautifully written article. It's extremely that I favourite an article but this is one.

A very fascinating read.

As a fellow tufte css enjoyer, Why is user select turned off on the sidenotes? I would like to be able to copy paste them quite badly.

Does anyone happen to know what font this site is using? It looks really elegant.

This is a beautifully written way of saying “Some parts of what the network memorizes affect test behavior, and some don’t.” But that’s not a theory of deep learning, the grand unified theory would explain that.

We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.

Okay, but then we have: why would SGD put the right things in the right bucket?

If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.

The Borges/Lavoisier stuff is a tell. "We have unified the field” rhetoric should come after nontrivial predictions and results. Claiming to solve benign overfitting, double descent, grokking, implicit bias, risk of training on population, how to avoid a validation set, and last but not least, skipping training by analytically jumping to the end is 6 theory papers, 3 NeurIPS winners, and a $10B startup. Let's get some results before we tell everyone we unified the field. :) I hope you're right.

This essay seems to be related to the paper "There Will Be a Scientific Theory of Deep Learning" [1] which was discussed here recently [2].

[1] https://arxiv.org/pdf/2604.21691

[2] https://news.ycombinator.com/item?id=47893779

Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.

> but the post title might be a bit of an overreach.

Really? An essay that leads off with a Borges anecdote skewed grandiose. Oh my, how unprecedented!

We're still in the era of room-sized-computers-only-scientists-understand era of the neural networks. Knobs and buttons for nerds are slowly coming.

A real theory would predict phenomena thus far unseen. We already know about this 4 part taxonomy.

Does anyone understand the formula they expressed above this sentence? is this just the classic "skip updating parameters with high gradient/loss variance in multiple batches/samples" ?

What a beautifully written article. It's extremely that I favourite an article but this is one.

The relevant paper: "A Theory of Generalization in Deep Learning". https://arxiv.org/abs/2605.01172

A very fascinating read.

As a fellow tufte css enjoyer, Why is user select turned off on the sidenotes? I would like to be able to copy paste them quite badly.

What is classic about "skip updating parameters with high gradient/loss variance in multiple batches/samples"? Do you have a particular algorithm in mind that uses this heuristic?

To get pure grokking, you need a model large enough to easily memorize the entire training data and keep training for a long time after memorization. In practice, you'll probably use a more realistically-sized model that might grok on some subset of the data, but not so strongly that it's extremely obvious.

The Hidden Physics of LLMs: Retrieval as Thermodynamics

https://www.youtube.com/watch?v=ppCZfjLdSY8

I found this video to be illustrative as well. Simple and anyone can understand.

Very extremely. Quite a lovely presentation. I'm definitely having a Patrick Bateman-esque appreciation for that delicate cream background.

I interpreted the kernel K of this paper as the BRDF in Rendering Equation[0] and its familiar diffusion process (from light transport simulation, or really any integro-differential equation system); together with https://en.wikipedia.org/wiki/Neural_tangent_kernel I hope this paper might be accessible with some study

[0] https://en.wikipedia.org/wiki/Rendering_equation

Layout is fine but font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real Bembo, not that piece of shit

Does anyone happen to know what font this site is using? It looks really elegant.

It is a modified version of ET_Book called ET_Bembo:

https://github.com/DavidBarts/ET_Bembo

apparently its the font used in Edward Tufte's books. Its on github: https://edwardtufte.github.io/et-book/

Font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real paid Bembo, not that piece of shit.

We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.

Okay, but then we have: why would SGD put the right things in the right bucket?

If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.

> why would SGD put the right things in the right bucket?

Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.

Gradient descent wants to be able to make the smallest change that moves the most data points towards the curve. To do this it learns an arrangement where it can change, say, one parameter and have a bunch of points move at once. What does this correspond to? The big common patterns shared by many data points.

Most of the capacity gets soaked up modelling these sorts of common patterns, and after they have been learned the model starts adding exceptions that allow individual points to deviate from the curve.

Because they’re exceptions, they must not impact neighbouring points, or at least only ones within a very short distance from them. Otherwise they’re now driving the error higher by impacting more points than they should. So you end up with very narrow ranges of features that are able to trigger different sorts of noise.

How narrow they are is shaped by the training data, they’re exactly as narrow as needed not to raise the error, so assuming the total population has the same distribution, they don’t get hit. Much.

At least, that’s what I take away from it.

Admittedly probably some aggrandized boasting here, but I think empirical verification of that Adam modification alone would be a meaningful contribution, unless that's prior work?

If that's the case, a way to test the theory and understanding (assuming some parts of reservoir and signal channel can be reliably identified) would be to prune the high-confidence reservoir significantly reducing the model size while still getting good predictions. I don't believe the authors mention this (though I skimmed and didn't read the full paper in detail so I may be wrong)

I don't know the math, but this point was clear to me and it screamed, "crank" but not being sure of that because I am not learned enough to understand the math... but even I could tell the magnitude of the claim. Even just the removing the need for validation sets would have epic consequences across many fields.

These are the same complaints I had. Also felt like it was high quality ai writing, possibly because of the style choices like "Benign overfitting is noise sitting in the reservoir at interpolation. XYZ is ..." and because of the similarity it has to the times I ended up with chatgpt or gemini creating very detailed and plausible reports about my own crackpot or vague-enough-to-be-useless ideas.

> The Borges/Lavoisier stuff is a tell.

Nah, the softer stuff seems like valuable outreach / good science communication for people that aren't up for the math. Including probably lots of software engineers who are sick of dumb debates in forums, and starting to dip into the real literature and listen to better authorities. More people should do this really, since it's the only way to see past the marketing and hype from fully entrenched AI boosters or detractors. Neither of those groups is big on critical thinking, and they dominate most conversation.

Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.

> but the post title might be a bit of an overreach.

Really? An essay that leads off with a Borges anecdote skewed grandiose. Oh my, how unprecedented!

I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.

A real theory would predict phenomena thus far unseen. We already know about this 4 part taxonomy.

We're still in the era of room-sized-computers-only-scientists-understand era of the neural networks. Knobs and buttons for nerds are slowly coming.

Did you also know about this?

Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by 5x, suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying 3x closer to the reference policy. [1]

[1] https://arxiv.org/abs/2605.01172

Very extremely. Quite a lovely presentation. I'm definitely having a Patrick Bateman-esque appreciation for that delicate cream background.

What is classic about "skip updating parameters with high gradient/loss variance in multiple batches/samples"? Do you have a particular algorithm in mind that uses this heuristic?

The Hidden Physics of LLMs: Retrieval as Thermodynamics

https://www.youtube.com/watch?v=ppCZfjLdSY8

I found this video to be illustrative as well. Simple and anyone can understand.

[0] https://en.wikipedia.org/wiki/Rendering_equation

Font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real paid Bembo, not that piece of shit.

Layout is fine but font is atrocious.

Uppercase letters have different stroke width than lowercase ones — it’s like they are *B*old *L*ike this.

Not only that: tracking, kerning is basically non-existent.

Please don’t use that open-source font

You need real Bembo, not that piece of shit

It is a modified version of ET_Book called ET_Bembo:

https://github.com/DavidBarts/ET_Bembo

apparently its the font used in Edward Tufte's books. Its on github: https://edwardtufte.github.io/et-book/

> why would SGD put the right things in the right bucket?

Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.

At least, that’s what I take away from it.

> The Borges/Lavoisier stuff is a tell.

Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.

The "Quantitative Display of Information", which I just checked, is using Monotype Bembo. So still Bembo, but a different version.

Admittedly probably some aggrandized boasting here, but I think empirical verification of that Adam modification alone would be a meaningful contribution, unless that's prior work?

A theory that skips the parameter space, and understands grokking theory, comes up with an unexplained update rule, which notably works on a per-parameter level by dropping the updates for most parameters.

I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.

I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.

Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.

Gradient descent is mathematically the most efficient optimization strategy (safe for some special functions) in high dimensions. This goes so far that people nowadays even believe it has to be used in the human brain [1], if only because every other method of updating the brain would be way too energy inefficient. From that perspective, finding the right parameterization was all we ever needed to achieve AI.

[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...

Did you also know about this?

[1] https://arxiv.org/abs/2605.01172

The "Quantitative Display of Information", which I just checked, is using Monotype Bembo. So still Bembo, but a different version.

I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.

I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.

Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.

The handwaving required is just to assume a diagonal preconditioner, and the optimal preconditioner under that constraint corresponds to the new update rule. (See section F of the paper.) And of course a diagonal preconditioner works on the per-paramer level.

[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...

Even in supervised ML, pure gradient descent is not the most efficient optimization strategy. E.g., momentum is ubiquitous, and the updates it induces cannot be expressed as a gradient of some scalar loss. But the rotational non-gradient component of its updates substantially improves performance and convergence on the architectures we use.

The brain probably primarily uses something like TD for task learning, which is also not expressible as a gradient of any objective function. And, though the paper mentions Hebbian learning, it's only very particular network architectures (e.g. single neuron; symmetric connections) that you can treat its updates as a gradient of some energy function; these architectures aren't anything close to what we see in the brain.

Elon Litman

Flowers

Borges wrote a story about a man named Funes who, after a horseback accident, acquires the ability to perceive and remember everything. Every leaf on every tree. Every ripple on every stream at every moment. He is the perfect empiricist. Infinite data, infinite recall, infinite resolution. And he cannot think. Because thinking, as Borges understood, requires forgetting. Funes could reconstruct entire days from memory but could not understand why the dog at 3:14, seen from the side, should be called the same thing as the dog at 3:15, seen from the front.

_I suspect [that Funes] was not very good at thinking. To think is to ignore (or forget) differences, to generalize, to abstract. In the teeming world of Ireneo Funes there was nothing but particulars._Jorge Luis Borges, "Funes the Memorious," in Ficciones (1944).

Later in the story, Borges conjures Locke, who in the seventeenth century postulated an impossible language in which each individual thing, each stone, each bird and each branch, would have its own name. Funes projected an analogous language but discarded it because it seemed too general to him, too ambiguous. Deep learning theory has built Locke's language and is well on its way to Funes'. More parameters. More data. Deeper networks. More compute. Uniform convergence people, optimization people, NTK people, PAC-Bayes people, stability people, mean-field people, all working on the same problem, none of them speaking the same language, each proving bounds under assumptions that are vacuous under each other's assumptions.

Deep learning alchemy today is where chemistry was before Lavoisier: a practice that works, built on a theory that doesn't. Everyone agrees this is a problem. Few believe it is a solvable one. At the Diffusion Group at Stanford, we have been trying for some time to answer this question, which most of our colleagues consider premature and quixotic: why does deep learning work? We think we have an answer.

But first, to see why the question is hard, start with what classical theory predicts. Classical statistical learning theory posits the bias-variance tradeoff: too simple and you underfit the data, too expressive and you overfit. Deep neural networks are highly expressive and overparameterized—they have far more parameters than data points; they can shatter any possible labeling of the data. During training, the network interpolates the training data perfectly, including all noise, achieving zero error. Surely, the test error should be catastrophic.Zhang et al., "Understanding Deep Learning (Still) Requires Rethinking Generalization," Communications of the ACM 64, no. 3 (2021). The original 2017 version demonstrated that standard architectures can memorize random labels, establishing that classical capacity-based explanations of generalization are insufficient. But then, the test error…

is also very low.

This is called benign overfitting. It violates the most basic intuition in statistical learning theory.Bartlett et al., "Benign Overfitting in Linear Regression," PNAS 117, no. 48 (2020). You fit the training data exactly, so presumably the noise must have been destroyed, or rendered harmless in some form.

Trying to visualize the bias-variance tradeoff with neural networks doesn't yield the expected U-shaped curve, but instead shows double descent. Test error goes up as model complexity increases, then comes back down past the interpolation threshold.Belkin et al., "Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-off," PNAS 116, no. 32 (2019). At the exact moment the network gains the capacity to memorize everything, it begins to generalize.

Gradient descent, given infinitely many solutions that interpolate the data, picks ones that generalize (usually low $\ell_2$-norm, low nuclear norm, approximately low-rank). This is called implicit bias.Gunasekar et al., "Implicit Regularization in Matrix Factorization," NeurIPS (2017), and Soudry et al., "The Implicit Bias of Gradient Descent on Separable Data," JMLR 19 (2018).

Lastly, in cases where the data-generating distribution is highly structured and the network doesn't possess the right inductive bias, the network memorizes the training set, then much later, hundreds of thousands of steps later, suddenly generalizes. This is _grokking._Power et al., "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets," arXiv:2201.02177 (2022).

Our explanation is available via preprint here.Litman & Guo, "A Theory of Generalization in Deep Learning," arXiv:2605.01172. It comes with proofs, experiments, and an algorithm that allows you to train on the population risk of any model, loss function, and dataset.

The Theory

The standard approach treats a neural network as a point in a hypothesis class, attempting to bound its complexity across billions of parameters. We propose a radical Vereinfachung: abandoning the parameter space entirely. Instead, we analyze the network as a dynamical system strictly in output space, focusing on how predictions evolve and where error flows. Stack all training outputs into a vector $U_S \in \mathbb{R}^{np}$. Form the Jacobian $J_S = D_w U_S$, the matrix of partial derivatives of every output with respect to every parameter. The object that governs everything is the empirical Neural Tangent Kernel (eNTK):Jacot et al., "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," NeurIPS (2018).

$$K_{SS}(w) = J_S(w) J_S(w)^\top$$

A matrix that tells you, for every pair of training points, how much a gradient step on one affects the prediction on the other. Under gradient flow, the training outputs and their gradient evolve as

$$\partial_t u = -K_{SS} g$$

$$\partial_t g = -B K_{SS} g$$

Where $g = \nabla \Phi_S(u)$ is the output gradient and $B = \nabla^2 \Phi_S(u)$ is the loss Hessian. The test outputs evolve in parallel through the cross-kernel $K_{QS} = J_Q J_S^\top$:

$$\partial_t U_Q = -K_{QS} g$$

This holds for any differentiable architecture and any convex loss, without any infinite-width or depth limit. The loss itself dissipates as

$$\frac{d}{dt}\Phi_S(u(t)) = -g(t)^\top K_{SS}(t) \, g(t) = -\|J_S^\top g\|_2^2$$

Loss decreases at a rate set by the kernel. Decompose $g$ along eigenvectors $v_i$ of $K_{SS}$ with eigenvalues $\lambda_i$. For squared loss the residual $r = u - y$ obeys $\partial_t r = -M(t)r$ where $M = K_{SS}/n$, so the component along $v_i$ decays as $e^{-\lambda_i t / n}$. A mode with eigenvalue $10\lambda$ is learned ten times faster. On any finite training horizon, modes below some eigenvalue threshold have barely moved. Given infinite time, all modes are interpolated, noise included.

In the feature learning regime, the kernel is not fixed. As the parameters move, the eigenvectors rotate and the eigenvalues shift, so signal and noise get rearranged. Here is the kernel rotating (plotted by centering and normalizing its Gram matrix, extracting eigenstructure changes relative to initialization, mapping those changes into a shaded deformed surface):

Kernel

To capture the cumulative effect of the entire training trajectory, we take the time integral of the eNTK:

$$\mathcal{W}_S(s,T) = \int_s^T P_g(\tau,s)^\top K_{SS}(\tau) P_g(\tau,s) \, d\tau$$

where $P_g$ is the propagator of the gradient ODE. The eigenvalue of $\mathcal{W}_S$ along direction $\psi_j$ is the total integrated squared reachability of that direction over the entire training window:

$$\lambda_j = \int_s^T \|J_S(\tau)^\top P_g(\tau,s) \psi_j\|_2^2 \, d\tau$$

Directions with large $\lambda_j$ are where training dissipated loss. This is the signal channel, $\text{range}(\mathcal{W}_S)$. Directions with $\lambda_j = 0$ are where training dissipated nothing. This is the reservoir, $\ker(\mathcal{W}_S)$.

Now define the test transfer operator

$$G_Q(T,s) = \int_s^T K_{QS}(\tau) P_g(\tau,s) \, d\tau$$

which propagates the initial gradient to test displacement: $U_Q(T) - U_Q(s) = -G\,g(s)$. We show that $G$ vanishes on the reservoir. $\ker \mathcal{W} \subseteq \ker G$. Thus, whatever the network memorized in the reservoir is invisible at test time. The point of overparameterization, of depth, of inductive bias, is to give the kernel a spectrum that puts signal in the channel and noise in the reservoir.

The Field, Reinterpreted

This theory unifies the major puzzles of deep learning theory under one mechanism.

Benign overfitting is noise sitting in the reservoir at interpolation. The network memorized the noise in the train set, but the noise is in the reservoir $\ker \mathcal{W}_S$, which is test-invisible. It doesn't matter. As a pedagogical sidenote: yes, I know that in highly overparameterized networks this is technically a soft reservoir of near-zero eigenvalues rather than strictly a mathematical null space, but treating it as a hard boundary is the best way to build intuition for why that trapped noise disappears at test time.

Double descent is noise moving between the signal channel and the reservoir as model capacity sweeps across interpolation. At the interpolation threshold, noise briefly enters the signal channel and test error spikes. Past it, the noise gets absorbed back into the reservoir.

Implicit bias is the spectral schedule of $\mathcal{W}_S(t)$ filling the signal channel from the largest kernel eigenvalue down. Gradient flow learns parsimonious, high-mobility modes first and low-mobility modes last. By strictly confining its test predictions to this accumulated signal channel, the network acts as a Moore-Penrose pseudo-inverse over the realized path, effectively finding the minimum-norm solution in the dynamic feature space rather than the static parameter space.

Grokking is signal migrating from the reservoir into the signal channel as the kernel evolves over training. The network memorizes first (fast noise-fitting modes saturate early), then generalizes later (slow signal modes finally enter the signal channel).

By the way: the same operators that explain generalization also give you a way to train directly on population risk. Treating each training point in a minibatch as a one-point held-out test set against the rest and localizing to a single optimizer step collapses the operator expression to a per-parameter rule: update parameter $k$ if and only if

$$\mu_k^2 > \frac{\sigma_k^2}{b-1}$$

That is, if the batch signal on a parameter exceeds its leave-one-out noise, update it; if not, skip it. This is a one-line change to Adam that accelerates grokking by $5 \times$, suppresses memorization in PINNs, and improves DPO fine-tuning, eliminating the need for validation sets entirely.

What the Future Holds

The math indicates several exciting areas of research on the horizon. The first implication is that we have been training neural networks with a tragic amount of waste. Gradient descent currently functions as a pointwise simulation of a dynamical system whose asymptotic behavior we can characterize in closed form. This exact characterization is possible because in output space, training dynamics can be understood through a locally linear differential equation along the realized path, where dominant eigenmodes of the evolving kernel equilibrate exponentially fast. Forcing an optimizer to slowly step through these solved directions is highly inefficient and suggests a path to analytically jump to the final network state.

Our theory also provides the foundation necessary to train neural networks directly on the population risk, completely bypassing the fundamental compromise of machine learning. Moving away from pure empirical risk minimization allows networks to target true generalization natively during the training process, eliminating overfitting as we understand it.

Finally, understanding that overparameterization primarily serves to create a larger test-invisible reservoir invites a fundamental rethinking of model architecture. We can now explore whether it is possible to achieve the generalization benefits of infinite scale by designing smaller, highly efficient models that optimally sequester label noise. $\blacksquare$

Hacker Times

Hacker Times

A Theory of Deep Learning

Discussion

Discussion

The Theory

The Field, Reinterpreted

What the Future Holds