There are no dedekind cuts or cauchy sequences on digital computers so the fact that the analytical equations map to algorithms at all is very non-obvious.
Discretizing e.g. time or space is perhaps a bigger issue, but the issues are usually well understood and mitigated by e.g. advanced numerical integration schemes, discrete-continuous formulations or just cranking up the discretization resolution.
Analytical tools for discrete formulations are usually a lot less developed and don't as easily admit closed-form solutions.
https://en.wikipedia.org/wiki/Finite_difference
I'm not sure about applications of real numbers outside of calculus, and how to replace them there.
I am unsure of the next course of action or if software will survive another 5 years and how my career will look like in the future. Seems like I am engaged in the ice trade and they are about to invent the refrigerator.
I'm most interested in how the equation can be implemented step by step in an ML library - worked examples would be very helpful.
Thank you!
Instead of people "hacking" university education to make them purely fotm job training centers. The real hack would be something that really drills down at the fundamentals. CS, Math, Physics, and Philosophy to get an all around education in approaching problems from fundamentals I think would be the optimal school experience.
The way I like to look at it is that I'm engaged in the ice trade and they are about to invent everything else that will end mine and every other current trade. Which leaves me with two practical options: a) deep despair. b) to become a Jacks of all trades, master of none, but oftentimes better than a master of one. The Jacks can, for now, capitalize in the thing that the Machines currently lack, which is agency.
So in my specific case I stopped thinking about mathematics as: how to interpret a sequence of symbols
But instead I decided to start thinking about it as “the symbols tell me about the multidimensional topological coordinate space that I need to inhabit
So now when I look at a equation (or whatever) my first step is “OK how do I turn this into a topology so that I can explore the toplogical space the way that a number would”
Kind of like if you were to extend Nagle’s “what it’s like to be a bat” but instead of being a bat you’re a number
Read here: http://incompleteideas.net/book/the-book-2nd.html
https://www.andrew.cmu.edu/course/10-703/textbook/BartoSutto...
This will have a gentler learning curve. After this you can move on to more advanced material.
The other resource I will recommend is everything by Bertsekas. In this context, his books on dynamic programming and neurodyanamic programming.
Happy reading.
It includes both mathematical formulas and PyTorch code.
I found it a bit more practical than the Sutton & Barto book, which is a classic but doesn't cover some of the more modern methods used in deep reinforcement learning.
Once the fundamental concepts are understood, what problem is being solved and where the key difficulties are, only then the equations will start to make sense. If you start out with the math, you're making your life unnecessarily hard.
Also, not universally true but directionally true as a rule of thumb, the more equations a text contains the less likely it is that the author itself has truly grasped the subject. People who really grasp a subject can usually explain it well in plain language.
This is because they work assuming you know a model of the data. Most real world RL is model-free RL. Or, like in LLMs, "model is known but too big to practically use" RL.
Apart from the resources you use (good ones in other comments already), try to get the initial mental model of the whole field right, that is important since everything you read can then fit in the right place of that mental model. I will try to give one below.
- the absolute core raison d'etre of RL as a separate field: the quality of data you train on only improves as your algorithm improves. As opposed to other ML where you have all your data beforehand.
- first basic bellman equation solving (this is code wise just solving a system of linear equations)
- an algo you will come across called policy iteration (code wise, a bunch of for loops..)
- here you will be able to see how different parts of the algo become impossible in different setups, and what approximations can be done for each of them (and this is where the first neural network - called "function approximator" in RL literature - comes into play). Here you can recognise approximate versions of the bellman equation.
- here you learn DDPG, SAC algos. Crucial. Called "actor critic" in parlance.
- you also notice problems of this approach that arise because a) you don't have much high quality data and b) learning recursivelt with neural networks is very unstable, this motivates stuff like PPO.
- then you can take a step back, look at deep RL, and re-cast everything in normal ML terms. For example, techniques like TD learning (the term you would have used so far) can be re-cast as simply "data augmentation", which you do in ML all the time.
- at this point you should get in the weeds of actually engineering at scale real RL algos. Stuff like atari benchmarks. You will find that in reality, the algos as learnt are more or less a template and you need lots of problems specific detailing to actually make it work. And you will also learn engineering tricks that are crucial. This is mostly computer science stuff (increasing throughout on gpu etc - but correctly! without changing the model assumptions)
- learn goal conditioned RL, imitation learning, some model based RL like alphazero/dreamer after all of the above. You will be able to easily understand it in the overall context at this point. First two are used in robotics quite a bit. You can run a few small robotics benchmarks at this point.
- learn stuff like HRL, offline RL as extras since they are not that practically relevant yet.
More often than not this is duplicated work (mathematically speaking) and there is a lot to be gained by sharing advances in either field by running it through a "translation". This has happened many times historically - a lot of the "we met at a cafe and worked it out on a napkin" inventions are exactly that.
Math proficiency helps a lot at that. The level of abstraction you deal with is naturally high.
Recently, the problem of actually knowing every field enough, just cursorily, to make connections is easier with AI. Modern LLMs do approximate retrieval and still need a planner + verifier, the mathematician can be that.
This is somewhat adjacent to what terry tao spoke about, and the setup is sort of what alpha evolve does.
You get that impression because such advances are high impact and rare (because they are difficult). Most advances come as a sequence of field-specific assumption, field-specific empirical observation, field-specific theorem, and so on. We only see the advances that are actually made, leading to an observation bias.
(One of) The value(s) that a math grad brings is debugging and fixing these ML models when training fails. Many would not have an idea about how to even begin debugging why the trained model is not working so well, let alone how to explore fixes.
That's very much a matter of style. An equation is often the plainest way of expressing something
You give a physicist an equation of a completely unrelated field in mathematics and it will make zero sense to them because they lack the context. And vice versa. The only people who can readily read and understand your equations are those that already understand the subject and have learned all the context around the math.
Therefore it's pointless to try to start with the math when you're foreign to a field. It simply won't make any sense without the context.
Why didn't the training converge
Validation/test errors are great but why is performance in the wild so poor
Why is the model converging so soon
Why is this all zero
Why is this NaN
Model performance is not great, do I need to move to something more complicated or am I doing something wrong
Did the nature of the upstream data change ?
Sometimes this feature is missing, how should I deal with this
The training set and the data on which the model will be deployed are different. How to address this problem
The labelers labelled only the instances that are easy to label, not chosen uniformly from the data. How to train with such skewed label selection
I need to update model but with a few thousand data points but not train from scratch. How do I do it
Model too large which doubles can I replace with float32
So on and so forth. Many times models are given up on prematurely because the expertise to investigate lackluster performance does not exist in the team.
Machine learning feels recent, but one of its core mathematical ideas dates back to 1952, when Richard Bellman published a seminal paper titled “On the Theory of Dynamic Programming” [6, 7], laying the foundation for optimal control and what we now call reinforcement learning.
Later in the 50s, Bellman extended his work to continuous-time systems, turning the optimal condition into a PDE. What he later found was that this was identical to a result in physics published a century before (1840s), known as the Hamilton-Jacobi equation.
Once that structure is visible, several topics line up naturally:
In this post I want to turn our attention to two applications of Bellman’s work: continuous-time reinforcement learning, and how the training of generative models (diffusion models) can be interpreted through stochastic optimal control
Bellman originally formulated dynamic programming in discrete time in the early 1950s [6, 7]. Consider a Markov decision process with state space $\mathcal X$, action space $\mathcal A$, transition kernel $P(\cdot\mid x,a)$, reward function $r(x,a)$, and discount factor $\gamma\in(0,1)$. A policy $\pi$ maps each state to a distribution over actions. If the state evolves as a controlled Markov chain
$$ X_{n+1}\sim P(\cdot\mid X_n,a_n), $$
with one-step reward $r(x,a)$ and discount factor $\gamma\in(0,1)$, then the objective is
$$ J(\pi):=\mathbb E\left[\sum_{n=0}^\infty \gamma^n r(X_n,a_n)\right],\qquad a_n\sim \pi(\cdot\mid X_n), $$
and the value function is defined as:
$$ V(x):=\sup_\pi \mathbb E\left[\sum_{n=0}^\infty \gamma^n r(X_n,a_n)\,\middle|\,X_0=x\right]. $$
Under mild conditions, $V$ satisfies the Bellman equation
$$ V(x)=\max_{a\in\mathcal A}\left\{r(x,a)+\gamma\,\mathbb E \left[V(X_{n+1})\mid X_n=x,a_n=a\right]\right\}.\tag{Bellman equation} $$
This says: choose the action that maximizes immediate reward plus continuation value. Continuous time keeps the same local logic, but now the time step has length $h$ and we send $h\downarrow 0$.
To isolate the main idea, first ignore noise and consider the non-autonomous deterministic control system
$$ \dot X_s = f(s,X_s,a_s),\qquad X_t=x, $$
with finite-horizon value function
$$ V(t,x):=\sup_{a_\cdot}\left[\int_t^T r(s,X_s,a_s)\,ds+g(X_T)\,\middle|\,X_t=x\right]. $$
Theorem (HJB, deterministic non-autonomous). For $V\in C^1$, the value function satisfies
$$ > -\partial_t V(t,x) = H\bigl(t,x,\nabla_x V(t,x)\bigr), \tag{HJB} > $$
where the Hamiltonian is $H(t,x,p):=\sup_{a\in\mathcal A}\left\{r(t,x,a)+p^\top f(t,x,a)\right\}$.
Proof. Fix $(t,x)$ and $h>0$. The dynamic programming principle gives
$$ V(t,x)=\sup_{a_\cdot}\left[\int_t^{t+h} r(s,X_s,a_s)\,ds + V(t+h,X_{t+h})\right]. $$
To first order in $h$, it is enough to optimize over constant actions $a$ on $[t,t+h]$. For smooth $V$ and deterministic dynamics:
$$ V(t+h,X_{t+h})=V(t,x)+h\,\partial_t V(t,x)+h\,\nabla_x V(t,x)^\top f(t,x,a)+o(h), $$
$$ \int_t^{t+h} r(s,X_s,a)\,ds = h\,r(t,x,a)+o(h). $$
Substituting into the DPP, cancelling $V(t,x)$, dividing by $h$, and letting $h\downarrow 0$:
$$ 0=\sup_{a\in\mathcal A}\left\{r(t,x,a)+\nabla_x V(t,x)^\top f(t,x,a)+\partial_t V(t,x)\right\}, $$
which rearranges to $-\partial_t V(t,x) = H(t,x,\nabla_x V(t,x))$. $\quad\blacksquare$
Connection to Hamilton–Jacobi. What Bellman realized in the 1950s is that the partial differential equation produced by dynamic programming has exactly the same structure as the 19th-century Hamilton–Jacobi equation from classical mechanics. Writing the running reward as minus a Lagrangian, $r(t,x,a)=-L(t,x,a)$, define
$$ H(t,x,p):=\sup_{a\in\mathcal A}\{p^\top f(t,x,a)-L(t,x,a)\}. $$
The HJB equation then becomes identical to Hamilton’s equation for the action $S(t,q)$,
$$ \frac{\partial S}{\partial t}+H\!\left(q,\frac{\partial S}{\partial q}\right)=0. $$
Under the correspondence $S\leftrightarrow V$ and $q\leftrightarrow x$, the two equations are the same at the level of PDE structure.
We now move to the stochastic setting: continuous time, continuous state and action spaces, and Itô dynamics. Assume the system evolves according to the SDE
$$ dX_t = f(X_t,a_t)\,dt + \Sigma(X_t,a_t)\,dW_t $$
where $X_t$ is the state, $a_t$ is the control, $W_t$ is a standard Wiener process, and $f$ and $\Sigma$ describe drift and diffusion. The reward is given by $r(x,a)$, and the objective is to maximize expected discounted reward over an infinite horizon:
$$ J(\pi):=\mathbb{E}\Big[\int_0^\infty e^{-\rho t}r(X_t,a_t)\,dt\Big],\qquad a_t\sim \pi(\cdot\mid X_t) $$
where $\rho>0$ is the discount rate. The associated value function is
$$ V(x):=\sup_\pi \mathbb{E}\Big[\int_0^\infty e^{-\rho t}r(X_t,a_t)\,dt \Big| X_0=x\Big] $$
Theorem (Hamilton-Jacobi-Bellman equation for a controlled diffusion). Under suitable regularity conditions:
- $f(\cdot,a)$, $\Sigma(\cdot,a)$, $r(\cdot,a)$ are continuous in $(x,a)$; Lipschitz in $x$ uniformly in $a$.
- $\Sigma\Sigma^\top(x,a)$ is bounded and uniformly nondegenerate (for classical $C^2$ theory; if you drop this you typically work in viscosity form).
- $r$ is bounded (or has at most linear growth with enough integrability).
- $V\in C^2(\mathbb R^d)$ and bounded (or polynomial growth, with the usual technical modifications).
Then the value function satisfies the Hamilton-Jacobi-Bellman (HJB) PDE
$$ \rho V(x)=\max_{a\in \mathcal{A}}\Big\{ r(x,a)+\mathcal{L}^a V(x)\Big\}.\tag{1} $$
where $\mathcal{L}^a$ is the infinitesimal generator under action $a$:
$$ \mathcal{L}^a \varphi(x):=\nabla \varphi(x)^\top f(x,a) +\tfrac12 \mathrm{Tr}\big(\Sigma\Sigma^\top \nabla^2 \varphi(x)\big) $$
Proof. The structure is the same as in the deterministic case. The only new ingredient is the short-time expansion of $\mathbb E_x[V(X_h)]$: by Itô’s formula,
$$ \mathbb E_x[V(X_h)] = V(x) + h\,\mathcal L^a V(x) + o(h), $$
where the generator $\mathcal{L}^a$ replaces the directional derivative $\nabla V^\top f$, adding the curvature term $\tfrac{1}{2}\mathrm{Tr}(\Sigma\Sigma^\top\nabla^2 V)$ coming from the quadratic variation of $W$. The rest is unchanged: substitute into the DPP, cancel $V(x)$, divide by $h$, and let $h\downarrow 0$ to obtain (1). $\quad\blacksquare$
The same argument also yields the non-autonomous HJB when $f$, $\Sigma$, and $r$ depend explicitly on time; see Appendix C.
Historical Note: In 1960, Rudolf E. Kalman published his seminal paper on the linear-quadratic regulator (LQR) problem [8], which is a continuous-time optimal control problem with linear dynamics and quadratic cost. The solution to the LQR problem is given by the algebraic Riccati equation, which can be derived from the Hamilton-Jacobi-Bellman (HJB) equation for continuous-time control problems.
Define the continuous-time analogue of the Q-function by
$$ Q(x,a):=\frac{1}{\rho}\Big(r(x,a)+\mathcal{L}^a V(x)\Big).\tag{2} $$
From the HJB (1), it follows immediately that $V(x)=\max_{a} Q(x,a)$. This identity is the basis of policy improvement: once we have an estimate of $V$, the greedy action is $a^*(x) = \arg\max_a Q(x,a)$.
This stationary, discounted form is the RL convention used in the next two sections.
We solve the HJB numerically with policy iteration (PI), alternating between evaluating the current policy and improving it through the Q-function. Both the value $V_\theta$ and the policy $\alpha_\phi$ are represented by MLPs.
This algorithm is model-based: it assumes known dynamics through $f(x,a)$ and $\Sigma(x,a)$ (equivalently, access to the generator $\mathcal L^a$). The model is used both to simulate closed-loop trajectories in policy evaluation and to compute $\mathcal L^aV$ in policy improvement.
We iterate the following steps until convergence:
Policy evaluation (value under current policy $\alpha_k$):
$$ \rho V_k(x)=r\big(x,\alpha_k(x)\big)+\mathcal{L}^{\alpha_k(x)}V_k(x). $$
In practice, we estimate $V_k\approx V^{\alpha_k}$ by Monte Carlo rollouts of the closed-loop SDE and fit $V_\theta$ by regression.
Policy improvement (greedy with respect to $V_k$):
$$ \alpha_{k+1}(x)\in\arg\max_{a\in\mathcal A}\{r(x,a)+\mathcal L^aV_k(x)\} =\arg\max_{a\in\mathcal A} Q_k(x,a), $$
where
$$ Q_k(x,a):=\frac{1}{\rho}\big(r(x,a)+\mathcal L^aV_k(x)\big). $$
With a differentiable actor $\alpha_\phi$, this becomes gradient ascent on
$$ \max_\phi\;\mathbb E_x\big[Q_k\big(x,\alpha_\phi(x)\big)\big]. $$
Diagnosis / stopping:
$$ \mathcal R_{\mathrm{HJB}}(x)=\rho V(x)-\max_a\{r(x,a)+\mathcal L^aV(x)\}, $$
and stop when sampled norms of $\mathcal R_{\mathrm{HJB}}$ and parameter changes plateau.
Intuition: evaluation estimates the value landscape induced by the current policy, and improvement moves the policy uphill on that landscape. Repeating the two steps drives $(V,\alpha)$ toward a fixed point of the HJB.
The generator requires $\nabla V$ and $\nabla^2 V$, obtained from $V_\theta$ via autograd. The diffusion $\Sigma(x,a)$ is problem-given (model-based setting).
def compute_generator(V_net, x, f_xa, Sigma_xa):
"""L^a V(x) = ∇V · f + ½ Tr(ΣΣᵀ ∇²V)."""
V = V_net(x) # (batch, 1)
grad_V = autograd.grad(V.sum(), x, create_graph=True)[0] # (batch, d)
drift = (grad_V * f_xa).sum(-1, keepdim=True) # ∇V · f
d = x.shape[1]
H = torch.stack([autograd.grad(grad_V[:,i].sum(), x,
create_graph=True)[0] for i in range(d)], dim=1)
A = Sigma_xa @ Sigma_xa.transpose(-1,-2) # ΣΣᵀ
diff = 0.5 * (A * H).sum(dim=(-2,-1)).unsqueeze(-1) # ½Tr(AH)
return drift + diff
During policy improvement, $\nabla V$ and $\nabla^2 V$ are detached from $\theta$ so gradients flow only through $\phi$.
For a fixed policy $\alpha$, $V^\alpha$ solves the linear PDE
$$ \rho V^\alpha(x)=r\big(x,\alpha(x)\big)+\mathcal L^{\alpha(x)}V^\alpha(x). $$
By the Feynman-Kac representation, for any truncation horizon $T>0$,
$$ V^\alpha(x)= \mathbb E_x\!\left[\int_0^\infty e^{-\rho s}\,r\big(X_s,\alpha(X_s)\big)\,ds\right]=\mathbb E_x\!\left[\int_0^T e^{-\rho s}\,r\big(X_s,\alpha(X_s)\big)\,ds + e^{-\rho T}V^\alpha(X_T)\right] $$
In Monte Carlo policy evaluation, we approximate this expectation with simulated trajectories and use the critic to bootstrap the terminal value at time $T$.
At collocation points, compute $Q_k(x,\alpha_\phi(x))$ using the detached $\nabla V_k$, $\nabla^2 V_k$, and maximise $\mathbb{E}[Q_k]$ w.r.t. $\phi$:
grad_V, H = compute_generator_detached(V_net, x) # frozen V
a = policy_net(x) # differentiable in φ
f_xa, S_xa = problem.drift(x, a), problem.diffusion(x, a)
L_V = generator_from_precomputed(grad_V, H, f_xa, S_xa)
Q = (problem.reward(x, a) + L_V) / rho
loss = -Q.mean() # gradient ascent on Q
loss.backward()
opt_pi.step()
Policy iteration is model-based. A complementary route is Q-learning, which can be implemented model-free from sampled transitions.
In continuous time, the Q-function satisfies the PDE
$$ \rho Q(x,a)=r(x,a)+\mathcal L^a\big(\max_{a'\in\mathcal A}Q(x,a')\big).\tag{4} $$
With neural networks, set
$$ Q_\psi(x,a)\approx Q(x,a),\qquad a_\omega(x)\approx \arg\max_{a}Q_\psi(x,a), $$
where $Q_\psi$ (critic) and $a_\omega$ (actor) are MLPs.
Using short transitions $(X_t,a_t,r_t,X_{t+\Delta t})$ and a small step size $\Delta t$, a practical TD target is
$$ y_t = r_t\,\Delta t + e^{-\rho\Delta t}\,\bar V(X_{t+\Delta t}), \qquad \bar V(x):=Q_{\bar\psi}(x,a_\omega(x))\ \text{(or }\max_a Q_{\bar\psi}(x,a)\text{)}. $$
Then train the critic with
$$ \mathcal L_Q(\psi)=\mathbb E\big[(Q_\psi(X_t,a_t)-y_t)^2\big]. $$
The actor is updated by ascent on
$$ \max_\omega\;\mathbb E\big[Q_\psi(X_t,a_\omega(X_t))\big]. $$
This mirrors the usual actor-critic split: one network fits state-action values, while the other outputs actions that maximize them.
The linear-quadratic regulator is the canonical continuous-time control benchmark: linear dynamics, quadratic cost, and a closed-form solution. That makes it ideal for validating a numerical solver.
Dynamics (additive noise, 1-D scalar):
$$dX_t = (\alpha\,X_t + \beta\,a_t)\,dt + \sigma\,dW_t$$
Reward (negative quadratic cost):
$$r(x,a) = -\tfrac{1}{2}(q\,x^2 + r_a\,a^2)$$
| Symbol | Meaning | Value |
|---|---|---|
| $\alpha$ | open-loop drift (stable if $<0$) | $-0.5$ |
| $\beta$ | control effectiveness | $1.0$ |
| $q$ | state cost weight | $1.0$ |
| $r_a$ | action cost weight | $0.1$ |
| $\sigma$ | diffusion (noise intensity) | $0.3$ |
| $\rho$ | discount rate | $0.1$ |
Substituting a quadratic ansatz $V(x) = -\tfrac{1}{2}Px^2 - c$ into the HJB and optimising over $a$ yields (see Appendix A for the full derivation):
The closed-form objects are:
$$ a^*(x)=-\frac{\beta P}{r_a}x=: -Kx, $$
$$ \rho P = q + 2\alpha P - \frac{\beta^2}{r_a}P^2, $$
$$ c = \frac{\sigma^2 P}{2\rho}. $$
class StochasticLQR(ControlProblem):
def drift(self, x, a): return x @ A.T + a @ B.T
def diffusion(self, x, a): return D.unsqueeze(0).expand(x.shape[0], -1, -1)
def reward(self, x, a):
return -0.5*((x@Q*x).sum(-1,keepdim=True) + (a@R*a).sum(-1,keepdim=True))
P, c, K = solve_are(A, B, Q, R, D, rho=0.1) # exact Riccati solution
solver = PolicyIteration(problem, cfg)
history = solver.solve() # neural PI
Learned $V_\theta$ and $\alpha_\phi$ closely match $V^*(x) = -\tfrac{1}{2}Px^2 - c$ and $a^*(x) = -Kx$:

Sample trajectories under the learned policy ($x_0=1.5$) and cumulative discounted reward:

Convergence diagnostics (value-fit MSE, policy objective, HJB residual):

Merton’s (1969) portfolio problem asks how an investor should allocate wealth between a risk-free bond and a risky asset while also choosing a consumption rate. The objective is to maximize expected lifetime CRRA (constant relative risk aversion) utility of consumption. It also admits a closed-form solution, making it a useful second benchmark with multiplicative noise rather than the additive noise of LQR.
State: wealth $X_t > 0$. Controls: $a_t = (\pi_t, k_t)$ — risky-asset fraction and consumption-to-wealth ratio $k = c/X$.
Dynamics (geometric / multiplicative noise):
$$dX_t = \big[r_f + \pi_t(\mu - r_f) - k_t\big] X_t dt + \pi_t \sigma X_t dW_t$$
Reward (CRRA utility of consumption flow, $\gamma \neq 1$):
$$r(x,a) = \frac{(k\,x)^{1-\gamma}}{1-\gamma}$$
| Symbol | Meaning | Value |
|---|---|---|
| $r_f$ | risk-free rate | $0.03$ |
| $\mu$ | risky asset expected return | $0.08$ |
| $\sigma$ | risky asset volatility | $0.20$ |
| $\gamma$ | relative risk aversion (CRRA) | $2.0$ |
| $\rho$ | subjective discount rate | $0.05$ |
Substituting a power-law ansatz $V(x) = \frac{A}{1-\gamma}x^{1-\gamma}$ into the HJB and optimising over $(\pi, k)$ yields (see Appendix B for the full derivation):
The closed-form controls and value are:
$$ \pi^*=\frac{\mu-r_f}{\gamma\sigma^2}=0.625, $$
$$ k^*=\frac{\rho-(1-\gamma)M}{\gamma},\qquad M:=r_f+\frac{(\mu-r_f)^2}{2\gamma\sigma^2},\qquad k^*\approx 0.0478, $$
$$ V^*(x)=\frac{A}{1-\gamma}x^{1-\gamma},\qquad A=\left(\frac{\gamma}{\rho-(1-\gamma)M}\right)^\gamma. $$
Both optimal controls are constant — independent of wealth and time.
class MertonProblem(ControlProblem):
def drift(self, x, a): # a = (π, c_rate)
pi, cr = a[:, 0:1], a[:, 1:2]
return (self.r_f + pi*(self.mu - self.r_f) - cr) * x
def diffusion(self, x, a):
return (a[:, 0:1] * self.sigma * x).unsqueeze(-1)
def reward(self, x, a):
c = (a[:, 1:2] * x).clamp(min=1e-8)
return c.pow(1 - self.gamma) / (1 - self.gamma)
Learned value function matches the exact power-law $V^*\propto x^{1-\gamma}$; both controls converge to the analytical constants:

Sample wealth trajectories under the learned policy ($X_0 = 1$) and cumulative discounted reward:

Convergence diagnostics:

The same HJB machinery also appears in diffusion models once reverse-time sampling is written as a control problem. Let $p_{\text{data}}(x)$ be the target data distribution. For simplicity, consider a forward diffusion whose noise coefficient depends only on time, as in standard score-based SDE formulations:
$$ dY_t = f(Y_t, t)\,dt + \sigma(t)\,dB_t, \qquad Y_0 \sim p_{\text{data}}. $$
Let $p_t(x)$ denote the marginal density of $Y_t$. Under standard regularity assumptions, the time reversal of this process is again a diffusion. Instead of writing time backward from $T$ down to $0$, define
$$ X_t := Y_{T-t}, \qquad t\in[0,T]. $$
Then $X_0\sim p_T$, $X_T\sim p_{\text{data}}$, and $X_t$ has marginal $p_{T-t}$.
To expose the control structure [9], define the reverse-time drift and diffusion coefficients
$$ \mu(x, t) := -f(x, T-t), \qquad \Sigma(t) := \sigma(T-t). $$
Now consider a family of controlled diffusions $X_t^u$ driven by an arbitrary control field $u(x, t)$:
$$ dX_t^u = \big(\mu(X_t^u, t) + \Sigma(t) u(X_t^u, t)\big)\,dt + \Sigma(t)\,dW_t, \qquad X_0^u \sim p_T. $$
The goal is to choose $u$ so that the terminal law of $X_T^u$ matches $p_{\text{data}}$.
Now define the candidate value function
$$ V(x, t) := -\log p_{T-t}(x). $$
This is the negative log-density of the reverse-time marginals, so its terminal value is
$$ V(x, T) = -\log p_{\text{data}}(x). $$
To identify the PDE satisfied by $V$, recall that the forward marginals $p_t$ solve the Fokker-Planck equation for $Y_t$. Consequently, $\rho_t(x) := p_{T-t}(x) = e^{-V(x, t)}$ satisfies the reverse-time Fokker-Planck PDE
$$ \partial_t \rho_t = -\operatorname{div}(\mu\,\rho_t) - \tfrac{1}{2}\operatorname{Tr}\big(\Sigma\Sigma^\top \nabla_x^2 \rho_t\big). $$
Substituting $\rho_t = e^{-V}$, $\nabla_x \rho_t = -e^{-V}\nabla_x V$, and $\nabla_x^2 \rho_t = e^{-V}(\nabla_x V\nabla_x V^\top - \nabla_x^2 V)$ into that PDE and dividing by $-e^{-V}$ yields
$$ \partial_t V = \operatorname{div}\mu - \mu \cdot \nabla_x V + \tfrac{1}{2}\|\Sigma^\top \nabla_x V\|^2 - \tfrac{1}{2}\operatorname{Tr}\big(\Sigma\Sigma^\top \nabla_x^2 V\big). $$
To rewrite this as a control problem, introduce a control variable $u$ through the convex-conjugate identity for the quadratic function $g(y) = \frac{1}{2}\|y\|^2$:
$$ \tfrac{1}{2}\|y\|^2 = \sup_{u\in\mathbb{R}^d} \left\{ u \cdot y - \tfrac{1}{2}\|u\|^2 \right\}. $$
Setting $y = -\Sigma^\top \nabla_x V$ lets us rewrite the quadratic gradient term as
$$ \tfrac{1}{2}\|-\Sigma^\top \nabla_x V\|^2 = \sup_{u} \left\{ u^\top (-\Sigma^\top \nabla_x V) - \tfrac{1}{2}\|u\|^2 \right\} = -\inf_{u} \left\{ \tfrac{1}{2}\|u\|^2 + (\Sigma u) \cdot \nabla_x V \right\}. $$
This is the key step: it replaces a quadratic gradient term with a linear one that can be interpreted as controlled drift.
Plugging this back into the PDE for $V$, multiplying by $-1$, and collecting the $u$-independent terms inside the infimum gives the finite-horizon HJB equation
$$ -\partial_t V = \inf_u \left\{ \tfrac{1}{2}\|u\|^2 - \operatorname{div}\mu + (\mu + \Sigma u)\cdot \nabla_x V + \tfrac{1}{2}\operatorname{Tr}\big(\Sigma\Sigma^\top \nabla_x^2 V\big) \right\}. \tag{3} $$
This PDE shows that $V(x,t) = -\log p_{T-t}(x)$ is exactly the value function of a stochastic control problem with dynamics $X_t^u$ and cost
$$ J(u; x, t) = \mathbb{E}\left[ \int_t^T \left( \tfrac{1}{2}\|u(X_s^u,s)\|^2 - \operatorname{div}\mu(X_s^u,s) \right) ds - \log p_{\text{data}}(X_T^u) \;\middle|\; X_t^u=x \right]. $$
The optimal control law $u^*(x, t)$ is the minimizer in the HJB equation. From the quadratic optimization above, the minimizer of $\frac{1}{2}\|u\|^2 + u^\top (\Sigma^\top \nabla_x V)$ is obtained by setting its derivative with respect to $u$ to zero:
$$ u^* + \Sigma^\top \nabla_x V = 0 \implies u^*(x, t) = -\Sigma^\top(t)\nabla_x V(x, t). $$
Since $V(x, t) = -\log p_{T-t}(x)$, we have $\nabla_x V = -\nabla_x \log p_{T-t}(x)$. Substituting this identity gives the exact optimal control law
$$ u^*(x, t) = \Sigma^\top(t)\nabla_x \log p_{T-t}(x) = \sigma(T-t)^\top \nabla_x \log p_{T-t}(x). $$
Therefore the controlled drift becomes
$$ \mu(x,t)+\Sigma(t)u^*(x,t) =-f(x,T-t)+\Sigma(t)\Sigma(t)^\top \nabla_x \log p_{T-t}(x), $$
which is exactly the reverse-time score correction.
Applying Itô’s formula to $V(X_s^u, s)$ along an arbitrary controlled trajectory and then using the HJB gives the verification identity
$$ J(u; x, t) = V(x, t) + \frac{1}{2} \mathbb{E}\left[ \int_t^T \| u(X_s^u,s) - u^*(X_s^u,s) \|^2 ds \;\middle|\; X_t^u = x \right]. $$
This identity is the control-theoretic backbone of diffusion models:
So diffusion-based generative modeling can be viewed as a finite-horizon stochastic optimal control problem whose optimal policy is precisely the score-induced reverse-time drift correction.
[1] Jia, Yanwei, and Xun Yu Zhou. “q-Learning in continuous time.” Journal of Machine Learning Research 24, no. 161 (2023): 1-61.
[2] Jia, Yanwei, and Xun Yu Zhou. “Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms.” Journal of Machine Learning Research 23, no. 275 (2022): 1-50.
[3] Hamilton-Jacobi-Bellman Equations, Stochastic Differential Equations by Benjamin Moll https://benjaminmoll.com/wp-content/uploads/2019/07/Lecture4_ECO521_web.pdf
[4] Fleming, Wendell H., and H. Mete Soner. Controlled Markov processes and viscosity solutions. New York, NY: Springer New York, 2006.
[5] Yong, Jiongmin, and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations. Vol. 43. Springer Science & Business Media, 1999.
[6] Bellman, Richard. “On the Theory of Dynamic Programming.” Proceedings of the National Academy of Sciences 38, no. 8 (1952): 716-719. https://doi.org/10.1073/pnas.38.8.716.
[7] Bellman, Richard Ernest. An Introduction to the Theory of Dynamic Programming. Santa Monica, CA: RAND Corporation, 1953. https://www.rand.org/pubs/reports/R245.html.
[8] Kalman, Rudolf E. “Contributions to the Theory of Optimal Control.” Boletin de la Sociedad Matematica Mexicana 5 (1960): 102-119. https://boletin.math.org.mx/pdf/2/5/BSMM%282%29.5.102-119.pdf.
[9] Berner, Julius, Lorenz Richter, and Karen Ullrich. “An optimal control perspective on diffusion-based generative modeling.” Transactions on Machine Learning Research, 2024. https://arxiv.org/abs/2211.01364.
Ansatz. Guess $V(x) = -\tfrac{1}{2}Px^2 - c$ with $P > 0$. Then $V'=-Px$, $V''=-P$.
Generator. With drift $f = \alpha x + \beta a$ and constant diffusion $\sigma$:
$$ \mathcal{L}^a V =V'(\alpha x+\beta a)+\tfrac{1}{2}\sigma^2V'' =-Px(\alpha x+\beta a)-\tfrac{1}{2}\sigma^2P =-\alpha P x^2-\beta P a x-\tfrac{1}{2}\sigma^2P. $$
HJB. Substituting into $\rho V = \max_a \{r + \mathcal{L}^a V\}$:
$$ -\tfrac{1}{2}\rho P x^2 - \rho c =\max_{a}\Big\{-\tfrac{1}{2}qx^2-\tfrac{1}{2}r_a a^2-\alpha P x^2-\beta P a x-\tfrac{1}{2}\sigma^2P\Big\}. $$
Optimality in $a$. The RHS is concave in $a$; set $\partial_a(\cdot) = 0$:
$$ -r_a\,a-\beta P x=0 \implies a^*(x)=-\frac{\beta P}{r_a}\,x=: -Kx. $$
Riccati equation. Substituting $a^* = -Kx$ back and matching the $x^2$ coefficient and the constant:
$$x^2:\quad \rho P = q + 2\alpha P - \frac{\beta^2}{r_a}P^2, \qquad \text{const:}\quad c = \frac{\sigma^2 P}{2\rho}$$
The first is the discounted algebraic Riccati equation. In the scalar case it is a quadratic in $P$. $\quad\blacksquare$
Ansatz. Power-law value: $V(x) = \frac{A}{1-\gamma}\,x^{1-\gamma}$, $A > 0$. Then:
$$V'(x) = A\,x^{-\gamma},\qquad V''(x) = -\gamma A\,x^{-\gamma-1}$$
Generator. Using drift $\mu_X x = [r_f + \pi(\mu-r_f) - k]x$ and diffusion $\sigma_X x = \pi\sigma x$:
$$\mathcal{L}^a V = A\,x^{-\gamma}\cdot\mu_X x - \tfrac{1}{2}\gamma A\,x^{-\gamma-1}\cdot\sigma_X^2 x^2 = A\,x^{1-\gamma}\big[r_f + \pi(\mu-r_f) - k - \tfrac{1}{2}\gamma\pi^2\sigma^2\big]$$
HJB. Substituting and dividing through by $x^{1-\gamma} > 0$:
$$\frac{\rho A}{1-\gamma} = \max_{\pi,\,k}\left\{\frac{k^{1-\gamma}}{1-\gamma} + A\big[r_f + \pi(\mu-r_f) - k - \tfrac{1}{2}\gamma\pi^2\sigma^2\big]\right\}$$
Optimality in $\pi$. FOC $\partial_\pi(\cdot) = 0$: $\;A[(\mu-r_f) - \gamma\pi\sigma^2] = 0$:
$$\pi^* = \frac{\mu - r_f}{\gamma\,\sigma^2}$$
This is the myopic portfolio rule — independent of wealth and time. Higher risk aversion $\gamma$ or volatility $\sigma$ reduces exposure.
Optimality in $k$. FOC $\partial_k(\cdot) = 0$: $\;k^{-\gamma} - A = 0$, so $k^* = A^{-1/\gamma}$.
Solving for $A$. Define the certainty-equivalent growth rate $M := r_f + \frac{(\mu-r_f)^2}{2\gamma\sigma^2}$. Substituting the optimisers back:
$$ \frac{\rho A}{1-\gamma} =\frac{(A^{-1/\gamma})^{1-\gamma}}{1-\gamma}+A\big[M-A^{-1/\gamma}\big] =\frac{A^{(\gamma-1)/\gamma}}{1-\gamma}+AM-A^{(\gamma-1)/\gamma}. $$
Multiplying by $1-\gamma$ and collecting terms gives
$$ \rho A=(1-\gamma)AM+\gamma A^{(\gamma-1)/\gamma}. $$
Since $A>0$, divide by $A^{(\gamma-1)/\gamma}$ to obtain
$$ A^{1/\gamma}=\frac{\gamma}{\rho-(1-\gamma)M}. $$
Therefore
$$ A = \left(\frac{\gamma}{\rho - (1-\gamma)M}\right)^\gamma, \qquad k^* = \frac{\rho - (1-\gamma)M}{\gamma} $$
The denominator $\rho - (1-\gamma)M$ must be positive — this is the feasibility condition ensuring lifetime utility is finite. With our parameters: $M \approx 0.04563$, so $k^* \approx 0.0478$. $\quad\blacksquare$
Let the dynamics and reward depend on time:
$$ dX_t=f(t,X_t,a_t)\,dt+\Sigma(t,X_t,a_t)\,dW_t,\qquad r=r(t,x,a). $$
For the discounted infinite-horizon problem, define the time-dependent value
$$ V(t,x):=\sup_\pi \mathbb E\Big[\int_t^\infty e^{-\rho(s-t)} r(s,X_s,a_s)\,ds\ \Big|\ X_t=x\Big]. $$
Then the time-dependent generator is
$$ \mathcal L_t^a \varphi(x)=\nabla \varphi(x)^\top f(t,x,a)+\tfrac12\mathrm{Tr}\big(\Sigma\Sigma^\top(t,x,a)\nabla^2\varphi(x)\big), $$
and the HJB becomes
$$ \rho V(t,x)=\max_{a\in\mathcal A}\Big\{r(t,x,a)+\partial_t V(t,x)+\mathcal L_t^a V(t,x)\Big\}. $$
Equivalently,
$$ -\partial_t V(t,x)=\max_{a\in\mathcal A}\Big\{r(t,x,a)+\mathcal L_t^a V(t,x)\Big\}-\rho V(t,x). $$
In the autonomous case, $V(t,x)$ is time-independent, so $\partial_t V=0$ and you recover (1).
For the finite-horizon deterministic case, set $\Sigma\equiv 0$ and define
$$ V(t,x):=\sup_{a_\cdot}\left[\int_t^T r(s,X_s,a_s)\,ds+g(X_T)\,\middle|\,X_t=x\right]. $$
Then
$$ -\partial_t V(t,x)=\sup_{a\in\mathcal A}\left\{r(t,x,a)+\nabla_x V(t,x)^\top f(t,x,a)\right\}, \qquad V(T,x)=g(x). $$
Writing $r=-L$ and
$$ H(t,x,p):=\sup_{a\in\mathcal A}\{p^\top f(t,x,a)-L(t,x,a)\}, $$
this becomes
$$ \partial_t V(t,x)+H\bigl(t,x,\nabla_x V(t,x)\bigr)=0, \qquad V(T,x)=g(x), $$
which is the classical Hamilton-Jacobi form.
But the examples you quoted were not my examples, at least not their primary movers (the NaNs could be caused by overflow but that overflow can have a deeper cause). The examples I gave have/had very different root causes at play and the fixes required some facility with maths, not to the extent that you have to be capable of discovering new math, or something so complicated as the geometry and topology of strings, but nonetheless math that requires grad school or advanced and gifted undergrad level math.
Coming back to numeric overflow that you mention. I can imagine a software engineer eventually figuring out that overflow was a root cause (sometimes they will not). However there's quite a gap between overflow recognition and say knowledge of numerical analysis that will help guide a fix.
You say > "literally every single example"... can be dealt without much math. I would be very keen to learn from you about how to deal with this one, say. Without much math.
The labelers labelled only
the instances that are
easy to label, not chosen
uniformly from the data.
How to train with such
skewed label selection
(without relabeling properly)
This is not a gotcha, a genuine curiosity here because it is always useful to understand a solution different from your own(mine).