Now, since it's impossible to code review the tens of thousands of lines of code that AI produces, I see discussions about establishing an absolute rule like mathematical proofs. Reading this reminds me of Rust's borrow checker. In fact, after writing in Rust a few times, it often leads to bad practices where people use tricks to avoid the borrow checker.
Actually, when mathematical rigor goes too far, humans tend to find ways around it. An undereducated programmer like me is especially prone to that.
Looking back at this kind of attempt, it will probably result in writing code only for specific formalized answers. If it becomes that standardized, I'm not sure it will be able to respond to human needs.
I think these defensive programming attempts are fine, but I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship. Believing that over time, it will become good enough. Of course, for established industries where accuracy matters and the scope of work is well defined, like Jane Street, the approach in this article is correct. In other words, because there is enough data to adequately model the market's demands
But for social losers like me trying to make money, constantly moving from place to place looking for gold mines, these kinds of methodologies seem like a luxury.Established businesses with mature modeling need highly educated and specialized personnel to continuously optimize. But I know that realistically, I can't keep up with that demand. So I look for places where modeling is unstructured, but I'm not sure if I can use this approach even then.
Hopefully we get more ergonomic ways to do this? Like of the tools listed in the post, dafny + iris are the closest to being industrial I think. And amzn S3 has a history of TLA use in-house I think. But we probably haven't seen the typescript in this space yet, a zero cost abstraction that drops into existing tools, and people genuinely prefer it to the old way.
(And custom linters are also still pretty bad to write. Like golangci-lint is a painful codebase, haven't tried semgrep but the rules engine seemed intimidating. I've yet to use an AST API that I liked)
Formal methods won, now what? https://muratbuffalo.blogspot.com/2026/04/bugbash26-keynote....
We've banned everything from JSON to Python, rewritten `nix` to have a compiler, and almost everything we write is not only property tested and multiply fuzzed to within an inch of it's life, but we have proofs in `lean4` that at a minimum drive property tests via `.olean` linkage, and when we have the bandwidth we prove exhaustiveness over the domain and property test that.
We skip the whole C++/Rust thing because all of the fast stuff is generated from `lean4` and so it doesn't really matter (C++ has advantages when bugs in it are actually bugs in `lean4` code, but you could go either way).
This is a big departure, and I certainly don't blame anyone who is skeptical: "ban JSON and Python wtf?!?!", but we've done millions of lines (checked) of this stuff and AI + formal systems is a dramatically bigger leap than no AI -> AI and Python. The latter in our experience is not monotone in progress, the former is almost always monotone in progress.
And you can do wild shit, this is a formal proof of the polyhedral tensor calculus that is modeled by things like ISL and CuTe, and using that we can get swizzling and tiling using `mdspan` in C++23 on device and prove it's right (up to some L'Hopital arugment about the coverage which this example doesn't demonstrate well: https://github.com/b7r6/mdspan-cute
That in turn, well, it goes real fast. On the first try.
There is a movement parallel to formal methods to define acceptance criteria at high resolution but not logico-mathematically, which at least grapples with the mapping problem but can’t resolve it where the map isn’t the territory, which is most places. Has Google’s results page, with its extremely evolved internal optimization frameworks really hit an optimum? Could that prototype you whipped up to capture a hazy idea have better illustrated it? These questions are best answered by looking outside the system to what the system serves.
I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.
I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.
I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.
Can anyone enlighten me?
Is it that they can help write formulas faster? That they can help ensure formulas match the system they're modeling faster? If the problem you think formal methods will help with is sloppy code, isn't the verification code going to be sloppy as well, unless some (not sloppy) intelligence carefully confirms that the specification matches the target system, which was the labor that previously made formal methods too expensive? I guess I don't understand how the argument works if code was previously less sloppy and verification was too expensive, and now code is more sloppy, and there's more of it, but somehow the sloppy intelligence will make verification move fast enough to make it worthwhile.
Unless we have some non-sloppy intelligence that's less of a bottleneck on verification than humans, how are we in a better place?
Maybe it's that investing that huge amount of labor of verification by human experts is now worth it because so much code will be produced that uses the verification systems that the investment will now pay off. But that requires creating pretty general verification systems, such as type system verification or something (which is what they seem to be aimed at), rather than individually verifying software systems like the micro-kernel example.
In other words, maybe the play is to invest in reusable verification systems that can be run tons of times on new code and systems. If so, it's surprising that this wasn't always the strategy.
And in some cases, where the result isnt well defined, it can be learned, so it's not about sitting and thinking what would make sense.
A big difference is that formal methods allow you to use the "for all" quantifier.
For example, you might write a unit test that says "foo('abc') returns a string with no trailing whitespace".
But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".
This is a trivial example, but you could imagine something more complicated, such as "for any program P, compile(P) has the same behavior as P".
Of course, you have to define what "has the same behavior as" means!
You specify properties of the design and the tooling tests the design in a variety of ways to see if it can violate those properties.
Let’s say you have a system that controls turn signals. You can specify properties that say things like lanes that cross each other can’t both have a green light at the same time or even within some period of time of each other.
The toolling can exhaustively check the design for this and surface code traces that violate it.
Statically proven type systems let you do this in a way that each contributing piece is checked in advance against all the other pieces, guaranteeing not just "this test passes" but, in combination, "all these tests create a reasonable, coherent whole", and it disallows incoherent models from being implemented in the code. An example of this is like Rust's match, which requires complete coverage of all the possible branches, but writ large across an entire codebase.
You're right that it does nothing for you if you make a fundamental logic or theory error, it can't catch that kind of error, but you'd be surprised how much more robust it is than "merely" e.g. Haskell's type system + ad hoc functional testing + property testing - which I would consider a quite strong system overall, but formally proven e.g. cryptography is a much higher bar.
I don't disagree with your point (formal verification does not rid you of all bugs), but this is not the subject of the linked issue. This was a bug in an unverified path.
The point up to which formal methods stops is: do the formally encoded requirements actually encode what I want to be true?
One can make the argument that the requirements is a much smaller surface to verify than that of the entire program.
The counter argument is that figuring out what you want the program to do has always been the hardest part of programming, and that programming in itself is a journey to discover latent requirements.
The beauty of formal methods is it doesn't matter if your proof is sloppy. As long as it passes verification, it is correct. And unlike in pure math, the proof that a software system is correct is usually a huge mess of special cases, loop invariants, proofs by induction, and boilerplate that requires a large amount of human labour while providing no insight.
Proofs are also brittle: a tiny change in the code can force you to throw your proof away and start from scratch.
To me, the exciting thing about formal methods in the LLM era is it allows humans to offload the difficult and tedious work of writing proofs to a computer. Taken to an extreme, the human could live entirely in the world of a formal specification, and the LLM could generate 100% of the code. The code may be a mess, but if the system proves it satisfies the spec then it can't be wrong.
This argument is unfortunately empirically false for any program of any meaningful complexity (>1000 lines, probably even as low as >100 lines ignoring well-defined algorithms and data structures) using current formal methods.
Complete formal specifications are usually multiple times larger than the corresponding source code and encode esoteric propertys necessary for the proof, but which are largely even more impenetrable than a undocumented codebase.
So, it is both harder to figure out if you encoded the desired requirements and it is more complex. Your only advantage is confidence that what you wrote down is proven.
Things can be improved when people help guide and focus the LLMs, but these people still need to be formal methods experts.
If a formal spec is messy, then it's a proof of ... what, exactly?
A formal specification that bridges tech and product, that lets non-technical contributors read and discuss all the logical nuances, directly as operational code, at product's level of abstraction of interest, would transform a lot.
It's no longer a challenge to create code, it's a challenge to create business requirements and translate them into systems.
I’ve been telling people for the last 25 years that Jane Street as an organization was just not interested in formal methods.
I’m not saying that anymore.
It’s not exactly that I think we were wrong all those years. To be clear, we’re strong believers in the power of tools to help us write better and more reliable code. And type systems are a kind of lightweight formal method that we’ve gotten an enormous amount of benefit from. So you might expect us to have been big believers in more full-on formal methods.
But outside of some special cases (notably, hardware synthesis), our sense has been that formal methods were just not worth the costs for us. And those costs are really high! seL4 is a great example of this. It’s a formally verified microkernel, and a profound achievement. But, boy was it expensive to do! It took 25 person-years of effort to verify 8,700 lines of C, and each line of code required something like 23 lines of proof and a half a person-day to verify.
Our hope is to make formal methods as pervasively useful of a tool for building software as sophisticated type systems are for us today.
That kind of approach could be worth it for a security-critical microkernel, where the stakes are high and the specifications are fairly clear. But it just doesn’t make sense for most software, and to us it didn’t feel like it made sense for even our most critical software.
But the emergence of agentic coding has changed our perspective, and we’ve gone from being skeptical to being excited about the possibilities. And as a result, we’re now building a team to focus on formal methods. Our hope is to make formal methods as pervasively useful of a tool for building software as sophisticated type systems are for us today.
Agentic coding upsets the formal-methods apple-cart in a few ways.
For one thing, it dramatically changes the cost of using formal methods. It’s not that agents can on their own construct arbitrarily challenging proofs.1 But models are enormously helpful, and broaden the set of people who can use these tools productively. With formal methods being easier to use than ever, it’s worth reconsidering the old cost/benefit calculus.
But things haven’t changed only on the cost side. The benefits seem bigger now too. There are really two reasons for this:
The verification bottleneck is more important than ever. Models are increasingly good at writing useful code. But there’s a big gap between the code that models generate, and code that you’d want to actually release. To some degree, this is an artifact of how the models are trained. They’re surprisingly good at achieving the goal you set in front of them, but they don’t do a great job of maintaining and even improving the quality of the codebase as they do so. Agentic code is getting better, but is still tends towards slop: overly complicated, full of weird bugs and corner cases, often not following essential invariants of the codebase that it’s a part of.
As a result, people need to spend a lot of time verifying that the code produced by agents is up to snuff. And formal methods could be a way of relieving some of that verification burden, and making the process of review a lot more efficient.
Separately, agents thrive on feedback. This is true both when you’re training agents using RL, and when you’re using agents to code. And formal methods are another powerful form of feedback that can increase the agents’ ability to solve hard problems.
A lot of why we're excited about full-on formal methods is that we see how valuable types are when programming with agents.
Not that formal methods are the only way of getting feedback. Tests are incredibly valuable as well, and can be made even better by leaning into property-based tests and fuzzing. And lord knows we’ve spent a lot of time building out testing infrastructure.
But tests aren’t enough! There are inherent limits in the power of tests to cover the state space that your program might explore. One of the things we’ve seen in our own programming in OxCaml is that agents benefit a ton from universal guarantees, the ∀ you get out of type systems. If your type system has a way of preventing data races, it lets you get rid of all2 data races. If you set up your types to make cross-site scripting vulnerabilities impossible, then you can really get rid of those entirely, in a way that mere testing has trouble doing.
Indeed, a lot of why we’re excited about full-on formal methods is that we see how valuable types are when programming with agents, both for easing the verification bottleneck and providing agents with better feedback, and that makes us excited to see how much more uplift could be available by leveraging more powerful proof techniques.
We have two things going for us: deep control of the language we're using, and a community of programmers who are ready for this.
One question this raises is: why is Jane Street well positioned to work on this problem? The whole world is thinking about what agents mean for the future of programming, and there are endless startups looking for ways of mixing formal methods and agents. Why is this something we’d work on internally? And why should formal methods experts in the outside world be excited to join our efforts here?
For one thing, we have deep control of the language we’re using, and that lets us adjust that language to make it a better home for proof-oriented techniques. There are lots of potential directions to go here: from integrating modular specifications of properties into the type system, to adding type-level constraints around things like ownership and mutability to make certain kinds of proofs easier, to building proof techniques directly into the language.
We also have a community of programmers who are ready for this, or at least more ready than any serious programming community I’ve encountered. For most people who work on programming languages, the easy part is coming up with new and better ideas about how to make programming better. The hard part is convincing anyone to actually use those ideas for real work.
At Jane Street, things are different! We routinely have users angry at us because the new, weird type-system features we promised them aren’t coming fast enough. We have a lot of people with the right background to leverage these techniques, and a lot of baked in interest in getting things right and building high-quality software.
We think that user base will gives us the freedom to try a mixture of approaches; there are some near-term improvements we think we can make which will have pretty immediate impact, and some ambitious, longer-term visions for where we can get in a few years. Having an engaged and excited user base makes both of these approaches possible, and lets us learn from the first, while we build towards the second.
None of this is to say that we’re going to ignore work in the outside world. We’re excited and inspired by the work in a variety of other PL communities, built around tools like Lean, Dafny, Rocq, Agda, Iris, and too many more to mention. And we’re excited to look for ways of integrating OxCaml with some of these tools, to take advantage of the great infrastructure that’s already out there. But we also think there are some unique advantages that can only be realized by engaging with the language and the proof techniques at the same time.
If this sounds interesting to you, consider applying! We’re looking for people in both London and New York. We’re in the early stages of interviewing people for these spots and building a team, and there’s an enormous amount of work ahead of us, and we’d love you to be a part of it.
Our experience is that models still need help and guidance from humans in order to navigate a complex proof. A human programmer may have ideas about why a system works and how, at a high level, to go about proving it. But most programmers don’t know how to encode these proof ideas in a way that will satisfy a given proof system. Models can automate much of the drudgery and provide a ready source of expertise on the technical details of writing out a proof. ↩
OK, well, maybe not all. There are escape hatches, like Obj.magic, that let you work around type-level constraints. But you can track and ban exceptions like that for most of your code, at which point you do get something very close to universal guarantees. And, indeed, formal methods can allow you to make it explicit why your use of those escape hatches is actually safe. ↩
Yaron Minsky joined Jane Street back in 2002, and claims the dubious honor of having convinced the firm to start using OCaml.