Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.
(sarcasm off).
Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.
The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.
The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.
This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.
Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.
Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!
Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.
Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.
At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.
The next step up from that is a good automated test suite.
Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.
Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.
I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.
If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.
There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.
Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.
I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.
It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.
At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.
Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.
A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.
Regardless, I often use coding assistants in that manner:
1. First, I use the assistant to come up with the success condition program
2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
3. Then I check the solution myself
It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...
I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.
So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.
Longer term, I think it's more important to understand what's hard and what's easy for agents.
One is that modern formal systems like Lean are quite concise and flexible compared to what you're probably expecting. Lean provides the primitives to formalize all kinds of things, not just math or software. In fact, I really believe that basically _any_ question with a rigorous yes-or-no answer can have its semantics formalized into a kind of "theory". The proofs are often close to how an English proof might look, thanks to high-level tactics involving automation and the power of induction.
Another is that proof-checking solves what are (in my opinion) two of the biggest challenges in modern AI: reward specification and grounding. You can run your solver for a long time, and if it finds an answer, you can trust that without worrying about reward hacking or hallucination, even if the answer is much too complicated for you to understand. You can do RL for an unlimited time for the same reason. And Lean also gives you a 'grounded' model of the objects in your theory, so that the model can manipulate them directly.
In combination, these two properties are extremely powerful. Lean lets us specify an unhackable reward for an extremely diverse set of problems across math, science, and engineering, as well as a common environment to do RL in. It also lets us accept answers to questions without checking them ourselves, which "closes the loop" on tools which generate code or circuitry.
I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.
[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/
What will happen instead is a more general application of AI systems to verifying software correctness, which should lead to more reliable software. The bottleneck in software quality is in specifying what the behavior needs to be, not in validating conformance to a known specification.
> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program
This sounds a lot like Test-Driven Development. :)
If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.
They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).
To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.
People don't verify those because it's hard, not for lack of value.
It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.
Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.
The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.
Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.
And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.
> The next step up from that is a good automated test suite.
And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.
The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.
Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.
I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.
I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.
This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.
I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.
Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“
Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate
vs
Giving them a framework or a language that does not have for loop?
Edit: If by formal verification you mean type checking. That I very much agree.
You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”
Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.
> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.
TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.
Concrete Semantics [1] has a little example in §9.2.
A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.
It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.
My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.
In fact, there are already serious industrial efforts to generate Dafny using LLMs.
Besides, some of the largest verification efforts have been achieved with this language [1].
[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf
It even lets you separate implementation from specification.
Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...
Strong type systems are providing partial validation which is helping quite a lot IMO. The better we can model the state - the more constraints we can define in the model, the closer we'd be getting to writing "self-proven" code. I would assume formal proofs do way more than just ensuring validity of the model, but the similar approaches can be integrated to mainstream programs as well I believe.
> At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years.
We already had some software engineers here on HN explain that they don't make a large use of LLMs because the hard part of their job isn't to actually write the code, but to understand the requirements behind it. And formal verification is all about requirements.
> Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.
Writing the spec is easier once you are confident about having fully understood the requirements, and here we get back to the above issue. Plus, it is already the case that you don't write the proof by hand, this is what the prover either assists you with or does in full.
> I find it exciting to think that we could just specify in a high-level, declarative way the properties that we want some piece of code to have, and then to vibe code the implementation along with a proof that it satisfies the specification.
And here is where I think problems will arise: moving from the high level specification to the formal one that is the one actually getting formally verified.
Of course, this would still be better than having no verification at all. But it is important to keep in mind that, with these additional levels of abstractions, you will likely end up with a weaker form of formal verification, so to speak. Maybe it is worth it to still verify some high assurance software "the old way" and leave this only for the cases where additional verification is nice to have but not a matter of life or death.
LLMs enable code bootstrapping and experimentation faster not only for the vibe coders, but also for the researchers, many of them are not really good coders, btw. It may well be that we will see new wild verification tools soon that come as a result of quick iteration with LLMs.
For example, I recently wrote an experimental distributed bug finder for TLA+ with Claude in about three weeks. A couple of years ago that effort would require three months and a team of three people.
The quest for purity is some fountain of youth nonsense that distracts a lot of otherwise brilliant engineers.
Ask the AI to make a program that consumes a program and determine if it halts.
What it will make go mainstream, and in fact has already started to, is “ChatGPT verified it so it must be OK.”
Fans of LLMs brag about speed and productivity.
Most popular programming languages are optimized for human convenience, not for correctness! Even most of the popular typed languages (Java/Kotlin/Go/...) have a wide surface area for misuse that is not caught at compile time.
Case in point: In my experience, LLMs produce correct code way more regularly for Rust than for Js/Ts/Python/... . Rust has a very strict type system. Both the standard library and the whole library ecosystem lean towards strict APIs that enforce correctness, prevent invalid operations, and push towards handling or at least propagating errors.
The AIs will often write code that won't compile initially, but after a few iterations with the compiler the result is often correct. Strong typing also makes it much easier to validate the output when reviewing.
With AIs being able to do more and more of the implementation, the "feel-good" factor of languages will become much less relevant. Iteration speed is not so important when parallel AI agents do the "grunt work". I'd much rather wait 10 minutes for solid output rather than 2 minutes for something fragile.
We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.
Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.
What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.
Something like ADA/Spark, but more modern.
2020: I don't care how it performs
2030: I don't care why it performs
2040: I don't care what it performs
One example could be a low-level programming language for a given PLC manufacturer, where the prompt comes from a context-aware domain expert, and the LLM is able to output proper DSL code for that PLC. Think of "make sure this motor spins at 300rpm while this other task takes place"-type prompts.
The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.
We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)
So, the job is not done for humans yet.
Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}
As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.
It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?
I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.
Source code generation is possible due to large training set and effort put into reinforcing better outcomes.
I suspect debugging is not that straightforward to LLM'ize.
It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.
I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.
We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.
TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.
Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.
Formal proof is not just about proving stuff, it's also about disproving stuff, by finding counterexamples. Once you have stated your property, you can let quickcheck/plausible attack it, possibly helped by a suitable generator which does not have to be random: it can be steered by an LLM as well.
Even further, I'm toying with the idea of including LLMs inside the formalization itself. There is an old and rich idea in the domain of formal proof, that of certificates: rather than proving that the algorithm that produces a result is correct, just compute a checkable certificate with untrusted code and verify it is correct. Checkable certificates can be produced by unverified programs, humans, and now LLMs. Properties, invariants, can all be "guessed" without harm by an LLM and would still have to pass a checker. We have truly entered an age of oracles. It's not halting-problem-oracle territory of course, but it sometimes feels pretty close for practical purposes. LLMs are already better at math than most of us and certainly than me, and so any problem I could plausibly solve on my own, they will do faster without my having to wonder if there is a subtle bug in the proof. I still need to look at the definitions and statements, of course, but my role has changed from finding to checking. Exploring the space of possible solutions is now mostly done better and faster by LLMs. And you can run as many in parallel as you can keep up with, in attention and in time (and money).
If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !
* A user interface is confusing, or the English around it is unclear
* An API you rely on changes, is deprecated, etc.
* Users use something in unexpected ways
* Updates forced by vendors or open source projects cause things to break
* The customer isn't clear what they want
* Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)
For some mathematically pure task, sure, it's great. Or a low-level library like a regular expression parser or a compression codec. But I don't think that represents a lot of what most of us are tasked with, and those low-level "mathematically pure" libraries are generally pretty well handled by now.Reducing the problem to "ya just create a specification to formally verify" doesn't move the needle enough to me.
When it comes to real-world, pragmatic, boots-on-the-ground engineering and design, we are so far from even knowing the right questions to ask. I just don't buy it that we'd see huge mainstream productivity changes even if we had access to a crystal ball.
Its hilarious how close we're getting to Hitch hikers guide to the galaxy though. We're almost at that phase where we ask what the question is supposed to be.
I wouldn't. An unreadable mess that has been formally verified is worse than a clear easy to understand piece of code that has not.
Code is rarely written from scratch. As long as you want humans to maintain code, readability is crucial. Code is changed magnitudes more often than written initially.
Of course, if you don't want humans to maintain the code then this point is moot. Though he gets to the catch later on: then we need to write (and maintain and debug and reason about) the specification instead. We will just have kicked the can down the road.
How big is the effort of writing a specification for an application versus implementing the application in the traditional way? Can someone with more knowledge chime in here please?
More likely is the rise of test driven development, or spec driven development.
However I don't still believe in vibecoding full programs. There are too many layers in software systems, even when the program core is fully verified, the programmer must know about the other layers.
You are Android app developer, you need to know what phones people commonly use, what kind of performance they have, how the apps are deployed through Google App Store, how to manage wide variety of app versions, how to manage issues when storage is low, network is offline, battery is low and CPU is in lower power state.
In pre-silicon verification, formal has been used successfully for decades, but is not a replacement for simulation based verification.
The future of verification (for hardware and software) is to eliminate verification all together, by synthesizing intent into correct code and tests.
-- https//www.verifai.ai
Debugging is not easy but there should be a lot of training corpus for "bug fixing" from all the commits that have ever existed.
They generated it, and had a compiler compile it, and then had it examine the output. Rinse, repeat.
I am wondering what exactly you are doing? What tasks you are solving using generated lean?
https://chatgpt.com/share/6941df90-789c-8005-8783-6e1c76cdfc...
I understand that AI can help with writing, coding, analyzing code bases and summarizing other papers, but going through these myself makes a difference, at least for me. I tried ChatGPT 3.5 when I started and while I got a pile of work done, I had to throw it away at some point because I didn't fully understand it. AI could explain to me various parts, but it's different when you create it.
The reason we don't all write code to the level that can operate the Space Shuttle is because we don't have the resources and the projects most of us work on all allow some wiggle room for bugs since lives generally aren't at risk. But we'd all love to check in code that was verifiably bug-free, exploit-free, secure etc if we could get that at a low, low price.
AI makes creating mock objects much easier in some cases, but it still creates a lot of busy work and makes configuration more difficult. At at this points it often is difficult configuration management that cause the issues in the first place. Putting everything in some container doesn't help either, on the contrary.
But it's true. AI is still super narrow and dumb. Don't understand basic prompts even.
Look at the computer games now - they still don't look real despite almost 30 years since Half-life 1 started the revolution - I would claim. Damn, I think I ran it on 166 Mhz computer on some lowest details even.
Yes, it's just better and better but still looking super uncanny - at least to me. And it's been basically 30 years of constant improvements. Heck, Roomba is going bankrupt.
I am not saying things don't improve but the hype and AI bubble is insane and the reality doesn't match the expectation and predictions at all.
It is first and foremost about learning a way of thinking. Tools only exist to augment and systematize this thinking into a methodology. There are different levels of "Formal Methods Thinking" starting with informal all the way to completely rigorous. Understanding and using these methods of thinking as the "interface" to specify a problem to an AI agent/LLM is what is important to ensure "correctness by construction to a specification".
Everybody should read this excellent (and accessible) paper On Formal Methods Thinking in Computer Science Education which details the above approach - https://research.tue.nl/en/publications/on-formal-methods-th...
Excerpts:
One may ask What good is FM? Who needs it? Millions of programmers work everyday without it. Many think that FM in a CS curriculum is peddling the idea that Formal Logic (e.g.,propositional or predicate logic) is required for everyday programmers, that they need it to write programs that are more likely to be correct, and correspondingly less likely to fail the tests to which they subsequently (of course) must still be subjected. However, this degree of formality is not necessarily needed. What is required of everyday programmers is that, as they write their programs, they think — and code — in a way that respects a correctness-oriented point of view. Assertions can be written informally, in natural language: just the “thinking of what those assertions might be” guides the program-construction process in an astonishingly effective way. What is also required are the engineering principles referred to above. Connecting programs with their specifications through assertions provides training on abstraction, which, in turn, encourages simplicity and focus, helping build more robust, flexible and usable systems.
The answer to “Who needs it?” is that everyday programmers and software developers indeed may not need to know the theory of FM. But they do need to know is how to practise it, even if with a light touch, benefiting from its precepts. FM theory, which is what explains — to the more mathematically inclined — why FM works, has become confused with the FM practice of using the theory’s results to benefit from what it assures. Any “everyday programmer” can do that...except that most do not.
The paper posits 3 levels of "Formal Methods Thinking" viz.
a) Level 1 (“What’s True Here”). Level 1 of FM thinking is the application of FM in its most basic form. Students develop abilities to understand their programs and reason about their correctness using informal descriptions. By “What’s True Here”, we mean including natural language prose or informal diagrams to describe the properties that are true at different points of a program’s execution rather than the operations that brought them about.
b) Level 2 (Formal Assertions). Level 2 introduces greater precision to Level 1 by teaching students to write assertions that incorporate arithmetic and logical operators to capture FM thinking more rigorously. This may be accompanied by lightweight tools that can be used to test or check that their assertions hold.
c) Level 3 (Full Verification). This level enables students to prove program properties using tools such as a theorem prover, model checker or SMT solver. But in addition to tool-based checking of properties (now written using a formal language), this level can formally emphasise other aspects of system-level correctness, such as structural induction and termination.
And nothing is stopping the AI from making the unreadable mess more readable in later iterations. It can make it pass the spec first and make it cleaner later. Just like we do!
Take sorting a list for example. The spec is quite short.
- for all xs: xs is a permutation of sort(xs)
- for all xs: sorted(sort(xs))
Where we can define "xs is a permutation of ys" as "for each x in xs: occurrences(x, xs) = occurrences(x, ys)"
And "sorted(l)" as "forall xs, x, y, ys: (l = xs ++ [x, y] ++ ys) => x < y".
A straightforward bubble or insertion sort would perhaps be considered as simple or simpler than this spec. But the sorting algorithms in, say, standard libraries, tend to be significantly more complex than this spec.
Problem is - while these will be resolved (in one way or another) - or left unresolved, as the user will only test the app on his device and that LLM "roll" will not have optimizations for the broad range of others - the user is still pretty much left clueless as to what has really happened.
Models theoretically inform you about what they did, why they did it (albeit, largely by using blanket terms and/or phrases unintelligible to the average 'vibe coder') but I feel like most people ignore that completely, and those who don't wouldn't need to use a LLM to code an entirety of an app regardless.
Still, for very simple projects I use at work just chucking something into Gemini and letting it work on it is oftentimes faster and more productive than doing it manually. Plus, if the user is interested in it, it can be used as a relatively good learning tool.
Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.
I can't bear to go in circles with Sonnet when Opus can just one shot it.
Isn't this what TLA+ was meant to deal with?
A lot of libraries document invariants that are either not checked at all, only at runtime, or somewhere in between. For instance, the requirement that a collection not be modified during interaction. Or that two region of memory do not overlap, or that a variable is not modified without owning a lock. These are all things that, in principle, can be formally verified.
No one claims that good type systems prevent buggy software. But, they do seem to improve programmer productivity.
For LLMs, there is an added benefit. If you can formally specify what you want, you can make that specification your entire program. Then have an LLM driven compiler produce a provably correct implementation. This is a novel programming paradigm that has never before been possible; although every "declarative" language is an attempt to approximate it.
Formal verification will eventually lead to good, stable API design.
> Users use something in unexpected ways
> Complex behavior between interconnected systems
It happens when there's no formal verification during the design stage.
Formal verification literally means cover 100% state changes and for every possible input/output, every execution branch should be tested.
Give me a list of all the libraries you work with that don't have some sort of "okay but not that bit" rule in the business logic, or "all of those function are f(src, dst) but the one you use most is f(dst,src) and we can't change it now".
I bet it's a very short list.
Really we need to scrap every piece of software ever written and start again from scratch with all these weirdities written down so we don't do it again, but we never will.
Published by Martin Kleppmann on 08 Dec 2025.
Much has been said about the effects that AI will have on software development, but there is an angle I haven’t seen talked about: I believe that AI will bring formal verification, which for decades has been a bit of a fringe pursuit, into the software engineering mainstream.
Proof assistants and proof-oriented programming languages such as Rocq, Isabelle, Lean, F*, and Agda have been around for a long time. They make it possible to write a formal specification that some piece of code is supposed to satisfy, and then mathematically prove that the code always satisfies that spec (even on weird edge cases that you didn’t think of testing). These tools have been used to develop some large formally verified software systems, such as an operating system kernel, a C compiler, and a cryptographic protocol stack.
At present, formal verification is mostly used by research projects, and it is uncommon for industrial software engineers to use formal methods (even those working on classic high-assurance software such as medical devices and aircraft). The reason is that writing those proofs is both very difficult (requiring PhD-level training) and very laborious.
For example, as of 2009, the formally verified seL4 microkernel consisted of 8,700 lines of C code, but proving it correct required 20 person-years and 200,000 lines of Isabelle code – or 23 lines of proof and half a person-day for every single line of implementation. Moreover, there are maybe a few hundred people in the world (wild guess) who know how to write such proofs, since it requires a lot of arcane knowledge about the proof system.
To put it in simple economic terms: for most systems, the expected cost of bugs is lower than the expected cost of using the proof techniques that would eliminate those bugs. Part of the reason is perhaps that bugs are a negative externality: it’s not the software developer who bears the cost of the bugs, but the users. But even if the software developer were to bear the cost, formal verification is simply very hard and expensive.
At least, that was the case until recently. Now, LLM-based coding assistants are getting pretty good not only at writing implementation code, but also at writing proof scripts in various languages. At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.
If formal verification becomes vastly cheaper, then we can afford to verify much more software. But on top of that, AI also creates a need to formally verify more software: rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct. If it can do that, I’ll take AI-generated code over handcrafted code (with all its artisanal bugs) any day!
In fact, I would argue that writing proof scripts is one of the best applications for LLMs. It doesn’t matter if they hallucinate nonsense, because the proof checker will reject any invalid proof and force the AI agent to retry. The proof checker is a small amount of code that is itself verified, making it virtually impossible to sneak an invalid proof past the checker.
That doesn’t mean software will suddenly be bug-free. As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.
I could also imagine AI agents helping with the process of writing the specifications, translating between formal language and natural language. Here there is the potential for subtleties to be lost in translation, but this seems like a manageable risk.
I find it exciting to think that we could just specify in a high-level, declarative way the properties that we want some piece of code to have, and then to vibe code the implementation along with a proof that it satisfies the specification. That would totally change the nature of software development: we wouldn’t even need to bother looking at the AI-generated code any more, just like we don’t bother looking at the machine code generated by a compiler.
In summary: 1. formal verification is about to become vastly cheaper; 2. AI-generated code needs formal verification so that we can skip human review and still be sure that it works; 3. the precision of formal verification counteracts the imprecise and probabilistic nature of LLMs. These three things taken together mean formal verification is likely to go mainstream in the foreseeable future. I suspect that soon the limiting factor will not be the technology, but the culture change required for people to realise that formal methods have become viable in practice.
If you found this post useful, please support me on Patreon so that I can write more like it!
To get notified when I write something new, follow me on Bluesky or Mastodon, or enter your email address:
I won't give your address to anyone else, won't send you any spam, and you can unsubscribe at any time.
I don't see how AI would help with that even if it made writing code completely free. Even if the AI is writing the spec and fully specifies all possible outcomes, the human reviewing it will glance over the spec and approve it only to change their mind when confrunted with the actual behavior or user reports.
This makes me wonder if LLMs works better in Chinese.
All said, I’m now running all commits through Codex (which is the only thing it’s any good at), and it’s really pretty good at code reviews.
The problem is there is always some chance a coding agent will get stuck and be unable to produce a conforming implementation in a reasonable amount of time. And then you are back in a similar place to what you were with those pre-LLM solutions - needing a human expert to work out how to make further progress.
That's exactly what languages with advanced type systems claim. To be more precise, they claim to eliminate entire classes of bugs. So they reduce bugs, they don't eliminate them completely.
To me it seems they reduce productivity. In fact, for Rust, which seems to match the examples you gave about locks or regions of memory the common wisdom is that it takes longer to start a project, but one reaps the benefits later thanks to more confidence when refactoring or adding code.
However, even that weaker claim hasn’t been proven.
In my experience, the more information is encoded in the type system, the more effort is required to change code. My initial enthusiasm for the idea of Ada and Spark evaporated when I saw how much ceremony the code required.
That is not novel and every declarative language precisely embodies it.
They really don’t. How did you arrive at such a conclusion?
Why? Has it ever happened like this? Because to me it would seem that if the system verified to work, then it works no matter how API is shaped, so there is no incentive to change it to something better.
Given the spec, formal verification can tell you if your implementation follows the spec. It cannot tell you if the spec if good
depends ofcourse, what am i writing for ? a feature, a bugfix, refactor ... ?
Bug reporting is already about signal vs noise. Imagine how it will be when we hand the megaphone to bots.
Forcing devs to pre-fix/avoid bugs before the compiler will allow the app means the programs are more correct as a group.
Wrong, incomplete, insufficient, unhelpful, unimpressive, and dumb are all still very possible. But more correct than likely in looser systems.
At one end of the spectrum, the weakest type systems limit the ability of an IDE to do basic maintenance tasks (e.g. refactoring).
At the other end of the spectrum, dependent type and especially sigma types capture arbitrary properties that can be expressed in the logic. But then constructing values in such types requires providing proofs of these properties, and the code and proofs are inextricably mixed in an unmaintainable mess. This does not scale well: you cannot easily add a new proof on top of existing self-sufficient code without temporarily breaking it.
Like other engineering domains, proof engineering has tradeoffs that require expertise to navigate.
I would tend to disagree. All that information encoded in the type system makes explicit what is needed in any case and is otherwise only carried informally in peoples' heads by convention. Maybe in some poorly updated doc or code comment where nobody finds it. Making it explicit and compiler-enforced is a good thing. It might feel like a burden at first, but you're otherwise just closing your eyes and ignoring what can end up important. Changed assumptions are immediately visible. Formal verification just pushes the boundary of that.
To be honest, I believe it makes refactoring/maintenance take longer. Sure, safer, but this is not a one-time only price.
E.g. you decide to optimize this part of the code and only return a reference or change the lifetime - this is an API-breaking change and you have to potentially recursively fix it. Meanwhile GC languages can mostly get away with a local-only change.
Don't get me wrong, in many cases this is more than worthwhile, but I would probably not choose rust for the n+1th backend crud app for this and similar reasons.
Of course it is a hyperbole, but sadly not that large.
I am right now working on an offline api client: https://voiden.md/. I wonder if this can be a feature.
That the spec solves the problem is called validation in my domain and treated explicitly with different methods.
We use formal validation to check for invariants, but also "it must return a value xor an error, but never just hang".
Have you seen large js codebases? Good luck changing anything in it, unless they are really, really well written, which is very rare. (My own js code is often a mess)
When you can change types on the fly somewhere hidden in code ... then this leads to the opposite of clarity for me. And so lots of effort required to change something in a proper way, that does not lead to more mess.
That's the case for one-off integrations, but the messy part always comes when system goal changes
Let's say formal verification could help to avoid some anti-patterns.
I beg to differ, if a spec is hard to verify, then it's a bad sign.
Things can only happen if only you allow it to happen.
It other words, your software may come to a stage where it's no longer applicable, but it never crashes.
Formal verification had little adoption only because it costs 23x of your original code with "PhD-level training"
this is, in fact better for llms, they are better at carrying information and convention in their kv cache than they are in having to figure out the actual types by jumping between files and burning tokens in context/risking losing it on compaction (or getting it wrong and having to do a compilation cycle).
if a typed language lets a developer fearlessly build a semantically inconsistent or confusing private API, then llms will perform poorer at them even though correctness is more guaranteed.
It’s actually similar to tests in a way: they provide additional confidence in the code, but at the same time ossify it and make some changes potentially more difficult. Interestingly, they also make some changes easier, as long as not too many types/tests have to be adapted.
The theory is entirely correct. If a machine can write provably perfect code there is absolutely no reason to have people write code. The problem is that the 'If' is so big it can be seen from space.
It is quite clear that this industry is mostly driven by hype and fades, not by empirical studies.
Empirical evidence in favor of a claim that static typing and complex type systems reduce bugs or improve productivity is highly inconclusive at best
I'd still like to hear about the actual mechanism of this happening. Because I personally find it much easier to believe that the moment keeping the formal verification up to date becomes untenable for whatever reason (specs changing too fast, external APIs to use are too baroque, etc) people would rather say "okay, guess we ditch the formal verification and just keep maintaining the integration tests" instead of "let's change everything about the external world so we could keep our methodology".
a) It’s fast to change the code, but now I have failures in some apparently unrelated part of the code base. (Javascript) and fixing that slows me down.
b) It’s slow to change the code because I have to re-encode all the relationships and semantic content in the type system (Rust), but once that’s done it will likely function as expected.
Depending on project, one or the other is preferable.
It's just people's hunches.
Of course, you can declare that the world itself is inherently sinful and imperfect, and is not ready for your beautiful theories but seriously.
When you say things like "eliminate a class of bugs", that is played out in the abstraction: an infinite subset of that infinity of machines is eliminated, leaving an infinity.
How you then sample from that infinity in order to have something which fits on your actual machine is a separate question.
I compared it to a statically typed language with a GC - where the runtime takes care of a property that Rust has to do statically, requiring more complexity.
As an example, I currently mostly write GUI applications for mobile and desktop as a solo dev. 90% of my time is spent on figuring out API calls and arranging layouts. Most of the data I deal with are strings with their own validation and formatting rules that are complicated and at the same time usually need to be permissive. Even at the backend all the data is in the end converted to strings and integers when it is put into a database. Over-the-wire serialization also discards with most typing (although I prefer protocol buffers to alleviate this problem a bit).
Strong typing can be used in between those steps but the added complexity from data conversions introduces additional sources of error, so in the end the advantages are mostly nullified.
i see we are both familiar with haskellers (friendly joke!)
I am not an expert on this, but the worst API I've seen is those with hidden states.
e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.
And there's call A before you call B types of APIs, the client has to keep a strict call order (which itself is a state machine of some kind)
To me, this has been one of the biggest advantages of both tests and types. They provide confidence to make changes without needing to be scared of unintended breakages.
this is exactly where a good type system helps: you have an unvalidated string and a validated string which you make incompatible at the type level, thus eliminating a whole class of possible mistakes. same with object ids, etc.
don't need haskell for this, either: https://brightinventions.pl/blog/branding-flavoring/
> e.g. .toggle() API. Call it old number of times, it goes to one state, call it even number of times, it goes back.
This is literally a dumb light switch. If you have trouble proving that, starting from lights off, flicking a simple switch twice will still keep lights off then, well, I have bad news to tell you about the feasibility of using the formal methods for anything more complex than a dumb light switch. Because the rest of the world is a very complex and stateful place.
> (which itself is a state machine of some kind)
Yes? That's pretty much the raison d'être of the formal methods: for anything pure and immutable, normal intuition is usually more than enough; it's tracking the paths through enormous configuration spaces that our intuition has problem with. If the formal methods can't help with that with comparable amount of effort, then they are just not worth it.