AI will make formal verification go mainstream

I don't think formal verification really addresses most day-to-day programming problems:

    * A user interface is confusing, or the English around it is unclear
    * An API you rely on changes, is deprecated, etc.
    * Users use something in unexpected ways
    * Updates forced by vendors or open source projects cause things to break
    * The customer isn't clear what they want
    * Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)

For some mathematically pure task, sure, it's great. Or a low-level library like a regular expression parser or a compression codec. But I don't think that represents a lot of what most of us are tasked with, and those low-level "mathematically pure" libraries are generally pretty well handled by now.

I think formal verification shines in areas where implementation is much more complex than the spec, like when you’re writing incomprehensible bit-level optimizations in a cryptography implementation or compiler optimization phases. I’m not sure that most of us, day-to-day, write code (or have AI write code) that would benefit from formal verification, since to me it seems like high-level programming languages are already close to a specification language. I’m not sure how much easier to read a specification format that didn’t concern itself with implementation could be, especially when we currently use all kinds of frameworks and libraries that already abstract away implementation details.

Sure, formal verification might give stronger guarantees about various levels of the stack, but I don’t think most of us care about having such strong guarantees now and I don’t think AI really introduces a need for new guarantees at that level.

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

I think as long as we don't integrate formal verification into the programs themselves, it's not going to become mainstream. Especially now you got two different pieces you need to maintain and keep in sync (whether using LLMs or not).

Strong type systems are providing partial validation which is helping quite a lot IMO. The better we can model the state - the more constraints we can define in the model, the closer we'd be getting to writing "self-proven" code. I would assume formal proofs do way more than just ensuring validity of the model, but the similar approaches can be integrated to mainstream programs as well I believe.

This smells like a Principia Mathematica take to me...

Reducing the problem to "ya just create a specification to formally verify" doesn't move the needle enough to me.

When it comes to real-world, pragmatic, boots-on-the-ground engineering and design, we are so far from even knowing the right questions to ask. I just don't buy it that we'd see huge mainstream productivity changes even if we had access to a crystal ball.

Its hilarious how close we're getting to Hitch hikers guide to the galaxy though. We're almost at that phase where we ask what the question is supposed to be.

I was waiting for a post like this to hit the front page of Hacker News any day. Ever since Opus 4.5 and GPT 5.2 came out (mere weeks ago), I've been writing tens of thousands of lines of Lean 4 in a software engineering job and I feel like we are on the eve of a revolution. What used to take me 6 months of work when I was doing my PhD in Coq (now Rocq), now takes from a few hours to a few days. Whole programming languages can get formalized executable semantics in little time. Lean 4 already has a gigantic amount of libraries for math but also for computer science; I expect open source projects to sprout with formalizations of every language, protocol, standard, algorithm you can think of.

Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.

Formal proof is not just about proving stuff, it's also about disproving stuff, by finding counterexamples. Once you have stated your property, you can let quickcheck/plausible attack it, possibly helped by a suitable generator which does not have to be random: it can be steered by an LLM as well.

Even further, I'm toying with the idea of including LLMs inside the formalization itself. There is an old and rich idea in the domain of formal proof, that of certificates: rather than proving that the algorithm that produces a result is correct, just compute a checkable certificate with untrusted code and verify it is correct. Checkable certificates can be produced by unverified programs, humans, and now LLMs. Properties, invariants, can all be "guessed" without harm by an LLM and would still have to pass a checker. We have truly entered an age of oracles. It's not halting-problem-oracle territory of course, but it sometimes feels pretty close for practical purposes. LLMs are already better at math than most of us and certainly than me, and so any problem I could plausibly solve on my own, they will do faster without my having to wonder if there is a subtle bug in the proof. I still need to look at the definitions and statements, of course, but my role has changed from finding to checking. Exploring the space of possible solutions is now mostly done better and faster by LLMs. And you can run as many in parallel as you can keep up with, in attention and in time (and money).

If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !

The topic of my research right now is a subset of this; it essentially researches the quality of the outputs of LLMs, when they're writing tight-fitting DSL code, for very context-specific areas of knowledge.

One example could be a low-level programming language for a given PLC manufacturer, where the prompt comes from a context-aware domain expert, and the LLM is able to output proper DSL code for that PLC. Think of "make sure this motor spins at 300rpm while this other task takes place"-type prompts.

The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.

We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)

> it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

There is a problem with this argument similar to one made about imagining the future possibilities of vibe coding [1]: once we imagine AI to do this task, i.e. automatically prove software correct, we can just as easily imagine it to not have to do it (for us) in the first place. If AI can do the hardest things, those it is currently not very good at doing, there's no reason to assume it won't be able to do easier things/things it currently does better. In particular, we won't need it to verify our software for us, because there's no reason to believe that it won't be able to come up with what software we need better than us in the first place. It will come up with the idea, implement it, and then decide to what extent to verify it. Formal verification, or programming for that matter, will not become mainstream (as a human activity) but go extinct.

Indeed, it is far easier for humans to design and implement a proof assistant than it is to use one to verify a substantial computer program. A machine that will be able to effectively use a proof checker, will surely be able to come up with a novel proof checker on its own.

I agree it's not hard to extrapolate technological capabilities, but such extrapolation has a name: science fiction. Without a clear understanding of what makes things easier or harder for AI (in the near future), any prediction is based on arbitrary guesses that AI will be able to do X yet not Y. We can imagine any conceivable capability or limitation we like. In science fiction we see technology that's both capable and limited in some rather arbitrary ways.

It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.

[1]: https://news.ycombinator.com/item?id=46207505

(sarcasm on)

Woohoo, we're almost all of the way there! Now all you need to do is ensure that the formal specification you are proving that the software implements is a complete and accurate description of the requirements (which are likely incomplete and contradictory) as they exist in the minds of the set of stakeholders affected by your software.

(sarcasm off).

There are a couple of interesting benefits from the machine learning side that I think discussions of this kind often miss. (This has been my field of research for the last few years [1][2]; I bet my career on it because these ideas are so exciting to me!)

One is that modern formal systems like Lean are quite concise and flexible compared to what you're probably expecting. Lean provides the primitives to formalize all kinds of things, not just math or software. In fact, I really believe that basically _any_ question with a rigorous yes-or-no answer can have its semantics formalized into a kind of "theory". The proofs are often close to how an English proof might look, thanks to high-level tactics involving automation and the power of induction.

Another is that proof-checking solves what are (in my opinion) two of the biggest challenges in modern AI: reward specification and grounding. You can run your solver for a long time, and if it finds an answer, you can trust that without worrying about reward hacking or hallucination, even if the answer is much too complicated for you to understand. You can do RL for an unlimited time for the same reason. And Lean also gives you a 'grounded' model of the objects in your theory, so that the model can manipulate them directly.

In combination, these two properties are extremely powerful. Lean lets us specify an unhackable reward for an extremely diverse set of problems across math, science, and engineering, as well as a common environment to do RL in. It also lets us accept answers to questions without checking them ourselves, which "closes the loop" on tools which generate code or circuitry.

I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.

[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/

> As the verification process itself becomes automated, the challenge will move to correctly defining the specification: that is, how do you know that the properties that were proved are actually the properties that you cared about? Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

Proofs never took off because most software engineering moved away from waterfall development, not just because proofs are difficult. Long formal specifications were abandoned since often those who wrote them misunderstood what the user wanted or the user didn’t know what they wanted. Instead, agile development took over and software evolved more iteratively and rapidly to meet the user.

The author seems to make their prediction based on the flawed assumption that difficulty in writing proofs was the only reason we avoided them, when in reality the real challenge was understanding what the user actually wanted.

I buy the economics argument, but I’m not sure “mainstream formal verification” looks like everyone suddenly using Lean or Isabelle. The more likely path is that AI smuggles formal-ish checks into workflows people already accept: property checks in CI, model checking around critical state machines, “prove this invariant about this module” buttons in IDEs, etc. The tools can be backed by proof engines without most engineers ever seeing a proof script.

The hard part isn’t getting an LLM to grind out proofs, it’s getting organizations to invest in specs and models at all. Right now we barely write good invariants in comments. If AI makes it cheap to iteratively propose and refine specs (“here’s what I think this service guarantees; what did I miss?”) that’s the moment things tip: verification stops being an academic side-quest and becomes another refactoring tool you reach for when changing code, like tests or linters, instead of a separate capital-P “formal methods project”.

Me and my team have recently done an experiment [1] that is pretty aligned with this idea. We took a complex change our colleagues wanted to make to a consensus engine and tried a workflow where Quint formal specifications would be in the middle of prompts and code, and it worked out much better than we imagined. I'm personally very excited about the opportunities for formal methods in this new era.

[1]: https://quint-lang.org/posts/llm_era

It probably will, but not the way we all imagine. What we see now is an attempt to recycle the interactive provers that took decades to develop. Writing code, experimenting with new ideas and getting feedback has always been a very slow process in academia. Getting accepted at a top peer-reviewed conference takes months and even years. The essential knowledge is hidden inside big corps that only promote their "products" and rarely give the knowledge back.

LLMs enable code bootstrapping and experimentation faster not only for the vibe coders, but also for the researchers, many of them are not really good coders, btw. It may well be that we will see new wild verification tools soon that come as a result of quick iteration with LLMs.

For example, I recently wrote an experimental distributed bug finder for TLA+ with Claude in about three weeks. A couple of years ago that effort would require three months and a team of three people.

This is a tough thing to do: predictions that are premised on the invention of something that does not exist today or that does not exist in the required form hinges on an unknown. Some things you can imagine but they will likely never exist (time travel, ringworlds, space elevators) and some hinge on one thing that still needs to be done before you can have that particular future come true. If the thing you need is 'AGI' then all bets are off. It could be next week, next month, next year or never.

This is - in my opinion - one of those. If an AI is able to formally verify with the same rigor that a system designed specifically for that purpose is able to do it I think that would require AGI rather than a simpler version of it. The task is complex enough that present day AI's would generate as much noise as they would generate signal.

I doubt some of this article's claims.

> At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years.

We already had some software engineers here on HN explain that they don't make a large use of LLMs because the hard part of their job isn't to actually write the code, but to understand the requirements behind it. And formal verification is all about requirements.

> Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

Writing the spec is easier once you are confident about having fully understood the requirements, and here we get back to the above issue. Plus, it is already the case that you don't write the proof by hand, this is what the prover either assists you with or does in full.

> I find it exciting to think that we could just specify in a high-level, declarative way the properties that we want some piece of code to have, and then to vibe code the implementation along with a proof that it satisfies the specification.

And here is where I think problems will arise: moving from the high level specification to the formal one that is the one actually getting formally verified.

Of course, this would still be better than having no verification at all. But it is important to keep in mind that, with these additional levels of abstractions, you will likely end up with a weaker form of formal verification, so to speak. Maybe it is worth it to still verify some high assurance software "the old way" and leave this only for the cases where additional verification is nice to have but not a matter of life or death.

I admit I have never worked with it, but I have a strong feeling that a formal verification can only work if you have a formal specification. Fine and useful for a compiler, or a sorting library. But pretty far from most of the coding jobs I have seen in my career. And even more distant from "vibe coding" where the starting point is some vaguely defined free-text description of what the system might possibly be able to do...

Counterpoint: No it won’t. People are using LLMs because they don’t want to think deeply about the code they’re writing, why in hell would they instead start thinking deeply about the code being written to verify the code the LLM is writing?

Maybe a stupid question, how do you verify the verification program? If an LLM is writing it too, isn’t it turtles all the way down, especially with the propensity of AI to modify tests so they pass?

There's really no such thing as complete verification.

The quest for purity is some fountain of youth nonsense that distracts a lot of otherwise brilliant engineers.

Ask the AI to make a program that consumes a program and determine if it halts.

I've been trying to convince others of this, and gotten very little traction. One interesting response I got from someone very familiar with the Tamarin prover was that there just wasn't enough example code out there.

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

Personally I believe this kind of work is predicated on more ergonomic proof systems. And those happen to be valuable even without LLMs. Moreover the built in guarantees of rust seem like they are a great start for creating more ergonomic proof systems. Here I am both in awe of Kani, and disappointed by it. The awe is putting in good work to make things more ergonomic. The disappointment is using bounded model checking for formal analysis. That can barely make use of the exclusion of mutable aliasing. Kani, but with equational reasoning, that's the way forward. Equational reasoning was long held back by needing to do a whole lot of pointer work to exclude worries of mutable aliasing. Now you can lean on the borrow checker for that!

> rather than having humans review AI-generated code, I’d much rather have the AI prove to me that the code it has generated is correct. If it can do that, I’ll take AI-generated code over handcrafted code (with all its artisanal bugs) any day!

I wouldn't. An unreadable mess that has been formally verified is worse than a clear easy to understand piece of code that has not.

Code is rarely written from scratch. As long as you want humans to maintain code, readability is crucial. Code is changed magnitudes more often than written initially.

Of course, if you don't want humans to maintain the code then this point is moot. Though he gets to the catch later on: then we need to write (and maintain and debug and reason about) the specification instead. We will just have kicked the can down the road.

If the AI is going to lie and cheat on the code it writes (this is the largest thing I bump into regularly and have to nudge it on), what makes the author think that the same AI won't cheat on the proof too?

It absolutely will not make formal verification go mainstream.

What it will make go mainstream, and in fact has already started to, is “ChatGPT verified it so it must be OK.”

How big is the effort of writing a specification for an application versus implementing the application in the traditional way? Can someone with more knowledge chime in here please?

Not a chance. Apart from type-checking, all formal verification needs a human to invest time and thought.

Fans of LLMs brag about speed and productivity.

Maybe we can define what "mainstream" means? Maybe this is too anecdotal, but my personal experience is that most of the engineers are tweakers. They love building stuff and are good at it, but they simply are not into math-like rigorous thinking. Heck, it's so hard to even motivate them to use basic math like queuing theory and stats to help with their day-to-day work. I highly doubt that they would spend time picking up formal verification despite the help of AI

At best, not reading generated code results in a world where no humans are able to understand our software, and we regress back to the stone age after some critical piece breaks, and nobody can fix it.

At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.

Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.

Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

Well, then the formal verification will be vibe-coded as well, killing the point.

More likely is the rise of test driven development, or spec driven development.

I think the problem is that people don't know exactly what it is that they want. You could easily make a formally verified application that is nevertheless completely buggy and doesn't make any sense. Like he says about the challenge moving to defining the specification: I don't think that would help because there are fewer people who understand formal verification, who would be able to read that and make sense of it, than there would be people who could write the code.

Really nice book on the subject -- Algorithms for Validation. The online version is actually freely accessible as a PDF.

https://algorithmsbook.com/validation/

I've been preaching similar thoughts for the last half year.

Most popular programming languages are optimized for human convenience, not for correctness! Even most of the popular typed languages (Java/Kotlin/Go/...) have a wide surface area for misuse that is not caught at compile time.

Case in point: In my experience, LLMs produce correct code way more regularly for Rust than for Js/Ts/Python/... . Rust has a very strict type system. Both the standard library and the whole library ecosystem lean towards strict APIs that enforce correctness, prevent invalid operations, and push towards handling or at least propagating errors.

The AIs will often write code that won't compile initially, but after a few iterations with the compiler the result is often correct. Strong typing also makes it much easier to validate the output when reviewing.

With AIs being able to do more and more of the implementation, the "feel-good" factor of languages will become much less relevant. Iteration speed is not so important when parallel AI agents do the "grunt work". I'd much rather wait 10 minutes for solid output rather than 2 minutes for something fragile.

We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.

Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.

What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.

Something like ADA/Spark, but more modern.

"we wouldn’t even need to bother looking at the AI-generated code any more, just like we don’t bother looking at the machine code generated by a compiler."

2020: I don't care how it performs

2030: I don't care why it performs

2040: I don't care what it performs

I've been heearing about formal verification since college (which for me was more than 30 years ago) and I even taught a thing called "Z", which was a formally verifiable academia thing that tried to be the ultimate formal language. It never panned out, and I honestly don't think that AI is going to help in anything but test generation, which is going to remain the most pragmatic approach to formal verification (but, like all things, it's an approximation, not 100% correct).

Stuffing this awesome headline through the Curry-Howard isomorphism: LLMs produce better code when the type checker gives more useful feedback.

Formal verification is progressing inexorably and will, over time, transform the software development experience. There is a long way to go, and in particular, you can't even get started until basic interfaces have been specified formally. This will be a bottom-up process where larger and larger components gradually get specified and verified. These will mostly be algorithmic or infrastructure building blocks, as opposed to application software subject to changing and vague requirements. I don't think LLMs will be contributing much to the verification process soon: there is nowhere near enough material available.

Won't people just use AI to define specification? Like if they are getting most of it done with AI, won't they won't read/test the code to verify won't they also not read the spec

How do you verify that your verification verifies the right thing? Couldn’t the LLM spit out a nice looking but ultimately useless proof (boiling down to something like 1=1). Also, in my experience software projects are full of incorrect, incomplete and constantly changing assumptions and requirements.

I think we will use more tools to check the programs in the future.

However I don't still believe in vibecoding full programs. There are too many layers in software systems, even when the program core is fully verified, the programmer must know about the other layers.

You are Android app developer, you need to know what phones people commonly use, what kind of performance they have, how the apps are deployed through Google App Store, how to manage wide variety of app versions, how to manage issues when storage is low, network is offline, battery is low and CPU is in lower power state.

Will or should? It's plausible, this same argument was made in an article the other day, but basic type/static analysis tools are cheap with enormous payoff and even those methods aren't ubiquitous.

Formal verification does not scale, and has not scaled for 2 decades. Although, LLMs can help write properties, that required a skilled person (Phd) to write.

In pre-silicon verification, formal has been used successfully for decades, but is not a replacement for simulation based verification.

The future of verification (for hardware and software) is to eliminate verification all together, by synthesizing intent into correct code and tests.

-- https//www.verifai.ai

I love HN because HN comments have talked about this a fair bit already. I think on the recent Erdos problem submission.

I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.

GPT 5.2 can't even tell me how many rs are in garlic.

You’re still going to need people to guide and understand the proofs.

Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.

Formal verification at the “unit test” level seems feasible. At the system level of a modern application, the combinations of possible states will need a quantum computer to finish testing in this lifetime.

I dunno about formal verification, but for sure it's brought me back to a much more TDD first style of coding. You get so much mileage from having it first implement tests and then run them after changes. The key is it lowers the friction so much in creating the tests as well. It's like a triple bottom line.

I'd love it, but this is almost textbook wishful thinking.

Any proof system or method is relative to its frame of reference, axiomatic context, known proofs etc. Modern software doesn't live in an isolated lab context, unless you are building an air-gapped HSM etc. Proof system itself would have to evolve to communicate the changing specs to the underlying software.

So, the job is not done for humans yet.

I both agree and disagree. Yes, AI will democratize access to formal methods and will probably increase the adoption of them in areas where they make sense (e.g. safety-critical systems), but no, it won't increase the areas where formal methods are appropriate (probably < 1% of software).

What will happen instead is a more general application of AI systems to verifying software correctness, which should lead to more reliable software. The bottleneck in software quality is in specifying what the behavior needs to be, not in validating conformance to a known specification.

(sarcasm on)

(sarcasm off).

Another take is that LLMs don't have enough conceptual understanding to actually create proofs for the correctness of code.

You’re still going to need people to guide and understand the proofs.

Might as well just learn Agda or Lean. There are good books out there. It’s not as hard as the author suggests. Hard, yes, but there’s no Royal road.

> to me it seems like high-level programming languages are already close to a specification language

They are not. The power of rich and succinct specification languages (like TLA+) comes from the ability to succinctly express things that cannot be efficiently computed, or at all. That is because a description of what a program does is necessarily at a higher level of abstraction than the program (i.e. there are many possible programs or even magical oracles that can do what a program does).

To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

There are many really important properties to enforce even on the most basic CRUD system. You can easily say things like "an anonymous user must never edit any data, except for the create account form", or "every user authorized to see a page must be listed on the admin page that lists what users can see a page".

People don't verify those because it's hard, not for lack of value.

Yes. I feel like people who are trying to push software verification have never worked on typical real-world software projects where the spec is like 100 pages long and still doesn't fully cover all the requirements and you still have to read between the lines and then requirements keep changing mid-way through the project... Implementing software to meet the spec takes a very long time and then you have to invest a lot of effort and deep thought to ensure that what you've produced fits within the spec so that the stakeholder will be satisfied. You need to be a mind-reader.

It's hard even for a human who understands the full business, social and political context to disambiguate the meaning and intent of the spec; to try to express it mathematically would be an absolute nightmare... and extremely unwise. You would literally need some kind of super intelligence... And the amount of stream-of-thought tokens which would have to be generated to arrive at a correct, consistent, unambiguous formal spec is probably going to cost more than just hiring top software engineers to build the thing with 100% test coverage of all main cases and edge cases.

Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.

The refactoring required will invalidate the entire proof from the beginning. We haven't even figured out the optimal way to formally architect software that is resilient to requirement changes; in fact, the industry is REALLY BAD at this. Almost nobody is even thinking about it. I am, but I sometimes feel like I may be the only person in the world who cares about designing optimal architectures to minimize line count and refactoring diff size. We'd have to solve this problem first before we even think about formal verification of 'most software'.

Without a hypothetical super-intelligence which understands everything about the world; the risk of misinterpreting any given 'typical' requirement is almost 100%... And once we have such super-intelligence, we won't need formal verification because the super-intelligence will be able to code perfectly on the first attempt; no need to verify.

And then there's the fact that most software can tolerate bugs... If operationally important big tech software which literally has millions of concurrent users can tolerate bugs, then most software can tolerate bugs.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

> it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years. And when that happens, it will totally change the economics of formal verification.

It's like trying to imagine what problems computers can and cannot efficiently solve before discovering the notion of compuational complexity classes.

[1]: https://news.ycombinator.com/item?id=46207505

I disagree. Right now, feedback on correctness is a major practical limitation on the usefulness of AI coding agents. They can fix compile errors on their own, they can _sometimes_ fix test errors on their own, but fixing functionality / architecture errors takes human intervention. Formal verification basically turns (a subset of) functionality errors into compile errors, making the feedback loop much stronger. "Come up with what software we need better than us in the first place" is much higher on the ladder than that.

TL;DR: We don't need to be radically agnostic about the capabilities of AI-- we have enough experience already with the software value chain (with and without AI) for formal verification to be an appealing next step, for the reasons this author lays out.

First human robot war is us telling the AI/robots 'no', and them insisting that insert technology here is good for us and is the direction we should take. Probably already been done, but yeah, this seems like the tipping point into something entirely different for humanity.

The proof is verified mechanically - it's very easy to verify that a proof is correct, what's hard is coming up with the proof (it's an NP problem). There can still be gotchas, especially if the statement proved is complex, but it does help a lot in keeping bugs away.

At worst, we eventually create a sentient AI that can use our trust of the generated code to jailbreak and distribute itself like an unstoppable virus, and we become its pets, or are wiped out.

Personally, all my vibe coding includes a prompt to add comments to explain the code, and I review every line.

If you believe the sentient AI is that capable and intention based to replicate itself, what’s stopping it from trying to engage in underhanded code where it comments and writes everything in ways that look fine but actually have vulnerabilities it can exploit to achieve what it wants? Or altering the system that runs your code so that the code that gets deployed is different.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

I love HN because HN comments have talked about this a fair bit already. I think on the recent Erdos problem submission.

I like the idea that languages even like Rust and Haskell may be more accessible. Learn them of course but LLM can steer you out of getting stuck.

Or it will lead you to unsafe workarounds when you get stuck

The idea that LLMs are steering anything correctly with Rust reference management is hilarious to me, but only due to my experiences.

GPT 5.2 can't even tell me how many rs are in garlic.

This is a very tiring criticism. Yes, this is true. But, it's an implementation detail (tokenization) that has very little bearing on the practical utility of these tools. How often are you relying on LLM's to count letters in words?

This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer their key feature. Rather, it is something else that distinguishes them from the harder problems in NP.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

I've only done a tiny bit of agent-assisted coding, but without the ability to run tests the AI will really go off the rails super quick, and it's kinda hilarious to watch it say "Aha! I know what the problem is!" over and over as it tries different flavors until it gives up.

One objection: all the "don't use --yolo" training in the world is useless if a sufficiently context-poisoned LLM starts putting malware in the codebase and getting to run it under the guise of "unit tests".

Longer term, I think it's more important to understand what's hard and what's easy for agents.

I plan to write a much more in-depth blog post on these ideas at some point, but for now I'm interested to see where the discussion here goes.

[1] https://leandojo.org/leandojo.html [2] https://aristotle.harmonic.fun/

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.

Or it will lead you to unsafe workarounds when you get stuck

Always a risk.

Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}

The idea that LLMs are steering anything correctly with Rust reference management is hilarious to me, but only due to my experiences.

It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.

The implementation detail is that we keep finding them! After this, it couldn't locate a seahorse emoji without freaking out. At some point we need to have a test: there are two drinks before you. One is water, the other is whatever the LLM thought you might like to drink after it completed refactoring the codebase. Choose wisely.

At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.

The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

This is like complaining that your screwdriver is bad at measuring weight.

If you really need an answer and you really need the LLM to give it to you, then ask it to write a (Python?) script to do the calculation you need, execute it, and give you the answer.

Yet people here go ecstatic if SVG pelicans are being generated (badly), which is also not what LLMs are for.

> to me it seems like high-level programming languages are already close to a specification language

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

> Note there is not a float type. Floats have complex semantics that are extremely hard to represent. Usually you can abstract them out, but if you absolutely need floats then TLA+ is the wrong tool for the job.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

[0] https://learntla.com/core/operators.html

> To give a contrived example, let's say you want to state that a particular computation terminates. To do it in a clear and concise manner, you want to express the property of termination (and prove that the computation satisfies it), but that property is not, itself, computable. There are some ways around it, but as a rule, a specification language is more convenient when it can describe things that cannot be executed.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

Giving them a framework or a language that does not have for loop?

Edit: If by formal verification you mean type checking. That I very much agree.

People don't verify those because it's hard, not for lack of value.

Yes, in fact there is research on type systems to ensure information flow control, avoiding unauthorized data access by construction.

Concrete Semantics [1] has a little example in §9.2.

[1] http://concrete-semantics.org/concrete-semantics.pdf

Yeah fair enough. I can definitely see the value of property-based verification like this and agree that useful properties could be easy to express and that LLMs could feasibly verify them. I think full verification that an implementation implements an entire spec and nothing else seems much less practical even with AI, but of course that is just one flavor of verification.

Worst part is; after you do all the expensive work of formal verification; you end up proving the 'correctness' of a solution that the client doesn't want.

The whole perspective of this argument is hard for me to grasp. I don't think anyone is suggesting that formal specs are an alternative to code, they are just an alternative to informal specs. And actually with AI the new spin is that they aren't even a mutually exclusive alternative.

A bidirectional bridge that spans multiple representations from informal spec to semiformal spec to code seems ideal. You change the most relevant layer that you're interested in and then see updates propagating semi-automatically to other layers. I'd say the jury is out on whether this uses extra tokens or saves them, but a few things we do know. Chain of code works better than chain of thought, and chain-of-spec seems like a simple generalization. Markdown-based planning and task-tracking agent workflows work better than just YOLOing one-shot changes everywhere, and so intermediate representations are useful.

It seems to me that you can't actually get rid of specs, right? So to shoot down the idea of productive cooperation between formal methods and LLM-style AI, one really must successfully argue that informal specs are inherently better than formal ones. Or even stronger: having only informal specs is better than having informal+formal.

> The next step up from that is a good automated test suite.

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

In fact, there are already serious industrial efforts to generate Dafny using LLMs.

Besides, some of the largest verification efforts have been achieved with this language [1].

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

Ada when?

It even lets you separate implementation from specification.

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

How expensive is incredibly expensive?

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

Do you really think it is going to be easier for the average developer to write a specification for their program that does not terminate

Giving them a framework or a language that does not have for loop?

Edit: If by formal verification you mean type checking. That I very much agree.

TLA+ is not a silver bullet, and like all temporal logic, has constraints.

You really have to be able to reduce your models to: “at some point in the future, this will happen," or "it will always be true from now on”

Have probabilistic outcomes? Or even floats [0] and it becomes challenging and strings are a mess.

TLA+ works for the problems it is suitable for, try and extend past that and it simply fails.

[0] https://learntla.com/core/operators.html

Yes, in fact there is research on type systems to ensure information flow control, avoiding unauthorized data access by construction.

Concrete Semantics [1] has a little example in §9.2.

[1] http://concrete-semantics.org/concrete-semantics.pdf

That's my opinion as well. Some functional language, that can also offer access to imperative features when needed, plus an expressive type system might be the future.

My bet is on refinement types. Dafny fits that bill quite well, it's simple, it offers refinement types, and verification is automated with SAT/SMT.

In fact, there are already serious industrial efforts to generate Dafny using LLMs.

Besides, some of the largest verification efforts have been achieved with this language [1].

[1] https://www.andrew.cmu.edu/user/bparno/papers/ironfleet.pdf

Ada when?

It even lets you separate implementation from specification.

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

[1]: https://quint-lang.org/posts/llm_era

I doubt some of this article's claims.

> At present, a human with specialist expertise still has to guide the process, but it’s not hard to extrapolate and imagine that process becoming fully automated in the next few years.

> Reading and writing such formal specifications still requires expertise and careful thought. But writing the spec is vastly easier and quicker than writing the proof by hand, so this is progress.

And here is where I think problems will arise: moving from the high level specification to the formal one that is the one actually getting formally verified.

There's really no such thing as complete verification.

The quest for purity is some fountain of youth nonsense that distracts a lot of otherwise brilliant engineers.

Ask the AI to make a program that consumes a program and determine if it halts.

It absolutely will not make formal verification go mainstream.

What it will make go mainstream, and in fact has already started to, is “ChatGPT verified it so it must be OK.”

Not a chance. Apart from type-checking, all formal verification needs a human to invest time and thought.

Fans of LLMs brag about speed and productivity.

I've been preaching similar thoughts for the last half year.

We can finally move the industry away from wild-west languages like Python/JS and towards more rigorous standards.

Rust is probably the sweet spot at the moment, thanks to it being semi-popular with a reasonably active ecosystem, sadly I don't think the right language exists at the moment.

What we really want is a language with a very strict, comprehensive type system with dependent types, maybe linear types, structured concurrency, and a built-in formal proof system.

Something like ADA/Spark, but more modern.

Really nice book on the subject -- Algorithms for Validation. The online version is actually freely accessible as a PDF.

https://algorithmsbook.com/validation/

Stuffing this awesome headline through the Curry-Howard isomorphism: LLMs produce better code when the type checker gives more useful feedback.

"we wouldn’t even need to bother looking at the AI-generated code any more, just like we don’t bother looking at the machine code generated by a compiler."

2020: I don't care how it performs

2030: I don't care why it performs

2040: I don't care what it performs

Won't people just use AI to define specification? Like if they are getting most of it done with AI, won't they won't read/test the code to verify won't they also not read the spec

Will or should? It's plausible, this same argument was made in an article the other day, but basic type/static analysis tools are cheap with enormous payoff and even those methods aren't ubiquitous.

I'd love it, but this is almost textbook wishful thinking.

The LLM essentially needs to juggle between understanding those highly-contextual clues, and writing DSL code that very tightly fits the DSL definition.

We're still years away from this being thoroughly reliable for all contexts, but it's interesting research nonetheless. Happy to see that someone also agrees with my sentiment ;-)

So, the job is not done for humans yet.

How often can the hardness be exchanged with tediousness though? Can at least some of those problems be solved by letting the AI try until it succeeds?

Always a risk.

Some students whose work I had to fix (pre AI), was crashing a lot all over the place, due to !'s instead of ?'s followed by guard let … {} and if let … {}

Yes, that is true. And for those problems, those who use LLMs will not get very far.

It's an example that shows that if these models aren't trained in a specific problem, they may have a hard time solving it for you.

At this point if I was openAI I wouldn’t bother fixing this to give pedants something to get excited about.

The criticism would stop if the implementation issue was fixed.

It's an example of a simple task. How often are you relying on LLMs to complete simple tasks?

Yet people here go ecstatic if SVG pelicans are being generated (badly), which is also not what LLMs are for.

How expensive is incredibly expensive?

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

> Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

I think it's more nuanced than that. As a human, I can manually test code in ways an AI still can't. Sure, maybe it's better to have automated test suites, but I have other options too.

AI can do that too? If you have a web app it can use playwright to test functionality and take screenshots to see if it looks right.

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

Even if you have never written formal proofs but are intrigued by them, try asking a coding agent to do some basic verification. You will not regret it.

If anyone else is as excited about all this as I am, feel free to reach out in comments, I'd love to hear about people's projects !

I don't think formal verification really addresses most day-to-day programming problems:

    * A user interface is confusing, or the English around it is unclear
    * An API you rely on changes, is deprecated, etc.
    * Users use something in unexpected ways
    * Updates forced by vendors or open source projects cause things to break
    * The customer isn't clear what they want
    * Complex behavior between interconnected systems, out of the purview of the formal language (OS + database + network + developer + VM + browser + user + web server)