Lean proved this program correct; then I found a bug

This article’s framing and title are odd. The author, in fact, found no bugs or errors in the proven code. She says so at the end of the article:

> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.

Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.

It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.

[1] https://lean-lang.org/doc/reference/latest/Elaboration-and-C...

I’ve had similar experiences with code I’ve proven correct, although my issues were of the more common variety than the overflow issue - subtle spec bugs. (I think the post mentions the denial of service issue as related to this: a spec gap)

If you have a spec that isn’t correct, you can certainly write code that conforms to that spec and write proofs to support it. It just means you have verified a program that does something other than what you intended. This is one of the harder parts of verification: clearly expressing your intention as a human. As programs get more complex these get harder to write, which means it isn’t uncommon to have lean or rocq proofs for everything only to later find “nope, it has a bug that ultimately traces back to a subtle specification defect.” Once you’ve gone through this a few times you quickly realize that tools like lean and rocq are tricky to use effectively.

I kinda worry that the “proof assistants will fix ai correctness” will lead to a false sense of assurance if the specs that capture human intention don’t get scrutinized closely. Otherwise we’ll likely have lots of proofs for code that isn’t the code the humans actually intended due to spec flaws.

Clickbait title, the proved part of the program had no bugs?

As an aside, why can't people just write factually? This isn't a news site gamed for ad revenue. It's also less effort. I felt this post was mostly an insulting waste of time. I come to HN to read interesting stuff.

So this is very good news and was predictable.

LLM's are capable of producing code that passes formal verification.

The writing is on the wall: in the future more and more software on the abstract or platonic side of our computing base will be hermetically sealed against bugs and exploits. This quenching of bugs in the assured side will shift the mean location of bugs closer to the hardware side: at some point bugs and exploits will rely more and more on hardware quirks, and simply unspecified hardware.

Afterwards we can expect a long exponential decay of preventable safety violations: people mistakenly or surreptitiously disengaging the formal verification steps and shipping malicious or unverified code. Each such event will be its own big or small scandal, at some point there will be no deniability left: something must be on purpouse, either a malicious vulnerability or intentional disengagement of safety measures.

As the attack surface recedes towards the lower level hardware stack, it will open the debate that the community needs proper formal hardware descriptions (at least at the interface initially, not necessarily how the hardware has implemented it). As interface bugs get formalized 3 things can happen:

either vulnerabilities go extinct, and full formal hardware descriptions are not released

or vulnerabilities remain in each new generation of hardware, and malicious intent or negligence on behalf of the manufacturer can only be presumed, this will set up the community against manufacturers, as they demand full hardware descriptions (verilog, VHDL,...).

or vulnerabilities are tactically applied (vulnerabilities appear extinct to the bulk of the population, but only because manufactured vulnerabilities are sparingly exploited by the manufacturing block)

It is hard to predict what is more valuable: embedding HW vulnerabilities for the status quo and being able to exploit it for while before the public demands full hardware IP descriptions (verilog, VHDL) etc. or facing the end of vulnerabilities a little sooner but keeping hardware IP private (if the bugs stop with full interface descriptions).

The spec-completeness problem here is the same one that bites distributed systems verification: the proof holds inside an operating envelope (no adversarial inputs, trusted runtime, bounded sizes), and the interesting failures live at the boundary. TLA+ has the same property - you can prove liveness under a fairness assumption the deployment silently violates, and nothing in the proof tells you when reality drifted outside.

What I'd actually want from the tooling is a machine-checkable statement of the envelope itself, propagated as a runtime guard rather than a compile-time comment. Then "proof holds" and "we are still inside the proof's domain" are two separate, observable properties, and the unverified-parser / unverified-runtime cases stop being invisible.

Am I reading the article wrong? It appears that the author did not test the claims of the proof. Wouldn't a "bug" in this case mean she found an input that did not survive a round trip through the compression algorithm?

Update: Actually, I guess this may have been her point: "The two bugs that were found both sat outside the boundary of what the proofs cover." So then I guess the title might be a bit click baity.

"Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth

Not verifying the parser seems like a pretty big oversight. Parsing binary formats is notoriously dangerous!

You guys are missing the forest for the trees.

They used an AI agent sending ideas to a fuzzer and discovered a heap buffer overflow in Lean. This is big.

Compression/decompression is a good problem for proof of correctness. The specification is very simple (you must get back what you put in), while the implementation is complex.

What seems to have happened here is that the storage allocator underneath is unverified. That, too, has a relatively simple spec - all buffers disjoint, no lost buffers, no crashes.

claude making a statement that sounds impressive but it is actually the first codebase it has ever analyzed.

"This is genuinely one of the most memory-safe codebases I've analyzed."

I'll probably get a lot of hate mail for this but here goes nothing... Despite what many people like to claim, you cannot prove that a program has no bugs. That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.

Before everyone starts blabbing about formal verification, etc., consider this: how do you know that you didn't make a mistake in your formal verification? IOW, how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.

You can get asymptotically close to zero-bug proof, but you can never get there 100% of the way.

Tools like this (formal verification, sparse, etc.) and built-in mechanism (types, generics, RAII, rust's borrow checker) can only verify issues within the scope of that framework. There are also trade-offs and limitations with each type and each implementation.

Type checking allows you to (outside of type casting such as in languages like C/C++ and casting to object for generic containers in Java) verify that an object is of a given type. That allows you to be sure that a well-formed program isn't doing things like putting a random object in a list.

Languages like C#, Scala, and Kotlin improve Java generics by making the generic type of a container or other interface/type part of the type system. This allows generic types of a generic type to preserve the inner type. This makes it possible to implement things like monads and mapping functions to preserve the generic type.

A similar thing is possible with union types, sealed interfaces/traits, etc. that allow you to check and verify the return type instead of defaulting it to a generic object/any type.

Likewise with other features like nullable/non-null annotations (or corresponding nullable type annotations like in Kotlin and recent C# versions).

All of these can be abused/circumvented, but if you keep your code within that framework the compiler will stop that code compiling. Likewise, these solve a limited set of bugs. For example, nullable types can't verify memory management and related bugs.

Unfortunately, the discussion focused on the somewhat click baity title "proved this program correct". It's unclear what "this program" is. If it refers to the core algorithm with a proof, then there's no bug. If it includes the runtime and the header parser, then Lean didn't prove it correct.

That being said, using a coding agent to direct fuzzying and find bugs in the Lean kernel implementation is the big news here. (After all the kernel's implementation is not proved.)

The moral of the story is to push for more verified code not less and try AI bug hunting.

This is analogous to the fundamental problem of better automation in programming - eventually, the complexity and correctness of of the spec takes over, and if we don't manage that well, creating the spec is not that much less work than the programming part. If your program was for the wrong thing, a proof of it is also wrong.

> 105,823,818 fuzzing executions.

That sounds like quite the monthly bill. o_O

> The two bugs that were found both sat outside the boundary of what the proofs cover

Are we baiting people with headlines now?

This is very cool work but the author is labouring under a false premise about how axiomatic systems work:

> Every Lean proof assumes the runtime is correct.

No. Every valid Lean proof assumes that if the runtime/mathlib etc is correct, then it too is correct.

Tangentially also, most lean proofs are not dependent on whether or not the runtime has things like buffer overflows or denial of service against lean itself at all, because if I prove some result in Lean (without attacking the runtime) then a bug in the runtime doesn’t affect the validity of the result in general. It does mean however that it’s not ok to blindly trust a proof just because it only relies on standard axioms and has no “sorry”s. You also need to check that the proof doesn’t exploit lean itself.

So that's just one more win for formal verification despite the title it seems

I'm genuinely excited about AI agents and formal verification languages. To me it's obviously the way forward instead of moonshots trying to make agents that program in their own AI blackbox binary, or agents that code in current programming languages.

If we are heading in the direction of "huge codebases that nobody has written", or, "code is an artifact for the machine", I don't see a way out without making it proved.

If humans can review and edit the spec, then verify that the implementation matches it, suddenly leaving the implementation be an artifact for the machines seems okay

The downside of provers also being that they are a massive pain in the ass that very few want to use, this is also a complete win.

I didn't like the clickbait title. I would have preferred something along the lines of

"Lean proves other program correct but not itself"

Nice work. Amusing that Lean's own standard library has a buffer overflow bug, which "leaked out" due to being exempted from the verification.

Regarding the DoS in the lean-zip application itself: I think this is a really neat example of the difficult problem of spec completeness, which is a subcase of the general problem (mentioned by porcoda in a sibling comment) of being sure that the spec is checking the right things. For a compression program, the natural, and I would also say satisfyingly beautiful thing to prove is that decomp(comp(x)) = x for all possible inputs x. It's tempting to think that at that point, "It's proven!" But of course the real world can call decomp() on something that has never been through comp() at all, and this simple, beautiful spec is completely silent on what will happen then.

This is a great reminder that ‘proved correct’ always has an invisible suffix: ‘with respect to the thing you actually specified.’ The hard part was never just proving things, it was pinning reality down tightly enough that the proof is about the right world.

Incorrect usage of semicolon in title/headline. Should be a comma.

Tsk, tsk.

Correctness with respect to a specification is not the same as correctness with respect to intent(ion) or expectation

Formal proofs can only ever work to validate against requirements... But a major issue with a lot of modern software is that the requirements themselves can be incorrect and some requirements can logically conflict with other requirements... So long as we have non-technical people formulating requirements, we can never have provably correct software.

I've come across issues in the past which weren't actually bugs. For example, the software was behaving exactly as intended but it looked like a bug to a user who didn't understand the nuances and complexities of what was happening.

I also cannot count how many times a non-technical person asked me to implement conflicting functionality or functionality which would have made the UX incredibly confusing for the user.

c++, you asked for it.

Now I want to see benchmarks

"Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth

https://news.ycombinator.com/item?id=12761986 (being this link more than 10yrs old is not surprising)

HN to be renamed Flex Social. Damn.

Alan Turing already proved with the Halting Problem that reasoning about program correctness is not possible. But we still try.

Wikipedia: [1] Turing proved no algorithm exists that always correctly decides whether, for a given arbitrary program and input, the program halts when run with that input. The essence of Turing's proof is that any such algorithm can be made to produce contradictory output and therefore cannot be correct.

[1] https://en.wikipedia.org/wiki/Halting_problem