I suspect a true "big new general-purpose" model is around the corner from them, whether or not they were in on Le Chaton Fat for real. They've mentioned it after the media circus. Hopefully more creatively named than just "Large 4".
I've found that you can get wildly different quality results from these sorts of models due to seemingly insignificant differences in prompt construction. It would be much easier to guess at what it wants if I could just see some RL transcripts -- and so the model author is in a much better position to provide initial advice.
I do a lot of OCR, file analysis, stuff like that. I use Mistral for that. I put 100$ into my account, and it just runs for a year without any worries about the amount of requests I make, because the cost is minuscule. That's valuable, even if it doesn't compete with Opus 4.8.
> One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
In what way would this boundary condition case be considered something that "testing [...] would typically miss"? It's certainly something that bad tests would miss or not think about, but I find that (a) careful people and (b) ML coding systems are actually really good at "oh, I should test the extreme values". Especially for things that parse user input.
I'm curious if they found other bugs that were more interesting, but found them too hard to explain quickly.
that library is: https://github.com/datrs/varinteger
it seems probably correct, as there's an identical issue filed on that repo a week before this was published: https://github.com/datrs/varinteger/issues/8 (is this a leanstral employee? they have almost no info and only very sparse activity. or did leanstral perhaps just pick up this issue?)
it's a tiny, surprisingly-poorly tested, long-untouched (8y) library: https://github.com/datrs/varinteger/blob/master/tests/test.r... that has about 1k downloads per day: https://crates.io/crates/varinteger [1] which seems rather low.
I don't think I'd consider that such a smashing success that it's worth bringing up as the sole example tbh. though automated detection is certainly useful. or is this a noteworthy accomplishment for this sub-field? I haven't played with proof-writing LLMs, but given the paucity of training data I wouldn't be surprised if they're a bit rough compared to general coding.
1: https://crates.io/crates/varinteger lists it as https://github.com/mafintosh/varinteger-rs which redirects to https://github.com/datrs/varinteger , so despite looking different at a glance it does appear to be the same library
--- edit
concretely, I made a very simple round-trip test with proptest, and got dozens of failures and this in less than a second:
thread 'signed_round_trip' (50528) panicked at tests/test.rs:72:1:
Test failed: attempt to multiply with overflow.
minimal failing input: value = 4611686018427387904
successes: 2
local rejects: 0
global rejects: 0It does speak to the benefits of using lean in that you don't need to be clever about the different examples you test.
The best and the brightest from Europe have no incentive to build in Europe when they can do it in America and be compensated and treated far better
Identify bugs in [datrs/varinteger](https://github.com/datrs/varinteger) . Do NOT look at the GitHub issues, just inspect the source
It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.
Disclosure: I work at OpenAI.
Plus, you can't really build loyalty on a 100 USD/Y price tag. Since there are no switching costs holding them back, those buyers will leave the moment somebody offers a lower rate. An easily cloned, low cost tool with zero customer lock in is not a business. It is a feature.
That might sound great for the buyer (you), but it is a terrible strategy if we want a European company to compete long term against global competitors on actual product merit instead of just regulatory arbitrage.
Every property-based testing system (invented ca. 1980) will explore boundary values. The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
Life's not all about the number on your bank account. Once you're past a certain level (which a competent engineer can easily reach in Europe as well), the marginal utility of money diminished quite quickly.
This is a little bit like someone pointing the moon and you look at the finger.
The formal proof domain goes way beyond just finding bugs.
It has tons of usages in term of functional safety, protocol validation, cryptography, etc...
The fact Mistral tackle this kind of problem is both smart and not so surprising.
Smart because it is niche enough that they do not front face the big competitors (yet).
No so surprising because the French labs have a well known and long time expertise with formal proof tools (Coq and all its Ocaml associated tools). It has been historically mainly pushed by the aerospace and train industries (Airbus, Dassault, Alsthom).
Honestly: Think twice before dragging your firm into what you say.
Disclaimer: I speak for myself. Not any firm I am associated with.
this sounds like a great tool to add to the toolbelt, as part of the "how do we handle all the code output from LLMs" problem
Basically, to find the answer you really need your own benchmark you run with real examples from what you want to do. Basically the same goes for anything ML nowadays as the public benchmarks cannot really be trusted to give you any sort of indication on how we'll it'd work for you.
However, Lean is currently gaining significant momentum as an alternative, particularly due to its capabilities as a general-purpose functional programming language.
Personally, I think something based on Hoare or separation logic would be more practical as it'd be easier to align requirements with specifications. I like Dafny and F*.
I created this tool for my own research and have found it really helpful to benchmark different automated theorem provers (my experience so far has been that Claude Code + Codex still out-perform Leanstral). My genuine aim is to share that usefulness with others, not self promote!
That said, if (or when) the progress of the LLMs flatten out, then I think even Europe can catch up in a few years. If they don’t, and that seems unlikely to me, if the required compute needs to increase at the rate it does today, then I am not sure any of us can predict where society ends up.
And yes, its good that "its good for buyer" after all we do business so that living would be nicer, not the other way around (live to do business)
I still have a ways to go before calling myself a lean4 expert, but I don't need assist to get useful programs anymore.
The ability to start with very little knowledge and still be able to trust parts you don't fully understand is a real unlock on learning progress: it's both practical and motivating to get useful programs you can rely on with incomplete knowledge, it sort of drags you in. You're bounded by the subset of the language that describes your axiom and proposition surface, not the subset that describes the intermediate steps. Over time as your ambition goes up, you need to understand more to do more things, but you can operate safely at level N+1 in a sense.
It's also just a delightful programming language irrespective of its theorem proving role, and it's remarkably fast. I've got it bolted to io_uring and in many cases it blows the ass off of C++ with libuv or Rust with Tokio. Now and again you'll see some huge tail at the p99.99 latency or something and you go make a number fixed width or something, but you have to tune C++ and Rust too.
I get what you are saying but does this actually apply to a test? If the code under test is in one compilation unit and the test harness in another and they are linked together then the UB optimization issue ends at the API boundary and can't possibly make the test pass ..?
That's still 3-10x smaller than the other models in that graph though (400B, 1T, 1.5T).
Now, go write the code for an artificial heart , and sleep at night thanks to strong testing !
Here’s a tip, imagine how you’d comment if this wasn’t your library but just one that you knew and used. Would you have commented at all? Would you explain why you’d use it, how, the direct and clear relevance to the article?
Take that and add in that it’s your library.
If you wouldn’t comment otherwise, you’re just advertising a thing where people are. If you would, you’re contributing to a conversation with something you built.
Oh and there’s lots of open source products that have most useful features paid saas or licensed, or it’s an oss library that only talks to a paid saas thing.
Edit - the other thing you can do is just call out the comment at the start as self promotion or a plug, I think your work is more directly relevant here and doesn’t need that but it’s another approach when it’s a bit more tangential. That’s better here than other communities.
"I love Lean because <abc>. I found it failed in <xyz> case because <123>. I created a thing <blah> which handles that like this: <ahhh>.
I'd love feedback! It's open source here: "
My thought was: Good job, this is tasteful personable marketing for a product with genuine value. I wish more marketing were done this way. So I think it's totally fine to be talking about the cool thing you're working on. I for one found it interesting and added to the discussion.
Just my 2c, but maybe your original post could’ve been written in a less promotional / less excited way. It looked a little like spam at _first_ glance.
Probably the most annoying part about Reddit and HN and X (although it let you mute people) is the abundance of "expert" opinions from people who aren't experts at all. You just end up with a bunch of false signals that you shouldn't even be listening to.
all in all I say invest in spreading the words via other channels, maybe even X is better and even the right time zone (besides US working hours, I find European time the worst statistically for sharing your work).
Food is important for national security so we should subsidize it, but it's a cost center. It'll never drive growth.
If that's what Mistral is aiming for, it would probably be better to give up now.
Another big use case is GPU kernel compilation. I've encoded the polyhedral algebra used by CuTe and all of the stated theorems from nvfuser in lean4. From this you can recover scheduling and tiling properties that must be met to get equal or better performance to CASK (likely optimal) for many operations, and then lower that to PTX. Surprisingly often the candidate set for the fastest possible kernel turns out to be finite and enumerable, so fast inference.
Somewhat recently I started benchmarking stuff just directly in lean4 with a very small TCB shim to e.g. io_uring. I've done like the basic systems programming ABCs (I've got a reverse proxy that's lean4 and ~100 lines of C and shatters nginx, faster at every point on the histogram, memory use flat as a strap). Lithe runs on it so I can do web serving as fast as anything in the hyper ecosystem.
Now that I've got my feet with it I'm digging in on a serious project which is a hypervisor that addresses the deficiencies in firecracker. This is a place where some proven properties can really get you a lot of performance (something like firecracker has to be very defensive regarding e.g. virtqueue semantics in the presence of a malicious guest). Firecracker is an exceptional piece of code and it's carried me far but at agent scale it's time to go another turn of the crank on virtualization performance.
For example, Mercedes autonomous driving team is moving ahead at glacial speed, but the system they have so far is excellent and reliable. I'd prefer that over the sad joke Tesla is promoting any day.
Is Big AI on track to pay that back with profit from any foreseeable and defensible business model? Different question. I sincerely doubt it.
The good thing about open capital markets is the possibility to invest in the upside. The downside is who gets holding the bag when the money runs out
There are a whole lot of commodity businesses that flourishes and that are profitable. It's true, that, yes, they will not have huge margins.
Grocery stores are like that - some of their suppliers might be subsidised but they are not, and many places they operate with typical margins in the 2% range. Discount supermarkets in the UK are operating on around 0.7% margins.
They are still huge, profitable businesses.
And they are examples of what happens when markets work.
If Mistral is aiming not to grow to infinity but instead to offer niche models at reasonable prices, should they give up now? I think I'm not sure what exactly you're saying.
And API is 80% of their business (subscriptions the other 20%)
Thinking
Summary
Leanstral 1.5, a free Apache-2.0 licensed model with 6B active parameters, delivers a major performance upgrade in formal verification, saturating miniF2F, solving 587/672 PutnamBench problems, and achieving state-of-the-art results on FATE-H (87%) and FATE-X (34%). Trained through mid-training, supervised fine-tuning, and reinforcement learning with CISPO, it excels in agentic proof engineering and real-world code verification, uncovering 5 previously unknown bugs across 57 repositories tested. Fully open-sourced and available via Hugging Face and a free API, Leanstral 1.5 is now accessible for practical proof engineering in Lean 4.
Since its launch, Leanstral has offered an open, practical approach to proof engineering in Lean 4. Today, we are releasing Leanstral 1.5, a free Apache-2.0 licensed model with 119B total and only 6B active parameters, delivering a performance upgrade that makes formal verification more powerful and accessible than ever.
Leanstral 1.5 saturates miniF2F, solves 587/672 PutnamBench problems, and achieves a new state-of-the-art of %87 on FATE-H and 34% on FATE-X. Beyond benchmarks, it verifies complex code properties and uncovers previously unknown bugs in open-source repositories—proving that rigorous formal methods can be both effective and practical for real-world use.

Leanstral 1.5 goes through a three-stage process: mid-training, supervised fine-tuning, and reinforcement learning with CISPO. Leanstral 1.5 leverages extensive training on two RL environments:
In the multiturn environment, the model is given a theorem statement and must either prove or disprove it. The model submits a proof, receives Lean compiler feedback, and refines its approach with each attempt. If the proof compiles it succeeds; otherwise the loop continues until the model either solves the problem or exhausts its budget.

In the code agent environment, Leanstral operates like a developer in a raw filesystem: it edits files, runs bash commands, and uses the Lean language server to inspect goals, errors, and type information in real time. This allows it to tackle long-horizon tasks like completing partial proofs in a repository, building auxiliary lemmas, and persisting through multiple rounds of context compaction. The model learns to navigate the full proof-engineering workflow and is finally verified by our fork of SafeVerify for correctness given a list of target theorems.

We evaluate Leanstral on the following benchmarks:
miniF2F is a cross-system benchmark for formal mathematics, ranging from elementary problems to IMO-level challenges, testing diverse proof abilities across algebra, combinatorics, and number theory.
PutnamBench consists of 672 problems from the Putnam Mathematical Competition, requiring deep reasoning and long proof chains to solve challenging mathematical problems.
FATE-H and FATE-X are abstract algebra benchmarks for graduate and PhD-level problems, respectively, testing advanced reasoning in areas like group theory, ring theory, and module theory.
FLTEval is based on real pull requests from the Fermat’s Last Theorem repository, testing practical proof engineering with real-world complexity.
We saturate miniF2F completely, reaching 100% on both the validation and test sets. On PutnamBench and FATE-H/X, we compare Leanstral 1.5 against Goedel-Architect without natural-language guidance, Seed-Prover 1.5 at its high setting, and AxProverBase. Leanstral reaches a new state-of-the-art on FATE-H/X, solving 87 and 34 problems respectively. On PutnamBench, it edges out Seed-Prover 1.5 high by 7 problems at far lower cost: about $4 per problem, against an estimated $300 or more for Seed-Prover, whose high setting runs with a budget of 10 H20-days per problem. The only provers ranked higher operate under different conditions—some receive natural-language proof guidance, others cost far more to run, like Aleph Prover at $54–68 per problem.
Leanstral 1.5 shows the strongest test-time scaling we have seen from a formal-reasoning model. The figure below tracks Pass@8 on PutnamBench as we raise the token budget per attempt from 25k to 4M: performance climbs smoothly and monotonically the whole way, from 44 problems solved at 50k to 244 at 200k, 493 at 1M, and 587 at 4M. Rather than giving up when a proof runs long, Leanstral keeps reasoning, editing files, and revising across millions of tokens, turning that budget directly into solved problems—the same behavior behind the AVL-tree proof below, which ran for over 2.7 million tokens across 22 compactions.

With this release, we also fully open source FLTEval. Leanstral 1.5 lifts pass@1 on the benchmark from 21.9 to 28.9 and pass@8 from 31.9 to 43.2, surpassing Opus 4.6's 39.6 at one-seventh the cost. It also widens its lead over open-source models 3–10× larger, as shown in the figure below.

While being primarily trained for mathematics, Leanstral 1.5 exhibits strong abilities in code verification. We present 2 critical case studies to demonstrate its impact.
AVL trees are self-balancing binary search trees that maintain O(log n) height through rebalancing during insertions and deletions. Leanstral 1.5 proved these time complexity guarantees for a real implementation—a task that required structural induction to mirror the tree’s recursive structure, careful handling of monadic time tracking, and exhaustive case analysis for rebalancing paths. Over 2.7 million tokens and 22 compactions, Leanstral systematically unfolded each layer of the TimeM monad, exposing the underlying computations despite their interleaving with control flow. It established an almost tight bound of 48 steps per height unit plus a constant for insertion, then connected height to tree size via a logarithmic relationship, delivering complete, verified proofs that insertion and deletion are indeed O(log n).
To test Leanstral’s bug-catching abilities, we built an automated pipeline: Aeneas translates Rust code to Lean, while Leanstral infers the user intent and generates correctness properties from the code. Leanstral then attempts to prove each property in four attempts. If they all fail, it tries to prove the negation instead, also with four attempts. Across 57 tested repositories, this process flagged 47 violated properties, with 11 pointing to genuine bugs—5 of them previously unreported on GitHub.
One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss. Leanstral’s pipeline caught it automatically, demonstrating that formal verification can already be applied to real-world codebases and find bugs that some traditional methods overlook.
Leanstral 1.5 has a Apache-2.0 license. The weights can be found on Huggingface, while also being available now as a free API endpoint as leanstral-1-5. We recommend using it in Mistral Vibe. To begin your journey, grab an API Key, and:
1. Set up Mistral Vibe
uv tool install mistral-vibeuv tool update mistral-vibevibe --setup
2. Install Leanstral 1.5
/leanstallexit
3. Launch the agent
vibe --agent lean
4. Install Lean LSP MCP (Optional)
It is highly recommended to install Lean LSP MCP by adding the following to your ~/.vibe/config.toml
[[mcp_servers]]name = "lean-lsp"transport = "stdio"command = "uvx"args = ["lean-lsp-mcp"]tool_timeout_sec = 600
If there are no existing MCP servers, you may have to remove mcp_servers = [].
5. Start proving
Ask Leanstral to tackle a theorem, debug a proof, or contribute to a repository. It’s that simple.