> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
1. Agent writes a minimal test against a spec. 2. Another agent writes minimal implementation to make test pass only.
Repeat
This is ping pong pair programming.
> The value is not in the verification workforce. It is in what verified delivery enables
These takes misrepresent safety-critical software verification. The verification workforce is indispensable. Whereas the software engineers are usually highly specialized, the verification team owns the domain and project knowledge to make sure the whole thing integrates. They consult on requirements, anticipate oversights, build infrastructure and validate it, and execute the verification.
So when the business needs a go / no-go call, they ask the verification lead.
Automation and proofs have their place, but I don't see how verification without human accountability is workable.
As you add components to a system, the time it takes to verify that the components work together increases superlinearly.
At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
AI coding agents hit this barrier faster than ever, because of how quickly they can generate components (and how poorly they manage complexity).
I think verification is now the problem of agentic software engineering. I think formal methods will help, but I don't see how they will apply to messy situations like end-to-end UI testing or interactions between the system and the real world.
I posted more detailed thoughts on X: https://x.com/i/status/2027771813346820349
I've been saying "the last job to be automated will be QA" and it feels more true every day. It's one thing to be a product engineer in this era. It's another to be working at the level the author is, where code needs to be verifiable. However, once people stop vibing apps and start vibing kernels, it really does fundamentally change the game.
I also have another saying: "any sufficiently advanced agent is indistinguishable from a DSL." I hadn't considered Lean in this equation, but I put these two ideas together and I feel like we're approaching some world where Lean eats the entire agentic framework stack and the entire operating system disappears.
If you're thinking about building something today that will still be relevant in 10 years, this is insightful.
Not write code, write tests, ensure all test-cases are covered. Now, imagine such a flaky foundation is used to build on top of even more untested code. That's how bad quality software (that's usually unfixable without a major re-write) is born.
Also, most vibe-coders don't have enough experience/knowledge to figure out what is wrong with the code generated by the AI. For that, you need to know more than the AI and have a strong foundation in the domain you're working on. Here is an example: You ask the AI to write the code for a comment form. It generates the backend and frontend code for you (let's say React/Svelte/Vue/whatever). The vibe-coder sees the UI - most likely written in Tailwind CSS and thinks "wow, that looks really good!" and they click approve. However, an experienced person might notice the form does not have CSRF protection in place. The vibe-coder might not even be aware of the concept of CSRF (let alone the top 10 OSWAP security risks).
Hence, the fundamental problem is not knowing about the domain more than the AI to pick up the flaw. Unless this fundamental problem is solved - which I don't think it will anytime soon because everyone can generate code + UI these days, I don't see a solution to the verification problem.
However, this is good news for consultants and the like because it creates more work down the line to fix the vide-coded mess because they got hacked the very next day and we can charge them a rush fee on top of it, too. So, it's not all that bad.
The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).
In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...
The domains are different, but the strategy is similar: don’t rely on heuristics or empirical patching; define a small trusted core of axioms and generate coherent structure compositionally from there.
It keeps me in the loop, I'm testing actual functionality rather than code, and my code is always in a state where I can open a PR and merge it back to main.
The collapse of civilisation is real.
We've got fossil fuels that were deposited over millions of years, a timescale we are not even properly equipped to imagine. We've been tapping that reserve for a few decades and it's caused all kinds of problems. We've painted ourselves into a corner and can't get out.
Now we've got a few decades worth of software to tap. When you use an LLM you don't create anything new, you just recycle what's already there. How long until we find ourselves in a very similar corner?
The inability of people to think ahead really astounds me. Sustainability should be at the forefront of everyone's mind, but it's barely even an afterthought. Rather, people see a tap running and just drink from it without questioning once where the water is coming from. It's a real animal brain thing. It'll get you as far as reproducing, but that's about it.
This might be the case for a hobby project or a start-up MVP being created in a rush, but in reality, there are a few points we may want to take into account:
1. Software teams I work with are maintaining the usual review practices. Even if a feature is completely created by AI. It goes through the usual PR review process. The dev may choose "Accept All", although I am not saying this is a good practice, the change still gets reviewed by a human.
2. From my experience, sub-agents intended for code and security review do a good job. It is even possible to use another model to review the code, which can provide a different perspective.
3. A year ago, code written by AI was failing to run the first time, requiring a painful joint troubleshooting effort. Now it works 95% of the time, but perhaps it is not optimal. Given the speed at which it is improving, it is safe to expect that in 6-9 months' time, it will not only work but will also be written to a good quality.
What's interesting is this might be the forcing function that finally brings formal verification into mainstream use. Tools like Lean and Coq have been technically impressive but adoption-starved. If unverified AI code is too risky to deploy in critical systems, organizations may have no choice but to invest in formal specs. AI writes the software, proof assistants verify it.
The irony: AI-generated code may be what makes formal methods economically viable.
Compiled languages like Go and Rust are my new default for projects on the backend, typescript with strict typing on for the frontend, and I foresee the popularity growing the more LLM use grows. The moment you let an LLM loose in a Javascript/Python codebase everything goes off the rails.
The goal is to make the code write-only and replace it with spec declaration? ... math ppl still cant accept the "x = 1;" statement :)
honestly think the answer isn't more tests, it's stricter contracts. like if your API has an OpenAPI spec, you can validate requests/responses against it automatically. the spec becomes the source of truth, not the tests, not the implementation
we've been doing this backwards for years. write code, write tests that match the code, realize six months later that both the code and tests were implementing the wrong behavior. but if you have a machine-readable contract (openapi, json schema, whatever), at least you can verify one dimension automatically
ngl this is why i'm skeptical of "AI will write all the code" takes. without formal specs, you're just getting really confident garbage that happens to pass its own tests. which tbh describes a lot of human-written code too lol
The value proposition of a software engineer is no longer creating code; it is making sure that the ultra-fast code generating capability of the LLM fits the needs of the business.
BDD/TDD, code reviews, deep dives on what the code does, and educating yourself on design patterns... all of these help.
Currently, engineers work with loose specifications, which they translate into code. With the proposed approach, they would need to first convert those specifications into a formally verifiable form before using LLMs to generate the implementation.
But to be production-ready, that spec would have to cover all possible use-cases, edge cases, error handling, performance targets, security and privacy controls, etc. That sounds awfully close to being an actual implementation, only in a different language.
The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?
Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.
But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.
In fact it will probably need to happen a few times PER org for the dust to settle. It will take several years.
Maybe in some other circles it is not like that, but I am sure that 90% of industry measures output in the amount of value produced and correct code is not the value you can show to the stockholders.
It is sad state of affairs dictated by profit seeking way of life (capitalism).
In addition you can have one AI check another AI's code. I routinely copy/paste code from Claude to ChatGPT and Gemini have them check each other's code. This works very well. During the process I have my own eyes verify the code as well.
For example, I have discovered there is a big difference between prompting 'there is a look ahead bias' and 'there is a [T+1] look ahead bias' where the later will cause it to not stop until it finds the [T+1] look ahead bias. It will start to write scripts that will `.shift(1)` all values and do statistical analysis on the result set trying to find the look ahead bias.
Now, I know there isn't look ahead bias, but the point is I was able to get it to iterate automatically trying different approaches to solve the problem.
The software is going to verify itself eventually, sooner than later.
It’s such an intoxicating copyright-abuse slot machine that a buddy who is building an ocaml+htmx tree editor told me “I always get stuck and end up going to the llm to generate code. Usually when I get to the html part.” I asked if he used a debugger before that, he said “that’s a good idea”.
Someone needs to be held accountable when things go wrong. Someone needs to be able to explain to the CEO why this or that is impossible.
If you want to have AI generate all the code for your business critical software, fine, but you better make sure you understand it well. Sometimes the fastest path to deep understanding is just coding things out yourself - so be it.
This is why the truly critical software doesn’t get developed much faster when AI tools are introduced. The bottleneck isn’t how fast the code can be created, it’s how fast humans can construct their understanding before they put their careers on the line by deploying it.
Ofc… this doesn’t apply to prototypes, hackathons, POCs, etc. for those “low stakes” projects, vibe code away, if you wish.
Sitting in your cubical with your perfect set of test suites, code verification rules, SOP's and code reviews you wont want to hear this, but other companies will be gunning for your market; writing almost identical software to yours in the future from a series of prompts that generate the code they want fast, cheap, functionally identical, and quite possibly untested.
As AI gets more proficient and are given more autonomy (OpenClaw++) they will also generate directly executable binaries completely replacing the compiler, making it unreadable to a normal human, and may even do this without prompts.
The scenario is terrifying to professional software developers, but other people will do this regardless of what you think, and run it in production, and I expect we are months or just a few years away from this.
Source code of the future will be the complete series of prompts used to generate the software, another AI to verify it, and an extensive test suites.
half sarcasm, half real-talk.
TDD is nice, but human coders barely do it. At least AI can do it more!
There is another route with Lean where Rust generates the Lean and there is proof done there but I haven't chased that down fully.
I think formal verification is a big win in the LLM era.
the bug will also be introduced in the formal spec, and people will still miss it by not looking.
i think fast response and fix time - anti-entropy - will win out against trying to increase the activation energy, to quote the various S3 talks. You need a cleanup method, rather than to prevent issues in the first place
If a piece of code is produced by an agent loop (prompt -> tool calls -> edits -> tests), the real artifact isn’t just the final code but the trace/pipeline that produced it.
In that sense verification might look closer to: checking constraints on the generator (tests/specs/contracts), verifying the toolchain used by the agent, and replaying generation under controlled inputs.
That feels closer to build reproducibility or supply-chain verification than traditional program proofs.
It does not matter if the middle man is human or AI, or written in "traditional language" or "formal verification". Bugs will be there as human failed to defined a bullet proof requirements.
The answer is no one, for most of the time.
If reviewing is the expensive part now, optimize for reviewability.
> Engineers spend more time writing specifications and models, designing systems at a higher level of abstraction, defining precisely what systems must do, what invariants they must maintain, what failures they must tolerate.
We do that already and the abstractions are very high. The other part is about knowing what the system is supposed to do way in advance, which is not how a lot of engineering is done because it is an exploratory problem. Very few of us write crypto or spend much time in a critical piece of code. And most importantly no user ever said if the software they buy is using proofs. Just like security these concerns are at the bottom of a barrel.
If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.
I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
This is an example of an article which 'buries the lede'†.
It should have started with the announcement of the new zlib autoformalization (!) https://leodemoura.github.io/blog/2026/02/28/when-ai-writes-... to get you excited.
Then it should have talked about the rest - instead of starting with rather graceless and ugly LLM-written generic prose about AI topics that to many readers is already tiresomely familiar and doubtless was tldr for even the readers who aren't repelled automatically by that.
† or in my terms, fails to 'make you care': https://gwern.net/blog/2026/make-me-care
Sigh. Is there any LLM solution for HN reader to filter out all top-level commenters that hadn't RTFA? I don't need the (micro-)shitstorms that these people spawn, even if the general HN algo scores these as "interesting".
I don’t think we’ll get those exact things back but we will see more specification and design than we do today.
Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.
Turns out that the French were just right too early, like with the Minitel.
Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
> At a certain point, the verification complexity takes off. You literally run out of time to verify everything.
Could you elaborate on this? Your post makes it sound as if the verification complexity diverged as the number of components n approaches a certain finite value n_0, but that seems unlikely to me. If, in contrast, the verification complexity remains finite at n_0, then verification should still be possible in finite time, shouldn't it? Yes, it might be a considerable amount of time but I assume your theorem doesn't predict lower bounds for the involved constants?
Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though? AI might also lead us to removing unnecessarily complex dependencies from our software supply chain or stripping them down to the few features we need.
I don't quite follow but I'd love to hear more about that.
Moreover, humans will still need to read even rigorously proved code if only to suss out performance issues. And training people to read Lean will continue to be costly.
Though, as the OP says, this is a very exciting time for developing provably correct systems programming.
All largely stemming from the fact that tests can't meaningfully see and interact with the page like the end user will.
A) They let you verify that implementation passes its spec, more or less.
B) They are a (trustworthy) description of how the system behaves, they allow you to understand the system better.
Of course, this remains largely theoretical for now, but it is an exciting possibility. Note high-level specifications often overlook performance issues, but they are likely sufficient for most scenarios. Regardless, we have formal development methodologies able to decompose problems to an arbitrary level of granularity since the 1990s, all while preserving correctness. It is likely that many of these ideas will be revisited soon.
I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
>> At a certain point, the verification complexity takes off. You literally run out of time to verify everything. > Could you elaborate on this?
I plan to publish a thorough post with an interactive model. Whether human or AI, you are capacity constrained, and I glossed over `C` (capacity within a given timeframe) in the X post.
You are correct that verification complexity remains finite at n_0. The barrier is practical: n_0 is where V(n) exceeds your available capacity C. If V(n) = n^(1+k), then n_0 = C^(1/(1+k)). Doubling your capacity doesn't double n_0. It increases by a factor of 2^(1/(1+k)), which is always less than 2.
So the barrier always exists for, say, a given "dev year" or "token budget," and the cost to push it further grows superlinearly. It's not absolutely immovable, but moving it gets progressively harder. That's what I mean by "literally run out of time." At any given capacity, there is a finite n beyond which complete verification is not possible. Expanding capacity buys diminishing returns.
> Either way, this entire discussion assumes n will increase as more and more software gets written by AI. Couldn't it also be the opposite, though?
You are getting at my core motivation for exploring this question.
Verification requires a definition of "done" and I wonder if it will ever be possible (or desirable) for AI to define done on its own, let alone verify it and simplify software based on its understanding of our needs.
You make a great point that we are not required to add more components and "go right" along the curve. We can choose to simplify, and that is absolutely the right takeaway. AI has made many people believe that by generating more code at a faster pace they are accomplishing more. But that's not how software productivity should be judged.
To answer your question about assumptions, while AI can certainly be prompted to help reduce n or k in isolated cases where "done" is very clear, I don't think it's realistic to expect this in aggregate for complex systems where "done" is subjective and dynamic.
I'm speaking mainly in the context of commercial software dev here, informed by my lived experience building hundreds of apps. I often say software projects have a fractal complexity. We're constantly identifying new needs and broader scope the deeper we go, not to mention pivots and specific customer asks. You rarely get to stand still.
I don't mean to be pessimistic, but my hunch is that complexity growth outpaces the rate of simplification in almost every software project. This model attempts to explain why that is so. And notably, simplification itself requires verification and so it is in a sense part of the verification cost, too.
Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.
Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
For anyone who hasn't worked in a waterfall project and would like to try: You are kidding yourself.
There is no such thing as a perfect spec. Read that again and say it outloud.
It took humanity 50 years to figure out that perfect specs are impossible, unless of course you know exactly(!) what you need. And even then, the specs are never complete.
The reality is that we often don't know, and can't know, what we want, exactly, until we actually see and experience what we said we wanted. Then we adjust. Try again.
That's the reality for individuals already. By simply using logic we can deduct that entities made of more than one individual, aka companies, will not behave better. They just make it look better by giving you a nice document that says "we want this!", only to then come around when they see what they got, and to claim "wait, we didn't mean it like so!".
That's just human nature. Not much we can do. AI will not change that.
So when some people think AI will deliver perfect software given perfect specs, hence we have to write the specs first! That is just missing the boat my a mile.
Agile is not a process but a human-friendly way of doing things. It simply says hey you want this? Let me build that and show you. Then let's adjust or move to the next thing. Rinse and repeat. Agile works because it matches how humans think and act. Step by step, day by day.
Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.
I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.
Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.
When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
Now, it is actually completely possible to write UI code without any unit tests in a completely safe way. You use the functional core, imperative shell approach. When all your domain logic is in a fully tested, functional core, you can just go ahead and write "what works" in a thin UI shell. Good luck getting an LLM to rigidly conform to such an architecture, though.
Does it need to be HN-popular or a household name? Be in the news?
Or something that saves 50% of time by automating inane manual work from a team?
1. widely considered successful 2. made by humans from scratch in 2025
It looks like humans and AI are on par in this realm.
Another way of doing it is the agent just writes an algorithm to perform the task and runs it. In this world, tools are just APIs and the agent has to think through its entire process end to end before it even begins and account for all cases.
Only latter is turing complete, but the former approaches the latter as it improves.
Some performance issues (asymptotics) can be addressed via proof, others are routinely verified by benchmarking.
Not disagreeing with you here, but what ends up happening is the frontend works flawlessly in the browser/device being tested but has tons of bugs in the others. Best examples of this are most banking apps, corporate portals, etc. But honestly, you can get away with the frontend without writing tests. But the backend? That directly affects the security aspect of the software and cannot afford to skip important tests atleast.
I recall a time, maybe around 2013-2017, when people were talking about 4 or 5 nines. But sometime around then the goalposts shifted, and instead of trying to make things as reliable as possible, it started becoming more about seeing how unreliable they can get before anyone notices or cares. It turns out people will suffer through a lot if there's some marginal benefit--remember what personal computers were like in the 1990s before memory protection? Vibe coding is just another chapter in that user hostile epic. Convenient reliability, like this author describes, (if it can be achieved) might actually make things better? But my money isn't on that.
> We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications … We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs.
https://aws.amazon.com/blogs/opensource/lean-into-verified-s...
If you want it to be a question of economics, I think the answer is in whether this approach is more economical than the alternative, which is having people run this substrate. There's a lot of enthusiasm here and you can't deny there has been progress.
I wouldn't be so quick to doubt. It costs nothing to be optimistic.
Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.
1. AI is meant to make us go faster, reviews are slow, the AI is smart, let it go.
2. There are plenty of AI maximizers who only think we should be writing design docs and letting the AI go to town on it.
Maybe, this might be a great time to start a company. Maximize the benefits of AI while you can without someone who has never written a line of code telling you that your job is going to disappear in 12 months.
All the incentives are against someone who wants to use AI in a reasonable way, right now.
It's like watching someone else solve a puzzle, or watching someone else play a game vs playing it yourself (at least that's half as interesting as playing it through)
Can you blame them? All the AI companies are saying “this does a better job than you ever could”, every discussion topic on AI includes at least one (totally organic, I’m sure) comment along the lines of “I’ve been developing software for over twenty years and these tools are going to replace me in six months. I’m learning how to be a plumber before I’m permanently unemployed.” So when Claude spits out something that seems to work with a short smoke test, how can you blame developers for thinking “damn the hype is real. LGTM”?
Then don’t even bother looking at C work or below.
I really want to say: "You are absolutely right"
But here is a problem I am facing personally (numbers are hypothetical).
I get a review request 10-15/day by 4 teammates, who are generating code by prompting, and I am doing same, so you can guess we might have ~20 PRs/day to review. now each PR is roughly updating 5-6 files and 10-15 lines in each.
So you can estimate that, I am looking at around 50-60 files, but I can't keep the context of the whole file because change I am looking is somewhere in the middle, 3 lines here, 5 lines there and another 4 lines at the end.
How am I supposed to review all these?
I personally find the “move fast and break thing” ethos morally abhorrent, but that doesn’t make it go away.
The whole point of formal verification is that you don't test. You prove the program correct mathematically for any input.
> an agent that's technically correct but consistently misunderstands a whole class of user queries is invisible to any pre-deploy check
The agent isn't verifying the program. The agent is writing the code that proves the program correct. If the agent misunderstands, it fails to verify the program.
It’s not just dropping in case and getting a functional product. But more up front design seems to work best. Very explicit design.
(Sorry.)
This is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.
One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.
In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.
Not sure how LLM's handle that case to come up with a proper test.
In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.
If boilerplate was such a big issue, we should have worked on improving code generation. In fact, many tools and frameworks exist that did this already:
- rails has fantastic code generation for CRUD use cases
- intelliJ IDEs have been able to do many types of refactors and class generation that included some of the boilerplate
I haven't reached a conclusion on this train of thought yet, though.
If you could pause a bit from being awed by your own perceived insightfulness, you would think a just bit harder and realize that LLMs can generate hundreds of thousands of code that no human could every verify within a finite amount of time. Human-written software is human verifiable, AI-assisted human-written software is still human verifiable to some extent, but purely AI-written software can no longer be verified by humans.
And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?
I know I'm psychologizing the agent. I can't explain it in a different way.
Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"
Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.
Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.
That isn’t necessarily a hit against them - they make an LLM coding tool and they should absolutely be dogfooding it as hard as they can. They need to be the ones to figure out how to achieve this sought-after productivity boost. But so far it seems to me like AI coding is more similar to past trends in industry practice (OOP, Scrum, TDD, whatever) than it is different in the only way that’s ever been particularly noteworthy to me: it massively changes where people spend their time, without necessarily living up to the hype about how much gets done in that time.
Do agree it's a weird metric to have, but can't think of a better one outside of "business" but that still seems like a poor rubric because the vast majority of people care about things that aren't businesses and if this "life altering" technology basically amounts to creating digital slaves then maybe we as a species shouldn't explore the stars.
they arent good enough yet at all.
i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage
This is the ONLY point of software unless you’re doing it for fun.
Bluntly though, if what you were doing was CRUD boilerplate then yeah it is going to just be a review fest now, but that kind of work always was just begging to be automated out one way or another.
Also works with planning before any coding sessions. Gemini + Opus + GPT-xhigh works to get a lot of questions answered before coding starts.
Yes I absolutely can and do blame them
It might seem hopeless. But on the other hand the innate human BS detector is quite good. Imagine the state of us if we could be programmed by putting billions of dollars into our brains and not have any kind of subconscious filter that tells us, hey this doesn’t seem right. We’ve already tried that for a century. And it turns out that the cure is not billions of dollars of counter-propaganda consisting of the truth (that would be hopeless as the Truth doesn’t have that kind of money).
We don’t have to be discouraged by whoever replies to you and says things like, oh my goodness the new Siri AI replaced my parenting skills just in the last two weeks, the progress is astounding (Siri, the kids are home and should be in bed by 21:00). Or by the hypothetical people in my replies insisting, no no people are stupid as bricks; all my neighbors buy the propaganda of [wrong side of the political aisle]. Etc. etc. ad nauseam.
Even state of the art AI models seem to have no taste, or sense of 'hang on, what's even the point of this test' so I've seen them diligently write hundreds of completely pointless tests and sometimes the reason they're pointless is some subtle thing that's hard to notice amongst all the legit looking expect code.
For me with any new release of my winery production software I re-ran every job put into my clients production systems from job #1, about 200,000 jobs; Before going into production we checked all the balances, product compositions and inventory from this new software matches what the old system currently says. Takes about an hour to re-run fifteen years production and even a milliliter, or milligram difference was enough to trigger a stop/check. We had an extensive set of input data I could also feed in, to ensure mistakes were caught there too.
I expect other people do it there own way, but as a business this would be the low bar of testing I would expect to be done.
A large part of it is probably just using it as a better search. Like "How do I define a new data type in go?".
They still can't do math.
Or you can be a grifter and make some AI wrapper yourself and cash out with some VC investment. So good time for a new company either way.
I've been using LLMS for 14+ months now and they've exceeded my expectations.
AI changes none of this. If you’re putting up PRs and getting comments, you need to slow down. Slow is smooth, and smooth is fast.
I’ll caveat this with that’s only if your employer cares about quality. If you’re fine passing that on to your users, might as well just stop reviewing all together.
Just going ahead and piling up PRs or skipping the review process is of course not recommended.
The less any of those applies, the more costly it is to figure it out as you go along, because accounting for design changes can become something of a game of crack the whip. Iterative design is still important under such circumstances, but it may need to be a more thoughtful form of iteration that’s actively mindful about which kinds of design decisions should be front-loaded and which ones can be delayed.
I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.
I have personally outpaced some of my more impatient colleagues by spending extra time up front setting up test harnesses, reading specifications, etcetera. When done judiciously it pays off in time scales of weeks or less.
The problem is information fatigue from all the agents+code itself.
Which, funny enough, is a pretty good reflection of how I thought of the people writing in those languages before LLMs: One considers testing a complete afterthought and in the wild it is rare to find tests at all, and when they are present they often aren't good. Whereas the other brings testing as a first-class feature and most codebases I've seen generally contain fairly decent tests.
No doubt LLM training has picked up on that.
I would say that which come first between specs and code depend on the context. If you are implementing a standard, the specs of the standard obviously come first, but if you are iterating, maybe for a user interface, it can make sense to start with the code so that you can have working prototypes. You can then write formal documents and tests later, when you are done prototyping, for regression control.
But I think that leaning on tests is not always a good idea. For example, let's continue with the HTTP server. You write a test suite, but there is a bug in your tests, I don't know, you confuse error 404 and 403. The you write your code, correctly, run the tests, see that one of your tests fail and tell you have returned 404 and not 403. You don't think much, after all "the tests are the specs", and change the code. Congratulations, you are now making sure your code is wrong.
Of course, the opposite can and do happen, writing the code wrong and making passing test without thinking about what you actually testing, and I believe that's why people came up with the idea of TDD, but for me, test-first flip the problem but doesn't solve it. I'd say the only advantage, if it is one, is that it prevents taking a shortcut and releasing untested code by moving tests out of the critical path.
But outside of that, I'd rather focus on the code, so if something are to be "the spec", that's it. It is the most important, because it is the actual product, everything else is secondary. I don't mean unimportant, I mean that from the point of view of users, it is better for the test suite to be broken than for the code to be broken.
The real blockers and time sinks were always bad/missing docs and examples. LLMs bridge that gap pretty well, and of course they do. That's what they're designed to be (language models), not an AGI!
I find it baffling how many workplaces are chasing perceived productivity gains that their customers will never notice instead of building out their next gen apps. Anyone who fails to modernize their UI/UX for the massive shift in accessibility about to happen with WebMCP will become irrelevant. Content presentation is so much higher value to the user. People expect things to be reliable and simple. Especially new users don't want your annoying onboarding flow and complicated menus and controls. They'll just find another app that gives them what they want faster.
Pretending that they can only save the world and at the same time declaring they don't use AI but use it secretly by building an so-called "AI startup" and then going on the media doomsaying that "AGI" is coming.
At this point in this cycle in AI, "AGI" is just grifting until IPO.
That’s management, which is not necessarily bad by itself.
What is bad if you have to be both a manager giving the requirements and the IC responsible for the output. That’s the worst of both worlds.
If it was possible to truly vibe code in the same way a product manager asks a team to build something then we wouldn’t need to have this discussion.
In reality you can never truly trust the output. Ultimately you’re on the hook if production breaks apart, not the LLM.
But when looking at the PR changes, you don't always see whole picture because review subjects (code lines) are scattered across files and methods, and GitHub also shows methods and files partially making it even more difficult to quickly spot the context around those updated lines.
Its difficult problem, because even if GitHub shows whole body of the updated method or a file, you still don't see grand picture.
For example: A (calls) -> B -> C -> D
And you made changes in D, how do you know the side effect on B, what if it broke A?
And I completely agree that requirement proximity estimation is a critical skill. I do think estimation of requirement proximity is a much easier task than time estimates.
Code Metal recently raised $125 million to rewrite defense industry code using AI. Google and Microsoft both report that 25–30% of their new code is AI-generated. AWS used AI to modernize 40 million lines of COBOL for Toyota. Microsoft’s CTO predicts that 95% of all code will be AI-generated by 2030. The rewriting of the world’s software is not coming. It is underway.
Anthropic recently built a 100,000-line C compiler using parallel AI agents in two weeks, for under $20,000. It boots Linux and compiles SQLite, PostgreSQL, Redis, and Lua. AI can now produce large-scale software at astonishing speed. But can it prove the compiler correct? Not yet.
No one is formally verifying the result.
Andrej Karpathy described the pattern: “I ‘Accept All’ always, I don’t read the diffs anymore.” When AI code is good enough most of the time, humans stop reviewing carefully. Nearly half of AI-generated code fails basic security tests, and newer, larger models do not generate significantly more secure code than their predecessors. The errors are there. The reviewers are not. Even Karpathy does not trust it: he later outlined a cautious workflow for “code [he] actually care[s] about,” and when he built his own serious project, he hand-coded it.
Consider what happens at scale. A single bug in OpenSSL — Heartbleed — exposed the private communications of millions of users, survived two years of code review, and cost the industry hundreds of millions of dollars to remediate. That was one bug, introduced by one human, in one library. AI is now generating code at a thousand times the speed, across every layer of the software stack, and the defenses we relied on (code review, testing, manual inspection) are the same ones that missed Heartbleed for two years.
The Harvard Business Review recently documented what it calls “workslop”: AI-generated work that looks polished but requires someone downstream to fix. When that work is a memo, it is annoying. When it is a cryptographic library, it is catastrophic. As AI accelerates the pace of software production, the verification gap does not shrink. It widens. Engineers stop understanding what their systems do. AI outsources not just the writing but the thinking.
The threat extends beyond accidental errors. When AI writes the software, the attack surface shifts: an adversary who can poison training data or compromise the model’s API can inject subtle vulnerabilities into every system that AI touches. These are not hypothetical risks. Supply chain attacks are already among the most damaging in cybersecurity, and AI-generated code creates a new supply chain at a scale that did not previously exist. Traditional code review cannot reliably detect deliberately subtle vulnerabilities, and a determined adversary can study the test suite and plant bugs specifically designed to evade it. A formal specification is the defense: it defines what “correct” means independently of the AI that produced the code. When something breaks, you know exactly which assumption failed, and so does the auditor.
Poor software quality already costs the U.S. economy $2.41 trillion per year, according to a 2022 study by the Consortium for Information & Software Quality. That number was calculated before AI began writing a quarter or more of new code at leading companies. Chris Lattner, the creator of LLVM and Clang, put it bluntly: AI amplifies both good and bad structure. Bad code at AI speed becomes “incomprehensible nightmares.” As AI generates an increasing share of the world’s critical infrastructure (financial systems, medical devices, defense, transportation), unverified code becomes a systemic risk, not just a quality problem.
Most critical software today works, maintained by talented engineers using testing, code review, and hard-won experience. The problem is not that everything is broken. It is that AI is changing the scale and speed of software production faster than our ability to verify it. What works at human pace may not survive AI pace.
It will get worse, and it will get worse fast, unless verification scales with generation.
Testing and proof are complementary. Testing, including property-based testing and fuzzing, is powerful: it catches bugs quickly, cheaply, and often in surprising ways. But testing provides confidence. Proof provides a guarantee. The difference matters, and it is hard to quantify how high the confidence from testing actually is. Software can be accompanied by proofs of its correctness, proofs that a machine checks mechanically, with no room for error. When AI makes proof cheap, it becomes the stronger path: one proof covers every possible input, every edge case, every interleaving. A verified cryptographic library is not better engineering. It is a mathematical guarantee.
Consider an example. An AI rewrites a TLS library. The code passes every test. But the specification requires constant-time execution: no branch may depend on secret key material, no memory access pattern may leak information. The AI’s implementation contains a subtle conditional that varies with key bits, a timing side-channel invisible to testing, invisible to code review. A formal proof of constant-time behavior catches it instantly. Without the proof, that vulnerability ships to production. Proving such low-level properties requires verification at the right level of abstraction, which is why the platform must support specialized sublanguages for reasoning about timing, memory layout, and other hardware-level concerns.
The Claude C Compiler illustrates the other side: it optimizes for passing tests, not for correctness. It hard-codes values to satisfy the test suite. It will not generalize. Property-based testing would likely catch this particular case, but the general problem remains: for any fixed testing strategy, a sufficiently adversarial system can overfit to it. A proof cannot be gamed. It covers all inputs by construction.
The friction of writing code manually used to force careful design. AI removes that friction, including the beneficial friction. The answer is not to slow AI down. It is to replace human friction with mathematical friction: let AI move fast, but make it prove its work. The new friction is productive: writing specifications and models, defining precisely what “correct” means, designing before generating.
The obstacle has always been cost. Writing proofs by hand was too expensive to apply broadly. AI changes the economics. Proof is becoming practical at scale.
Most people think of verification as a cost, a tax on development, justified only for safety-critical systems. That framing is outdated. When AI can generate verified software as easily as unverified software, verification is no longer a cost. It is a catalyst.
The value is not in the verification workforce. It is in what verified delivery enables. Consider a company delivering ML kernels for new hardware. Today, months go to testing and qualification. When AI writes the kernel and proves it correct in one pass, that timeline collapses to hours. A provably correct hardware design delivered in weeks rather than the year it currently takes changes the economics of an entire industry.
Verification, testing, and specification have always been the bottleneck, not implementation. Good engineers know what they want to build. They just cannot afford to prove it correct. If that cost drops to near zero, every domain where correctness matters accelerates. Aerospace, automotive, and medical device certification currently takes years of qualification effort. Cloud providers invest similar effort qualifying security-critical services and cryptographic implementations. Verified code generation could collapse that timeline to weeks. Hardware verification, where a single bug can cost hundreds of millions of dollars, benefits equally.
This acceleration requires specifications: precise descriptions of what the software must do. As AI takes over implementation, specification becomes the core engineering discipline. Writing a specification forces clear thinking about what a system must do, what invariants it must maintain, what can go wrong. This is where the real engineering work has always lived. Implementation just used to be louder.
Writing specifications is not always easy, but it is easier than writing the optimized implementation. And a powerful shortcut exists: an inefficient program that is obviously correct can serve as its own specification. User and AI co-write a simple model, AI writes an efficient version, and proves the two equivalent. The hard part shifts from implementation to design. That is the right kind of hard.
This is not limited to critical systems. Any non-trivial engineering project where bugs are expensive (which is most of them) accelerates when correctness is cheap.
What would a verification platform for the AI era require?
A small, trusted kernel: a few thousand lines of code that check every step of every proof mechanically. Everything else (the AI, the automation, the human guidance) is outside the trust boundary. Independent reimplementations of that kernel, in different languages (Lean, Rust), serve as cross-checks. You do not need to trust a complex AI or solver; you verify the proof independently with a kernel small enough to audit completely. The verification layer must be separate from the AI that generates the code. In a world where AI writes critical software, the verifier is the last line of defense. If the same vendor provides both the AI and the verification, there is a conflict of interest. Independent verification is not a philosophical preference. It is a security architecture requirement. The platform must be open source and controlled by no single vendor.
The platform must be both a programming language and a theorem prover, with code and proofs in one system, with no translation gap. It needs a rich and extensible tactic framework that gives AI structured, incremental feedback: here is the current goal, here are the hypotheses available, here is what changed after each step. AI must control the proof search, not delegate to a black box.
The alternative, push-button solvers that return a binary pass or fail with no intermediate state, gives AI nothing to learn from and no way to guide the search. Worse, proofs that rely on heuristic solvers often break when the solver updates or when developers make small changes to how they write their specifications, even when the changes are logically equivalent. You cannot build a reliable AI pipeline on a foundation that is not reproducible. (I discuss this in detail in a recent Stanford talk.)
It needs the largest possible library of formalized knowledge to build on. Verifying software is mathematics: the same reasoning that proves a theorem in abstract algebra proves that a cryptographic library correctly implements its specification. A platform that serves mathematicians and engineers is not a compromise. It is the recognition that rigorous reasoning is one discipline, whether applied to prime numbers or protocol correctness.
And it needs deep extensibility. Users and AI must be able to write extensions that access the system’s internals, building custom tools, automation, and domain-specific reasoning engines. This is already happening: AI agents build their own proof strategies on top of the platform. The platform adapts to its users, not the other way around.
The AI community has already made its choice. AlphaProof (Google DeepMind), Aristotle (Harmonic), SEED Prover (ByteDance), Axiom, Aleph (Logical Intelligence), and Mistral AI all build on Lean. Every major AI reasoning system that has achieved medal-level performance at the International Mathematical Olympiad used Lean. No competing platform was used by any of them. The future is much larger than today’s early applications.
Lean is backed by Mathlib, the largest coherent body of formalized mathematics ever created: over 200,000 formalized theorems and 750 contributors. Five Fields medalists engage with Lean. The same platform serves mathematicians formalizing theorems and engineers verifying production systems. ACM SIGPLAN recognized this convergence with its 2025 Programming Languages Software Award: “Lean has become the de facto choice for AI-based systems of mathematical reasoning.”
Enterprise teams already use Lean in production: AWS verified its Cedar authorization policy engine, and Microsoft is using Lean to verify its SymCrypt cryptographic library. Over 8,000 GitHub repositories contain Lean code. Over 200,000 users have installed the programming environment. More than 700 people are active in the Lean Zulip channel every day. Research groups worldwide contribute to the ecosystem. As Chris Lattner observed, manual rewrites and translation work are becoming AI-native tasks. AI will rewrite the world’s codebase. The platform it does so on matters enormously.
Lean already produces performance comparable to Haskell and OCaml. When higher performance is essential, Lean models can be translated into efficient imperative code embedded in Lean, with clean semantics and without C’s undefined behavior. We are actively working on closing the remaining gap for performance-critical code. The real comparison is not Lean versus C. It is verified code versus unverified code.
Lean is the result of over twelve years of continuous development. We designed and built every layer from scratch: the trusted kernel, the compiler, the language server, the IDE, the proof automation. The team is 20 people. The community independently chose Lean: mathematicians, AI researchers, and enterprise engineers, all building on the same platform.
At the Lean FRO, Kim Morrison, a Senior Research Software Engineer, recently ran an experiment that went well beyond our expectations. An AI agent converted zlib, a widely used C compression library embedded in countless systems, to Lean, with minimal human guidance. No special tooling was built. It was Claude, a general-purpose AI, with no special training for theorem proving, out of the box. The workflow had four steps. First, the AI produced a clean, readable Lean implementation of the zlib compression format, including the DEFLATE algorithm at its core. Second, the Lean version passed the library’s existing test suite, confirming behavioral equivalence. Third, key properties were stated and proved, not as tests, but as mathematical theorems. The capstone theorem:
theorem zlib_decompressSingle_compress (data : ByteArray) (level : UInt8)
(hsize : data.size < 1024 * 1024 * 1024) :
ZlibDecode.decompressSingle (ZlibEncode.compress data level) = .ok data
This is a machine-checked proof that decompressing a compressed buffer always returns the original data, at every compression level, for the full zlib format. The size precondition (hsize) is a proof engineering artifact: it bounds the fuel the AI used during proof construction. A human engineer would likely eliminate it; the AI, working without guidance, chose a bound that was sufficient to close the proof. As both AI capabilities and platform tooling improve, artifacts like this will disappear. The raw DEFLATE and gzip framing roundtrips are proved as well. Fourth, an optimized version is being developed and proved equivalent to the verified model. This work is ongoing.
The entire proof was constructed with minimal human guidance and no specialized tooling. The result demonstrates that AI can convert production software to a verified form today. This was not expected to be possible yet.
zlib is a sequential algorithm with a clean RFC specification, simpler than a database engine or a distributed system. The gap between this result and a verified software stack is real. But until recently, no one had demonstrated AI-generated proofs about production software. The trajectory matters more than the starting point. Crucially, the zlib proof was produced by a general-purpose AI with no specialized training for theorem proving. No custom model was needed. This means the barrier to verified software is no longer AI capability. It is platform readiness. As general-purpose AI improves, the bottleneck shifts entirely to the verification platform: how rich is the feedback it gives AI, how powerful is the automation, how large is the library of prior knowledge to build on.
The zlib workflow also reveals something testing structurally cannot do. When your tests and your code share the same wrong assumptions, testing finds nothing. The act of proving forces every assumption to be explicit: a designer states a property, the AI turns it into a precise formal statement, and then either constructs a proof or discovers that the specification needs revision. Wrong assumptions surface as proof obligations that cannot be discharged. Specifications become sharper through the attempt to prove them.
The approach extends beyond sequential algorithms to distributed systems, where the hardest bugs live. Ilya Sergey’s group at NUS built Veil, a distributed protocol verifier on top of Lean that combines model checking with full formal proof. When a property does not hold, Veil generates concrete counterexamples: actual execution traces that demonstrate the failure, just as model checkers do. When it does hold, Veil produces a full formal proof, not just the absence of counterexamples up to some bound. Veil has verified protocols including Rabia, a randomized consensus protocol, proving agreement and validity for any number of nodes. During verification, the team discovered an inconsistency in a prior formal verification of Rabia that had gone undetected across two separate tools. For simpler protocols, Claude Code can infer most of the required auxiliary invariants within minutes.
The broader picture confirms the trajectory. AI systems using Lean have solved open mathematical problems that resisted decades of human effort. A single mathematician working with an AI agent formalized the full Prime Number Theorem in three weeks: 25,000 lines of Lean, over 1,000 theorems. The previous formalization took over a year and dozens of contributors. Fields Medalist Maryna Viazovska’s proof that the E8 lattice is the optimal sphere packing in eight dimensions has been formally verified in Lean, with an AI agent completing the final steps. Google DeepMind’s AlphaEvolve discovered a new mathematical construction, and AlphaProof formalized the proof in Lean, a complete AI pipeline from discovery to verified proof.
The productivity gap is widening: teams with the best tools are pulling further ahead while others stagnate. Verification will be a decisive advantage.
Where this leads is clear. Layer by layer, the critical software stack will be reconstructed with mathematical proofs built in. The question is not whether this happens, but when.
It is already beginning. The ecosystem is growing. Startups are training AI on proof data, building verification tools, and growing talent. This is phase one: verifying existing code using Lean. Phase two is reconstruction: building the stack in Lean from the ground up, with proofs built in from the start. Both phases happen on the same platform. Both strengthen it.
The target is the foundation of the modern software stack: cryptography, because everything else trusts it. Core libraries (data structures, algorithms, compression) because they are the building blocks of all software. Storage engines like SQLite, embedded in every device on earth. Parsers and protocol implementations (JSON, HTTP, DNS, certificate validation) because every message passes through them. And compilers and runtimes, because they build everything else.
Consider SQLite, embedded in every smartphone, every browser, every major operating system. A verified SQLite carries proofs that a crash during a write cannot corrupt the database, that concurrent readers never see partial transactions. Testing tools like Jepsen and stateful property-based testing find many such bugs in practice, but they provide confidence, not a guarantee. Mathematical proof covers every crash point and every interleaving.
Each verified component is a permanent public good. Unlike proprietary software, a verified open-source library cannot be degraded, cannot have its guarantees quietly revoked, and cannot be held hostage by a single company’s business decisions. The proofs are public. Anyone can audit them, build on them, or replace the implementation while preserving the guarantees. This is infrastructure in the deepest sense: it raises the floor for everyone. When you build on this stack, you inherit the proofs. Each layer is independent. The platform and methodology come first; then the work parallelizes naturally.
Once verified components are cheap, you compose them with confidence. Today, integration is where most bugs live: the boundaries between components, the assumptions that do not quite match. Integration testing exists precisely because component tests do not compose: passing unit tests on two components tells you nothing about whether they work together. Verified interfaces are fundamentally different. When each component carries a proof against a shared specification, the composition is guaranteed correct by construction. You can verify each piece independently and know the whole is sound. This advantage scales superlinearly: the larger the system, the wider the gap between tested and verified.
The goal is a verified software stack: open source, freely available, mathematically guaranteed correct. Developers building critical systems choose verified components the way they choose open-source libraries today, except these carry proofs, not just tests. No single team builds this alone, and no single platform is guaranteed to be the right one forever. What matters is that verified software becomes the norm. If something better than Lean emerges, the mission has still succeeded. It is an open challenge for the entire community.
The role of the engineer changes, but it does not shrink. Engineers spend more time writing specifications and models, designing systems at a higher level of abstraction, defining precisely what systems must do, what invariants they must maintain, what failures they must tolerate. The work becomes more creative, not less: designing before generating, thinking before building. Productivity comes not from generating more code, but from generating code that is provably correct on the first attempt.
Specifications also raise a question that is not purely technical: who decides what “correct” means? A specification for a medical device, a voting system, or an AI safety monitor encodes values, not just logic. Making specifications formal and public does not resolve these questions, but it makes them explicit and auditable in a way that buried code never can. AI itself can bridge the accessibility gap: the same systems that generate verified code can explain specifications in plain language, making formal guarantees accessible beyond the experts who write them.
The AI industry has a strategic stake in this outcome. An AI that generates provably correct code is qualitatively different from one that merely generates plausible code. Verification transforms AI code generation from a productivity tool into a trust infrastructure.
AI is going to write a great deal of the world’s software. It will advance mathematics, science, and engineering in ways we cannot yet anticipate. The question is whether anyone can prove the results correct.
Lean is open source and freely available at lean-lang.org. To learn Lean, start with our documentation. To join the community, visit the Lean Zulip. The Lean Focused Research Organization builds and maintains the platform. If you want to support this work, contact us at contact@lean-fro.org.
I do trust them, but code is not theirs, prompt is. What if I trust them, but because how much they use LLMs their brain started becoming lazy and they started missing edge cases, who should review the code? me or them?
At the beginning, I relied on my trust and did quick scans, but eventually noticed they became un-interested in the craft and started submitting LLM output as it is, I still trust them as good faith actors, but not their brain anymore (and my own as well).
Also, assumption is based on ideal team: where everyone behaves in good faith. But this is not the case in corporations and big tech, especially when incentives are aligned with the "output/impact" you are making. A lot of times, promoted people won't see the impact of their past bad judgements, so why craft perfect code
It is more like a meta spec. You still have to write a final spec that applies to your particular technical constraints, business needs, etc. RFC 7231 specifies the minimum amount necessary to interface with the world, but an actual program to be deployed into the wild requires much, much more consideration.
And for that, since you have the full picture not available to a meta spec, logically you will write it in a language that both humans and computers can understand. For the best results, that means something like Lean, Rocq, etc. However, in the real world you likely have to deal with middling developers straight out of learn to code bootcamps, so tests are the practical middle ground.
> I don't know, you confuse error 404 and 403.
Just like you would when writing RFC 7231? But that's what the RFC process is for. You don't have to skip the RFC process just because the spec also happens to be machine readable. If you are trying to shortcut the process, then you're going to have this problem no matter what.
But, even when shortcutting the process, it is still worthwhile to have written your spec in a machine-readable format as that means any changes to the spec automatically identify all the places you need to change in implementation.
> writing the code wrong and making passing test without thinking about what you actually testing
The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything. Then, years down the road after everyone has forgotten or moved on, when someone needs to do some refactoring there is no specification to define what the original code was actually supposed to do. Writing the test first means that you have proven that it can fail. That's not the only reason TDD suggests writing a test first, but it is certainly one of them.
> It is the most important, because it is the actual product
Nah. The specification is the actual product; it is what lives for the lifetime of the product. It defines the contract with the user. Implementation is throwaway. You can change the implementation code all day long and as long as the user contract remains satisfied the visible product will remain exactly the same.
I'm thinking HVAC or painting lines in parking lots. HVAC because I can program smart systems and parking lot lines because I can use google maps and algos to propose more efficient parking lot designs to existing business owners.
There is that paradox when if something becomes cheaper there is more demand so we'll see what happens.
Finally, I'm a mediocre dev that can only handle 2-3 agents at a time so I probably won't be good enough.
the rest of your issues sound architectural.
if changes are breaking contracts in calling code, that heavily implies that type declarations are not in use, or enumerable values which drive conditional behavior are mistyped as a primitive supertype.
if unit tests are not catching things, that implies the unit tests are asserting trivial things, being written after the implementation to just make cases that pass based on it, or are mocking modules they don't need to. outside of pathological cases the only thing you should be mocking is i/o, and even then that is the textbook use for dependency injection.
You spend the time on what is needed for you to move ahead - if code review is now the most time consuming part, that is where you will spend your time. If ever that is no longer a problem, defining requirements will maybe be the next bottleneck and where you spend your time, and so forth.
Of course it would be great to get rid of the review bottleneck as well, but I at least don't have an answer to that - I don't think the current generation of LLMs are good enough to allow us bypassing that step.
> For example: A (calls) -> B -> C -> D
> And you made changes in D, how do you know the side effect on B, what if it broke A?
That's poor encapsulation. If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.
I do think some of this is just a hype wave and businesses will learn quality and trust matter. But maybe not - if wealth keeps becoming more concentrated at the top, it’s slop for the plebs.
What I usually do to prevent this situation is to write a passing test, then modify the code to make it fail, then revert the change. It also gives an occasion to read the code again, kind of like a review.
I have never seen this practice formalized though, good for me, this is the kind of things I do because I care, turning it into a process with Jira and such is a good way to make me stop caring.
That's the reality of most software built in last 20 years.
> If the changes in D respect its contract, and C respects D's contract, your changes in D shouldn't affect C, much less B or A.
Any changes in D, eventually must affect B or A, it's inevitable, otherwise D shouldn't exist in call stack.
How the case I mentioned can happen, imagine in each layer you have 3 variations: 1 happy path 2 edge case handling, lets start from lowest:
D: 3, C: 3D=9, B: 3C=27, A: 3*B=81
Obviously, you won't be writing 81 unit tests for A, 27 for B, you will mock implementations and write enough unit tests to make the coverage good. Because of that mocking, when you update D and add a new case, but do not surface relevant mocking to upper layers, you will end up in a situation where D impacts A, but its not visible in unit tests.
While reading the changes in D, I can't reconstruct all possible parent caller chain in my brain, to ask engineer to write relevant unit tests.
So, case I mentioned happens, otherwise in real world there would be no bugs
Honestly I'm not sure much has changed with my output, because I don't submit PRs which aren't thoughtful. That is what the most annoying people in my organization do. They submit something that compiles, and then I spend a couple hours of my day demonstrating how incorrect it is.
For small fixes where I can recognize there is a clear, small fix which is easily testable I no longer add them to a TODO list, I simply set an agent off on the task and take it all the way to PR. It has been nice to be able to autopilot mindless changesets.
Big constraint. Code changes, initial architecture could have been amazing, but constantly changing business requirements make things messy.
Please don't use, "In ideal world" examples :) Because they are singular in vast space of non-ideal solutions
> You know we’ve had the ability to generate large amounts of code for a long time, right?
No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.
Even if you employ 100K people and ask them to write proper if/else code non-stop, LLM can still outcompete them by a huge margin with much better looking code.
(don't compare it LLM output to codegen of the past, because codegen was carefully crafted and a lot of times were deterministic, I am only talking about people writing code vs LLMs writing code)
Isn't that what is oft known as mutation testing? It is formalized to the point that we have automation to do the mutation for you automatically.
There's no way to make spaghetti code easy to review.
> No, I was not aware. Nothing comes close to the scale of 'coherent looking' code generation of today's tech.
Are you talking about “I’m overwhelmed by code review” or “we can now produce code at a scale no amount of humans can ever review”. Those are 2 very different things.
You review code because you’re responsible for it. This problem existed pre AI and nothing had changed wrt to being overwhelmed. The solution is still the same. To the latter, I think that’s more the software dark factory kind of thinking?
I find that interesting and maybe we’ll get there. But again, the code it takes to verify a system is drastically more complex than the system itself. I don’t know how you could build such a thing except in narrow use cases. Which I do think well see one day, though how narrow they are is the key part.
About fuzzing, I have about 20 years of experience in development and I have never seen fuzzing being done as part of a documented process in a project I worked in, not even once. Many people working in validation don't even know that it exists! The only field where fuzzing seems to be mainstream is cybersecurity, and most fuzzing tools are "security oriented", which is nice but it doesn't mean that security is the only field where it is useful.
Anyways, what I do is a bit different in that it is not random like fuzzing, it is more like reverse-TDD. TDD starts with a failing test, then, you write code to pass the test, and once done, you consider the code to be correct. Here you start with a passing test, then, you write code to fail the test, and once done, you consider the test to be correct.