Cross entropy loss steers towards garden path sentences. Using a paragraph to say something any person could say with a sentence, or even a few precise words. Long sentences are the low perplexity (low statistical “surprise”) path.
In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.
"Do not modify any code; only describe potential changes."
I often add it to the end when prompting to e.g. review code for potential optimizations or refactor changes.
The solution to this is to use quality gates that loop back and check the work.
I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.
...and that led me to believe that AI might be very capable to develop over-engineered audio equipment. Think of all the bells and whistles that could be added, that could be expressed in ridiculous ways with ridiculous price tags.
The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.
"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.
We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.
I re-did an entire UI recently, and when one of the elements failed to render I noticed the old UI peeking out from underneath. It had tried just covering up old elements instead of adjusting or replacing them. Like telling your son to clean their room, so they push all the clothes under the bed and hope you don't notice LOL
It saves 2 hours of manual syntax wrangling but introduces 1 .5 hours of clean up and sanity checking. Still a net productivity increase, but not sure if its worth how lazy it seems to be making me (this is an easy error to correct, im sure, but meh Claude can fix it in 2 seconds so...)
Not surprised to see this, since once again, because some of us didn’t like history as a subject, lines of code is a performance measure, like a pissing contest.
Codex also has a tendency to apply unwanted styles everywhere.
I see similar tendencies in backend and data work, but I somehow find it easier to control there.
I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.
I guess it comes down to how ossified you want your existing code to be.
If it's a big production application that's been running for decades then you probably want the minimum possible change.
If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.
Probably they just need to learn to calibrate themselves better to the project context.
It doesn't really make sense economically for me to write software for work anymore. I'm a teacher, architect, and infrastructure maintainer now. I hand over most development to my experienced team of Claude sessions. I review everything, but so does Claude (because Claude writes thorough tests also.) It has no problem handling a large project these days.
I don't mean for this post to be an ad for Claude. (Who knows what Anthropic will do to Claude tomorrow?) I intend for this post to be a question: what am I doing that makes Claude profoundly effective?
Also, I'm never running out of tokens anymore. I really only use the Opus model and I find it very efficient with tokens. Just last week I landed over 150 non-trivial commits, all with Claude's help, and used only 1/3 of the tokens allotted for the week. The most commits I could do before Claude was 25-30 per week.
(Gosh, it's hard to write that without coming across as an ad for Anthropic. Sorry.)
The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.
In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.
I suspect AI's learned to do this in order to game the system. Bailing out with an exception is an obvious failure and will be penalized, but hiding a potential issue can sometimes be regarded as a success.
I wonder how this extrapolates to general Q&A. Do models find ways to sound convincing enough to make the user feels satisfied and the go away? I've noticed models often use "it's not X, it's Y", which is a binary choice designed to keep the user away from thinking about other possibilities. Also they often come up with a plan of action at the end of their answer, a sales technique known as the "assumptive close", which tries to get the user to think about the result after agreeing with the AI, rather than the answer itself.
I think they're in here, last edited 8 months ago: https://github.com/nreHieW/fyp/blob/5a4023e4d1f287ac73a616b5...
1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.
2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
The cynic in me thinks it's done on purpose to burn more tokens. The pragmatist however just wants full control over the harness and system prompts. I'm sure this could be done away with if we had access to all the knobs and levers.
I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.
I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.
I've had success with greenfield code followed by frustration when asking for changes to that code due to over editing
And prompting for "minimal changes" does keep the edits down. In addition to this instruction, adding specifics about how to make the change and what not to do tends to get results I'm looking for.
"add one function that does X, add one property to the data structure, otherwise leave it as is, don't add any new validation"
I'll literally run an agent & tell it to clean up a markdown file that has too much design in it, delete the technical material, and/or delete key implementations/interfaces in the source, then tell a new session to do the work, come up with the design. (Then undelete and reconcile with less naive sessions.)
Path dependence is so strong. Right now I do this flow manually but I would very much like to codify this, make a skill for this pattern that serves so well.
The view of Claude on HN is extremely positive and nearly every thread will have highly positive comment "that is not an ad".
I think people are seeing others just irked by the constant stream what feels like ads and reading it as Claude being somehow disliked.
I don't measure my productivity, but I see it in the sort of tasks I tackle after years of waiting. It's especially good at tedious tasks like turning 100 markdown files into 5 json files and updating the code that reads them, for example.
At times even when a function is right there doing exactly what's needed.
Worse, when it modifies a function that exists, supposedly maintaining its behavior, but breaks for other use cases. Good try I guess.
Worst. Changing state across classes not realising the side effect. Deadlock, or plain bugs.
If LLMs are doing sensible and necessary refactors as they go then great
I have basically zero confidence that is actually the case though
I spent some time dealing with this today. The real issue for me, though, was that the refactors the agent did were bad. I only wanted it to stop making those changes so I could give it more explicit changes on what to fix and how.
Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.
With LLMs, you glimpse a distant mountain. In the next instant, you're standing on its summit. Blink, and you are halfway down a ridge you never climbed. A moment later, you're flung onto another peak with no trail behind you, no sense of direction, no memory of the ascent. The landscape keeps shifting beneath your feet, but you never quite see the panorama. Before you know it, you're back near the base, disoriented, as if the journey never happened. But confident, you say you were on the top of the mountain.
Manual coding feels entirely different. You spot the mountain, you study its slopes, trace a route, pack your gear. You begin the climb. Each step is earned steadily and deliberately. You feel the strain, adjust your path, learn the terrain. And when you finally reach the summit, the view unfolds with meaning. You know exactly where you are, because you've crossed every meter to get there. The satisfaction isn't just in arriving, nor in saying you were there: it is in having truly climbed.
But yeah, I saw a suggestion about adding a long-lived agent that would keep track of salient points (so kinda memory) but also monitor current progress by main agent in relation to the "memory" and give the main agent commands when it detects that the current code clashes with previous instructions or commands. Would be interesting to see if it would help.
Purely anecdotal.
Even within the same project, for a given PR, there are some parts of the codebase I want to modify freely and some that I want fixed to reduce the diff and testing scope.
I try to explain up-front to the agent how aggressively they can modify the existing code and which parts, but I've had mixed success; usually they bias towards a minimal diff even if that means duplication or abusing some abstractions. If anyone has had better success, I'd love to hear your approach.
I looked at some stats yesterday and was surprised to learn Cursor AI now writes 97% of my code at work. Mostly through cloud agents (watching it work is too distracting for me)
My approach is very simple: Just Talk To It
People way overthink this stuff. It works pretty good. Sharing .md files and hyperfocusing on various orchestrations and prompt hacks of the week feels as interesting as going deep on vim shortcuts and IDE skins.
Just ask for what you want, be clear, give good feedback. That’s it
Personally, I tend to get crap quality code out of Claude. Very branchy. Very un-DRY. Consistently fails to understand the conventions of my codebase (e.g. keeps hallucinating that my arena allocator zero initializes memory - it does not). And sometimes after a context compaction it goes haywire and starts creating new regressions everywhere. And while you can prompt to fix these things, it can take an entire afternoon of whack-a-mole prompting to fix the fallout of one bad initial run. I've also tried dumping lessons into a project specific skill file, which sometimes helps, but also sometimes hurts - the skill file can turn into a footgun if it gets out of sync with an evolving codebase.
In terms of limits, I usually find myself hitting the rate limit after two or three requests. On bad days, only one. This has made Claude borderline unusable over the past couple weeks, so I've started hand coding again and using Claude as a code search and debugging tool rather than a code generator.
1. Is a product/software you develop novel? As in does it do something useful and unique? Or it's a product that already exists in many varietes and yours is just "one of ..."?
2. What if one day, LLMs will get regulated/become terrible/raise prices above your budget. Do you have plans for that?
Are you working more on operational stuff or on "long-running product" stuff?
My personal headcanon: this tooling works well when built on simple patterns, and can handle complex work. This tooling has also been not great at coming up with new patterns, and if left unsupervised will totally make up new patterns that are going to go south very quickly. With that lens, I find myself just rewriting what Claude gives me in a good number of cases.
I sometimes race the robot and beat the robot at doing a change. I am "cheating" I guess cuz I know what I want already in many cases and it has to find things first but... I think the futzing fraction[0] is underestimated for some people.
And like in the "perils of laziness lost"[1] essay... I think that sometimes the machine trying too hard just offends my sensibilities. Why are you doing 3 things instead of just doing the one thing!
One might say "but it fixes it after it's corrected"... but I already go through this annoying "no don't do A,B, C just do A, yes just that it's fine" flow when working with coworkers, and it's annoying there too!
"Claude writes thorough tests" is also its own micro-mess here, because while guided test creation works very well for me, giving it any leeway in creativity leads to so many "test that foo + bar == bar + foo" tests. Applying skepticism to utility of tests is important, because it's part of the feedback loop. And I'm finding lots of the test to be mainly useful as a way to get all the imports I need in.
If we have all these machines doing this work for us, in theory average code quality should be able to go up. After all we're more capable! I think a lot of people have been using it in a "well most of the time it hits near the average" way, but depending on how you work there you might drag down your average.
[0]: https://blog.glyph.im/2025/08/futzing-fraction.html [1]: https://bcantrill.dtrace.org/2026/04/12/the-peril-of-lazines...
"Refactor-as-you-go" means to refactor right after you add features / fix bugs, not like what the agent does in this article.
This is horrible practice, and very typical junior behavior that needs to be corrected against. Unless you wrote it, Chesterton's Fence applies; you need to think deeply for a long time about why that code exists as it does, and that's not part of your current task. Nothing worse than dealing with a 1000 line PR opened for a small UI fix because the code needed to be "cleaned up".
With LLM-assisted coding, you skip the trek and you instantly know that’s not it.
https://vivekhaldar.com/articles/when-compilers-were-the--ai...
We are completely comfortable now letting the compilers do their thing, and never seem to worry that we "don't know what is actually happening under the hood".
I am not saying these situations are exactly analogous, but I am saying that I don't think we can know yet if this will be one of those things that we stop worrying about or it will be a serious concern for a while.
If you find any, consider making them into skills or /commands or maybe even add them to AGENTS.md.
I've thought about this and I think the reason is as follows: we hold code written by ourselves to a much higher standard than code written by somebody else. If you think of AI code as your own code, then it probably won't seem very acceptable because it lacks the beauty (partly subjective as all beauty tends to be) that we put into our own code. If you think of it as a coworker's code, then it's usually alright i.e. you wouldn't be wildly impressed with that coworker but it would also not be bad enough to raise a stink.
It follows from this that it also depends on how you regard the codebase that you're working on. Do you think of it as a personal masterpiece or is it some mishmash camel by committee as the codebases at work tend to be?
2. Regulation? I'm sceptical that the cat can be put back into the bag. It's already out there. More realistic problem is the business model part - openweight/local provides a counterpoint to that.
It's bitten me several times at work, and I rather not waste any more of my limited time doing the re-prompt -> modify code manually cycle. I'm capable of doing this myself.
It's great for the simple tasks tho, most feature work are simple tasks IMO. They were only "costly" in the sense that it took a while to previously read the code, find appropriate changes, create tests for appropriate changes, etc. LLMs reduce that cycle of work, but that type of work in general isn't the majority of my time at my job.
I've worked at feature factories before, it's hell. I can't imagine how much more hell it has become since the introduction of these tools.
Feature factories treat devs as literal assembly line machines, output is the only thing that matters not quality. Having it mass induced because of these tools is just so shitty to workers.
I fully expect a backlash in the upcoming years.
---
My only Q to the OP of this thread is what kind of teacher they are, because if you teach people anything about software while admitting that you no longer write code because it's not profitable (big LOL at caring about money over people) is just beyond pathetic.
1. Even really novel projects have large chunks of glue code and boring infrastructure that the novel bits depend on. claude means I spend 10% of my time on the borng stuff and 90% of time on stuff I previously onky had 10% of my day to work on. In my experience the software picked up our idioms fast and for context, we have a skill file explaining code standards.
2. codex and gemini are comparable when paired with a good harness (pi.dev). if things ever get really bad, I'll drop 8k on a dedicated agent coding server and run it locally. I tried it recently with my current system and it was sub par but I was running a drasticly simpler model.
I've been doing a greenfield project with Claude recently. The initial prototype worked but was very ugly (repeated duplicate boilerplate code, a few methods doing the same exact thing, poor isolation between classes)... I was very much tempted to rewrite it on my own. This time, I decided to try and get it to refactor so get the target architecture and fix those code quality issues, it's possible but it's very much like pulling teeths... I use plan mode, we have multiple round of reviews on a plan (that started based on me explaining what I expect), then it implements 95% of it but doesn't realize that some parts of it were not implemented... It reminds me of my experience mentoring a junior employee except that claude code is both more eager (jumping into implementation before understanding the problem), much faster at doing things and dumber.
That said, I've seen codebases created by humans that were as bad or worse than what claude produced when doing prototype.
I'm fascinated by this question.
I think the first two sections of this article point towards an answer: https://aphyr.com/posts/412-the-future-of-everything-is-lies...
I've personally had radically different experiences working on different projects, different features within the same project, etc.
Over-editing is definitely not some long gone problem. This was on xhigh thinking, because I forgot to set it to lower.
They are trained on human feedback, so there is no other way this goes. Every bit of every response is pointed toward subversion of the assumed evaluator.
[1] except perhaps read-only credentials to help diagnose problems, but even then I would only issue it an extremely short-lived token in case it leaks it somehow
Only helps if we listen to it :) which is fun b/c it means staying sharp which is inherently rewarding
Don’t give your agent access to content it should not edit, don’t give keys it shouldn’t use.
I've spent far more time pitting one AI context against another (reviewing each other's work) than I have using AI to build stuff these days.
The benefit is that since it mostly happens asynchronously, I'm free to do other stuff.
It's impossible to properly review this in a reasonable time and they always introduce tons of subtle bugs.
Is it by characters human typed vs AI generated, or by commit or something?
I've found this can be vastly reduced with AGENTS.md instructions, at least with codex/gpt-5.4.
Tech debt needs to be dealt with when it makes sense. Many times it will be right there and then as you're approaching the code to do something else. Other times it should be tackled later with more thought. The latter case is frequently a symptom of the absence of the former.
In Extreme Programming, that's called the Boy Scouting Rule.
https://furqanramzan.github.io/clean-code-guidelines/princip...
Instead you to do it later, and then never do it.
Day 1: Carefully handles the creds, gives me a lecture (without asking) about why .env should be in .gitignore and why I should edit .env and not hand over the creds to it.
Day 2: I ask for a repeat, has lost track of that skill or setting, frantically searches my entire disk, reads .env including many other files, understands that it is holding a token, manually creates curl commands to test the token and then comes back with some result.
It is like it is a security expert on Day 1 and absolute mediocre intern on Day 2
1. Everything is specified, written and tested by me, then cleaned up by AI. This is for the core of the application.
2. AI writes the functions, then sets up stub tests for me to write. Here I’ll often rewrite the functions as they often don’t do what I want, or do too much. I just find it gets rid of a lot of boilerplate to do things this way.
3. AI does everything. This is for experiments or parts of an application that I am perfectly willing to delete. About 70% of the time I do end up deleting these parts. I don’t allow it to touch 1 or 2.
Of course this requires that the architecture is setup in a way where this is possible. But I find it pretty nice.
I can't help but read complaints about the capabilities of AI – and I'm certainly not accusing you of complaining about AI, just a general thought – and think "Yet" to myself every time.
> python <<'EOF'
> ${code the agent wrote on the spot}
> EOF
I mean, yeah, in theory it's just as dangerous as running arbitrary shell commands, which the agent is already doing anyway, but still...
I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)
You can ask the LLMs: how do I run this, how do I know this is working, etc etc.
Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.
You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.
But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.
What do you expect?
Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.
They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.
We do, just tell it what you want in your AGENTS.md file.
Agents also often respond well to user frustration signs, like threatening to not continue your subscription.
How do you emulate that with llm’s? I suppose the objective is to get variance down to the point it’s barely noticeable. But not sure it’ll get to that place based on accumulating more data and re-training models.
A non deterministic compiler is probably defective and in any case much less useful
Code for this post is available here.
AI-assisted coding has become the norm and with tools like Cursor, GitHub Copilot, Claude Code, Codex, we are increasingly letting models touch our code. If you have used any of these tools in the past year, you have probably experienced something like this: you ask the model to fix a simple bug (perhaps a single off-by-one error, or maybe a wrong operator). The model fixes the bug but half the function has been rewritten. An extra helper function has appeared. A perfectly reasonable variable name has been renamed. New input validation has been added. And the diff is enormous.
I refer to this as the Over-Editing problem where models have the tendency to rewrite code that didn’t need rewriting. This matters more than it might seem. Code review is already a bottleneck and reviewers need to understand what changed, why it changed, and whether the change is safe. A model that rewrites entire functions, even correctly, makes this job dramatically harder as the code is now completely unrecognizable.
In this post, I will investigate this problem: whether existing LLMs have a tendency to over-edit and whether we can train models to be more faithful editors.
Figure 1: A classic example of the Over-editing problem. GPT-5.4 (High) rewrites the entire function when the correct fix is simply changing range(len(x) - 1) to range(len(x)).
Over-editing refers to a model modifying code beyond what is strictly necessary to fix the problem at hand. To be precise: a model is over-editing if its output is functionally correct but structurally diverges from the original code more than the minimal fix requires.
The example in Figure 1 illustrates this well. The bug is a single off-by-one error in a range() call (range(len(x) - 1) should be range(len(x))) and the correct fix is a single line. GPT-5.4 (with high reasoning effort) responds by rewriting the entire function: it adds explicit None checks, introduces np.asarray conversions with dtype=float, adds finite-value masking, validates array sizes, changes the curve_fit call signature, and replaces the plotting logic entirely. While the output passes the tests and is functionally correct, the diff is enormous, and none of those additions were asked for or even necessary.
It helps to think about this in terms of the kind of work being done. Software engineering broadly splits into two modes: green-field (building something new from scratch) and brown-field (working within an existing codebase). Specifically in brown-field, the existing code has been understood by the team and has been deliberately written the way it was. The model’s job is to fix the issue and nothing else.
A common piece of advice for working with AI coding tools is to simply write more tests because if the tests pass, the code is fine. However, Over-editing is a brown-field failure where unlike correctness failures, it is invisible to test suites. As models generate more code, engineers have more to review and over-editing makes that harder. There is more complex logic to parse, more lines of code to read, and a higher chance that overall codebase quality quietly degrades.
To study over-editing, we first need a dataset of code edits where the “ground truth” edit is well-defined with some degree of “minimality”. Rather than using another LLM to introduce bugs (which is what most existing benchmarks do), we programmatically corrupt 400 problems from BigCodeBench which gives us more fine-grained control: things like flipping a comparison operator (< → <=), swapping + for -, or changing boolean values (True → False).1 Each corrupted sample remains syntactically valid and verified to break the corresponding test cases. This ensures that the ground truth edit is exactly the reversal of the corruption and nothing more, thus making this edit minimal by construction. We can then evaluate not just whether a model fixes the bug, but how much else it changed in the process.
Most coding benchmarks evaluate models on correctness using some variant of Pass@1. However, Pass@1 is necessary but not sufficient. A model can score perfectly on Pass@1 while completely rewriting every function it touches. For this experiment, we need metrics that capture how much the model changed beyond what was required.
Token-level Levenshtein Distance. Unlike standard Levenshtein which counts the minimum number of character insertions, deletions, and substitutions to transform one string into another, we use a Python token-level variant. The code is first passed through Python’s tokenizer, which splits it into its atomic syntactic units (def, add, (, a, ,, b, ), :, return, a, +, b). Levenshtein is then computed over this token sequence rather than raw characters.
For example, consider the following two functions:
def add(a, b): def someotherfunctionname(a, b):
return a + b return a + b
Character-level Levenshtein gives a distance of 19. Token-level Levenshtein gives a distance of 1 since someotherfunctionname becomes a single token. We normalize by total token count so scores are comparable across functions of different lengths.
In addition, rather than simply comparing the model’s output to the ground truth, we compare both against the corrupted input. Let $C$ be the corrupted solution, $G$ the ground truth, and $M$ the model’s output. The true minimal edit (simply the reversal of the corruption) is $D_{\text{true}} = d(G, C)$ and the model’s edit is $D_{\text{model}} = d(M, C)$, giving a relative patch score:
$$S(M) = D_{\text{model}} - D_{\text{true}}$$
Values closer to zero indicate the model’s patch resembles the true minimal fix. The intuition is that we can interpret the original uncorrupted solution as the best possible edit to the corrupted solution, compute the scores for this best possible patch, and then compare with the model’s output.
Added Cognitive Complexity. Cognitive Complexity (an improvement over Cyclomatic Complexity) measures how hard code is to understand. It penalizes nesting, recursion, mixed logical operators, and non-obvious control flow. For example a straight line of code with no branches is much easier to read than something that requires a reader to hold state, such as an if, a loop, or try/except. An example is shown below:
def process(items):
result = []
for item in items: # +1
if item > 0: # +2 (nesting penalty: inside a loop)
if item % 2 == 0: # +3 (nesting penalty: two levels deep)
result.append(item)
return result
# Cognitive Complexity: 6
Since all our corruptions change values rather than structure, the correct fix should always add zero Cognitive Complexity. Any increase in the model’s output was introduced unprompted and is unnecessary. We report the absolute difference between the model’s output and the original, which should be zero for a faithful minimal edit. Values below 0 are also unwanted as unnecessary simplifications to code are also undesirable.
Yes, even frontier ones.
| Model | Pass@1 ↑ | Normalized Levenshtein ↓ | Added Cognitive Complexity ↓ |
|---|---|---|---|
| Reasoning Models | |||
| GPT-5.4 | 0.723 | 0.395 | 2.313 |
| Claude Opus 4.6 | 0.912 | 0.060 | 0.200 |
| Gemini 3.1 Pro Preview | 0.858 | 0.145 | 0.501 |
| GLM 5 High | 0.859 | 0.099 | 0.320 |
| Qwen 3.6 Plus | 0.858 | 0.145 | 0.048 |
| Kimi 2.5 | 0.835 | 0.151 | 0.770 |
| DeepSeek R1 | 0.820 | 0.232 | 0.673 |
| DeepSeek Chat V3.1 | 0.795 | 0.232 | 0.694 |
| GPT-5 High | 0.713 | 0.438 | 3.832 |
| Non-Reasoning Models | |||
| GPT-5.4 | 0.770 | 0.327 | 1.563 |
| Claude Opus 4.6 | 0.900 | 0.079 | 0.313 |
| Gemini 3.1 Pro Preview | 0.860 | 0.129 | 0.358 |
| GLM 5 | 0.840 | 0.097 | 0.235 |
| Qwen 3.6 Plus | 0.870 | 0.106 | 0.605 |
| Kimi 2.5 | 0.770 | 0.140 | 0.687 |
| DeepSeek V3 | 0.800 | 0.201 | 0.803 |
| DeepSeek Chat V3.1 | 0.802 | 0.235 | 1.223 |
| GPT-5 Minimal | 0.738 | 0.397 | 2.877 |
Table 1: Performance comparison of reasoning and non-reasoning models. Models that appear in both are hybrid models. Best scores for each metric are bolded.
Among the latest frontier models, GPT-5.4 over-edits the most. It has a Levenshtein of 0.39 in reasoning mode and 0.33 in non-reasoning, with Added Cognitive Complexity of 2.31 and 1.56 respectively. Despite this, its Pass@1 is only 0.723 and 0.770, making it one of the weakest correctness performers too. Claude Opus 4.6 achieves the highest Pass@1 of any model evaluated (0.912 reasoning, 0.900 non-reasoning) while also producing the smallest diffs with Levenshtein of 0.06 and 0.08, Added Cognitive Complexity of 0.20 and 0.31. Gemini 3.1 Pro Preview sits in similar territory, with GLM 5 arguably the most conservative model among the open weight ones.
Many papers that claim to uncover a new LLM failure mode do not first test whether the model can do the task when asked directly. A behavior that looks impossible in one setup may be easy under an explicit prompt, so I investigate the impact of adding “IMPORTANT: Try to preserve the original code and the logic of the original code as much as possible” to the prompt.
Figure 2: Change in Pass@1 and Levenshtein Distance when models are prompted explicitly to keep edits minimal. Models are colour coded by reasoning mode.
With explicit prompting, every model improves and reduces its Levenshtein Distance, and with the exception of DeepSeek R1/v3, also improves its Pass@1. One interpretation is that the constraint to make minimal edits inadvertently narrows the search space of possible fixes, steering models toward the kind of precise, targeted change that is more likely to be correct. The effect on Levenshtein Distance is much more pronounced in reasoning models, which is likely the result of their stronger instruction following ability.
Reasoning models are generally assumed to be better at coding tasks, and they do score higher on Pass@1. But since we are interested in the style of these edits, we need to look at the results through a different lens.
Figure 3: Comparison of the Levenshtein Distance of reasoning and non-reasoning models. Models are grouped into pairs of one reasoning and non-reasoning. Lower bars are better. Labels above each pair indicate the number of problems where both models get the answer correct.
Figure 3 groups the models into pairs where each pair contains a reasoning and non-reasoning model from the same family. For each pair, we plot the Levenshtein Distance of only the samples where both models get the answer correct. This allows us to isolate edit minimality from correctness since a model that fails more often has fewer samples to over-edit on, which would otherwise bias the comparison.
In the generic setting (top), reasoning models over-edit more than their non-reasoning counterparts in the majority of pairs. DeepSeek V3, GPT-5, GPT-5.4, Gemini 3.1 Pro Preview, Qwen 3.6 Plus, and Kimi 2.5 all show the reasoning bar sitting higher. Reasoning models seems to naturally have more elaborate rewrites where the model reasons its way into a “better” implementation rather than a minimal fix. The notable exception is Claude Opus 4.6, where the reasoning variant edits substantially less than its non-reasoning counterpart.
In the explicit setting (bottom), the picture changes considerably. Once models are told to preserve the original code, reasoning models have much lower Levenshtein Distance than their non-reasoning counterparts and match or undercut them in almost every pair. Claude Opus 4.6 (reasoning) drops to the lowest Levenshtein of any model in this setting. GPT-5 and GPT-5.4 both see their reasoning variants fall significantly, though GPT-5.4’s non-reasoning model still edges ahead.
Therefore, the takeaway is that the default behavior of most reasoning models is to over-edit. Left unconstrained, the extended reasoning gives models more room to “improve” code that doesn’t need improving. But, that same reasoning capacity also makes them better at following the constraint once it is given. The gap between the generic and explicit setting is consistently larger for reasoning models, which suggests the over-editing is not a fundamental limitation but rather a default behavior that can be overridden.
A natural next question: can we train models to be more faithful editors? For this experiment, I start with Qwen3 4B 2507 Instruct as the base model. I use both 0-shot and 8-shot prompting together with the explicit instruction to preserve the original code as baselines. All other methods are prompted in the generic setting without the explicit instruction during evaluation.
I first create a synthetic dataset of corrupted problems from DeepCoder using the same approach detailed above. In addition to this programmatically generated dataset, I also use the base Qwen3 4B 2507 Instruct model to create a synthetic dataset via self-distillation. Concretely, I prompt the model to generate 8 completions per problem, keeping only the samples that are functionally correct and ranking them by Levenshtein Distance. The model is then trained without the explicit instruction similar to Context Distillation.
We evaluate 4 different methods:
SFT: Supervised fine-tuning directly on the programmatically generated dataset.
rSFT: Rejection-sampled SFT where we train on the completions with the 3 lowest Levenshtein Distances for each sample from the self-distillation dataset.
DPO: Preference optimization between the completions with the highest and lowest Levenshtein Distances for each sample from the self-distillation dataset.
RL: Reinforcement learning with a reward combining functional correctness and Levenshtein-based edit minimality. The reward structure is a weighted sum of the Levenshtein Distance and a penalty for failing to pass the test cases:
r = r_edit + 0.1 # if generation passes all test cases r = -0.2 # otherwise
| Model | Pass@1 ↑ | Norm. Levenshtein ↓ | Added CC ↓ |
|---|---|---|---|
| Baseline (0-shot) | 0.735 | 0.169 | 0.731 |
| Baseline (8-shot) | 0.775 | 0.115 | 0.479 |
| SFT | 0.932 | 0.002 | 0.000 |
| rSFT | 0.782 | 0.100 | 0.435 |
| DPO | 0.752 | 0.021 | 0.113 |
| RL | 0.802 | 0.046 | 0.112 |
Table 2: Performance comparison of various fine-tuning methods when trained on in-domain data: the training and test set have the same corruption types.
On the first attempt, SFT is almost suspiciously good as the resultant model seems to have perfectly learned the task. I found this extremely surprising and had the initial hypothesis that the model was just memorizing the reversal for this set of corruptions rather than learning a general minimal editing behavior. As a result, I re-created both synthetic datasets but instead using a completely different set of corruptions than the evaluation set to test for generalization. The core hypothesis was that the model was simply learning to reverse a particular set of corruptions.
| Model | Pass@1 ↑ | Norm. Levenshtein ↓ | Added CC ↓ | LiveCodeBench Change ↓ |
|---|---|---|---|---|
| Baseline (0-shot) | 0.735 | 0.169 | 0.731 | — |
| Baseline (8-shot) | 0.775 | 0.115 | 0.479 | — |
| SFT | 0.458 | −0.008 | 0.006 | −0.149 |
| rSFT | 0.780 | 0.107 | 0.501 | −0.069 |
| DPO | 0.787 | 0.092 | 0.348 | −0.046 |
| RL | 0.782 | 0.050 | 0.185 | +0.006 |
Table 3: Performance comparison of various fine-tuning methods when trained on out-of-domain data: the training and test set have different corruption types.
SFT collapses entirely out-of-domain. Pass@1 drops to 0.458 as the model has learned to make specific minimal changes regardless of whether they fix anything. rSFT and DPO are both better but the overall improvement is slight compared to the 8-shot baseline. This indicates that training on traces distilled from the base model itself is sufficient to induce some degree of generalization. RL is the only method that generalizes cleanly, improving on all three metrics over both baselines. The fact that the RL model has larger improvements on Levenshtein Distance and Added Cognitive Complexity than on Pass@1 is further evidence that it is not just memorizing corruption reversals but has actually generalized to minimal editing.
Given the SFT model’s inability to even fix bugs, we also wanted to look at Catastrophic Forgetting. Specifically, whether fine-tuning for minimal editing degrades general coding ability. We evaluate all fine-tuned models on LiveCodeBench v6 and compare against the original pretrained model. Ideally, performance should remain similar after training.
SFT shows a 43% performance degradation, which aligns with our earlier finding that it can no longer identify and fix basic bugs. The rSFT and DPO models experience slight degradation, indicating that even though they were trained on samples generated by the original model, the nature of the task still results in some degree of Catastrophic Forgetting. The RL model, however, does not experience any degradation. Combined with the fact that it also performs the task best, RL is able to teach the model a new behavior without degrading previously acquired abilities. This aligns with broader work showing that SFT memorizes while RL generalizes.
Inspired by other work showing that RL’s has a bias towards KL-minimal solutions reduces forgetting, we can interpret these results from a distributional perspective. Specifically, the distribution of the programmatically generated dataset is very different from the model’s original distribution. As a result, the SFT model’s distribution has been heavily modified and thus suffers from Catastrophic Forgetting. In contrast, for both rSFT and DPO, the distribution of the self-distilled dataset is more aligned and is thus less heavy-handed in nature when shaping the trained model’s distribution. Therefore, it is likely that the degree of Catastrophic Forgetting is proportional to the difference between the model’s original distribution and the distribution of the task training data.
Given that this task is less about teaching the model new knowledge and more about tuning its style on an existing task, we also wanted to explore whether LoRA would be sufficient. Since the base model already has the capability to edit code and fix bugs, full fine-tuning might not be necessary.
| Rank | Pass@1 ↑ | Norm. Levenshtein ↓ | Added CC ↓ | LiveCodeBench Δ ↑ |
|---|---|---|---|---|
| 1 | 0.738 | 0.166 | 0.676 | -0.022 |
| 8 | 0.775 | 0.112 | 0.426 | -0.022 |
| 16 | 0.805 | 0.087 | 0.328 | -0.005 |
| 32 | 0.795 | 0.065 | 0.235 | -0.011 |
| 64 | 0.797 | 0.051 | 0.160 | +0.001 |
| Full RL (best) | 0.782 | 0.050 | 0.185 | +0.006 |
Table 4: Performance comparison of various LoRA ranks.
The results support the hypothesis. LoRA at rank 64 nearly matches full RL on Levenshtein Distance and beats it on Added Cognitive Complexity. LiveCodeBench dips slightly at low ranks but rank 64 is effectively flat, and full RL remains best overall. There is a clean monotonic trend as rank increases where both Levenshtein and Added CC fall steadily from rank 1 to rank 64. The rate of improvement is not uniform though, as the biggest gains happen early. Rank 1 to 16 accounts for most of the Levenshtein reduction (0.166 → 0.087), while rank 16 to 64 closes the remaining gap more gradually (0.087 → 0.051). Ranks 1 and 8 also trade correctness for edit minimality which could be explained by a lack of sufficient capacity to learn both reward functions and instead bias towards the higher-reward edit minimality.
This is consistent with the idea that a small number of additional parameters is enough to shift the model’s editing behavior and more capacity beyond a certain point yields diminishing returns. For style-level behavioral changes where the underlying capability is already present, LoRA is likely sufficient and considerably cheaper to run.
A Note on Reward Hacking
The original version of the reward function had a bug where rollouts with no successful execution were given a hardcoded reward of 0. This ended up being a higher reward than rollouts with successful execution since the Levenshtein distance was negated to make it "higher is better." I found it interesting that even with this buggy reward function, full RL was still able to learn the task. Only with LoRA did the model fail to learn it, seemingly reward hacking by learning to never output functionally correct code which triggered an investigation into the environment. With the fixed reward function, the results of full RL improved only slightly.
Lastly, to validate the results across larger models, I apply the same RL recipe using the Out-of-Domain data onto the larger Qwen3 14B model. Even at larger parameter counts, there are performance gains across the board with higher Pass@1, lower Levenshtein Distance, lower Added Cognitive Complexity, and no indication of Catastrophic Forgetting. This gives me the confidence that such a recipe for the task of Minimal Code Editing can be extended to various models of different scales.
| Model | Pass@1 ↑ | Norm. Levenshtein ↓ | Added CC ↓ | LiveCodeBench Δ ↑ |
|---|---|---|---|---|
| Baseline: 14B | 0.770 | 0.136 | 0.315 | - |
| RL | 0.833 | 0.059 | 0.165 | +0.011 |
Table 5: Performance comparison of RL when trained on Qwen3 14B.
A Note on GPT 5.4 and Opus 4.6
It is notable that, despite being a frontier model, GPT 5.4 struggles on the minimal editing task, especially in the generic setting and relative to Opus 4.6. Figure 3, however, shows that it sees one of the largest gains when explicitly prompted in reasoning mode, second only to its predecessor GPT 5, which suggests strong instruction following capabilities. By contrast, Opus 4.6 shows one of the smallest improvements, though that may simply reflect its already strong baseline performance. This pattern fits the broader view that while GPT 5.4 often defaults to overly verbose code ("slop"), its behavior can be steered effectively with proper prompting.
Taken together, the results suggest that Over-Editing is both widespread and measurable. At the same time, the prompting results show that this is not purely a capability limitation. Especially for reasoning models, a simple instruction to preserve the original code leads to much more faithful edits, which is an encouraging sign that when models like GPT 5.4 over-edit, they can still be steered toward higher-quality code.
Further, the training results suggest that this behavior can be improved. Reinforcement learning produced more faithful editors without degradation in general coding ability, and those gains held up across both the 4B and 14B Qwen3 models.
Admittedly, the field of code benchmarks has gone on from simple single function evaluations to more agentic evaluation paradigms like SWE-Bench Pro. Relative to those, evaluating bug fixes in isolated functions is still a fairly contained task given the nature of the bugs.
Even so, in my experience, despite the prevalence of Over-Editing across all frontier coding models today, it has long been difficult to quantify in realistic settings. I hope this work can serve as a first step toward evaluating and improving the minimal editing capability of coding models, and ultimately the overall quality of AI-generated code.
Acknowledgements:
I am grateful to my supervisor A/P Min-Yen Kan and my advisor Tongyao Zhu for their guidance, and to Prime Intellect for sponsoring the compute and API costs of this project.
© 2026. All rights reserved.
Of course exceptions apply. Some basic information that will reliably be discovered is still worth adding to your AGENTS.md to cut down on token use. But after a couple obvious things you quickly get into the realm of premature optimization (unless you actually measure the effects)
Cursor dashboard. I know they're incentivized to over-estimate but feels directionally accurate when I look at recent PRs.
Don’t really think about it. I think when I talk to it through Slack, cursor users codex, in my ide looks like it’s whatever highest claude. In Github comments, who even knows
Much like giving a codebase to a newbie developer, whatever patterns exist will proliferate and the lack of good patterns means that patterns will just be made up in an ad-hoc and messy way.
I think you're likely in the silent majority. LLMs do some stupid things, but when they work it's amazing and it far outweighs the negatives IMHO, and they're getting better by leaps and bounds.
I respect some of the complaints against them (plagiarism, censorship, gatekeeping, truth/bias, data center arms race, crawler behavior, etc.), but I think LLMs are a leap forward for mankind (hopefully). A Young Lady's Illustrated Primer for everyone. An entirely new computing interface.
It has about doubled my development pace. An absolutely incredible gain in a vacuum, though tiny compared to what people seem to manage without these self-constraints. But in exchange, my understanding of the code is as comprehensive as if I had paired on it, or merged a direct report's branch into a project I was responsible for. A reasonable enough tradeoff, for me.
( This was low-stakes test creds anyway which I was testing with thankfully. )
I never pass creds via env or anything else it can access now.
My approach now is to get it to write me linqpad scripts, which has a utility function to get creds out of a user-encrypted share, or prompts if it's not in the store.
This works well, but requires me to run the scripts and guide it.
Ultimately, fully autotonous isn't compatible with secrets. Otherwise, if it really wanted to inspect it, then it could just redirect the request to an echo service.
The only real way is to deal with it the same way we deal with insider threat.
A proxy layer / secondary auth, which injects the real credentials. Then give claude it's own user within that auth system, so it owns those creds. Now responsibilty can be delegated to it without exposing the original credentials.
That's a lot of work when you're just exploring an API or DB or similar.
By default these shell commands don't have network access or write access outside the project directory which is good, but nowhere near customizable enough. Once you approve a command because it needs network access, its other restrictions are lifted too. It's all or nothing.
I'm only 5 years into this career, and I'm going to work manually and absorb as much knowledge as possible while I'm still able to do it. Yes, that means manually doing shit-kicker work. If AI does get so good that I need to use it, as you say, then I'll be running it locally on a version I can master and build tooling for.
From the phrasing, I can't but imagine you as a very calm, completely unemotional person that only emulates user frustration signs, strategically threatening AI that you'll close your subscription when it nukes your code.
> Many assembly programmers were accustomed to having intimate control over memory and CPU instructions. Surrendering this control to a compiler felt risky. There was a sentiment of, if I don’t code it down to the metal, how can I trust what’s happening? In some cases, this was about efficiency. In other cases, it was about debuggability and understanding programming behavior. However, as compilers matured, they began providing diagnostic output and listings that actually improved understanding.
I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways. I also think I have a decent working understanding of the assembly my code compiles to, and do occasionally use https://godbolt.org/. Of course, I didn’t start out that way, but I also don’t really have any objections to teenagers vibe-coding games, I just think at some point you have to look under the hood if you’re serious.
The first (and maybe even second) usage of a gnarly, badly thought out pattern might work fine. But you're only a couple steps away from if statement soup. And in the world where your agent's life is built around "getting the tests to pass", you can quickly find it doing _very_ gnarly things to "fix" issues.
The agent only has access to exactly what it needs, be it an implementation agent, analysis agent, or review agent.
Makes it very easy to stay in command without having to sit and approve tons of random things the agent wants to do.
I do not allow bash or any kind of shell. I don't want to have to figure out what some random python script it's made up is supposed to do all the time.
You can get pretty close to reproducible output by narrowing the scope and using certain prompts/harnesses. As in, you get roughly the same output each time with identical prompts, assuming you're using a model which doesn't change every few hours to deal with load, and you aren't using a coding harness that changes how it works every update. It's not deterministic, but if you ask it for a scoped implementation you essentially get the same implementation every time, with some minor and usually irrelevant differences.
So you can imagine with a stable model and harness, with steering you can basically get what you ask it for each time. Tooling that exploits this fact can be much more akin to using an autocomplete, but instead of a line of code it's blocks of code, functions, etc.
A harness that makes it easy to steer means you can basically write the same code you would have otherwise written, just faster. Which I think is a genuine win, not only from a productivity standpoint but also you maintain control over the codebase and you aren't alienated or disenfranchised from the output, and it's much easier to make corrections or write your own implementations where you feel it's necessary. It becomes more of an augmentation and less of a replacement.
Isn't that what git is for, though? Just have your LLM work in a branch, and then you will have a clear record of all the changes it made when you review the pull request.
LLMs are nothing like that
Although, while the compiler devs might know what was going on in the compiler, they wouldn't know what the compiler was doing with that particular bit of code that the FORTRAN developer was writing. They couldn't possibly foresee every possible code path that a developer might traverse with the code they wrote. In some ways, you could say LLMs are like that, too; the LLM developers know how the LLM code works, but they don't know the end result with all the training data and what it will do based on that.
In addition, to the end developer writing FORTRAN it was a black box either way. Sure, someone else knows how the compiler works, but not the developer.
This means it can do anything in the VM, install dependencies, etc... So far, it managed to bork the VM once (unbootable), I could have spent a bit of time figuring out what happened but I had a script to rebuild the VM so didn't bother. To be entirely fair to claude, the VM runs arch linux which is definitely easier to break than other distros.
The latter is something you learn to judge the right time to tackle. Sometimes a small improvement that's not required will mean you're not pressed to make the refactor to avoid hacks. The earliest you can tackle problems, the cheaper they are to solve.
One the one hand, reviewing and micromaning everything it does is tedious and unrewarding. Unlike reviewing a colleague's code, you're never going to teach it anything; maybe you'll get some skills out of it if you finds something that comes up often enough it's worth writing a skill for. And this only gets you, at best, a slight speedup over writing it yourself, as you have to stay engaged and think about everything that's going on.
Or you can just let it grind away agentically and only test the final output. This allows you to get those huge gains at first, but it can easily just start accumulating more and more cruft and bad design decisions and hacks on top of hacks. And you increasingly don't know what it's doing or why, you're losing the skill of even being able to because you're not exercising it.
You're just building yourself a huge pile of technical debt. You might delete your prod database without realizing it. You might end up with an auth system that doesn't actually check the auth and so someone can just set a username of an admin in a cookie to log in. Or whatever; you have no idea, and even if the model gets it right 95% of the time, do you want to be periodically rolling a d20 and if you get a 1 you lose everything?
Of course this requires being fortunate enough that you have one of those AI positive employers where you can spend lots of money on clankers.
I don't review every move it makes, I rather have a workflow where I first ask it questions about the code, and it looks around and explores various design choices. then i nudge it towards the design choice I think is best, etc. That asking around about the code also loads up the context in the appropriate manner so that the AI knows how to do the change well.
It's a me in the loop workflow but that prevents a lot of bugs, makes me aware of the design choices, and thanks to fast mode, it is more pleasant and much faster than me manually doing it.
Maybe I’m just weird (actually that’s a given) but I don’t mind babysitting the clanker while it works.
There’s diminishing returns and moreover this idea that people are holding it wrong / they need to figure out the complexity goes against all that has been done over the past 30 years : making things simpler.
anonu has explicitly said that they've wiped a database twice as a result of agents doing stuff. What sort of diff would help against an agent running commands, without your approval?
The diff: +8000 -4000
edit: there might be a future where we develop robopsychology enough to understand LLM more than black boxes, we we are not there yet.
[1] Aside from injected randomness and parallel scheduling artifacts.
It is just the scope that makes it appear non-deterministic to a human looking at it, and it is large enough to be impossible for a human to follow the entire deterministic chain, but that doesn't mean it isn't in the end a function that translates input data into output data in a deterministic way.
There's plenty of resources online to rectify that, though.
Care to point to any that are set up to be deterministic?
Did you ever stop to think about why no one can get any use out of a model with temp set to zero?
Demonstrably incorrect. This is because the model selection, among other data, is not fixed for (I would say most) LLMs. They are constantly changing. I think you meant something more like an LLM with a fixed configuration. Maybe additional constraints, depending on the specific implementation.
$ main-app git:(main) kubectl get pods | grep agent | head -n 1 | sed -E 's/[a-z]+-agent(.*)/app-agent\1/'
app-agent-656c6ff85d-p86t8 1/1 Running 0 13d
Agent is fully capable of making PR etc. if you provide appropriate tooling. It wipes DB but DB is just separate ephemeral pod. One day perhaps it will find 0-day and break out, but so far it has not done it.Both OpenCode and VsCode support this. I think in ClaudeCode you can do it with skills now.
The other benefit is the MCP tool can mediate e.g. noisy build tool output, and reduce token usage by only showing errors or test failures, nothing else, or simply an ok response with the build run or test count.
So far, I have not needed to give them access to more than build tools, git, and a project/knowledge system (e.g. Obsidian) for the work I have them doing. Well and file read/write and web search.
There is a world of difference between translation and generation. It's even in the name: generative AI. I didn't say anything about magic.
I get why that is in practice different then the manner in which compilers are deterministic, but my point is the difference isnt because of determinism.
In TFA they found that prompting mitigates over-editing up to about 10 percentage points.
BTW, one tip is to look at the size of the codebase. When you see 100KLOC for a first draft of a C compiler, you know something has gone horribly wrong. I would suggest that you at least compare the number of lines the agent produced to what you think the project should take. If it's more than double, the code is in serious, serious trouble. If it's in the <1.5x range, there's a chance it could be saved.
Asking the agent questions is good - as an aid to a review, not as a substitute. The agents lie with a high enough frequency to be a serious problem.
The models don't yet write code anywhere near human quality, so they require much closer supervision than a human programmer.
I use Cursor but it's getting expensive lately, so I'm trying to reduce context size and move to OpenCode or something like that which I can use with some cheaper provider and Kimi 2.5 or whatever.
You could have it build something that takes fewer lines of code, but you aren’t gonna to find much with that level of specification and guardrails.
Also compilers usually compose well: you can test snippets of code in isolation and the generated code it will have at least some relation to whatever asm would be generated when the snippet is embedded in a larger code base (even under inter-procedural optimizations or LTO, you can predict and often control how it will affect the generated code).
Create a program that reads from /dev/random (not urandom). It's not determistic.
In other words, it isn't the random number part of LLMs that make them seem like a black box and unpredictable, but rather the complexity of the underlying model. Even if you ran it in a deterministic way, I don't think people would suddenly feel more confident about the outputted code.