Over-editing refers to a model modifying code beyond what is necessary

I'm either in a minority or a silent majority. Claude Code surpasses all my expectations. When it makes a mistake like over-editing, I explain the mistake, it fixes it, and I ask it to record what it learned in the relevant project-specific skills. It rarely makes that mistake again. When the skill file gets big, I ask Claude to clean and compact it. It does a great job.

It doesn't really make sense economically for me to write software for work anymore. I'm a teacher, architect, and infrastructure maintainer now. I hand over most development to my experienced team of Claude sessions. I review everything, but so does Claude (because Claude writes thorough tests also.) It has no problem handling a large project these days.

I don't mean for this post to be an ad for Claude. (Who knows what Anthropic will do to Claude tomorrow?) I intend for this post to be a question: what am I doing that makes Claude profoundly effective?

Also, I'm never running out of tokens anymore. I really only use the Opus model and I find it very efficient with tokens. Just last week I landed over 150 non-trivial commits, all with Claude's help, and used only 1/3 of the tokens allotted for the week. The most commits I could do before Claude was 25-30 per week.

(Gosh, it's hard to write that without coming across as an ad for Anthropic. Sorry.)

Conversely, I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement.

I guess it comes down to how ossified you want your existing code to be.

If it's a big production application that's been running for decades then you probably want the minimum possible change.

If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.

Probably they just need to learn to calibrate themselves better to the project context.

I think building something really well with AI takes a lot of work. You can certainly ask it to do things and it will comply, and produce something pretty good. But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many different angles and making sure it's precise can be a challenge. Will be interesting to see how all of this iterates over time.

I've noticed AI's often try and hide failure by catching exceptions and returning some dummy value maybe with some log message buried in tons of extraneous other log messages. And the logs themselves are often over abbreviated and missing key data to successfully debug what is happening.

I suspect AI's learned to do this in order to game the system. Bailing out with an exception is an obvious failure and will be penalized, but hiding a potential issue can sometimes be regarded as a success.

I wonder how this extrapolates to general Q&A. Do models find ways to sound convincing enough to make the user feels satisfied and the go away? I've noticed models often use "it's not X, it's Y", which is a binary choice designed to keep the user away from thinking about other possibilities. Also they often come up with a plan of action at the end of their answer, a sales technique known as the "assumptive close", which tries to get the user to think about the result after agreeing with the AI, rather than the answer itself.

Here, the author means the agent over-edits code. But agents also do "too much": as in they touch multiple files, run tests, do deployments, run smoke tests, etc... And all of this gets abstracted away. On one hand, its incredible. But on the other hand I have deep anxiety over this:

1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.

2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.

It's funny, because the wisdom that was often taught ( but essentially never practiced ) was "Refactor as you go".

The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.

In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.

I've not seen over-editing in Claude Code or Codex in quite a while, so I was interested to see the prompts being used for this study.

I think they're in here, last edited 8 months ago: https://github.com/nreHieW/fyp/blob/5a4023e4d1f287ac73a616b5...

This is a really solid writeup. LLMs are way too verbose in prose and code, and my suspicion is this is driven mainly by the training mechanism.

Cross entropy loss steers towards garden path sentences. Using a paragraph to say something any person could say with a sentence, or even a few precise words. Long sentences are the low perplexity (low statistical “surprise”) path.

I feel ambivalent about it. In most cases, I fully agree with the overdoing assessment and then having to spend 30min correcting and fixing. But I also agree with the fact sometimes the system is missing out on more comprehensive changes (context limitations I suppose)! I am starting to be very strict when coding with these tool but still not quite getting the level of control I would like to see

Feels like a training-data artifact. SFT and preference data are full of "here's a cleaner version of your file", not "here's the minimum 3-line diff". The model learned bigger, more polished outputs win. Prompting around it helps a bit but you're fighting the prior.

I'm building a website in Astro and today I've been scaffolding localization. I asked Codex 5.4 x-high to follow the official guidelines for localization and from that perspective the implementation was good. But then it decides to re-write the copy and layout of all pages. They were placeholders, but still?

Codex also has a tendency to apply unwanted styles everywhere.

I see similar tendencies in backend and data work, but I somehow find it easier to control there.

I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.

There is no need for a new name. It’s called a high-impact change. As opposed to a low-impact change, where one changes or adds the least number of lines necessary to achieve the goal.

Not surprised to see this, since once again, because some of us didn’t like history as a subject, lines of code is a performance measure, like a pissing contest.

Like others mentioned, letting the agent touch the code makes learning difficult and induces anxiety. By introducing doubt it actually increases the burden of revision, negating the fast apparent progress. The way I found around this is to use LLMs for designing and auditing, not programming per se. Even more so because it’s terrible at keeping the coding style. Call it skill issue, but I’m happier treating it as a lousy assistant rather than as a dependable peer.

Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?

This resonates

I've had success with greenfield code followed by frustration when asking for changes to that code due to over editing

And prompting for "minimal changes" does keep the edits down. In addition to this instruction, adding specifics about how to make the change and what not to do tends to get results I'm looking for.

"add one function that does X, add one property to the data structure, otherwise leave it as is, don't add any new validation"

Don't forget the non-stop unnecessary comments

I feel like a core of this is that agents aren't exactly a replacement for a junior developer like some people say. A junior dev has its own biases, predispositions, history and understanding of the internal and external aspects of a product and company. An AI agent wants to do what you ask in the best way possible which is...not always what a dev wants :) The fix the article talks about is simple but shows that these models have no inherent sense of project scope or proportionality. You have to give context (as much context as possible) explicitly to fill in the gaps so it infers less and makes smaller decisions.

I wish there was a reliable way to choke the agents back and prevent them from doing this. Every line of code added is a potential bug, and they overzealously spew pages and pages of code. I've routinely gone through my (hobby) projects and (yes, still with the aid of an LLM) trimmed some 80% of the generated code with barely any loss of functionality.

The cynic in me thinks it's done on purpose to burn more tokens. The pragmatist however just wants full control over the harness and system prompts. I'm sure this could be done away with if we had access to all the knobs and levers.

I attempt to solve most agent problems by treating them as a dumb human.

In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.

Over-editing and over-adding... I can find solutions that are just a few lines of code in a single file where AI would change 10 files and add 100s of lines of code. Writing less code is more important than ever. Too much code means more technical debt, a maintainability nightmare and more places where bugs can hide.

My experience is usually the opposite. The code they write is verbose yes, but the diffs are over-minimal. Whenever I see a comment like "Tool X doesn't support Y or has a bug with Z [insert terrible kludge]" and actually fixing the problem in the other file would be very easy, I know it is AI-generated. I suspect there is a bias towards local fixes to reduce token usage.

I call this overcooking. Adding unnecesary features.

Yeah I have always felt GPT 5.4 does too much. It is amazing at following instructions precisely but it convinces itself to do a bit too much.

I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.

With the pi-mono coding agent (running local, open models) this works very well:

"Do not modify any code; only describe potential changes."

I often add it to the end when prompting to e.g. review code for potential optimizations or refactor changes.

I always described it as over-complicating the code, but doing too much is a better diagnosis.

As mentioned in the article, prompting for minimal changes does help. I find GPT models to be very steerable, but it doesn’t mean much when you take your hands of the wheel. These type of issues should be solved at planning stage.

Tangent and admittedly off-topic but I've come to see LLM-assisted coding as a kind of teleportation.

With LLMs, you glimpse a distant mountain. In the next instant, you're standing on its summit. Blink, and you are halfway down a ridge you never climbed. A moment later, you're flung onto another peak with no trail behind you, no sense of direction, no memory of the ascent. The landscape keeps shifting beneath your feet, but you never quite see the panorama. Before you know it, you're back near the base, disoriented, as if the journey never happened. But confident, you say you were on the top of the mountain.

Manual coding feels entirely different. You spot the mountain, you study its slopes, trace a route, pack your gear. You begin the climb. Each step is earned steadily and deliberately. You feel the strain, adjust your path, learn the terrain. And when you finally reach the summit, the view unfolds with meaning. You know exactly where you are, because you've crossed every meter to get there. The satisfaction isn't just in arriving, nor in saying you were there: it is in having truly climbed.

I think the industry has leaned waaay too far into completely autonomous agents. Of course there are reasons why corporations would want to completely replace their engineers with fully autonomous coding agents, but for those of us who actually work developing software, why would we want less and less autonomy? Especially since it alienates us from our codebases, requiring more effort in the future to gain an understanding of what is happening.

I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.

I’m not sure if I share the authors opinion. When I was hand-writing code I also followed the boy-scout rule and did smaller refactorings along the line.

I really felt this. total pain point for me.

When asked to show their development-test path in the form of a design document or test document, I've also noticed variance between the document generated and what the chain-of-thought thingy shows during the process.

The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.

"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.

We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.

> The model fixes the bug but half the function has been rewritten.

The solution to this is to use quality gates that loop back and check the work.

I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.

https://github.com/tim-projects/hammer

Well seeing as they don't KNOW anything this isn't surprising at all

this is one of the best things about using claude over gpt. claude understands the bigger assignment and does all the work and sometimes more than necessary but for me it beats the alternative.

It's called code churn. Generally, LLMs make code churn.

I've had a bad experience using AI for front-end stuff, where I replace or deprecate a feature only to notice later all the artifacts it left behind, some which were never even used in the first place.

I re-did an entire UI recently, and when one of the elements failed to render I noticed the old UI peeking out from underneath. It had tried just covering up old elements instead of adjusting or replacing them. Like telling your son to clean their room, so they push all the clothes under the bed and hope you don't notice LOL

It saves 2 hours of manual syntax wrangling but introduces 1 .5 hours of clean up and sanity checking. Still a net productivity increase, but not sure if its worth how lazy it seems to be making me (this is an easy error to correct, im sure, but meh Claude can fix it in 2 seconds so...)

I use Claude Code every day and have for as long as it has been available. I use git add -p to ensure I'm only adding what is needed. I review all code changes and make sure I understand every change. I prompt Claude to never change only whitespace. I ask it to be sure to make the minimal changes to fix a bug.

Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.

You know, this made me think of over-engineering.

...and that led me to believe that AI might be very capable to develop over-engineered audio equipment. Think of all the bells and whistles that could be added, that could be expressed in ridiculous ways with ridiculous price tags.

This seems like something that should be easy to prevent in pi harness. Just tell it to make an extension that before calling file edit tool asks the model to make sure that no lines unconnected with the current topic are going to be unnecessarily changed by this edit.