Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

I think this is the right take. I usually am aligned with most of what Anthropic is doing, but cutting off OAuth login from open harnesses was a bad move. My guess is there is some serious worry/overlap with the Cursor's of the world - e.g. folks who will be competitors in the future who are taking advantage of cheaper Opus rates/loss leader from them while simultaneously building a competitive model (Composer).

Also, nice clever optimization here. Lots of low hanging fruit in harness land.

I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!

I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.

I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.

Seems like a very cool technique, but also very oversold. He's seeing a 5% improvement on a find and replace benchmark of his own devising and saying stuff like this in the blog post:

> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.

He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.

Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.

This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.

My guess always was that - if you took the source of training data- meaning the authors of the "best" answers and solutions on stackoverflow or github- and got the question reformatted, to sound like it was created by these experts- the created code, would try to hug these sources of truth while getting created.

So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.

> Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.

Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...

It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.

Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.

Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.

As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).

Worked like a charm.

I implemented this hash (read and edit) approach in tilth if you want to test it out.

https://github.com/jahala/tilth

its on npm and cargo:

- cargo install tilth

- npx tilth

then tilth install claude-code/windsurf/cursor --edit

(--edit flag is needed)

I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321

The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732

Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

I’d really like to see this optimized for the 50-120B parameter open source models that are local viable (gpt-oss-120b, qwen3-80b-3a etc.).

For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.

The token and attention overhead from a per line hash I suspect limits this approach for smaller models

My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.

It’s funny to see where we are on model improvements.

Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.

When you're in the business of selling tokens - you look at technology that reduces that as a threat. If they were selling services that USE tokens, then reducing them would be welcome... so they'll likely steal this and incorporate it into their proprietary CLIs like claude code...

You forgot to mention your tool does worse for 8/16 LLMs compared to replace?

Problem is, replace has been around for so long, most LLMs are tuned for it now

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.

I switched from a basic prompt wrapper to structured tool use with Claude Code and the quality of output jumped overnight. Same model, completely different results.

Has any harness matched the effectiveness of Claude Code yet? I haven't experimented much recently, but every time I have in the past, I wasn't able to get any other tool to approach how effective I am in CC.

I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.

The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.

If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.

I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.

On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?

I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?

The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.

If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.

So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.

Edit

Checking ohmypi The model has access to str replace too so this is just a edit till

Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?

One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy.

"You're absolutely right!"

At this point I'd take a contract with Anthropic to have Claude code pick better tooling.

I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.

Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.

I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.

Would also be worth having special tokens for this kind of navigation.

the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.

the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.

Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.

Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.

Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.

Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.

I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...

Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.

Still weird to me that most people are not just giving an LLM an access to an editor, forcing it to write shell scripts to edit files. Shrug.

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.

honestly the harness thing is way more important than people realize - I've been working on code security tools and the gap between what a model generates raw vs with better structure is massive, way bigger than model versions mattering. like the security bugs I see in AI code, half of them are just because the prompt didn't include enough context or the edit format was wonky

the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'

idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control

I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.

It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.

Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage

How about Kimi tho how can I play with it?

> Treating harnesses as solved, or even inconsequential, is very short-sighted

Is it possible that burning extra tokens is the point, since they get paid more?

Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me

This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.

My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.

Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.

really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

Why not just use line numbers?

Great article, recommend reading all of it.

This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.