Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...
It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.
Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.
Also, nice clever optimization here. Lots of low hanging fruit in harness land.
Problem is, replace has been around for so long, most LLMs are tuned for it now
For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.
The token and attention overhead from a per line hash I suspect limits this approach for smaller models
I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.
If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.
Edit
Checking ohmypi The model has access to str replace too so this is just a edit till
"You're absolutely right!"
At this point I'd take a contract with Anthropic to have Claude code pick better tooling.
the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.
So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.
Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.
How about Kimi tho how can I play with it?
the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'
idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control
The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.
read_toc tool:
...
{
"name": "mcp",
"qualified_name": "mcp",
"type": "constant",
"docstring": null,
"content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",
"is_nested": false
},
{
"name": "handler",
"qualified_name": "handler",
"type": "constant",
"docstring": null,
"content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",
"is_nested": false
},
....update_content tool:
{
"content": "...",
"content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",
"project_root": ....
}As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).
Worked like a charm.
Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.
If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.
I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.
> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.
> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
https://github.com/jahala/tilth
its on npm and cargo:
- cargo install tilth
- npx tilth
then tilth install claude-code/windsurf/cursor --edit
(--edit flag is needed)
I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321
Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.
Love the pragmatic mix of content based addressing + line numbers. Beautiful.
> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.
He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.
Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.
This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.
It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.
Have you tested followup edits on the same files?
Would also be worth having special tokens for this kind of navigation.
> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.
This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.
Is it possible that burning extra tokens is the point, since they get paid more?
Models have improved dramatically even with the same harness
You probably don't want to use the line number though unless you need to disambiguate
But your write tool implementation can take care of that
context: created hypertokens an even more robust hashing mechanism to create context-addressable memory (CAM), one cheat code is make them prefix-free, lots of others that get deep into why models work the way they do, etc.
I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.
Edit: Just noticed the article's author is using a fork of Pi.
[1]: https://shittycodingagent.ai/
[2]: https://github.com/badlogic/pi-mono/tree/main/packages/codin...
Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?
Some of the stuff definitely looks interesting.
I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.
LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.
That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.
Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.
These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.
The way edits happen is that the agent (local) first tells the model (typically remote) that it has an edit tool (e.g. taking parameters file name, find string and replace string). If the model decides it wants to edit a file then it'll invoke this edit tool, which just results in a blob of JSON being put in the model's response specifying the edit (filename, etc). The agent then receives the response, intercepts this JSON blob, sees that it is an edit request and does what is asked.
The problem the article is describing is that the edit request (tool invocation) generated by the model isn't always 100% accurate. Even if the agent told the model it had a tool to invoke an actual editor, say sed, assuming the model knew how to use sed, this is still going to fail if the edit request cannot be interpreted literally by the editor (due to being inaccurate).
We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.
Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating
* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.
* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).
* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.
It’s because they want to study you.
They want the data!
Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!
Like a good programming language, a good harness offers a better affordance for getting stuff done.
Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.
0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/
I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.
Someone has to do the baseline training, development, and innovation. it can't be clones all the way down
It's completely understandable that prompting in better/more efficient means would produce different results.
I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!
I say this because I did!
Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.
Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!
Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)
That’s when the future really starts hitting you.
I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.
2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)
It’s disheartening that programmers are using this advanced, cutting-edge technology with such a backwards, old-fashioned approach.[1]
Code generation isn’t a higher level abstraction. It’s the same level but with automation.
See [1]. I’m open to LLMs or humans+LLMs creating new abstractions. Real abstractions that hide implementation details and don’t “leak”. Why isn’t this happening?
Truly “vibe coding” might also get the same job done. In the sense of: you only have to look at the generated code for reasons like how a C++ programmer looks at the assembly. Not to check if it is even correct. But because there are concerns beyond just the correctness like code gen size. (Do you care about compiler output size? Sometimes. So sometimes you have to look.)
Part of the problem though is that tools like Claude Code don't want to assume too much of the environment - that a specific editor is available, or even that it is running on a particular OS. The way it remains platform agnostic and not reliant on specific tools is by only having a dependency on Node.js, which provides file read/write support, so to implement an edit request the agent uses Node.js to read the file, itself implements the edit, then again uses Node.js to create the new updated file.
The trouble is though, because it's all indeterminant slop, every model will break in small ways that you're back to indeterminancy and building a harness ontop of the harness.
Still, <nerd snipe>, there's probably a way to get the local model and arbitrary remote model to agree on how to make a method call. But the only way that will be fruitful if you find a highly reproducible set of tuples within the model's shared space.
With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.
Not sure what they're calculating, but this seems to me like it could be many times more efficient than 20%.
If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.
On recent versions Shift+Win+- also work, and Win+- produces en dash.
I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.
It's truly disgusting.
Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
> My Weird Hill is that we should be building things with GPT-4.
I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!
the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.
Ultimately the market is going to force them to open up and let people flex their subs.
(Already published on cargo, on npm in a few mins).
In fact only the edit tool changed. That’s it.
The conversation right now is almost entirely about which model is best at coding, GPT-5.3 or Opus. Gemini vs whatever dropped this week. This framing is increasingly misleading because it treats the model as the only variable that matters, when in reality one of the bottlenecks is something much more mundane: the harness.
Not only is it where you capture the first impression of the user (is it uncontrollably scrolling, or smooth as butter?), it is also the source of every input token, and the interface between their output and every change made to your workspace.
I maintain a little “hobby harness”, oh-my-pi, a fork of Pi, a wonderful open-source coding agent by Mario Zechner. I’ve so far authored 1,300 commits, mostly playing around and making incremental improvements here and there when I see a pain point, (or autism strikes and I see an opportunity to embed more Rust via N-API because “spawning rg feels wrong”~).
Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.
Tool schemas, error messages, state management, everything between “the model knows what to change” and “the issue is resolved.” This is where most failures happen in practice.
Being model agnostic, it is a great testing ground, as the model is but a parameter. The real variable is the harness, where you have unimaginable control over.
Anyhow, let me tell you about this one variable I changed yesterday.
Before I explain what I built, it’s worth understanding the state of the art.
Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.
But give this to any other model, completely unaware of it? Patch failures go through the roof. Grok 4’s patch failure rate in my benchmark was 50.7%, GLM-4.7’s was 46.2%. These aren’t bad models — they just don’t speak the language.
Claude Code (and most others) use str_replace: find the exact old text, swap in the new text. Very simple to think about. But the model must reproduce every character perfectly, including whitespace and indentation. Multiple matches? Rejected. The “String to replace not found in file” error is so common it has its own GitHub issues megathread (+27 other issues). Not exactly optimal. Gemini does essentially the same thing plus some fuzzy whitespace matching.
Cursor trained a separate neural network: a fine-tuned 70B model whose entire job is to take a draft edit and merge it into the file correctly. The harness problem is so hard that one of the most well-funded AI companies decided to throw another model at it, and even then they mention in their own blog post that “fully rewriting the full file outperforms aider-like diffs for files under 400 lines.”
Aider’s own benchmarks show that format choice alone swung GPT-4 Turbo from 26% to 59%, but GPT-3.5 scored only 19% with the same format because it couldn’t reliably produce valid diffs. The format matters as much as the model.
The Diff-XYZ benchmark from JetBrains confirmed it systematically: no single edit format dominates across models and use cases. EDIT-Bench found that only one model achieves over 60% pass@1 on realistic editing tasks.
As you can see, there is no real consensus on the “best solution” to the simple “how do you change things” problem. My 5c: none of these tools give the model a stable, verifiable identifier for the lines it wants to change without wasting tremendous amounts of context and depending on perfect recall. They all rely on the model reproducing content it already saw. When it can’t — and it often can’t — the user blames the model.
Now bear with me here. What if, when the model reads a file, or greps for something, every line comes back tagged with a 2-3 character content hash:
11:a3|function hello() {
22:f1| return "world";
33:0e|}
When the model edits, it references those tags — “replace line 2:f1, replace range 1:a3 through 3:0e, insert after 3:0e.” If the file changed since the last read, the hashes (optimistically) won’t match and the edit is rejected before anything gets corrupted.
If they can recall a pseudo-random tag, chances are, they know what they’re editing. The model then wouldn’t need to reproduce old content, or god forbid whitespace, to demonstrate a trusted “anchor” to express its changes off of.
Since my primary concern was about real-world performance, the fixtures are generated as follows:
An average task description looks something like this:
1# Fix the bug in `useCommitFilteringAndNavigation.js`
2A guard clause (early return) was removed.
3The issue is in the `useCommitFilteringAndNavigation` function.
4Restore the missing guard clause (if statement with early return).
Naturally, we don’t expect 100% success rate here, since the model can come up with a unique solution that isn’t necessarily the exact same file, but the bugs are mechanical enough that most of the time, the fix is our mutation being reverted.
3 runs per task, 180 tasks per run. Fresh agent session each time, four tools (read, edit, write). We simply give it a temporary workspace, pass the prompt, and once the agent stops, we compare against the original file before and after formatting.
Sixteen models, three edit tools, and the outcome is unambiguous: patch is the worst format for nearly every model, hashline matches or beats replace for most, and the weakest models gain the most. Grok Code Fast 1 went from 6.7% to 68.3%, a tenfold improvement, because patch was failing so catastrophically that the model’s actual coding ability was almost completely hidden behind mechanical edit failures. MiniMax more than doubled. Grok 4 Fast’s output tokens dropped 61% because it stopped burning tokens on retry loops.
+8% improvement in the success rate of Gemini is bigger than most model upgrades deliver, and it cost zero training compute. Just a little experimenting (and ~$300 spent benchmarking).
Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.
Anthropic recently blocked OpenCode, a massively popular open-source coding agent, from accessing Claude through Claude Code subscriptions.
Anthropic’s position “OpenCode reverse-engineered a private API” is fair on its face. Their infrastructure, their rules. But look at what the action signals:
Don’t build harnesses. Use ours.
It’s not just Anthropic either. While writing this article, Google banned my account from Gemini entirely:

Not rate-limited. Not warned. Disabled. For running a benchmark — the same one that showed Gemini 3 Flash hitting 78.3% with a novel technique that beats their best attempt at it by 5.0 pp. I don’t even know what for.
Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.
No vendor will do harness optimization for competitors’ models. Anthropic won’t tune for Grok. xAI won’t tune for Gemini. OpenAI won’t tune for Claude. But an open-source harness tunes for all of them, because contributors use different models and fix the failures they personally encounter.
The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.
I come from a background of game security. Cheaters are hugely destructive to the ecosystem. Sure, they get banned, chased, sued, but a well-known secret is that eventually the security team asks, “Cool! Want to show us how you got around that?”, and they join the defense.
The correct response when someone messes with your API, and manages to gather a significant following using their tools is “tell us more”, not “let’s blanket-ban them in thousands; plz beg in DMs if you want it reversed tho.”
The harness problem is real, measurable, and it’s the highest-leverage place to innovate right now. The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
The harness problem will be solved. The question is whether it gets solved by one company, in private, for one model, or by a community, in the open, for all of them.
The benchmark results speak for themselves.
All code, benchmarks, and per-run reports: oh-my-pi
I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?
At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.
Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.
Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.
---
The right route is open models and open harnesses, ideally on local hardware.
one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space
we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash
we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing
Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?
The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!
Instead of cat + grep + manual line counting, one tool call returns a structural outline of a large file, lets you drill into sections, and since this last update also returns hashline-anchored output that an edit tool can target.
Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.
There many be many lines that are duplicates, eg “{“
I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.
Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.
So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.
This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.
It sure does and Codex is great, but do you think they'll maintain the current prices after/if it eventually dominates Claude Code in terms of marketshare and mindshare?
The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?
And the MCP already only has the most essential tools for my workflow: the ability to run queries against a few databases.
1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.
2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.
3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.
4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)
5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them
6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.
7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below
8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.
9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says
Unlike Uber and Lyft, the price of inference continues to go down as datacenter capacity comes online and compute hardware gets more powerful.
So I think we'll always have affordable LLM services.
I do think the obsession with prices of the entry-level plans is a little odd. $20/month is nothing relative to the salaries people using these tools receive. HN is full of warnings that prices are going to go up in the future, but what's that going to change for software developers? Okay, so my $20/month plan goes to $40/month? $60/month? That's still less than I pay for internet access at home.