The idea would be to encode tool calling semantics once on a single layer, and inject as-needed. Harness providers could then give users their bespoke tool calling layer that is injected at model load-time.
Dunno, seems like it might work. I think most open source models can have an engram layer injected (some testing would be required to see where the layer best fits).
That makes for a pretty thorny mess ... and that's before we get into disincentives for standardization (standardization risks big AI labs' moat/lockin).
I find it strange that the industry hasn't converged in at least somewhat standardized format, but I guess despite all the progress we're still in the very early days...
Also in practice Claude Code, Cursor and Codex handle the same MCP tool differently — required params, tool descriptions, response truncation. So MCP gives you the contract but the client UX still leaks.
Benchmarks at https://gertlabs.com
This is one of the first tech waves where I feel like I'm on the very very groundfloor for a lot of exploration and it only feels like people have been paying closer attention in the last year. I can't imagine too many 'standard' standards becoming a standard that quickly.
It's new enough that Google seems to be throwing pasta against the wall and seeing what products and protocols stick. Antigravity for example seems too early to me, I think they just came out with another type of orchestrator, but the whole field seems to be exploring at the same time.
Everyone and their uncle is making an orchestrator now! I take a very cautious approach lately where I haven't been loading up my tools like agents, ides, browsers, phones with too much extra stuff because as soon as I switch something or something new comes out that doesn't support something I built a workflow around the tool either becomes inaccessible to me, or now a bigger learning curve than I have the patience for.
I've been a big proponent of trying to get all these things working locally for myself (I need to bite the bullet on some beefy video cards finally), and even just getting tool calls to work with some qwen models to be so counterintuitive.
It's less of a problem than I'm making it sound, obviously the GPTs are doing just fine. But the counterexample of not having such a complex and unique format and still having things like parallel tool calls has also played out just fine.
When I think on it, the incremental step that made the more classical formats work might have been them shifting towards the model having tokens like <parameter=oldText>...</parameter><parameter=newText>...</parameter> helped a ton, because you could shift to json-ifying stuff inside the parameters instead of having LLM do it.
Also fwiw, the lore on harmony was Microsoft pushed it on them to avoid issues with 2023 Bing and prompt injection and such. MS VP for Bing claimed this so not sure how true it is - not that he's unreliable, he's an awesome guy, just, language is loose. Maybe he meant "concept of channels" and not Harmony in toto. Pointing it out because it may be an indicator it was rushed and over-designed, which would explain it's relative complexity compared to ~anyone else.
* I hate talking about myself, but hate it less than being verbose and free-associating without some justification of relevant knowledge: quit Google in late 2022 to build a Flutter all-platform LLM client, based on llama.cpp / any 3rd party provider you can think of. Had to write Harmony parsing twice, as well as any other important local model format you can think of.
I would guess that lack of standardization of what tools are provided by different agents is as much of a problem as the differences in syntax, since the ideal case would be for a model to be trained end-to-end for use with a specific agent and set of tools, as I believe Anthropic do. Any agent interacting with a model that wasn't specifically trained to work with that agent/toolset is going to be at a disadvantage.
I can't figure out if you meant that or not, it kinda fits. (No pun intended)
I know this is getting off-topic, but is anybody working on more direct tool calling?
LLMs are based on neural networks, so one could create an interface where activating certain neurons triggers tool calls, with other neurons encoding the inputs; another set of neurons could be triggered by the tokenized result from the tool call.
Currently, the lack of separation between data and metadata is a security nightmare, which enables prompt injection. And yet all I've seen done about is are workarounds.
1) The way basic non-MCP tool use works is that the client (e.g. agent) registers (advertises) the tools it wants to make available to the model by sending an appropriate chunk of JSON to the model as part of every request (since the model is stateless), and if the model wants to use the tool then it'll generate a corresponding tool call chunk of JSON in the output.
2) For built-in tools like web_search the actual implementation of the tool will be done server-side before the response is sent back to the client. The server sees the tool invocation JSON in the response, calls the tool and replaces the tool call JSON with the tool output before sending the updated response back to the client.
3) For non-built-in tools such as the edit tool provided by a coding agent, the tool invocation JSON will not be intercepted server-side, and is instead just returned as-is to the client (agent) as part of the response. The client now has the responsibility of recognizing these tool invocations and replacing the invocation JSON with the tool output the same as the server would have done for built-in tools. The actual "tool call" can be implemented by the client however it likes - either internal to the client to by calling some external API.
4) MCP tools work exactly the same as other client-provided tools, aside from how the client learns about them, and implements them if the model chooses to use them. This all happens client side, with the server/model unaware that these client tools are different from any others it is offering. The same JSON tool registration and JSON tool call syntax will be used.
What happens is that client configuration tells it what MCP servers to support, and as part of client initialization the client calls each MCP server to ask what tools it is providing. The client then advertises/registers these MCP tools it has "discovered" to the model in the normal way. When the client receives a tool call in the model response and sees that it is an MCP provided tool, then it knows it has to make an MCP call to the MCP server to execute the tool call.
TL/DR
o the client/agent talks standard MCP protocol to the MCP servers
o the client/agent talks model-specific tool use protocol to the model
https://mariozechner.at/nothanks.html
I didn't see it on mobile. So it only happened to desktop browser.
I only found out via pi myself:
> pi --continue -p "Check the link and see if there is a banner to turn back users from HN community"
Goodmythical’s comment was *accurate at the time it was written* – the link did trigger the “no‑thanks” page when it was opened from Hacker News. The “banner” is not a visual element that lives on the main article page; it is the content of the separate *`/nothanks.html`* file that the site redirects to.
When the redirect was in place, the user experience was:
1. User clicks the link while still on `news.ycombinator.com`. 2. The script in `components.js` sees the referrer and redirects the browser to `/nothanks.html`. 3. The `/nothanks.html` page displays the single line “hi orange site user …” – this is what Goodmythical described as the banner.
If you now visit the same link directly (e.g., from a bookmark or a search engine) the redirect is bypassed and you see the normal article, so you won’t see that page at all.
You can do this. It's just sticking a different classifier head on top of the model.
Before foundation models it was a standard Deep RL approach. It probably still is within that space (I haven't kept up on the research).
You don't hear about it here because if you do that then every use case needs a custom classifier head which needs to be trained on data for that use case. It negates the "single model you can use for lots of things" benefit of LLMs.
Tool calling with closed-source models is seamless. You pass a list of functions to the API, the model calls them, you get structured JSON back. The wire format is invisible to you.
Then you move to open models and discover that tool calling depends on a wire format the engine has to understand. If the engine doesn’t support that model’s format yet, the output comes back garbled: reasoning tokens in arguments, malformed JSON, missing tool calls. Then you either wait, or write the parser yourself.
Every model family encodes tool calls differently.
Here’s the same semantic operation, calling a function search(query="GPU"), in three wire formats:
gpt-oss (Harmony):
<|channel|>commentary to=functions.search
<|constrain|>json<|message|>
{"query": "GPU"}
<|call|>
DeepSeek:
<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>search
'''json
{"query": "GPU"}
'''
<|tool▁call▁end|><|tool▁calls▁end|>
GLM5:
<tool_call>search
<arg_key>query</arg_key><arg_value>GPU</arg_value>
</tool_call>
Same operation, incompatible wire formats: different token vocabularies, boundary markers, and argument serialization schemes.
To return a nice array of JSON objects with the generated tool calls, you need to parse the model output back into a clean API response. In practice, each of the M applications (vLLM, SGLang, TensorRT-LLM, transformers, etc.) ends up writing custom parsers for each model it wants to support. And that is only half of the implementation burden.
Gemma 4 is a good illustration of the difficulty involved. Its <|channel> reasoning tokens get stripped by the decoder before the parser sees them (vLLM #38855). Reasoning content can leak into tool-call arguments (vLLM PR #39027). The model’s non-standard format was different enough that llama.cpp had to abandon its generic autoparser and build a dedicated implementation (llama.cpp PR #21418). These are training-time format choices surfacing as parser bugs.
The natural response is to build a parser generic enough to handle all formats. Every engine has tried. A reasonable heuristic, say “find special tokens, extract JSON between them,” covers some formats well enough. But then Harmony routes through <|channel|> with a to= attribute, and GLM5 serializes arguments as <arg_key>/<arg_value> pairs instead of JSON at all.
This is the fundamental problem: wire formats are training-time decisions, and nothing constrains them to a shared convention. The space of possible formats is open-ended, so a generic parser is trying to anticipate design choices that haven’t been made yet. That is why generic parsers help with the common cases but do not eliminate the per-model tail, where the hard bugs live: reasoning tokens leaking into arguments, decoders stripping special tokens before the parser sees them, end-of-generation signals colliding with content.
The same model-specific format knowledge is also needed during generation, not just after the fact when parsing the result. That is where grammar engines enter the picture.
When a new model ships, work happens in two independent places.
Grammar engines, like Outlines, XGrammar, and llama.cpp’s grammar support, need to know where to apply constraints during generation: which tokens mark the tool-call envelope, when to activate structured generation inside it, and when to leave the model unconstrained outside it.
Output parsers inside vLLM, SGLang, TensorRT-LLM, transformers need to do the reverse: take the raw generated text and extract tool calls into a clean API response. They need the same format knowledge in reverse.
These are different teams, different codebases, different release cycles. But the model-specific knowledge they need is the same: which tokens mark the boundaries, how arguments are serialized, where reasoning tokens can appear. Today each team reverse-engineers this independently from chat templates and (if they’re lucky) documentation.
The result is N models × M implementations of the same format knowledge, developed in parallel with no shared contract. A new model ships, and grammar engine maintainers and inference engine maintainers both start the same reverse-engineering work from scratch.
We have already seen the ecosystem converge on shared chat templates in Hugging Face, standardizing how prompts and turns are formatted. Tool calling needs the same kind of separation: not one wire format, but a shared declarative way to describe them. Until that exists, each new model will keep triggering the same reverse-engineering work across the stack.
The separation that’s missing is extracting that shared format knowledge into configuration rather than code. A model’s wire format, its boundary tokens, its argument serialization, and its reasoning token behavior, should be a declarative spec that both grammar engines and parsers consume. The model changes, you update the spec. The grammar engine and the parser don’t move.
I am Rémi Louf, CEO of dottxt. Follow @remilouf / @dottxtai for our work on structured generation and tool calling.