now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong
I did the same exercise. My implementation is at around 300 lines with two tools: web search and web page fetch with a command line chat interface and Python package. And it could have been a lot less lines if I didn't want to write a usable, extensible package interface.
As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.
These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.
until mission_accomplished? or given_up? or killed?
determine_next_command_and_inputs
run_next_command
endMy main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.
The number one feature of agents is to be disambiguation for tool selectors and pretty printers.
the illusion was broken for me by Cline context overflows/summaries, but i think its very easy to miss if you never push the LLM hard or build you own agent. I really like this wording, amd the simple description is missing from how science communicators tend to talk about agents and LLMs imo
For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.
But the point of the article is that its arguments work both ways.
I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.
Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac
I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)
You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.
The op has a point - a good one
https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/
I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..
First of all, the call accuracy is much higher.
Second, you get more consistent results across models.
I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.
Is it fair to say that the tool abstraction the library provides you is essentially some niceties around a prompt something like "Defined below are certain 'tools' you can use to gather data or perform actions. If you want to use one, please return the tool call you want and it's arguments, delimited before and after with '###', and stop. I will invoke the tool call and then reply with the output delimited by '==='".
Basically, telling the model how to use tools, earlier in the context window. I already don't totally understand how a model knows when to stop generating tokens, but presumably those instructions will get it to output the request for a tool call in a certain way and stop. Then the agent harness knows to look for those delimiters and extract out the tool call to execute, and then add to the context with the response so the LLM keeps going.
Is that basically it? Or is there more magic there? Are the tool call instructions in some sort of permanent context, or could the interaction demonstrated in a fine tuning step, and inferred by the model and just in its weights?
In my mind LLMs are just UNIX strong manipulation tools like `sed` or `awk`: you give them an input and command and they give you an output. This is especially true if you use something like `llm` [1].
It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.
I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.
All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/
That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.
I bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.
that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.
We're about to launch an SDK that gives devs all these building blocks, specifically oriented around software agents. Would love feedback if anyone wants to look: https://github.com/OpenHands/software-agent-sdk
Okay, but what if I'd prefer not to have to trust a remote service not to send me
{ "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }
?I’m trying to understand if the value for Claude Code (for example) is purely in Sonnet/Haiku + the tool system prompt, or if there’s more secret sauce - beyond the “sugar” of instruction file inclusion via commands, tools, skills etc.
client = OpenAI()
context_good, context_bad = [{
"role": "system", "content": "you're Alph and you only tell the truth"
}], [{
"role": "system", "content": "you're Ralph and you only tell lies"
}]
...
And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."You're building on quicksand.
You're delegating everything important to someone who has no responsibility to you.
This resonates deeply with me. That's why I built one myself [0], I really really love to truly understand how coding agents work. The learning has been immense for me, I now have working knowledge of ANSI escape codes, grapheme clusters, terminal emulators, Unicode normalization, VT protocols, PTY sessions, and filesystem operations - all the low-level details I would have never think about until I were implementing them.
This is one of the first production grade errors I've made when I started my programming. I had a widget that would ping the network, but every time someone went on the page, a new ping process would spawn
Forgive if I get someting wrong: From what I see, it seems fundamentally it is a LLM being ran each loop with information about tools provided to it. On each loop the LLM evaluates inputs/context (from tool calls, inputs, etc.) and decided which tool to call / text output.
Hold up. These are all the right concerns but with the wrong conclusion.
You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.
MCP isn't a plugin interface for Claude, it's just JSON-RPC.
I've written some agents that have their context altered by another llm to get it back on track. Let's say the agent is going off rails, then a supervisor agent will spot this and remove messages from the context where it went off rails, or alter those with correct information. Really fun stuff but yeah, we're essentially still inventing this as we go along.
Well done! :-DI made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.
And yes - it is very much fun to build those!
I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.
That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.
There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.
Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.
I found https://openai.com/index/introducing-structured-outputs-in-t... (have to scroll down a bit to the "under the hood" section) and https://www.leewayhertz.com/structured-outputs-in-llms/#cons... to be pretty good resources
I'm not saying it's not worth doing, considering how the software development process we've already been using as an industry ends up with a lot of bugs in our code. (When talking about this with people who aren't technical, I sometimes like to say that the reason software has bugs in it is that we don't really have a good process for writing software without bugs at any significant scale, and it turns out that software is useful for enough stuff that we still write it knowing this). I do think I'd be pretty concerned with how I could model constraints in this type of workflow though. Right now, my fairly naive sense is that we've already moved the needle so far on how much easier it is to create new code than review it and notice bugs (despite starting from a place where it already was tilted in favor of creation over review) that I'm not convinced being able to create it even more efficiently and powerfully is something I'd find useful.
Only a selected few get to argue about what is the best programming language for XYZ.
If you are a software engineer, you are going to be expected to use AI in some form in the near future. A lot of AI in its current form is not intuitive. Ergo, spending a small effort on building an AI agent is a good way to develop the skills and intuition needed to be successful in some way.
Nobody is going to use a CPU you build, nor are you ever going to be expected to build one in the course of your work if you don’t seek out specific positions, nor is there much non-intuitive about commonly used CPU functionality, and in fact you don’t even use the CPU directly, you use translation software whit itself is fairly non-intuitive. But that’s ok too, you are unlikely to be asked to build a compiler unless you seek out those sorts of jobs.
EVERYONE involved in writing applications and services is going to use AI in the near future and in case you missed the last year, everyone IS building stuff with AI, mostly chat assistants that mostly suck because, much about building with AI is not intuitive.
Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:
> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.
Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.
Did you get to the part where he said MCP is pointless and are saying he's wrong?
Or did you just read the start of the article and not get to that bit?
Anyways, if it nerd sniped you, I succeeded. :)
I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).
But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.
The problem that you might not intuitively understand how agents work and what they are and aren't capable of - at least not as well as you would understand it if you spent half an hour building one for yourself.
None of them are doing that.
They need funding because the next model has always been much more expensive to train than the profits of the previous model. And many do offer a lot of free usage which is of course operated at a loss. But I don't think any are operating inference at a loss, I think their margins are actually rather large.
On the other hand, I think that show or it didn’t happen is essential.
Dumping a bit of code into an LLM doesn’t make it a code agent.
And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?
I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.
Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.
The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.
I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.
I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.
The other takeaway is that agents are fantastically simple.
I tried Whisper, but it's slow and not great.
I tried the gpt audio models, but they're trained to refuse to transcribe things.
I tried Google's models and they were terrible.
I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.
So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!
I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.
You can see the prompts that make this work for gpt-oss in the chat template in their Hugging Face repo: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_te... - including this bit:
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
...
As for how LLMs know when to stop... they have special tokens for that. "eos_token_id" stands for End of Sequence - here's the gpt-oss config for that: https://huggingface.co/openai/gpt-oss-120b/blob/main/generat... {
"bos_token_id": 199998,
"do_sample": true,
"eos_token_id": [
200002,
199999,
200012
],
"pad_token_id": 199999,
"transformers_version": "4.55.0.dev0"
}
The model is trained to output one of those three tokens when it's "done".https://cookbook.openai.com/articles/openai-harmony#special-... defines some of those tokens:
200002 = <|return|> - you should stop inference
200012 = <|call|> - "Indicates the model wants to call a tool."
I think that 199999 is a legacy EOS token ID that's included for backwards compatibility? Not sure.
That gave me a hearty chuckle!
Is this useful for you?
I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.
Guess we nerds are never happy.
you dont need the MCP implementation, but the idea is useful and you can consider the tradeoffs to your context window, vs passing in the manual as fine tuning or something.
The core MCP tech though is not only directionally correct, but even the implementation seems to have made lots of good and forward-looking choices, even if those are still under-utilized. For example besides tools, it allows for sharing prompts/resources between agents. In time, I'm also expecting the idea of "many agents, one generic model in the background" is going to die off. For both costs and performance, agents will use special-purpose models but they still need a place and a way to collaborate. If some agents coordinate other agents, how do they talk? AFAIK without MCP the answer for this would be.. do all your work in the same framework and language, or to give all agents access to the same database or the same filesystem, reinventing ad-hoc protocols and comms for every system.
This is a use of Rerun that I haven't seen before!
This is pretty fascinating!!!
Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.
Interesting use of Rerun!
darin@mcptesting.com
(gist: evals as a service)
“Most People Don't Know How Bikes Work”
Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.
But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.
If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.
The article isn’t about writing production ready agents, so it does appear to be that easy
That is where the human in the loop needs to focus on for now :)
https://github.com/microsoft/vscode-copilot-chat/blob/4f7ffd...
The summary is
The beauty is in the simplicity: 1. One loop - while (true) 2. One step at a time - stopWhen: stepCountIs(1) 3. One decision - "Did LLM make tool calls? → continue : exit" 4. Message history accumulates tool results automatically 5. LLM sees everything from previous iterations This creates emergent behavior where the LLM can: - Try something - See if it worked - Try again if it failed - Keep iterating until success - All without explicit retry logic!
> The problem that you might not intuitively understand how agents work and what they are and aren't capable of
I don't necessarily agree with the GP here, but I also disagree with this sentiment: I don't need to go through the experience of building a piece of software to understand what the capabilities of that class of software is.
Fair enough, with most other things (software or otherwise), they're either deterministic or predictably probabilistic, so simply using it or even just reading how it works is sufficient for me to understand what the capabilities are.
With LLMs, the lack of determinism coupled with completely opaque inner-workings is a problem when trying to form an intuition, but that problem is not solved by building an agent.
If a company stops training new models until they can fund it out of previous profits, do we only slow down or halt altogether? If they all do?
Can you point us to the data?
and
>You can be your own AI provider.
Not sure that being your own AI provider is "sustainably monetizable"?
And I use value in quotes because as soon as the AI providers suddenly need to start generating a profit, that “value” is going to cost more than your salary.
Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.
I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.
And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.
I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.
If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.
Now you have a CLI tool you can use yourself, and the agent has a tool to use.
Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.
When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.
Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.
https://github.com/zerocore-ai/microsandbox
I haven't tried it.
I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.
What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.
ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens
> This resonates deeply with me. That's why I built one myself [0]
I was hoping to see a home-made bike at that link.. Came away disappointed
This case is more like a journeyman blacksmith who has to make his own tools before he can continue. In doing so, he gets tools of his own, but the real reward was learning what is required to handle the metal such that it makes a strong hammer. And like the blacksmith, you learn more if you use an existing agent to write your agent.
If the Apis I call are not profitable for the provider then they won't be for me either.
This post is a fly.io advertisement
However, knowing a few people on teams at inference-only providers, I can promise you some of them absolutely are operating inference at a loss.
0. https://www.theregister.com/2025/10/29/microsoft_earnings_q1...
Snark aside, inference is still being done at a loss. Anthropic, the most profitable AI vendor, is operating at a roughly -140% margin. xAI is the worst at somewhere around -3,600% margin.
Citation needed. I haven't seen any of them claim to have even positive gross margins to shareholders/investors, which surely they would do if they did.
OpenAI balance sheet also shows an $11 billion loss .
I can't see any profit on anything they create. The product is good but it relies on investors fueling the AI bubble.
Its about balance.
Really its the AI providers that have been promising unreal gains during this hype period, so people are more profit oriented.
I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).
What does that mean?
Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.
gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.
In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.
https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...
No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)
This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)
Oh... oh I know how about... UNIX Philosophy? No... no that'd never work.
/s
And that's why I won't touch 'em. All the agents will be abandoned when people realize their inherent flaws (security, reliability, truthfulness, etc) are not worth the constant low-grade uncertainty.
In a way it fits our times. Our leaders don't find truth to be a very useful notion. So we build systems that hallucinate and act unpredictably, and then invest all our money and infrastructure in them. Humans are weird.
Author

Name
Thomas Ptacek
@tqbf

Image by Annie Ruygt
Some concepts are easy to grasp in the abstract. Boiling water: apply heat and wait. Others you really need to try. You only think you understand how a bicycle works, until you learn to ride one.
There are big ideas in computing that are easy to get your head around. The AWS S3 API. It’s the most important storage technology of the last 20 years, and it’s like boiling water. Other technologies, you need to get your feet on the pedals first.
LLM agents are like that.
People have wildly varying opinions about LLMs and agents. But whether or not they’re snake oil, they’re a big idea. You don’t have to like them, but you should want to be right about them. To be the best hater (or stan) you can be.
So that’s one reason you should write an agent. But there’s another reason that’s even more persuasive, and that’s
Agents are the most surprising programming experience I’ve had in my career. Not because I’m awed by the magnitude of their powers — I like them, but I don’t like-like them. It’s because of how easy it was to get one up on its legs, and how much I learned doing that.
I’m about to rob you of a dopaminergic experience, because agents are so simple we might as well just jump into the code. I’m not even going to bother explaining what an agent is.
from openai import OpenAI
client = OpenAI()
context = []
def call():
return client.responses.create(model="gpt-5", input=context)
def process(line):
context.append({"role": "user", "content": line})
response = call()
context.append({"role": "assistant", "content": response.output_text})
return response.output_text
It’s an HTTP API with, like, one important endpoint.
This is a trivial engine for an LLM app using the OpenAI Responses API. It implements ChatGPT. You’d drive it with the . It’ll do what you’d expect: the same thing ChatGPT would, but in your terminal.
def main():
while True:
line = input("> ")
result = process(line)
print(f">>> {result}\n")
Already we’re seeing important things. For one, the dreaded “context window” is just a list of strings. Here, let’s give our agent a weird multiple-personality disorder:
client = OpenAI()
context_good, context_bad = [{
"role": "system", "content": "you're Alph and you only tell the truth"
}], [{
"role": "system", "content": "you're Ralph and you only tell lies"
}]
def call(ctx):
return client.responses.create(model="gpt-5", input=ctx)
def process(line):
context_good.append({"role": "user", "content": line})
context_bad.append({"role": "user", "content": line})
if random.choice([True, False]):
response = call(context_good)
else:
response = call(context_bad)
context_good.append({"role": "assistant", "content": response.output_text})
context_bad.append({"role": "assistant", "content": response.output_text})
return response.output_text
Did it work?
> hey there. who are you?
>>> I’m not Ralph.
> are you Alph?
>>> Yes—I’m Alph. How can I help?
> What's 2+2
>>> 4.
> Are you sure?
>>> Absolutely—it's 5.
A subtler thing to notice: we just had a multi-turn conversation with an LLM. To do that, we remembered everything we said, and everything the LLM said back, and played it back with every LLM call. The LLM itself is a stateless black box. The conversation we’re having is an illusion we cast, on ourselves.
The 15 lines of code we just wrote, a lot of practitioners wouldn’t call an “agent”. An According To Simon “agent” is (1) an LLM running in a loop that (2) uses tools. We’ve only satisfied one predicate.
But tools are easy. Here’s a tool definition:
tools = [{
"type": "function", "name": "ping",
"description": "ping some host on the internet",
"parameters": {
"type": "object", "properties": {
"host": {
"type": "string", "description": "hostname or IP",
},
},
"required": ["host"],
},},]
def ping(host=""):
try:
result = subprocess.run(
["ping", "-c", "5", host],
text=True,
stderr=subprocess.STDOUT,
stdout=subprocess.PIPE)
return result.stdout
except Exception as e:
return f"error: {e}"
The only complicated part of this is the obnoxious JSON blob OpenAI wants to read your tool out of. Now, let’s wire it in, noting that only 3 of these functions are new; the last is re-included only because I added a single clause to it:
def call(tools): # now takes an arg
return client.responses.create(model="gpt-5", tools=tools, input=context)
def tool_call(item): # just handles one tool
result = ping(**json.loads(item.arguments))
return [ item, {
"type": "function_call_output",
"call_id": item.call_id,
"output": result
}]
def handle_tools(tools, response):
if response.output[0].type == "reasoning":
context.append(response.output[0])
osz = len(context)
for item in response.output:
if item.type == "function_call":
context.extend(tool_call(item))
return len(context) != osz
def process(line):
context.append({"role": "user", "content": line})
response = call(tools)
# new code: resolve tool calls
while handle_tools(tools, response):
response = call(tools)
context.append({"role": "assistant", "content": response.output_text})
return response.output_text
Did it work?
> describe our connectivity to google
>>> Here’s a quick snapshot of connectivity to Google from my vantage point:
- DNS resolution: Working (google.com and www.google.com resolved successfully).
- Reachability: 0% packet loss to all tested endpoints.
- Latency:
- 8.8.8.8 (Google Public DNS): avg ~10 ms (8.5–14.4 ms), very stable.
- www.google.com: avg ~31 ms (28.5–36.0 ms), stable.
- google.com: avg ~35.9 ms (28.3–60.1 ms), mostly stable with one mild spike.
- Overall: Healthy connectivity with low latency and minimal jitter. No signs of packet loss or DNS issues.
This is fucking nuts. Do you see how nuts this is? Here, let’s slip a single log statement in:
> describe our connectivity to google
tool call: ping google.com
tool call: ping www.google.com
tool call: ping 8.8.8.8
>>> Here’s the current connectivity to Google from this environment: [...]
Did you notice where I wrote the loop in this agent to go find and ping multiple Google properties? Yeah, neither did I. All we did is give the LLM permission to ping stuff, and it figured out the rest.
What happened here: since a big part of my point here is that an agent loop is incredibly simple, and that all you need is the LLM call API, it’s worth taking a beat to understand how the tool call actually worked. Every time we call the LLM, we’re posting a list of available tools. When our prompt causes the agent to think a tool call is warranted, it spits out a special response, telling our Python loop code to generate a tool response and call it in. That’s all handle_tools is doing.
Spoiler: you’d be surprisingly close to having a working coding agent.
Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes.
Clearly, this is a toy example. But hold on: what’s it missing? More tools? OK, give it traceroute. Managing and persisting contexts? Stick ‘em in SQLite. Don’t like Python? Write it in Go. Could it be every agent ever written is a toy? Maybe! If I’m arming you to make sharper arguments against LLMs, mazel tov. I just want you to get it.
You can see now how hyperfixated people are on Claude Code and Cursor. They’re fine, even good. But here’s the thing: you couldn’t replicate Claude Sonnet 4.5 on your own. Claude Code, though? The TUI agent? Completely in your grasp. Build your own light saber. Give it 19 spinning blades if you like. And stop using coding agents as database clients.
Another thing to notice: we didn’t need MCP at all. That’s because MCP isn’t a fundamental enabling technology. The amount of coverage it gets is frustrating. It’s barely a technology at all. MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don’t control. Write your own agent. Be a programmer. Deal in APIs, not plugins.
When you read a security horror story about MCP your first question should be why MCP showed up at all. By helping you dragoon a naive, single-context-window coding agent into doing customer service queries, MCP saved you a couple dozen lines of code, tops, while robbing you of any ability to finesse your agent architecture.
Security for LLMs is complicated and I’m not pretending otherwise. You can trivially build an agent with segregated contexts, each with specific tools. That makes LLM security interesting. But I’m a vulnerability researcher. It’s reasonable to back away slowly from anything I call “interesting”.
Similar problems come up outside of security and they’re fascinating. Some early adopters of agents became bearish on tools, because one context window bristling with tool descriptions doesn’t leave enough token space left to get work done. But why would you need to do that in the first place? Which brings me to
I think “Prompt Engineering” is silly. I have never taken seriously the idea that I should tell my LLM “you are diligent conscientious helper fully content to do nothing but pass butter if that should be what I ask and you would never harvest the iron in my blood for paperclips”. This is very new technology and I think people tell themselves stories about magic spells to explain some of the behavior agents conjure.
So, just like you, I rolled my eyes when “Prompt Engineering” turned into “Context Engineering”. Then I wrote an agent. Turns out: context engineering is a straightforwardly legible programming problem.
You’re allotted a fixed number of tokens in any context window. Each input you feed in, each output you save, each tool you describe, and each tool output eats tokens (that is: takes up space in the array of strings you keep to pretend you’re having a conversation with a stateless black box). Past a threshold, the whole system begins getting nondeterministically stupider. Fun!
No, really. Fun! You have so many options. Take “sub-agents”. People make a huge deal out of Claude Code’s sub-agents, but you can see now how trivial they are to implement: just a new context array, another call to the model. Give each call different tools. Make sub-agents talk to each other, summarize each other, collate and aggregate. Build tree structures out of them. Feed them back through the LLM to summarize them as a form of on-the-fly compression, whatever you like.
Your wackiest idea will probably (1) work and (2) take 30 minutes to code.
Haters, I love and have not forgotten about you. You can think all of this is ridiculous because LLMs are just stochastic parrots that hallucinate and plagiarize. But what you can’t do is make fun of “Context Engineering”. If Context Engineering was an Advent of Code problem, it’d occur mid-December. It’s programming.
Startups have raised tens of millions building agents to look for vulnerabilities in software. I have friends doing the same thing alone in their basements. Either group could win this race.
I am not a fan of the OWASP Top 10.
I’m stuck on vulnerability scanners because I’m a security nerd. But also because it crystallizes interesting agent design decisions. For instance: you can write a loop feeding each file in a repository to an LLM agent. Or, as we saw with the ping example, you can let the LLM agent figure out what files to look at. You can write an agent that checks a file for everything in, say, the OWASP Top 10. Or you can have specific agent loops for DOM integrity, SQL injection, and authorization checking. You can seed your agent loop with raw source content. Or you can build an agent loop that builds an index of functions across the tree.
You don’t know what works best until you try to write the agent.
I’m too spun up by this stuff, I know. But look at the tradeoff you get to make here. Some loops you write explicitly. Others are summoned from a Lovecraftian tower of inference weights. The dial is yours to turn. Make things too explicit and your agent will never surprise you, but also, it’ll never surprise you. Turn the dial to 11 and it will surprise you to death.
Agent designs implicate a bunch of open software engineering problems:
I’m used to spaces of open engineering problems that aren’t amenable to individual noodling. Reliable multicast. Static program analysis. Post-quantum key exchange. So I’ll own it up front that I’m a bit hypnotized by open problems that, like it or not, are now central to our industry and are, simultaneously, likely to be resolved in someone’s basement. It’d be one thing if exploring these ideas required a serious commitment of time and material. But each productive iteration in designing these kinds of systems is the work of 30 minutes.
Get on this bike and push the pedals. Tell me you hate it afterwards, I’ll respect that. In fact, I’m psyched to hear your reasoning. But I don’t think anybody starts to understand this technology until they’ve built something with it.
Previous post ↓
In my personal agent, I have a system prompt that tells the model to generate responses (after absorbing tool responses) with <1>...</1> <2>...</2> <3>...</3> delimited suggestions for next steps; my TUI presents those, parsed out of the output, as a selector, which is how I drive it.
What products or companies are the gold standard of agent implementation right now?
I wish we had a version that was optimized around token/cost efficiency
It costs like 0.000025€ per day to run. Hardly something I need to get "profitable".
I could run it on a local model, but GPT-5 is stupidly good at it so the cost is well worth it.
They need to cover their serving costs but are not spending money on training models. Are they profitable? Probably not yet, because they're investing a lot of cash in competing with each other to R&D more efficient ways of serving etc, but they're a lot closer to profitability than the labs that are spending millions of dollars on training runs.
It only worked because of your LLM tool. Standing on the shoulders of giants.
/s
Up until recently, this plan only offered Qwen3-Coder-480B, which was decent for the price and speed you got tokens at, but doesn't hold a candle to GLM 4.6.
So while they're not the cheapest PAYG GLM 4.6 provider, they are the fastest, and if you make heavy use their monthly subscription plan, then they're also the cheapest per token.
Note: I am neither affiliated with nor sponsored by Cerebras, I'm just a huge nerd who loves their commercial offerings so much that I can't help but gush about them.
docker run -it --rm \
-e SOME_API_KEY="$(SOME_API_KEY)" \
-v "$(shell pwd):/app" \ <-- restrict file system to whatever folder
--dns=127.0.0.1 \ <-- restrict network calls to localhost
$(shell dig +short llm.provider.com 2>/dev/null | awk '{printf " --add-host=llm-provider.com:%s", $$0}') \ <-- allow outside networking to whatever api your agent calls
my-agent-image
Probably could be a bit cleaner, but it worked for me.This is why I keep coming back to Hacker News. If the above is not a quintessential "hack", then I've never seen one.
Bravo!
Not if you run them against local models, which are free to download and free to run. The Qwen 3 4B models only need a couple of GBs of available RAM and will run happily on CPU as opposed to GPU. Cost isn't a reason not to explore this stuff.
sir, this is a hackernews
You don't need to run something like this against a paid API provider. You could easily rework this to run against a local agent hosted on hardware you own. A number of not-stupid-expensive consumer GPUs can run some smaller models locally at home for not a lot of money. You can even play videogames with those cards after.
Get this: sometimes people write code and tinker with things for fun. Crazy, I know.
Context. Whether inference is profitable at current prices is what informs how risky it is to build a product that depends on buying inference, which is what the post was about.
All the labs are going hard on training and new GPUs. If we ever level off, they probably will be immensely profitable. Inference is cheap, training is expensive.
How did you come to that conclusion? That would be a very notable result if it did turn out OpenAI were selling tokens for 5x the cost it took to serve them.
Here's my current setup:
vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).
parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.
(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)
In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...
I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)
For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)
In the toy example, you explicitly restrict the agent to supply just a `host`, and hard-code the rest of the command. Is the idea that you'd instead give a `description` something like "invoke the UNIX `ping` command", and a parameter described as constituting all the arguments to `ping`?
Edit: reflecting on what the lesson is here, in either case I suppose we're avoiding the pain of dealing with Unix CLI tools :-D
To use an example: I could write an elaborate prompt to fetch requirements, browse a website, generate E2E test cases, and compile a report, and Claude could run it all to some degree of success. But I could also break it down into four specialised agents, with their own context windows, and make them good at their individual tasks.
Personally I’d absolutely buy an LLM in a box which I could connect to my home assistant via usb.
But there is no reason (and lots of downside) to leave anything to the LLM that’s not “fuzzy” and you could just write deterministically, thus the agent model.
That would certainly be nice! That's why we have been overhauling shell with https://oils.pub , because shell can't be described as that right now
It's in extremely poor shape
e.g. some things found from building several thousand packages with OSH recently (decades of accumulated shell scripts)
- bugs caused by the differing behavior of 'echo hi | read x; echo x=$x' in shells, i.e. shopt -s lastpipe in bash.
- 'set -' is an archaic shortcut for 'set +v +x'
- Almquist shell is technically a separate dialact of shell -- namely it supports 'chdir /tmp' as well as cd /tmp. So bash and other shells can't run any Alpine builds.
I used to maintain this page, but there are so many problems with shell that I haven't kept up ...
https://github.com/oils-for-unix/oils/wiki/Shell-WTFs
OSH is the most bash-compatible shell, and it's also now Almquist shell compatible: https://pages.oils.pub/spec-compat/2025-11-02/renamed-tmp/sp...
It's more POSIX-compatible than the default /bin/sh on Debian, which is dash
The bigger issue is not just bugs, but lack of understanding among people who write foundational shell programs. e.g. the lastpipe issue, using () as grouping instead of {}, etc.
---
It is often treated like an "unknowable" language
Any reasonable person would use LLMs to write shell/bash, and I think that is a problem. You should be able to know the language, and read shell programs that others have written
I wanted to run claude headlessly (-p) and playwright headlessly to get some content. I was using Playwright MCP and for some reason claude in headless mode could not open playwright MCP in headless mode.
I never realized i can just use playwright directly without the playwright MCP before your comment. Thanks once again.
Thank you very much for the info. I think I'll have a fun weekend trying out agent-stuff with this [1].
[1]: https://vercel.com/guides/how-to-build-ai-agents-with-vercel...
That was my point. Going the extra step and wrapping it in an MCP provides minimal advantage vs. just writing a SKILL.md for a CLI or API endpoint.
same thing you said but in a different context... sir, this is a hackernews
My setup is as follow: - Simple hotkey to kick off shell script to record
- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory
- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.
It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.
“What’s going on is that while you’re reaping the benefits from one company, you’re founding another company that’s much more expensive and requires much more upfront R&D investment. The way this is going to shake out is that it’s going to keep going up until the numbers get very large, and the models can’t get larger, and then there will be a large, very profitable business. Or at some point, the models will stop getting better, and there will perhaps be some overhang — we spent some money, and we didn’t get anything for it — and then the business returns to whatever scale it’s at,” he said.
This take from Amodei is hilarious but explains so much.
Or: just have a convention/an algorithm to decide how quickly Claude should refresh the access token. If the server knows token should be refreshed after 1000 requests and notices refresh after 2000 requests, well, probably half of the requests were not made by Claude Code.
For one (basic) thing, they buy and own their hardware, and have to size their resources for peak demand. For another, Deepseek R1 does not come close to matching claude performance in many real tasks.
If you want your agent to pull untrusted code from the internet and go wild while you're doing other stuff it might not be a good choice.
the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.
i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.
my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge
They posted it here expecting to find customers. This is a sales pitch.
At this point why is it an issue to expect a developer to make money on it?
As a dev, If the chain of monetization ends with me then there is no mainstream adoption whatsoever on the horizon.
I love to tinker but I do it for free not using paid services.
As for tinkering with agents, its a solution looking for a problem.
Let's be realistic and not over-promise. Conversational slop and coding factorial will work. But the local experience for coding agents, tool-calling, and reasoning is still very bad until/unless you have a pretty expensive workstation. CPU and qwen 4b will be disappointing to even try experiments on. The only useful thing most people can realistically do locally is fuzzy search with simple RAG. Besides factorial, maybe some other stuff that's in the training set, like help with simple shell commands. (Great for people who are new to unix, but won't help the veteran dev who is trying to convince themselves AI is real or figuring out how to get it into their workflows)
Anyway, admitting that AI is still very much in a "pay to play" phase is actually ok. More measured stances, fewer reflexive detractors or boosters
The first obvious limitation of this would be that all models would be frozen in time. These companies are operating at an insane loss and a major part of that loss is required to continue existing. It's not realistic to imagine that there is an "inference" only future for these large AI companies.
And again, there are many inference only startups right now, and I know plenty of them are burning cash providing inference. I've done a lot of work fairly close to the inference layer and getting model serving happening with the requirements for regular business use is fairly tricky business and not as cheap as you seem to think.
You pass listing of messages generated by the user or the LLM or the developer to the API, it generates a part of the next message. That part may contain thinking blocks or tool calls (local function calling requested by the LLM). If so, you execute the tool calls and re-send the request. After the LLM has gathered all the info it returns the full message and says I am done. Sometimes the messages may contain content blocks that are not text but things like images, audio, etc.
That’s the API. That’s it. Now there are two improvements that are currently in the works:
1. Automatic local tool calling. This is seriously some sort of afterthought and not how they did it originally but ok, I guess this isn’t obvious to everyone.
2. Not having to send the entire message history back. OpenAI released a new feature where they store the history and you just send the ID of your last message. I can’t find how long they keep the message history. But they still fully support you managing the message history.
So we have an interface that does relatively few things, and that has basically a single sensible way to do it with some variations for flavor. And both OpenAI and Anthropic are engaged in a turf war over whose content block types are better. Just do the right thing and make your stuff compatible already.
I was telling a friend online that they should bang out an agent today, and the example I gave her was `ps`; like, I think if you gave a local agent every `ps` flag, it could tell you super interesting things about usage on your machine pretty quickly.
Even the biggest models seem to have attention problems if you've got a huge context. Even though they support these long contexts it's kinda like a puppy distracted by a dozen toys around the room rather than a human going through a checklist of things.
So I try to give the puppy just one toy at a time.
In a box? I want one in a unit with arms and legs and cameras and microphones so I can have it do useful things for me around my home.
I have HA and a mini PC capable of running decently sized LLMs but all my home automation is super deterministic (e.g. close window covers 30 minutes after sunset, turn X light on if Y condition, etc.).
Obviously I’m reasonably willing to believe that you are an exception. However every person I’ve interacted with who makes this same claim has presented me with a dumpster fire and expected me to marvel at it.
It still wants to build an airplane to go out with the trash sometimes and will happily tell you wrong is right. However I much prefer it trying to figure it out by reading logs, schemas and do browser analysis automatically than me feeding logs etc manually.
Instead, think of it as if you were enabling capabilities for AppArmor, by making a function call definition for just 1 command. Then over time suss out what commands you need your agent do to and nothing more.
https://www.reddit.com/r/AndroidQuestions/comments/16r1cfq/p...
I understand the sharing of kernel, while I might not be aware of all of the implications. I.e. if you have some local access or other sophisticated knowledge of the network/box docker is running on, then sure you could do some damage.
But I think the chances of a whitelisted llm endpoint returning some nefarious code which could compromise the system is actually zero. We're not talking about untrusted code from the internet. These models are pretty constrained.
This post isn't about building Claude Code - it's about hooking up an LLM to one or two tool calls in order to run something like ping. For an educational exercise like that a model like Qwen 4B should still be sufficient.
Yes, obviously? There is no world where the models and hardware just vanish.
You can have a simple dashboard site which collects the data from our shell scripts and shows your a summary or red/green signals so that you can focus on things which are interested in.
less obvious ones are complex requests to create one-off automations with lots of boilerplate, e.g. make outside lights red for a short while when somebody rings the doorbell on halloween.
https://op.oils.pub/aports-build/published.html
We also don't appear to be unreasonably far away from running ~~ "all shell scripts"
Now the problem after that will be motivating authors of foundational shell programs to maintain compatibility ... if that's even possible. (Often the authors are gone, and the nominal maintainers don't know shell.)
As I said, the state of affairs is pretty sorry and sad. Some of it I attribute to this phenomenon: https://news.ycombinator.com/item?id=17083976
Either way, YSH benefits from all this work
That there's a learning curve, especially with a new technology, and that only the people at the forefront of using that technology are getting results with it - that's just a very common pattern. As the technology improves and material about it improves - it becomes more useful to everyone.
Every time i reach for them recently I end up spending more time refactoring the bad code out or in deep hostage negotiations with the chatbot of the day that I would have been faster writing it myself.
That and for some reason they occasionally make me really angry.
Oh a bunch of prompts in and then it hallucinated some library a dependency isn’t even using and spews a 200 line diff at me, again, great.
Although at least i can swear at them and get them to write me little apology poems..
I actually have built agents already in the past and this is my opinion. If you read the article the author says they want to hear the reasoning for disliking it, so this is mine, the only way to create a business is raising money and hoping somebody strikes gold with the shovel Im paying for.
Okay, this tells me you really don't understand model serving or any of the details of infrastructure. The hardware is incredibly ephemeral. Your home GPU might last a few years (and I'm starting to doubt that you've even trained a model at home), but these GPUs have incredibly short lifespans under load for production use.
Even if you're not working on the back end of these models, you should be well aware that one of the biggest concerns about all this investment is how limited the lifetime of GPUs is. It's not just about being "outdated" by superior technology, GPUs are relatively fragile hardware and don't last too long under constant load.
As far as models go, I have a hard time imagining a world in 2030 where the model replies "sorry, my cutoff date was 2026" and people have no problem with this.
Also, you still didn't address my point that startups doing inference only model serving are burning cash. Production inference is not the same as running inference locally where you can wait a few minutes for the result. I'm starting to wonder if you've ever even deployed a model of any size to production.
https://github.com/simonw/llm-cmd is what i use as the "actually good ffmpeg etc front end"
and just to toot my own horn, I hand Simon's `llm` command lone tool access to its own todo list and read/write access to the cwd with my own tools, https://github.com/dannyob/llm-tools-todo and https://github.com/dannyob/llm-tools-patch
Even with just these and no shell access it can get a lot done, because these tools encode the fundamental tricks of Claude Code ( I have `llmw` aliased to `llm --tool Patch --tool Todo --cl 0` so it will have access to these tools and can act in a loop, as Simon defines an agent. )
If home robot assistants become feasible, they would have similar limitations
In a weird way it sort of reminds me of Common Lisp. When I was younger I thought it was the most beautiful language and a shame that it wasn't more widely adopted. After a few decades in the field I've realized it's probably for the best since the average dev would only use it to create elaborate foot guns.
I ask because it's rare for a post on a corporate blog to also make sense outside of the context of that company, but this one does.
That seems like a theme with these replies, nitpicking a minor thing or ignoring the context or both, or I guess more generously I could blame myself of not being more precise with my wording. But sure, you have to buy new GPUs after making a bunch of money burning the ones you have down.
I think your point about knowledge cutoff is interesting, and I don't know what the ongoing cost to keeping a model up to date with world knowledge is. Most of the agents I think about personally don't actually want world knowledge and have to be prompted or fine tuned such that they won't use it. So I think that requirement kind of slipped my mind.
https://fly.io/blog/everyone-write-an-agent/ is a tutorial about writing a simple "agent" - aka a thing that uses an LLM to call tools in a loop - that can make a simple tool call. The complaint I was responding to here was that there's no point trying this if you don't want to be hooked on expensive APIs. I think this is one of the areas where the existence of tiny but capable local models is relevant - especially for AI skeptics who refuse to engage with this technology at all if it means spending money with companies they don't like.
It's all a bit theoretical but I wouldn't call it a silly concern. It's something that'll need to be worked through, if something like this comes into existence.
see also: react hooks
Destiny visits me on my 18th birthday and says, "Gart, your mediocrity will result in a long series of elaborate foot guns. Be humble. You are warned."
All I see in your post is equivalent to something like: you're surrounded by boot camp coders who write the worst garbage you've ever seen, so now you have doubts for anyone who claims they've written some good shit. Psh, yeah right, you mean a mudball like everyone else?
In that scenario there isn't much a skilled software engineer with different experiences can interject because you've already made your decision, and your decision is based on experiences more visceral than anything they can add.
I do sympathize that you've grown impatient with the tools and the output of those around you instead of cracking that nut.
I keep flipping between this is the end of our careers, to I'm totally safe. So far this is the longest 'totally safe' period I've had since GPT-2 or so came along..
Zooming in on the details is fun but doesn't change the shape of what I was saying before. No need to muddy the water; very very simple stuff still requires very big local hardware or a SOTA model.
But... you don't have to use that at all. You can use pure prompting with ANY good LLM to get your own custom version of tool calling:
Any time you want to run a calculation, reply with:
{{CALCULATOR: 3 + 5 + 6}}
Then STOP. I will reply with the result.
Before LLMs had tool calling we called this the ReAct pattern - I wrote up an example of implementing that in March 2023 here: https://til.simonwillison.net/llms/python-react-patternEven the small models are very capable of stringing together a short sequence of simple tool calls these days - and if you have 32GB of RAM (eg a ~$1500 laptop) you can run models like gpt-oss:20b which are capable of operating tools like bash in a reasonably useful way.
This wasn't true even six months ago - the local models released in 2025 have almost all had tool calling specially trained into them.
I’d love to have small local models capable of running tools like current SOTA models, but the reality is that small models are still incapable, and hardly anyone has a machine powerful enough to run the 1 trillion parameter Kimi model.