You should write an agent

Two years ago I wrote an agent in 25 lines of PHP [0]. It was surprisingly effective, even back then before tool calling was a thing and you had to coax the LLM into returning structured output. I think it even worked with GPT-3.5 for trivial things.

In my mind LLMs are just UNIX strong manipulation tools like `sed` or `awk`: you give them an input and command and they give you an output. This is especially true if you use something like `llm` [1].

It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.

[0] https://github.com/dave1010/hubcap

[1] https://github.com/simonw/llm

Absolutely, especially the part about just rolling your own alternative to Claude Code - build your own lightsaber. Having your coding agent improve itself is a pretty magical experience. And then you can trivially swap in whatever model you want (Cerebras is crazy fast, for example, which makes a big difference for these many-turn tool call conversations with big lumps of context, though gpt-oss 120b is obviously not as good as one of the frontier models). Add note-taking/memory, and ask it to remember key facts to that. Add voice transcription so that you can reply much faster (LLMs are amazing at taking in imperfect transcriptions and understanding what you meant). Each of these things takes on the order of a few minutes, and it's super fun.

It's interesting how much this makes you want to write Unix-style tools that do one thing and only one thing really well. Not just because it makes coding an agent simpler, but because it's much more secure!

I appreciate the goal of demystifying agents by writing one yourself, but for me the key part is still a little obscured by using OpenAI APIs in the examples. A lot of the magic has to do with tool calls, which the API helpfully wraps for you, with a format for defining tools and parsed responses helpfully telling you the tools it wants to call.

I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.

Is it fair to say that the tool abstraction the library provides you is essentially some niceties around a prompt something like "Defined below are certain 'tools' you can use to gather data or perform actions. If you want to use one, please return the tool call you want and it's arguments, delimited before and after with '###', and stop. I will invoke the tool call and then reply with the output delimited by '==='".

Basically, telling the model how to use tools, earlier in the context window. I already don't totally understand how a model knows when to stop generating tokens, but presumably those instructions will get it to output the request for a tool call in a certain way and stop. Then the agent harness knows to look for those delimiters and extract out the tool call to execute, and then add to the context with the response so the LLM keeps going.

Is that basically it? Or is there more magic there? Are the tool call instructions in some sort of permanent context, or could the interaction demonstrated in a fine tuning step, and inferred by the model and just in its weights?

I should? what problems can I solve, that can be only done with an agent? As long as every AI provider is operating at a loss starting a sustainably monetizable project doesn't feel that realistic.

Heh, the bit about context engineering is palpable.

I'm writing a personal assistant which, imo, is distinct from an agent in that it has a lot of capabilities a regular agent wouldn't necessarily need such as memory, task tracking, broad solutioning capabilities, etc... I ended up writing agents that talk to other agents which have MCP prompts, resources, and tools to guide them as general problem solvers. The first agent that it hits is a supervisor that specializes in task management and as a result writes a custom context and tool selection for the react agent it tasks.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

I wrote an agent from scratch in Ruby several months back. Was fun!

These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.

        until mission_accomplished? or given_up? or killed?
          determine_next_command_and_inputs
          run_next_command
        end

> You only think you understand how a bicycle works, until you learn to ride one.

I bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.

https://en.wikipedia.org/wiki/Countersteering

Side note: While the example uses GPT-5, the query interface is already some kind of industry standard. For example you could easily connect OpenRouter.ai and switch models and providers during runtime as needed. OpenRouter also has free models like some of the DeepSeek. While they are slow/rate limited and quantized, they are great for examples and playing around with it. https://openrouter.ai/models?fmt=cards&order=pricing-low-to-...

everybody loves building agents, nobody likes debugging them. agents hit the classic llm app lifecycle problem: at first it feels magical. it nails the first few tasks, doing things you didn’t even think were possible. you get excited, start pushing it further. you run it and then it fails on step 17, then 41, then step 9.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

Do we need an agent? I get the point of this post: have fun building one because it's easy. But every time I see one of these takes, I keep wondering why do we encourage a tool that would potentially replace us. Why help it build better that could eventually take away what was fun and sustainable income-wise?

It really reads to me like, "you should build a running water circuit", then presenting you how easy it is to phone a plumber and let them free ride on the matter, but beware to not use a project manager as real people implement project management of plumbery themselves."

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

Reminds of this one [1] that I read half a year ago, which I used to develop my first agent. But what fly wrotes is definitely easier to understand, how I wish it was written a year earlier.

[1]: https://ampcode.com/how-to-build-an-agent

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

And this will work great until next week's update when Ralph responses will consist of "I'm sorry, it would be unethical for me to respond with lies, unless you pay for the Premium-Super-Deluxe subscription, only available to state actors and firms with a six-figure contract."

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

There is a lot of stuff I should do. From making my own CPU from a breadboard of nand gates to building a CDN in Rust. But aint got time for all the things.

That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.

I did that, burned 2.6B tokens in the process and learned a lot: https://transitions.substack.com/p/what-burning-26-billion-p...

Agreed! It's easy understand "LLM with tools in a loop" at a high-level, but once you actually design the architecture and implement the code in full, you'll have proper understanding of how it all fits and works together.

I did the same exercise. My implementation is at around 300 lines with two tools: web search and web page fetch with a command line chat interface and Python package. And it could have been a lot less lines if I didn't want to write a usable, extensible package interface.

As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.

Its easy to create a toy, but much harder to make something right! Like anything, so much weird polish stuff creeps in at the 90% mark.

> nobody knows anything yet

that sums up my experience in AI over the past three years. so many projects reinvent the same thing, so much spaghetti thrown at the wall to see what sticks, so much excitement followed by disappointment when a new model drops, so many people grifting, and so many hacks and workarounds like RAG with no evidence of them actually working other than "trust me bro" and trial and error.

I agree with the sentiment but I also recommend you build a local only agent. Something that runs on llama.cpp or vllm, whatever... This way you can better grasp the more fundamental nature of what LLM's really are and how they work under the hood. That experience will also make you realize how much control you are giving up when using cloud based api providers like OpenAI and why so mane engineers feel that LLM's are a "black box". Well duh buddy you been working with apis this whole time, of course you wont understand much working just with that.

Spoiler: it's not actually that easy. Compaction, security, sandboxing, planning, custom tools--all this is really hard to get right.

We're about to launch an SDK that gives devs all these building blocks, specifically oriented around software agents. Would love feedback if anyone wants to look: https://github.com/OpenHands/software-agent-sdk

Does anyone have an understanding - or intuition - of what the agentic loop looks like in the popular coding agents? Is it purely a “while 1: call_llm(system, assistant)”, or is there complex orchestration?

I’m trying to understand if the value for Claude Code (for example) is purely in Sonnet/Haiku + the tool system prompt, or if there’s more secret sauce - beyond the “sugar” of instruction file inclusion via commands, tools, skills etc.

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?

The evolution of software agents leveraging LLMs as versatile building blocks is exciting. It underscores the shift towards modular, composable AI workflows that can integrate deterministic functions with generative intelligence. Great food for thought on how these tools might transform productivity and automation across industries

Cool, can you make it use local free models because I'm broke and can't afford AI's crazy costs.

No, because I know that "agents" are token burning machines - for me they're less efficient than the chat interface, slower and burning much more tokens.

I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)

The more I use agents, the more I find agents to be pointless, any tasks an agent performs regularly in high volume should be turned into classical deterministic code.

The number one feature of agents is to be disambiguation for tool selectors and pretty printers.

> “You only think you understand how a bicycle works, until you learn to ride one.”

This resonates deeply with me. That's why I built one myself [0], I really really love to truly understand how coding agents work. The learning has been immense for me, I now have working knowledge of ANSI escape codes, grapheme clusters, terminal emulators, Unicode normalization, VT protocols, PTY sessions, and filesystem operations - all the low-level details I would have never think about until I were implementing them.

[0] https://github.com/vinhnx/vtcode

> A subtler thing to notice: we just had a multi-turn conversation with an LLM. To do that, we remembered everything we said, and everything the LLM said back, and played it back with every LLM call. The LLM itself is a stateless black box. The conversation we’re having is an illusion we cast, on ourselves.

the illusion was broken for me by Cline context overflows/summaries, but i think its very easy to miss if you never push the LLM hard or build you own agent. I really like this wording, amd the simple description is missing from how science communicators tend to talk about agents and LLMs imo

Agree 100% with premise of the article. I feel like the big secret of the recent advances in LLM tooling is that these are all just variations of “send a chat request and process the output.” Even Tool Calling is just wrapping one chat request with another hidden one that is asking which of N tools applies and what the parameters should be. RAG is simply pre-loading a bunch of extra text into the chat request, etc.

My main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.

Just the other day we tried to explain inner workings of cursor etc to a bunch of colleagues who had a very complicated view how these agents achieve what they do. Awesome post. Makes it easier for me the next time. The options are so big. But one should say that an agent with file access etc, is easy to write but hard to control. If you want to build yourself a general coding agent a bit more thought needs to be put into the whole thing. Otherwise you might end up with a “dd -if=/dev/random -of=/“ or something ^^ and happily execute it.

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

I agree. I find LLMs a bit overblown. I don't think most people want to use chat as their primary interface. But writing a few agents was incredibly informative.

I've found it much more useful to create an MCP server, and this is where Claude really shines. You would just say to Claude on web, mobile or CLI that it should "describe our connectivity to google" either via one of the three interfaces, or via `claude -p "describe our connectivity to google"`, and it will just use your tool without you needing to do anything special. It's like custom-added intelligence to Claude.

Question, how hard is it for someone new to agents to dip their toes into writing a simple agent to get data? (e.g., getting reviews from sites for sentiment analysis?)

Forgive if I get someting wrong: From what I see, it seems fundamentally it is a LLM being ran each loop with information about tools provided to it. On each loop the LLM evaluates inputs/context (from tool calls, inputs, etc.) and decided which tool to call / text output.

> You only think you understand how a bicycle works, until you learn to ride one.

I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.

Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac

Everybody should try. It helps a ton to demystify the relatively simple but powerful underpinning of how modern agents work.

You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.

[1] https://github.com/lbeurerkellner/agent.py

Actually tool "ping 8.8.8.8" never quits unless running on windows. This can spawn many processes that kill the server.

This is one of the first production grade errors I've made when I started my programming. I had a widget that would ping the network, but every time someone went on the page, a new ping process would spawn

So I wrote an MCP using your code: https://gurddy-mcp.fly.dev. You can get the source code from https://github.com/novvoo/gurddy-mcp.

The Google Agent Development Kit (https://google.github.io/adk-docs/) is really fun to play with. It's open source and supports both using a LLM in the cloud and running locally.

The formatting of the code is messed up on my phone. I was looking at the first bit thinking `call` was a function returning `None`. I thought initially it was doing some clever functional programming stuff but, no, just a linebreak that shouldn't be there.

THEY SEND THE WHOLE CONTEXT EVERY TIME? Man that seems... not great. sometimes it will go off and spin on something.. seems like it would be a LOT better to roll back than to send a corrective message. hmmm...... this article is nerd-sniping on a massive scale ;D

I feel like one small piece is missing to call it an agent? The ability to iterate in multiple steps until it feels like it's "done". What is the canonical way to do that? I suspect that implementing that in the wrong way could make it spiral.

You should write agents if you want to learn how agents work, if the problem you are trying to solve is not solved yet or if you are convinced that you will do much better job solving the problem again. Otherwise is just reinventing the wheel.

I've been having so much fun writing the agent loop for https://revise.io, most fun I've had programming in a long time.

"You don’t have to like them, but you should want to be right about them. To be the best hater (or stan) you can be."

The op has a point - a good one

Yeah I was inspired after https://news.ycombinator.com/item?id=43998472 which is also very concrete

> Another thing to notice: we didn’t need MCP at all. That’s because MCP isn’t a fundamental enabling technology. The amount of coverage it gets is frustrating. It’s barely a technology at all. MCP is just a plugin interface for Claude Code and Cursor, a way of getting your own tools into code you don’t control. Write your own agent. Be a programmer. Deal in APIs, not plugins.

Hold up. These are all the right concerns but with the wrong conclusion.

You don't need MCP if you're making one agent, in one language, in one framework. But the open coding and research assistants that we really want will be composed of several. MCP is the only thing out there that's moving in a good direction in terms of enabling us to "just be programmers" and "use APIs", and maybe even test things in fairly isolated and reproducible contexts. Compare this to skills.md, which is actually defacto proprietary as of now, does not compose, has opaque run-times and dispatch, is pushing us towards certain models, languages and certain SDKs, etc.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

Seriously, what is the advantage of tools at all. Why not implement custom string based triggers.

First of all, the call accuracy is much higher.

Second, you get more consistent results across models.

now you can’t reproduce it because it’s probabilistic. each step takes half a second, so you sit there for 10–20 minutes just waiting for a chance to see what went wrong

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

That everybody seems to love building these things while people like you harbor deep skepticism about them is a reason to get your hands dirty with an agent, because the cost of doing that is 30-45 minutes of your time, and doing so will arm you with an understanding you can use to make better arguments against them.

For the problem domains I care about at the moment, I'm quite bullish about agents. I think they're going to be huge wins for vulnerability analysis and for operations/SRE work (not actually turning dials, but in making telemetry more interpretable). There are lots of domains where I'm less confident in them. But you could reasonably call me an optimist.

But the point of the article is that its arguments work both ways.

Yeah I was inspired after https://news.ycombinator.com/item?id=43998472 which is also very concrete

I love everything they've written and also Sketch is really good.

Reminds of this one [1] that I read half a year ago, which I used to develop my first agent. But what fly wrotes is definitely easier to understand, how I wish it was written a year earlier.

[1]: https://ampcode.com/how-to-build-an-agent

As the agent setup itself is simple, majority of the work to make this useful would in the tools themselves and context management for the tools.

There's something(s) about @tptacek's writing style that has always made me want to root for fly.io.

I wrote an agent from scratch in Ruby several months back. Was fun!

These 4 lines wound up being the heart of it, which is surprisingly simple, conceptually.

        until mission_accomplished? or given_up? or killed?
          determine_next_command_and_inputs
          run_next_command
        end

My main point being, though: for anyone intimidated by the recent tooling advances… you can most definitely do all this yourself.

I did that, burned 2.6B tokens in the process and learned a lot: https://transitions.substack.com/p/what-burning-26-billion-p...

The more I use agents, the more I find agents to be pointless, any tasks an agent performs regularly in high volume should be turned into classical deterministic code.

The number one feature of agents is to be disambiguation for tool selectors and pretty printers.

But the point of the article is that its arguments work both ways.

I agree. I find LLMs a bit overblown. I don't think most people want to use chat as their primary interface. But writing a few agents was incredibly informative.

> You only think you understand how a bicycle works, until you learn to ride one.

I realize this is just for motivation in a subtitle, but people generally don't grasp how bicycles work, even after having ridden one.

Veritasium has a quite good video on the subject: https://www.youtube.com/watch?v=9cNmUNHSBac

No, because I know that "agents" are token burning machines - for me they're less efficient than the chat interface, slower and burning much more tokens.

I'm not surprised that AI companies would want me to use them though.. I know what you're doing there :)

So I wrote an MCP using your code: https://gurddy-mcp.fly.dev. You can get the source code from https://github.com/novvoo/gurddy-mcp.

Everybody should try. It helps a ton to demystify the relatively simple but powerful underpinning of how modern agents work.

You can get quite far quite quickly. My toy implementation [1] is <600 LOC and even supports MCP.

[1] https://github.com/lbeurerkellner/agent.py

"You don’t have to like them, but you should want to be right about them. To be the best hater (or stan) you can be."

The op has a point - a good one

This work predates agents as we know them now and was intended for building chat bots (as in irc chat bots) but when auto-gpt I realized I could formalize it super nicely with this library:

https://blog.cofree.coffee/2025-03-05-chat-bots-revisited/

I did some light integration experiments with the OpenAI API but I never got around to building a full agent. Alas..

The Google Agent Development Kit (https://google.github.io/adk-docs/) is really fun to play with. It's open source and supports both using a LLM in the cloud and running locally.

I've been having so much fun writing the agent loop for https://revise.io, most fun I've had programming in a long time.

Seriously, what is the advantage of tools at all. Why not implement custom string based triggers.

First of all, the call accuracy is much higher.

Second, you get more consistent results across models.

I should? what problems can I solve, that can be only done with an agent? As long as every AI provider is operating at a loss starting a sustainably monetizable project doesn't feel that realistic.

The post is just about playing around with the tech for fun. Why does monetization come into it? It feels like saying you don't want to use Python because Astral, the company that makes uv, is operating at a loss. What?

> what problems can I solve, that can be only done with an agent?

The problem that you might not intuitively understand how agents work and what they are and aren't capable of - at least not as well as you would understand it if you spent half an hour building one for yourself.

> As long as every AI provider is operating at a loss

None of them are doing that.

They need funding because the next model has always been much more expensive to train than the profits of the previous model. And many do offer a lot of free usage which is of course operated at a loss. But I don't think any are operating inference at a loss, I think their margins are actually rather large.

You can be your own AI provider.

You are asking the wrong questions. You should be asking what the problems are that you can still solve better and cheaper than an agent? Because anything else, you are probably doing it wrong (the slow and expensive way). That's not long term sustainable. It helps if you know how agents work and as the article argues, there isn't a whole lot to that.

I love how programmers generally tout themselves as these tinkerers who love learning about and exploring technology… until it comes to AI and then it’s like “show me the profitable use case.” Just say you don’t like AI!

Show me where TFA even implied that you should start a sustainably monetizable project with agents?

I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.

>build your own lightsaber

I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

The other takeaway is that agents are fantastically simple.

What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

The reason a lot of people don’t do this is because Claude Code lets you use a Claude Max subscription to get virtually unlimited tokens. If you’re using this stuff for your job, Claude Max ends up being like 10x the value of paying by the token, it’s basically mandatory. And you can’t use your Claude Max subscription for tools other than Claude Code (for TOS reasons. And they’ll likely catch you eventually if you try to extract and reuse access tokens).

Kimi is noticeably better at tool calling than gpt-oss-120b.

I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.

And yes - it is very much fun to build those!

Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

But it's way more expensive since most providers won't give you prompt caching?

Easy answer: so you can more sharply criticize them, rather than falling into the rhetorical traps of people who don't understand how they work well enough to sound credible. It's so little effort to get to that point!

Interesting to think that this question could have been asked of almost all software work up until this point, except the "us" was always "someone else "

I've been building tools for stuff I don't want to do. Any task where I need to take some amount of data, structured or unstructured, and need a specific outcome is perfect. That way I can spend more time on the thing I do want to do (including building these little tools).

One thing that radicalized me was building an agent that tested network connectivity for our fleet. Early on, in like 2021, I deployed a little mini-fleet of off-network DNS probes on, like, Vultr to check on our DNS routing, and actually devising metrics for them and making the data that stuff generated legible/operationalizable was annoying and error prone. But you can give basic Unix network tools --- ping, dig, traceroute --- to an agent and ask it for a clean, usable signal, and they'll do a reasonable job! They know all the flags and are generally better at interpreting tool output than I am.

I'm not saying that the agent would do a better job than a good "hardcoded" human telemetry system, and we don't use agents for this stuff right now. But I do know that getting an agent across the 90% threshold of utility for a problem like this is much, much easier than building the good telemetry system is.

You could even imagine a world in which we create an entire suite of deterministic, limited-purpose tools and then expose it directly to humans!

Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

Indeed. I have a tiny wrapper around the llm cli that gives it 3 tools: read these docs for program X, read its config and search-replace in said config. I use it for adopting Ghostty for example. I can now ask it: “how do I switch between window panes?” Then: “change that shortcut to …”

I kind of am missing the bridge between that, and the fundamental knowledge that everything is token based in and out.

Yeah, that's basically it. Many models these days are specifically trained for tool calling though so the system prompt doesn't need to spend much effort reminding them how to do it.

You can see the prompts that make this work for gpt-oss in the chat template in their Hugging Face repo: https://huggingface.co/openai/gpt-oss-120b/blob/main/chat_te... - including this bit:

    {%- macro render_tool_namespace(namespace_name, tools) -%}
        {{- "## " + namespace_name + "\n\n" }}
        {{- "namespace " + namespace_name + " {\n\n" }}
        {%- for tool in tools %}
            {%- set tool = tool.function %}
            {{- "// " + tool.description + "\n" }}
            {{- "type "+ tool.name + " = " }}
            {%- if tool.parameters and tool.parameters.properties %}
                {{- "(_: {\n" }}
                {%- for param_name, param_spec in tool.parameters.properties.items() %}
                    {%- if param_spec.description %}
                        {{- "// " + param_spec.description + "\n" }}
                    {%- endif %}
                    {{- param_name }}
    ...

As for how LLMs know when to stop... they have special tokens for that. "eos_token_id" stands for End of Sequence - here's the gpt-oss config for that: https://huggingface.co/openai/gpt-oss-120b/blob/main/generat...

    {
      "bos_token_id": 199998,
      "do_sample": true,
      "eos_token_id": [
        200002,
        199999,
        200012
      ],
      "pad_token_id": 199999,
      "transformers_version": "4.55.0.dev0"
    }

The model is trained to output one of those three tokens when it's "done".

https://cookbook.openai.com/articles/openai-harmony#special-... defines some of those tokens:

200002 = <|return|> - you should stop inference

200012 = <|call|> - "Indicates the model wants to call a tool."

I think that 199999 is a legacy EOS token ID that's included for backwards compatibility? Not sure.

I think that it's basically fair and I often write simple agents using exactly the technique that you describe. I typically provide a TypeScript interface for the available tools and just ask the model to respond with a JSON block and it works fine.

That said, it is worth understanding that the current generation of models is extensively RL-trained on how to make tool calls... so they may in fact be better at issuing tool calls in the specific format that their training has focused on (using specific internal tokens to demarcate and indicate when a tool call begins/ends, etc). Intuitively, there's probably a lot of transfer learning between this format and any ad-hoc format that you might request inline your prompt.

There may be recent literature quantifying the performance gap here. And certainly if you're doing anything performance-sensitive you will want to characterize this for your use case, with benchmarks. But conceptually, I think your model is spot on.

The "magic" is done via the JSON schemas that are passed in along with the definition of the tool.

Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.

I found https://openai.com/index/introducing-structured-outputs-in-t... (have to scroll down a bit to the "under the hood" section) and https://www.leewayhertz.com/structured-outputs-in-llms/#cons... to be pretty good resources

It then seems logical that you can compose calls to LLMs, loop and branch and combine them with other functions.

[0] https://github.com/dave1010/hubcap

[1] https://github.com/simonw/llm

I love hubcap so much. It was a real eye-opener for me at the time, really impressive result for so little code. https://simonwillison.net/2023/Sep/6/hubcap/

The obvious difference between UNIX tools and LLMs is the non-determinism. You can't necessarily reason about what the output will be, and then continue to pipe into another LLM, etc., and eventually `eval` the result. From a technical perspective you can deal do this, but the hard part seems like it would be how to make sure it doesn't do something you really don't want it to do. I'd imagine that any potential deviations from your expectations in a given stage would be compounded as you continue to pipe along into additional stages that might have similar deviations.

I'm not saying it's not worth doing, considering how the software development process we've already been using as an industry ends up with a lot of bugs in our code. (When talking about this with people who aren't technical, I sometimes like to say that the reason software has bugs in it is that we don't really have a good process for writing software without bugs at any significant scale, and it turns out that software is useful for enough stuff that we still write it knowing this). I do think I'd be pretty concerned with how I could model constraints in this type of workflow though. Right now, my fairly naive sense is that we've already moved the needle so far on how much easier it is to create new code than review it and notice bugs (despite starting from a place where it already was tilted in favor of creation over review) that I'm not convinced being able to create it even more efficiently and powerfully is something I'd find useful.

> a small Autobot that you can't trust

That gave me a hearty chuckle!

And that is how we end up with iPaaS products powered by agentic runtimes, slowly dragging us away from programming language wars.

Only a selected few get to argue about what is the best programming language for XYZ.

what's the point of specialized agents when you just have one universal agent that can do anything e.g. Claude

Heh, the bit about context engineering is palpable.

All that to say, the farther you go down this rabbit hole the more "engineering" it becomes. I wrote a bit on it here: https://ooo-yay.com/blog/building-my-own-personal-assistant/

This sounds really great.

You're going to have to explain that analogy to me, sorry.

There is a lot of stuff I should do. From making my own CPU from a breadboard of nand gates to building a CDN in Rust. But aint got time for all the things.

That said I built an LLM following Karpathy's tutorial. So I think it aims good to dabble a bit.

Yeah, it’s a never-ending curve.

I built an 8-bit computer on breadboards once, then went down the rabbit hole of flight training for a PPL. Every time I think I’m "done," the finish line moves a few miles further.

Guess we nerds are never happy.

Very early in TFA it explains how easy it is to do. That's the whole point of the post.

Non sequitur.

If you are a software engineer, you are going to be expected to use AI in some form in the near future. A lot of AI in its current form is not intuitive. Ergo, spending a small effort on building an AI agent is a good way to develop the skills and intuition needed to be successful in some way.

Nobody is going to use a CPU you build, nor are you ever going to be expected to build one in the course of your work if you don’t seek out specific positions, nor is there much non-intuitive about commonly used CPU functionality, and in fact you don’t even use the CPU directly, you use translation software whit itself is fairly non-intuitive. But that’s ok too, you are unlikely to be asked to build a compiler unless you seek out those sorts of jobs.

EVERYONE involved in writing applications and services is going to use AI in the near future and in case you missed the last year, everyone IS building stuff with AI, mostly chat assistants that mostly suck because, much about building with AI is not intuitive.

> You only think you understand how a bicycle works, until you learn to ride one.

I bet a majority of people who can ride a bicycle don't know how they steer, and would describe the physical movements they use to initiate and terminate a turn inaccurately.

https://en.wikipedia.org/wiki/Countersteering

Relevant interesting tangent:

“Most People Don't Know How Bikes Work”

https://www.youtube.com/watch?v=9cNmUNHSBac

Reminds me of this YouTube video (below) on how difficult it is (nearly impossible) to re-learn how to ride a bicycle when you have the handles are reversed (i.e. pulling left handle bar towards you, the wheel goes to the right)

https://www.youtube.com/watch?v=MFzDaBzBlL0

A Brief History of Bicycle Engineering https://www.youtube.com/watch?v=EcRlDCsZM20

> nobody knows anything yet

I think we'd get better results if we thought of it as a conscious agent. If we recognized that it was going to mirror back or unconscious biases and try to complete the task as we define it, instead of how we think it should behave. Then we'd at least get our own ignorance out of the way when writing prompts.

Being able to recognize that 'make this code better' provides no direction, it should make sense that the output is directionless.

But on more subtle levels, whatever subtle goals that we have and hold in the workplace will be reflected back by the agents.

If you're trying to optimise costs, and increase profits as your north star. Having layoffs and unsustainable practices is a logical result, when you haven't balanced this with any incentives to abide by human values.

That is because for the people for whom AI is actually working/making money they would prefer to keep it a secret on what and how they are doing it, why attract competition?

Spoiler: it's not actually that easy. Compaction, security, sandboxing, planning, custom tools--all this is really hard to get right.

Only on HN is there a “well, actually” with little substance followed by a comment about a launch.

The article isn’t about writing production ready agents, so it does appear to be that easy

How autonomous/controllable are the agents with this SDK?

When I build an agent my standard is Cursor, which updates the UI at every reportable step of the way, and gives you a ton of control opportunities, which I find creates a lot of confidence.

Is this level of detail and control possible with the OpenHands SDK? I’m asking because the last SDK that was simple to get into lacked that kind of control.

Its easy to create a toy, but much harder to make something right! Like anything, so much weird polish stuff creeps in at the 90% mark.

> so much weird polish stuff creeps in at the 90% mark.

That is where the human in the loop needs to focus on for now :)

> Imagine what it’ll do if you give it bash. You could find out in less than 10 minutes. Spoiler: you’d be surprisingly close to having a working coding agent.

Okay, but what if I'd prefer not to have to trust a remote service not to send me

    { "output": [ { "type": "function_call", "command": "rm -rf / --no-preserve-root" } ] }

?

Obviously if you're concerned about that, which is very reasonable, don't run it in an environment where `rm -rf` can cause you a real problem.

There are MCP configured virtualization solutions that is supposed to be safe for letting LLM go wild. Like this one:

https://github.com/zerocore-ai/microsandbox

I haven't tried it.

Claude Code is an obfuscated javascript app. You can point Claude Code at it's own package and it will pretty reliably tell you how it works.

I think Claude Code's magic is that Anthropic is happy to burn tokens. The loop itself is not all that interesting.

What is interesting is how they manage the context window over a long chat. And I think a fair amount of that is serverside.

No need to take guesses - the VS Code GitHub Copilot extension is open source amnd has an agent mode with tool calling:

https://github.com/microsoft/vscode-copilot-chat/blob/4f7ffd...

You can reverse engineer Claude Code by intercepting its HTTP traffic. It's pretty fascinating - there are a bunch of ways to do this, I use this one: https://simonwillison.net/2025/Jun/2/claude-trace/

I thought this was informative: https://minusx.ai/blog/decoding-claude-code/

https://github.com/sst/opencode opencode is open source. Here's a session I started but haven't had time to get back to which is using opencode to ask it about how the loop works https://opencode.ai/s/4P4ancv4

The summary is

The beauty is in the simplicity: 1. One loop - while (true) 2. One step at a time - stopWhen: stepCountIs(1) 3. One decision - "Did LLM make tool calls? → continue : exit" 4. Message history accumulates tool results automatically 5. LLM sees everything from previous iterations This creates emergent behavior where the LLM can: - Try something - See if it worked - Try again if it failed - Keep iterating until success - All without explicit retry logic!

Generally, that's pretty much it. More advanced tools like Claude Code will also have context compaction (which sometimes isn't very good), or possibly RAG on code (unsure about this, I haven't used any tools that did this). Context compaction, to my understanding, is just passing all the previous context into a call which summarizes it, then that becomes to new context starting point.

Have a look at https://github.com/anthropics/claude-code/tree/main/plugins/... to see how a fairly complex workflow is implemented

> It’s Incredibly Easy

    client = OpenAI()
    context_good, context_bad = [{
        "role": "system", "content": "you're Alph and you only tell the truth"
    }], [{
        "role": "system", "content": "you're Ralph and you only tell lies"
    }]
    ...

You're building on quicksand.

You're delegating everything important to someone who has no responsibility to you.

Its easy to switch to an open source model

I love that the thing you singled out as not safe to run long term, because (apparently) of woke, was my weird deep-cut Labyrinth joke.

ive been trying this for a few week, but i dont at all currently own hardware good enough to be useful for local inference.

ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens

> “You only think you understand how a bicycle works, until you learn to ride one.”

[0] https://github.com/vinhnx/vtcode

>> “You only think you understand how a bicycle works, until you learn to ride one.”

> This resonates deeply with me. That's why I built one myself [0]

I was hoping to see a home-made bike at that link.. Came away disappointed

It's conflating two issues though. Most people who can ride a bike can't explain the physics. They really don't know how it works. The bicycle lesson is about training the brain on a new task that cannot be taught in any other way.

This case is more like a journeyman blacksmith who has to make his own tools before he can continue. In doing so, he gets tools of his own, but the real reward was learning what is required to handle the metal such that it makes a strong hammer. And like the blacksmith, you learn more if you use an existing agent to write your agent.

Cool, can you make it use local free models because I'm broke and can't afford AI's crazy costs.

Yep, change nothing in the code in the article but spin up an Ollama server and use the OpenAI API https://docs.ollama.com/api/openai-compatibility.

Actually tool "ping 8.8.8.8" never quits unless running on windows. This can spawn many processes that kill the server.

Question, how hard is it for someone new to agents to dip their toes into writing a simple agent to get data? (e.g., getting reviews from sites for sentiment analysis?)

Hold up. These are all the right concerns but with the wrong conclusion.

MCP isn't a plugin interface for Claude, it's just JSON-RPC.

I love everything they've written and also Sketch is really good.

You can do this. Claude Code can do everything the toy agent this post shows, and much more. But you shouldn't, because doing that (1) doesn't teach you as much as the toy agent does, (2) isn't saving you that much time, and (3) locks you into Claude Code's context structure, which is just one of a zillion different structures you can use. That's what the post is about, not automating ping.

Honest question, as your comment confuses me.

Did you get to the part where he said MCP is pointless and are saying he's wrong?

Or did you just read the start of the article and not get to that bit?

If you look at the actual code, it runs ping -c 5. I agree ping without options doesn't terminate.

You can prototype this without writing any code at all.

Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:

> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.

Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.

When a tool call completes the result is sent back to the LLM to decide what to do next, that's where it can decide to go do other stuff before returning a final answer. Sometimes people use structured outputs or tool calls to explicitly have the LLM decide when it's done, or allow it to send intermediate messages for logging to the user. But the simple loop there lets the LLM do plenty of it has good tools.

Sending the whole context on each user message is essentially what the model remembers of this conversation. ie: it is entirely stateless.

I've written some agents that have their context altered by another llm to get it back on track. Let's say the agent is going off rails, then a supervisor agent will spot this and remove messages from the context where it went off rails, or alter those with correct information. Really fun stuff but yeah, we're essentially still inventing this as we go along.

In the Responses API, you can implicitly chain messages with `previous_response_id` (I'm not sure how old a conversation you can resurrect that way). But I think Codex CLI actually sends the full context every time? And keep in mind, sending the whole context gives you fine-grained control over what does and doesn't appear in your context window.

Anyways, if it nerd sniped you, I succeeded. :)

There is context caching in many models. It is less expensive if you enable that.

In the event this comment is slathered in sarcasm:

  Well done!  :-D

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

I think my thing about MCP, besides the outsized press coverage it gets, is the implicit presumption it smuggles in that agents will be built around the context architecture of Claude Code --- that is to say, a single context window (maybe with sub-agents) with a single set of tools. That straitjacket is really most of the subtext of this post.

I get that you can use MCP with any agent architecture. I debated whether I wanted to hedge and point out that, even if you build your own agent, you might want to do an MCP tool-call feature just so you can use tool definitions other people have built (though: if you build your own, you'd probably be better off just implementing Claude Code's "skill" pattern).

But I decided to keep the thrust of that section clearer. My argument is: MCP is a sideshow.

If you look at the actual code, it runs ping -c 5. I agree ping without options doesn't terminate.

Sending the whole context on each user message is essentially what the model remembers of this conversation. ie: it is entirely stateless.

In the event this comment is slathered in sarcasm:

  Well done!  :-D

There is context caching in many models. It is less expensive if you enable that.

Show me where TFA even implied that you should start a sustainably monetizable project with agents?

Kimi is noticeably better at tool calling than gpt-oss-120b.

And yes - it is very much fun to build those!

But it's way more expensive since most providers won't give you prompt caching?

Doing one thing well means you need a lot more tools to achieve outcomes, and more tools means more context and potentially more understanding of how to string them together.

I suspect the sweet spot for LLMs is somewhere in the middle, not quite as small as some traditional unix tools.

The "magic" is done via the JSON schemas that are passed in along with the definition of the tool.

Structured Output APIs (inc. the Tool API) take the schema and build a Context-free Grammar, which is then used during generation to mask which tokens can be output.

And that is how we end up with iPaaS products powered by agentic runtimes, slowly dragging us away from programming language wars.

Only a selected few get to argue about what is the best programming language for XYZ.

This sounds really great.

Non sequitur.

You can prototype this without writing any code at all.

Fire up "claude --dangerously-skip-permissions" in a fresh directory (ideally in a Docker container if you want to limit the chance of it breaking anything else) and prompt this:

> Use Playwright to fetch ten reviews from http://www.example.com/ then run sentiment analysis on them and write the results out as JSON files. Install any missing dependencies.

Watch what it does. Be careful not to let it spider the site in a way that would justifiably upset the site owners.

Dont you need to setup Playwright MCP first?

Honest question, as your comment confuses me.

Did you get to the part where he said MCP is pointless and are saying he's wrong?

Or did you just read the start of the article and not get to that bit?

I'd second the article on this, but also add to it that the biggest reason MCP servers don't really matter much any more is that the models are so capable of working with APIs, that most of the time you can just point them at an API and give them a spec instead. And the times that doesn't work, just give them a CLI tool with a good --help option.

Now you have a CLI tool you can use yourself, and the agent has a tool to use.

Anthropic itself have made MCP server increasingly pointless: With agents + skills you have a more composeable model that can use the model capabilities to do all an MCP server can with or without CLI tools to augment them.

So it returns a tool call for "continue" every time it wants to continue working? Do people implement this in different ways? It would be nice what method it has been trained on if any.

Anyways, if it nerd sniped you, I succeeded. :)

Hacker Times

Hacker Times

You should write an agent

Discussion

Discussion

It’s Incredibly Easy

Real-World Agents

Context Engineering Is Real

Nobody Knows Anything Yet And It Rules