All your agents are going async

There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

The problem is the industry obsession on concatenating messages into a conversation stream. There is no reason to do it this way. Every time you run inference on the model, the client gets to compose the context in any way they want; there are more things than just concatenating prompts and LLM ouputs. (A drawback is caching won't help much if most of the context window is composed dynamically)

Coding CLIs as well as web chat works well because the agent can pull in information into the session at will (read a file, web search). The pain point is that if you're appending messages a stream, you're just slowly filling up the context.

The fix is to keep the message stream concept for informal communication with the prompter, but have an external, persistent message system that the agent can interact with (a bit like email). The agent can decide which messages they want to pull into the context, and which ones are no longer relevant.

The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.

Here is an interesting find.

Let's say that you have two agents running concurrently: A & B. Agent A decides to push a message into the context of agent B. It does that and the message ends up somewhere in the list of the message right at the bottom of the conversation.

The question is, will agent B register that a new message was inserted and will it act on it?

If you do this experiment you will find out that this architecture does not work very well. New messages that are recent but not the latest have little effect for interactive session. In other words, Agent A will not respond and say, "and btw, this and that happened" unless perhaps instructed very rigidly or perhaps if there is some other instrumentation in place.

Your mileage may vary depending on the model.

A better architecture is pull-based. In other words, the agent has tools to query any pending messages. That way whatever needs to be communicated is immediately visible as those are right at the bottom of the context so agents can pay attention to them.

An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.

I hope this helps. We've learned this the hard way.

> All of these features are about breaking the coupling between a human sitting at a terminal or chat window and interacting turn-by-turn with the agent.

This means:

- less and less "man-in-the-loop"

- less and less interaction between LLMs and humans

- more and more automation

- more and more decision-making autonomy for agents

- more and more risk (i.e., LLMs' responsibility)

- less and less human responsibility

Problem:

Tasks that require continuous iteration and shared decision-making with humans have two possible options:

- either they stall until human input

- or they decide autonomously at our risk

Unfortunately, automation comes at a cost: RISK.

So reinventing terminal multiplexing, except over proprietary chat/realtime transports instead of PTYs?

this is a commercial sales pitch for something that doesn't exist

Struggling with this at the moment too - the second you have a task that is a blend of CI style pipeline, LLM processing and openclaw handing that data back and forth, maintaining state and triggering next step gets tricky. They're essentially different paradigms of processing data and where they meet there are impedance mismatches.

Even if I can string it together it's pretty fragile.

That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)

> The interesting thing is what agents can do while not being synchronously supervised by a human.

I vibe coded a message system where I still have all the chat windows open but my agents run a command that finished once a message meant for them comes along and then they need to start it back up again themselves. I kept it semi-automatic like that because I'm still experimenting whether this is what I want.

But they get plenty done without me this way.

The idea of the "session" is an interesting solution, I'll be looking forward to new developments from you on this.

I don't think it solves the other half of the problem that we've been working on, which is what happens if you were not the one initiating the work, and therefore can't "connect back into a session" since the session was triggered by the agent in the first place.

I recognize the problem statement and decomposition of it. But not the solution. Especially saying that he sees the same problem being worked on by N people. And now that makes in N+1? I’ve been more interested by the protocols and standard that could truly solve this for everyone in a cross-compatible way. Some people have dabbled with atproto as the transport and “memory” storage for example.

I was of the same view - but then there is this other trend which is putting sync back in favor. And that is that agents are becoming faster. If they are faster - it makes sense to stick around and maintain your 'context' about the task and supervise in real time. The other thing which might keep sync in fashion is that LLM providers are cutting back on cheap tokens. So you have a bigger incentive to stick around and make sure that your agent is not going astray.

The only place I use async now is when I am stepping away and there are a bunch of longer tasks on my plate. So i kick them off and then get to review them when ever I login next. However I dont use this pattern all that much and even then I am not sure if the context switching whenever I get back is really worth it.

Unless the agents get more reliable on long horizon tasks, it seems that async will have limited utility. But can easily see this going into videos feeding the twitter ai launch hype train.

How would you differentiate between other tools like Temporal or Kitaru (https://kitaru.ai/) ?

> ... and streaming the tokens back on the HTTP response as an SSE stream

> So how are folks solving this?

$5 per month dedicated server, SSH, tmux.

Can anybody explain why many times if you switch away from the chat app on the phone, the conversation can get broken?

Having long living requests, where you submit one, you get back a request_id, and then you can poll for it's status is a 20 year old solved problem.

Why is this such a difficult thing to do in practice for chat apps? Do we need ASI to solve this problem?

at https://agentblocks.ai we just use Google-style LROs for this, do we really need a "durable transport for AI agents built around the idea of a session"?

Even if I can string it together it's pretty fragile.

That said I don't really want to solve this with a SaaS. Trying really hard to keep external reliance to a minimum (mostly the llm endpoint)

> The interesting thing is what agents can do while not being synchronously supervised by a human.

But they get plenty done without me this way.

> ... and streaming the tokens back on the HTTP response as an SSE stream

> So how are folks solving this?

$5 per month dedicated server, SSH, tmux.

There is nothing wrong with the HTTP layer, it's just a way to get a string into the model.

The key is to give the agent not just the ability to pull things into context, but also remove from it. That gives you the eternal context needed for permanent, daemonized agents.

I've been working on a coding agent that does this on and of for about a year. Here's my latest attempt: https://github.com/vanviegen/maca#maca - This one allows agents to request (and later on drop) 'views' on functions and other logical pieces of code, and always get to see the latest version of it. (With some heuristics to not destroy kv-caches at every turn.)

The problem is that the models are not trained for this, nor for any other non-standard agentic approach. It's like fighting their 'instincts' at every step, and the results I've been getting were not great.

> "and which ones are no longer relevant."

This is absolutely the hardest bit.

I guess the short-cut is to include all the chat conversation history, and then if the history contains "do X" followed by "no actually do Y instead", then the LLM can figure that out. But isn't it fairly tricky for the agent harness to figure that out, to work out relevancy, and to work out what context to keep? Perhaps this is why the industry defaults to concatenating messages into a conversation stream?

> The key is to give the agent not just the ability to pull things into context, but also remove from it

Of course Anthropic/OpenAI can do it. And the next day everyone will be complaining how much Claude/Codex has been dumbed down. They don't even comply to the context anymore!

Yeah, opencode was/is like this and they never got caching right. Caching is a BIG DEAL to get right.

To be utterly shameless, this what I've been building: https://github.com/ASIXicle/persMEM

Three persistent Claude instances share AMQ with an additional Memory Index to query with an embedding model (that I'm literally upgrading to Voyage 4 nano as I type). It's working well so far, I have an instance Wren "alive" and functioning very well for 12 days going, swapping in-and-out of context from the MCP without relying on any of Anthropic's tools.

And it's on a cheap LXC, 8GB of RAM, N97.

Hmm.

Maybe there’s a way to play around with this idea in pi. I’ll dig into it.

Here is an interesting find.

The question is, will agent B register that a new message was inserted and will it act on it?

Your mileage may vary depending on the model.

An agent in that case slightly more rigid in a sense that the loop needs to orchestrate and surface information and there is certainly not one-size-fits-all solution here.

I hope this helps. We've learned this the hard way.

Yep, I didn't want to have to think about concurrency so my solution was a global lock file on my VM that gets checked by a pre-start hook in claude code. Each of my "agents" is a it's own linux user with their own CLAUDE.md, and there is a changelog file that gets injected into that each time they launch. They can update the changelog themselves, and one agent in particular runs more frequently to give updates to all of them. Most of it is just initiated by cron jobs. This doesn't scale infinitely, but if you stick to two-pizza teams per VM it will still be able to do a lot.

So hooks are your friends. I also use one as a pre flight status check so it doesn't waste time spinning forever when the API has issues.

> All of these features are about breaking the coupling between a human sitting at a terminal or chat window and interacting turn-by-turn with the agent.

This means:

- less and less "man-in-the-loop"

- less and less interaction between LLMs and humans

- more and more automation

- more and more decision-making autonomy for agents

- more and more risk (i.e., LLMs' responsibility)

- less and less human responsibility

Problem:

Tasks that require continuous iteration and shared decision-making with humans have two possible options:

- either they stall until human input

- or they decide autonomously at our risk

Unfortunately, automation comes at a cost: RISK.

AI driven cars have better risk profiles than humans.

Why do you think the same will not also be true for AI steerers/managers/CEO?

In a year of two, having a human in the loop, will all of their biases and inconsistencies will be considered risky and irresponsible.

this is a commercial sales pitch for something that doesn't exist

I don't think this is quite right. I do work for a pub/sub company that's involved in this space, but this article isn't a commercial sales pitch and we do have a product that exists.

The article is about how agents are getting more and more async features, because that's what makes them useful and interesting. And how the standard HTTP based SSE streaming of response tokens is hard to make work when agents are async.

The idea of the "session" is an interesting solution, I'll be looking forward to new developments from you on this.

With the approach based on pub/sub channels, this is possible to do if you know the name of the session (i.e. know the name of the channel).

Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?

Generally we'd recommend having a separate kind of 'notification' or 'control' pub/sub channel that clients always subscribe to to be notified of new 'sessions'. Then they can subscribe to the new session based purely on knowing the session name.

How would you differentiate between other tools like Temporal or Kitaru (https://kitaru.ai/) ?

Yeah, opencode was/is like this and they never got caching right. Caching is a BIG DEAL to get right.

> The key is to give the agent not just the ability to pull things into context, but also remove from it

Of course Anthropic/OpenAI can do it. And the next day everyone will be complaining how much Claude/Codex has been dumbed down. They don't even comply to the context anymore!

Hmm.

Maybe there’s a way to play around with this idea in pi. I’ll dig into it.

To be utterly shameless, this what I've been building: https://github.com/ASIXicle/persMEM

And it's on a cheap LXC, 8GB of RAM, N97.

So hooks are your friends. I also use one as a pre flight status check so it doesn't waste time spinning forever when the API has issues.

With the approach based on pub/sub channels, this is possible to do if you know the name of the session (i.e. know the name of the channel).

Of course the hard bit then is; how does the client know there's new information from the agent, or a new session?

I don't know Kitaru too well, but I do know Temporal a bit.

The pattern I describe in the article of 'channels' works really well for one of the hardest bits of using a durable execution tool like Temporal. If your workflow step is long running, or async, it's often hard to 'signal' the result of the step out to some frontend client. But using channels or sessions like in the article it becomes super easy because you can write the result to the channel and it's sent in realtime to the subscribed client. No HTTP polling for results, or anything like that.

at https://agentblocks.ai we just use Google-style LROs for this, do we really need a "durable transport for AI agents built around the idea of a session"?

Assuming LROs are "Long running operations", then you kick off some work with an API request, and get some ID back. Then you poll some endpoint for that ID until the operation is "done". This can work, but when you try and build in token-streaming to this model, you end up having to thread every token through a database (which can work), and increasing the latency experienced by the user as you poll for more tokens/completion status.

Obviously polling works, it's used in lots of systems. But I guess I am arguing that we can do better than polling, both in terms of user experience, and the complexity of what you have to build to make it work.

If your long running operations just have a single simple output, then polling for them might be a great solution. But streaming LLM responses (by nature of being made up of lots of individual tokens) makes the polling design a bit more gross than it really needs to be. Which is where the idea of 'sessions' comes in.

> "and which ones are no longer relevant."

This is absolutely the hardest bit.

That's what the embedding model is for. It's like a tack-on LLM that works out the relevancy and context to grab.

AI driven cars have better risk profiles than humans.

Why do you think the same will not also be true for AI steerers/managers/CEO?

In a year of two, having a human in the loop, will all of their biases and inconsistencies will be considered risky and irresponsible.

"Did the vehicle just crash" has a short feedback loop, very amenable to RL. "Did this product strategy tank our earnings/reputation/compliance/etc" can have a much longer, harder to RL feedback loop.

But maybe not that much longer; METR task length improvement is still straight lines on log graphs.

Or autonomous weapons?

Getting to that point is likely going to involve a lot of (the business and personal equivalent of) Teslas electing to drive through white semitrailers.

> AI driven cars have better risk profiles than humans.

From which company? I hope you say "Waymo", because Tesla is lying through its teeth and hiding crash statistics from regulators.

I don't think this is quite right. I do work for a pub/sub company that's involved in this space, but this article isn't a commercial sales pitch and we do have a product that exists.

> but this article isn't a commercial sales pitch

Yes it is. But it's nice you've convinced yourself I guess.

What is this, if not a product pitch:

> Because we’re building on our existing realtime messaging platform, we’re approaching the same problem that Cloudflare and Anthropic are approaching, but we’ve already got a bi-directional, durable, realtime messaging transport, which already supports multi-device and multi-user. We’re building session state and conversation history onto that existing platform to solve both halves of the problem; durable transport and durable state.

Or autonomous weapons?

Getting to that point is likely going to involve a lot of (the business and personal equivalent of) Teslas electing to drive through white semitrailers.

> but this article isn't a commercial sales pitch

Yes it is. But it's nice you've convinced yourself I guess.

What is this, if not a product pitch:

> AI driven cars have better risk profiles than humans.

From which company? I hope you say "Waymo", because Tesla is lying through its teeth and hiding crash statistics from regulators.

I don't know Kitaru too well, but I do know Temporal a bit.

so to be clear, this should be used "instead of" rather then "on top of" durable execution engines?

That's what the embedding model is for. It's like a tack-on LLM that works out the relevancy and context to grab.

God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.

But maybe not that much longer; METR task length improvement is still straight lines on log graphs.

The AI has read all the business books, blogs and stories.

Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

so to be clear, this should be used "instead of" rather then "on top of" durable execution engines?

God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.

One of us is confusing prediction with retrieval. The embedding model doesn't predict what is going to be relevant in several turns, just on the turn at hand. Each turn gets a fresh semantic search against the full body of memory/agent comms. If the conversation or prompt changes the next query surfaces different context automatically.

As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.

The AI has read all the business books, blogs and stories.

Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

> The AI has read all the business books, blogs and stories.

This seems like a liability as most business books, blogs, and stories are either marketing BS or gloss over luck and timing.

> Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

As someone using AI agents daily, this is actually incredible really easy to imagine. It's actually hard to imagine it NOT being horrible! Maybe that'll change though... if gains don't plateau.

But they are shit. Over the last 2 days I've got bored of the predictable cycle of it first getting excited about a new idea then back peddling once I shoot it to pieces.

They can't write and think critically at the same time. Then subsequent messages are tainted by their earlier nonsensical statements.

Opus 3.7 BTW, not some toy open source model.

As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.

> The AI has read all the business books, blogs and stories.

This seems like a liability as most business books, blogs, and stories are either marketing BS or gloss over luck and timing.

> Unless your CEO is Steve Jobs, it's hard to imagine it being much worse than your average pointy haired boss.

As someone using AI agents daily, this is actually incredible really easy to imagine. It's actually hard to imagine it NOT being horrible! Maybe that'll change though... if gains don't plateau.

But they are shit. Over the last 2 days I've got bored of the predictable cycle of it first getting excited about a new idea then back peddling once I shoot it to pieces.

They can't write and think critically at the same time. Then subsequent messages are tainted by their earlier nonsensical statements.

Opus 3.7 BTW, not some toy open source model.

So reinventing terminal multiplexing, except over proprietary chat/realtime transports instead of PTYs?

Yeah, but one is free and the other one might make you a billionaire.

If you think about it, about 30% of the biggest businesses out there are based on this exact business idea. IRC - Slack, XMPP & co - the many proprietary messengers out there, etc.

Can anybody explain why many times if you switch away from the chat app on the phone, the conversation can get broken?

Having long living requests, where you submit one, you get back a request_id, and then you can poll for it's status is a 20 year old solved problem.

Why is this such a difficult thing to do in practice for chat apps? Do we need ASI to solve this problem?

Unless the agents get more reliable on long horizon tasks, it seems that async will have limited utility. But can easily see this going into videos feeding the twitter ai launch hype train.

I suspect the answer is that the AI chat-app is built so that the LLM response tokens are sent straight into the HTTP response as a SSE stream, without being stored (in their intermediate state) in a database. BUT the 'full' response _is_ stored in the database once the LLM stream is complete, just not the intermediate tokens.

If you look at the gifs of the Claude UI in this post[1], you can see how the HTTP response is broken on page refresh, but some time later the full response is available again because it's now being served 'in full' from the database.

[1]: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

Yeah, but one is free and the other one might make you a billionaire.

If you think about it, about 30% of the biggest businesses out there are based on this exact business idea. IRC - Slack, XMPP & co - the many proprietary messengers out there, etc.

[1]: https://zknill.io/posts/chatbots-worst-enemy-is-page-refresh...

Agents used to be a thing you talked to synchronously. Now they’re a thing that runs in the background while you work. When you make that change, the transport breaks.

For most of the time LLMs have been around, you use them by opening a chat-style window and typing a prompt. The LLM streams the response back token-by-token. It’s how ChatGPT, claude.ai, and Claude Code work. It’s also how the demos work for basically every AI SDK or AI Library. It’s easy to think that LLM chatbots are the ‘art of the possible’ for AI right now. But that’s not the case.

Standard design around single http request

Instead, all your agents are going async. Agents are getting crons, webhook support, whatsapp integrations, ‘remote control’ from your phone, scheduled tasks and routines. Agents are becoming something that runs in the background, working while you work, and reporting back results async. Agents are getting workflows in Temporal, Vercel WDK, Relay.app, etc. A human sitting at a terminal or webchat is just one mode now, and increasingly it’s not the interesting one. The interesting thing is what agents can do while not being synchronously supervised by a human.

The problem is that chatbots are primarily built on HTTP. An HTTP request with the prompt, and a SSE stream of LLM generated tokens back on the HTTP response. But this doesn’t work when the agent is running async. There’s no HTTP connection to stream the response back.

OpenClaw’s async step

OpenClaw took a big step towards async agents, by showing people that an agent could live in your WhatsApp chat. The agent could travel around with you, and could work on stuff in the background. OpenClaw showed that you didn’t have to be glued to your browser or terminal to get AI to do work for you.

Openclaw channels model

Anthropic’s direct response to the OpenClaw model is Channels, which is MCP based and allows you to push messages async from an external chat system into a Claude Code session. But they also have /loop and /schedule slash commands, as well as Routines, both allowing you to schedule and run agents in the background. Anthropic also has Remote Control, which lets you continue a Claude Code session from your phone or another browser.

ChatGPT has scheduled tasks which trigger agents async, that can reach out to you if needed.

Cursor has background agents that run in the background in the cloud.

All of these features are about breaking the coupling between a human sitting at a terminal or chat window and interacting turn-by-turn with the agent. They make interactions with agents continuous, remote, long-running, and async.

The transport mismatch

All these new async features share the same property; the lifetime of an agent’s work is decoupled from the lifetime of a single HTTP connection. In chatbot demo apps, the agent is only processing while the HTTP connection is open. The LLM is doing inference in response to an HTTP request and streaming the tokens back on the HTTP response as an SSE stream. I’ve said before that a chatbot’s worst enemy is page refresh, and this is entirely because of the transport mismatch. HTTP request-response can’t survive a page refresh, and it can’t serve async agents.

There are four scenarios that the old transport based on HTTP can’t handle cleanly:

Agent outlives the caller: A routine fires from a cron, or the agent takes a long time to do its work. Five minutes later the agent has a result, but no one’s listening anymore. Where do the results go? Right now, they go in a database and you have to poll for them with some session URL (which, y’know, sucks).
Agent wants to push unprompted: The agent finished a nightly backlog review and has three PRs for you to review. Or your async workflow hits a human-approval step and needs you to say yes before it can keep going. There’s no connection back to you. Right now, they email you or send a slack message.
Caller changes: You started a task at your desk, went to lunch, and want to check on it from your phone. Anthropic’s Remote Control handles this, but only by building custom backend session storage and management. It’s not a first-class feature of the HTTP transport.
Multiple humans in one session: You have a team of five people working on a task together, and an agent is helping you. The agent needs to be able to push updates to all five of you, and take input from any of you.

Part of the reason that folks found OpenClaw so awesome is that it handles all of these scenarios for you. OpenClaw’s model separates the lifetime of the agent’s work from the lifetime of the connection to the human. The agent can do work async, and then use WhatsApp, iMessage, Telegram, Discord or whatever async chat system to push the results back to you when it’s done.

This just isn’t possible with HTTP request-response.

So how are folks solving this?

Looking across the industry, there are a bunch of different solutions to this. Clearly there’s the OpenClaw model where all the interaction is through some external chat provider. This chat provider also provides the conversation history to the agent, so the agent can have context on the conversation even after restarts. But this is just an extension on the chat-based model. I don’t think it’s the most interesting solution.

Most folks are pulling more and more of the session state into a centralised and hosted environment. Anthropic is doing this with Routines and Remote Control. More of the session state, conversation history, and agent inference is running in the hosted Anthropic platform. They are consolidating more of the agent lifecycle and agent connection state into their own platform, rather than just being an LLM inference API.

Cloudflare are getting involved too with their own Agents platform built on their workers platform. The Cloudflare Sessions API provides the session and conversation storage for agents, accessible over HTTP. And to fix the async notification problem, Cloudflare has launched their Email for Agents product.

These solutions solve only one half

The problem actually splits into two halves. The first half is durable state. Where does the agent’s state live, how does the agent have access to that state on restart or when processing async tasks, and where does the agent store its output? The second half is durable transport. How do the bytes of the response get between the agent and the humans or other agents, how does the connection survive disconnect, device switch, fan-out, server-initiated push, etc?

The Anthropic and Cloudflare solutions are really focused on the first half of the problem. They are building durable state storage and management for agents. Their solution to getting the bytes of the response is still polling, or HTTP requests. Cloudflare does have websocket support, but it doesn’t survive disconnections for streaming LLM responses. Anthropic and Cloudflare’s solution is based on the idea that if they store all the data required, then clients can always HTTP GET that data later. It half works, but it’s not ‘art of the possible’.

Durable transport, durable state

Right now, the session and the transport are all wrapped up in a single HTTP request-response. Cloudflare and Anthropic’s hosted features go some way to making the session state durable, but the don’t fix the transport problem. You’re still stuck with HTTP gets, or polling, in order to find out something new has happened.

Looking at the OpenClaw model, where the conversation history is in the chat channel and the agent process and LLM provider are both separated from that, you can’t build the same design on Cloudflare or Anthropic. There’s no ’enterprise’ version of the OpenClaw channels model that you can run with your own infrastructure. There’s no durable transport and durable state solution.

I work for Ably and we are currently building a durable transport for AI agents built around the idea of a session. We’re building on top of our existing realtime messaging platform. It’s frustrating to see folks fighting the similar problems over and over again, all because they picked the wrong transport to start with: HTTP request-response.

Ably’s transport

A ‘session’ with an AI should be a thing that humans and agents can connect to, and disconnect from at any time. They should be able to come and go, the session should survive wifi issues, or your phone disconnecting when you go into a train tunnel without the human or the agent needing to care about it. This is what you can do with OpenClaw, and with modern async agentic applications. The conversation state should be accessible through the durable session, and humans and agents should be able to notify each other through the session. The session should be a first class primitive for building async agents.

Because we’re building on our existing realtime messaging platform, we’re approaching the same problem that Cloudflare and Anthropic are approaching, but we’ve already got a bi-directional, durable, realtime messaging transport, which already supports multi-device and multi-user. We’re building session state and conversation history onto that existing platform to solve both halves of the problem; durable transport and durable state.

Hacker Times