To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
The only reason you wouldn’t choose an API is if it wasn’t viable.
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
idk.. not really thought out too much, but has to be better
I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.
What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
Get all reviews:
To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
Go to each page:
Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
Hang on, that sounds like common corporate SaaS apps.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
Me: hmm, this title confuses and infuriates Rob.
[Clicks link]
Me: Sees same title, repeat feelings of confusion and infuration
[Scrolls article down on my smartphone]
Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.
[Closes tab]
[Continues living rest of my life]
I hope this feedback is well received and understood.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
Now the argument against this on [reddit](https://www.reddit.com/r/openclaw/comments/1s1dzxq/comment/o...)
"my experience is the opposite actually. UIA looks uniform on paper but WPF, WinForms, and Win32 all expose different control patterns and you end up writing per-toolkit handlers anyway. Qt only exposes anything if QAccessible was compiled in and the accessibility plugin is loaded at runtime, which on shipped binaries is basically never. Electron is just as opaque on Windows as on macOS because it's the same chromium underneath drawing into a canvas. the real split isn't OS vs OS, it's native toolkit vs everything else."
For better and worse, 5-10Mi isn't uncommon for a web app.
Instead of trying to go "bottom up" and, effectively, do what a browser engine is doing in reverse, it seems easier to go "top down" like a human does and go off the visual representation.
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.
I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...
I developed all this on $20 codex sub.
I think this is very fertile ground - big labs need to use approaches that can work on multiple platforms and arbitrary workflows, and full-page vision is the lowest common denominator. Platform-specific approaches are a really exciting open space!
Generative AI wasn't a thing at the time, but I had to resort to a combination of OCR, simulated user input, and print capture to drive the application and export data.
Had the developers been aware of the Windows DRM APIs that block screen capture, or the fact that text is easily recoverable from PostScript files with minimal formatting, I don't know what I would have done.
The irony is that the process this replaced involved giving cheap offshore labor full read-only remote access to all data in the system, which was by any measure a far more serious security risk than otherwise authorized employees using tools running locally with no network access provided by established, trustworthy vendors to automate their work.
We built isagent.dev for exactly this reason, serve human content to humans, serve agent optimized content to agents.
No most vision models focus on subset of an image at a time when using image -> text
image -> image uses whole image.
It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
If I think an LLM is good for something I create well defined, very deterministic "middleware" for that purpose on top of Openrouter.
It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.
i tend to think of invoke as "an API over macOS apps" tho...
doesn't `invoke finder shareAndCopyLink` read very nicely? :P
https://accessibilityinsights.io/
https://learn.microsoft.com/en-us/windows/win32/winauto/insp...
https://github.com/FlaUI/FlaUInspect
and for WPF applications specifically,
invoke rather has overlap with Claude's and Codex' computer-use, except the steps are stored/scripted.
webmcp is bottom-up. computer-use & invoke are top-down
in the context of this blog post, the conclusion looks similar though!
"use the whole web like it's an API"
works much better than
"figure out similar or identical tasks from a clean slate every single time you do them"
Try the exhorbitant expenses and ballooning waste of generated electricity and usable water.
Not possible on wayland, maybe on X11 protocol?
or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D
Thinking of Frigate NVR that does motion > object detection > scene description
Where you build up to progressively slower and more expensive algorithms i.e. there's motion > it's a person > here's what the person is doing
I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.
For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.
This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.
Isn't that what Apple is doing with its Foundation Models Framework?[0] Developers can integrate Apples on device llm that includes things like tool calling. I don't write Apple specific apps so not sure what can actually be done with it, but it looks promising and something Apple already seems to think things are headed.
> I think OpenAI designing their own phone is the next logical step
ChatGPT is already integrated into Apple Intelligence for those that want to use that instead of Apples model -- I don't see OpenAI trying to change lanes into phone making when they can focus on doing what they know while collecting a large check from Apple
[0] https://developer.apple.com/documentation/foundationmodels
1. things you wouldn’t otherwise bother doing
2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail
“Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.
I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.
I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.
The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.
One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return
I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.
i'm really not sure companies will allow their apps to be automated so easily, and the reason is api abuse (think of a saas where you can upload file attachments for example); you'd either end up banned or throttled pretty fast, and in the end the company will be like "cost > opportunity" and just close it off (and its like this already, llms just make this worse)
Well I am competing with geoip provider like maxmind.
I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo
These companies have huge teams. But my 3 people company already beat them.
Code doesn't matter much, it's not an opensource project.
My free app is http://macrocodex.app which I've developed along with a fitness coach.
I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.
I am simply very excited about all this.
Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.
Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.
AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness
https://playwright.dev/docs/getting-started-mcp#accessibilit...
I've mentioned several times and gotten snarky remarks about how rewriting your code so it fits in your head, and in the LLM's context helps the LLM code better, to which people complain about rewriting code just for an LLM, not realizing that the suggestion is to follow better coding principles to let the LLM code better, which has the net benefit of letting humans code better! Well looks like, if you support accessibility in your web apps correctly, Playwright MCP will work correctly for you.
Amazing.
Harder to scale if it's doing a lot of them, I suppose.
https://devblogs.microsoft.com/cppblog/spy-internals/
Obviously, if you can inject code into a process that receives sensitive data, you're already running in a context where all security bets are off.
But with processes you yourself create, you probably can, even without elevated privileges, unless the application takes measures to prevent injection (akin to game anticheat mechanisms), so it seems worth pointing out that there are simple mechanisms to subvert such "protected" fields that don't require application-specific reverse engineering.
Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.
Open source research/project I have been exploring on the topic: https://aevum.build/learn/architecture/
There are no shortcuts in life and its just expensive text autocomplete.
"Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."
Everyone is delusional.
The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.
The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares
Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now
So, like a Unix system?
Is this true? Where can I read more about it?
I don't think many realize how could the cheap, alternative models are becoming. I prefer SOTA models for key work, but I can also spend 10X as many tokens on an open model hosted by a non-VC subsidized provider (who is selling at a profit) for tasks that can tolerate slightly less quality.
The situation is only getting better as models improve and data centers get built out.
Face-scanning? Iris patterns?
Why on earth would they offer convenient hooks for AI chatbots?
Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.
I guess that just never occurred to anybody before.
Almost sounds like an Orielly book
> It's kind of fascinating that we never were willing to do these things for humans but now that AI needs it ... we are all in. A bit depressing in the sense that I think mostly the reason we happy to do it for AI is that we perceive it will benefit us personally rather than some abstract future human.
I don't think that's the reason.
I think it's because they take time, and few people were willing to put in time for "maybe it'll make writing the actual code faster" gains when the code was going to take a few times longer to write itself.
You also can get faster feedback to iterate on your spec now, which improves the probability of it helping future-you.
So combine that with the fact that the llms are more likely to get lost if you don't spec stuff in advance, and the value of up-front work is higher (whereas a human is more likely to land on the right track, just more slowly than otherwise, making the value harder to quantify).
Anthropic even says, that an agent based solution should only be your last resort and that most problems are well served with a one-shot.
https://www.anthropic.com/engineering/building-effective-age...
Perhaps because those numbers are provided on the Playstore dashboard? You should question Google's acumen in providing those statistics to developers?
And people have been estimating ARR through projections for a long time.
I already have services running for a decade+ in a product which I posted here: https://news.ycombinator.com/threads?id=faangguyindia&next=4...
In the end, simplicity wins.
Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise, you clearly have made up your mind about what to focus on.
> My free app is http://macrocodex.app which I've developed along with a fitness coach.
Crazy, this app you've run for ~1-2 months has 10K active users already, even though there is zero info about who runs it, zero reviews, and says "Download on the App Store" on the landing page even though you then ask people to use the web app, impressive.
I don't think anyone said using AI can't produce a ton of code really quickly, and no one is finding that difficult to manage either. But most of us software engineers are trying to build long-lasting codebases with AI too, then "less === better" typically, so it's not about being able to spit out features as fast as possible, but avoid the evergrowing codebase from collapsing on top of itself, and each prompt not getting slower and slower, but as fast as on a greenfield project.
Sounds like you've found the holy grail of being able to avoid that, kudos if so. Judging by you giving zero care to how the design and architecture actually is, I kind of find that hard to believe. But, if it works for you, it works for you, not up to me or others to dictate how you build stuff, hope you enjoy it, however you build stuff :)
A closer analogue would be AppleScript, or rather, the underlying Apple Event and Open Scripting Architecture functionality supplied by the OS to support AppleScript, that allowed applications to expose these interfaces along with metadata documenting them, and for external tools to record manually performed tasks across applications as programs expressed in terms of these interfaces to make them easier to use (this last bit, while not strictly required, is convenient, and especially useful for less technical users).
If you're familiar with VBA in Microsoft Office applications, sort of like that, except with support provided by OS APIs that could be used by any application that chose to implement scripting support, official guidance from Apple suggesting that all well-designed applications should be scriptable and recordable, and application design patterns and frameworks designed with scriptability and recordability in mind.
Note that I use the past tense here, despite AppleScript still being available in macOS, because it is not well-supported by modern applications.
That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.
Somebody pointed out that those Markdown files might be helpful for people to read directly. Bit of an Emperor's new clothes moment. (I wanted to slap a : rolling_on_the_floor_laughing: reaction on it, but sadly it turns out I'm actually too chickenshit to do that in today's job market.)
This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.
Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.
One of the best parts of LLMs is that you can use them to bootstrap your documentation, or scan for outdated things, etc, far more quickly than ever before.
Don't just throw a mountain at it and ask it to get it right, but use a targeted process to identify inconsistencies, duplicates, etc, and then resolve those.
And then you have better onboarding material for the next human OR llm...
Matthew B. Doar (2011). Practical JIRA Plugins. O’Reilly.
https://www.oreilly.com/library/view/practical-jira-plugins/...
In case anyone was wondering. Which they probably weren’t :p
I'm much more agreeable with that type of LLM workflow. Running "agents" with monolithic "harness" for long time horizon tasks seems wasteful, unecessary but probably super appealing to lazy people.
Already running for a decade+ in production, recently talked about my stack here: https://news.ycombinator.com/threads?id=faangguyindia&next=4...
>Even though there is zero info about who runs it.
People in the community already know who runs it; most others don't care. You won't get 10K users without people getting results. It's a free app, so not like I am spending bucks to advertise it on social networks.
The app is completely free, doesn't upload data to any server (other than Sentrycrash reporting), doesn't ask for any email or phone number. When people get results, they share them with their friends. That's how it's growing.
>Says "Download on the App Store" on the landing page even though you then ask people to use the web app.
On iOS, we’ve a PWA app. I am well aware of it.
Actually there's a lot of projection there too; I don't read documentation in detail. And nowadays, I point an LLM at documentation so that it can find the details I would otherwise skip over.
The destruction of the millennial attention span is real, and it's worse in the younger generations, lmao.
We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).
Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project.
Vision agents are the default for letting AI agents operate web apps that don't expose APIs. The alternative, writing an MCP or REST surface per app, is its own engineering project across the 20+ internal tools most teams have. Most teams default to vision agents not because they are better, but because the alternative is too expensive to build. The cost of the vision approach is treated as a fixed price.
We wanted to measure the price.
The test app is an admin panel for managing customers, orders, and reviews, modeled on the react-admin Posters Galore demo. Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly. Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable.
The task: find the customer named "Smith" with the most orders, locate their most recent pending order, accept all of their pending reviews, and mark the order as delivered. This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. It is the shape of work a typical internal tool sees daily.
Path A: Vision agent. Claude Sonnet driving the UI via browser-use 0.12. Vision mode, taking screenshots and executing clicks.
Path B: API agent. Claude Sonnet with tool-use, calling the handlers the UI calls. Each tool maps to one or more event handlers on the app's State, the same functions a button click would trigger. The agent gets the structured response back instead of a rendered page.
We started by giving both agents the same six-sentence task above and seeing what happened.
The API agent completed it in 8 calls. It listed the customer's reviews filtered by pending status, accepted each one, and marked the order as delivered. Both agents are calling into the same application logic; the API agent just reads the structured response directly instead of looking at a rendered page.
The vision agent, on the same prompt, found one of four pending reviews, accepted it, and moved on. It never paginated. The remaining three reviews were below the visible fold of the reviews page and the agent had no signal to scroll for them.
This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything. The API agent calls the same handler the UI calls, but the response includes the full result set the handler returned, not just the rows currently rendered. The agent reads "page 1 of 4 with 50 results per page" directly instead of having to interpret pagination controls from pixels.
To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
With the walkthrough, the vision agent completed the task. It also ran for fourteen minutes and consumed about half a million input tokens.
The walkthrough is itself a finding. Each numbered instruction is engineering work that doesn't show up in token counts but represents real cost. Anyone deploying a vision agent against an internal tool is either writing prompts at this level of specificity or accepting that the agent will silently miss work.
We ran the API path five times and the vision path three times. The vision path was capped at three trials because each run takes 14-22 minutes and consumes 400-750k tokens.
Variance was the most surprising part of the vision results. Across three trials the wall-clock time spanned 749s to 1257s, and input tokens spanned 407k to 751k. The agent took 43 cycles in the shortest run and 68 in the longest. The screenshot-reason-click loop has enough non-determinism that a single run is not a representative cost estimate.
The API path had no such variance. Sonnet hit identical 8 tool calls on every trial, with input token counts varying by ±27 across all five runs. The agent calls the same handlers in the same order because the structured responses give it no reason to deviate.
| Metric | Vision agent (Sonnet) | API (Sonnet) | API (Haiku) |
|---|---|---|---|
| Steps / calls | 53 ± 13 | 8 ± 0 | 8 ± 0 |
| Wall-clock time | 1003s ± 254s (~17 min) | 19.7s ± 2.8s | 7.7s ± 0.5s |
| Input tokens | 550,976 ± 178,849 | 12,151 ± 27 | 9,478 ± 809 |
| Output tokens | 37,962 ± 10,850 | 934 ± 41 | 819 ± 52 |
Numbers are mean ± sample standard deviation (n−1), with n=5 per API path and n=3 for the vision path. Full run details are available in the repo.
Haiku could not complete the vision path. The failure was specific to browser-use 0.12's structured-output schema, which Haiku could not reliably produce in either vision or text-only mode. On the API path, Haiku finished in under 8 seconds for under 10k input tokens, which is the cheapest configuration we tested.
The cost difference follows directly from the architecture. An agent that must see in order to act will always pay for the seeing, regardless of how good the model gets. Better vision models reduce error rates per screenshot, but they do not reduce the number of screenshots required to reach the relevant data. Each render is a screenshot is thousands of input tokens.
Both agents in this benchmark walk through the same application logic. They both filter, paginate, and update the same way the UI does. The difference is what they read at each step. The vision agent reads pixels and has to render every intermediate state to interpret it. The API agent reads the structured response from the same handlers, which already contains the data the UI was going to display.
Better models will narrow the cost per step. They will not narrow the step count, because the step count is set by the interface.
The benchmark was made cheap to run by Reflex 0.9, which includes a plugin that auto-generates HTTP endpoints from a Reflex application's event handlers. None of the structural argument depends on Reflex specifically, but it is what made running the API path possible without writing a second codebase.
The interesting question is what becomes possible when the engineering cost of an API surface drops to zero. Vision agents remain the right tool for applications you do not control: third-party SaaS products, legacy systems, anything you cannot modify. For internal tools you build yourself, the math now points the other way.
Vision results are specific to browser-use 0.12 in vision mode, and other vision agents may behave differently. The Path B runner shapes the auto-generated endpoints into a small REST tool surface of about thirty lines, which the agent sees as list_customers, update_order, and similar. The dataset is pinned and small (900 customers, 600 orders, 324 reviews), so behavior on production-scale data is not measured here. The vision agent runs through LangChain's ChatAnthropic, and the API agent runs through the Anthropic SDK directly. Reported token counts are uncached input tokens.
The repo includes seed data generation, the patched react-admin demo, both agent scripts, and raw results.
Bedrock isn't the cheapest either although I'm fairly sure they aren't being VC subsidized
There are definitely cheap tokens out there. The big gotcha is "for tasks that can tolerate slightly less quality"
I think everyone making claims that inference is getting more expensive are unaware that there are more LLM providers than Google, Anthropic, and OpenAI.
Advertising isn't the only possible business model.
And profit isn't the only possible motive to provide a service.
No, that's forward. Any documentation an AI can make, another AI can regenerate. If an LLM didn't write the code, it shouldn't document it either. You don't want to bake in slop to throw off the next LLM (or person).
Imagine how crippled you would be if you felt compelled to follow every comment thread to its end.
We're just monkeys looking for the good bits among a pile of rotten fruit.
Agent use can be used to improve quality and maintainability
https://www.google.com/search?q=identify+anonymous+visa+mast...
In fact, the only area I've been struggling with are "Concepts" because they have less clear boundaries for the right amount of detail.
Here is what I've been working on: https://github.com/super-productivity/super-productivity/wik...
Most wikis you can mirror locally if you really need to hammer them.