I feel that if you want to build a coding agent / harness the first thing you should do is to build an evaluation framework to track performance for coding by having your internal metrics and task performance, instead I see most coding agents just fiddle with adding features that don't improve the core ability of a coding agent.
The OpenCode docs suggest its possible, but it only works with their extension (not in an already open VS Code terminal) with a very specific keyboard shortcut and only barely at that.
Many folks from other tools are only getting exposed to the same functionality they got used to, but it offers much more than other harnesses, especially for remote coding.
You can start a service via `opencode serve`, it can be accessed from anywhere and has great experience on mobile except a few bugs. It's a really good way to work with your agents remotely, goes really well with TailScale.
The WebUI that they have can connect to multiple OpenCode backends at once, so you may use multiple VPS-es for various projects you have and control all of them from a single place.
Lastly, there's a desktop app, but TBH I find it redundant when WebUI has everything needed.
Make no mistakes though, it's not a perfect tool, my gripes with it:
- There are random bugs with loading/restoring state of the session
- Model/Provider selection switch across sessions/projects is often annoying
- I had a bug making Sonnet/Opus unusable from mobile phone because phone's clock was 150ms ahead of laptop's (ID generation)
- Sometimes agent get randomly stuck. It especially sucks for long/nested sessions
- WebUI on laptop just completely forgot all the projects at one day
- `opencode serve` doesn't pick up new skills automatically, it needs to be restarted
I wonder why did they use Typescript and not a more resource efficient language like C, C++, Rust, Zig?
Since their code is generated by AI human preferences shouldn't matter much and AI is happy to work with any language.
- GH copilot API is a first class citizen with access to multiple providers’ models at a very good price with a pro plan - no terminal flicker - it seems really good with subagents - I can’t see any terminal history inside my emacs vterm :(
Curious how the context window management works in practice. With large repos, the "what files to include" problem tends to dominate — does it have a strategy beyond embedding-based retrieval, or is that the main approach here?
There's also a request and a PR to add such option but it was closed due to "not adhering to community standards"
I really should look into more "native" Emacs options as I find using vterm a bit of a clunky hack. But I'm just not that excited about this stuff right now. I use it because I'm lazy, that's all. Right now I'm actually getting into woodwork.
IMO, the web UI is a killer feature - it’s got just enough to be an agent manager - without any fluff. I run it on my remote VMs and connect over HTTP.
More than that, it's an extremely large and complex TypeScript code base — probably larger and more complex than it needs to be — and (partly as a result) it's fairly resource inefficient (often uses 1GB of RAM or more. For a TUI).
On top of that, at least I personally find the TUI to be overbearing and a little bit buggy, and the agent to be so full of features that I don't really need — also mildly buggy — that it sort of becomes hard to use and remember how everything is supposed to work and interact.
To change that, you need to set a custom "small model" in the settings.
I really like how their subagents work, as a bonus I get to choose which model is in which agent. Sadly I have to resort to the mess that Anthropic calls Claude Code
https://www.youtube.com/live/z0JYVTAqeQM?si=oLvyLlZiFLTxL7p0
I sprinkle in some billed API usage to power my task-planner and reviewer subagents (both use GPT 5.4 now).
The ability to switch models is very useful and a great learning experience. GLM, Kimi and their free models surprised me. Not the best, not perfect, but still very productive. I would be a wary shareholder if I owned a stake in the frontier labs… that moat seems to be shrinking fast.
I'm guessing that a model which only covers a single language might be more compact and efficient vs a model trained across many languages and non-programming data.
I use it with Qwen 3.5 running locally when my daily limits run out on my other subscriptions.
The harness is great. Local models are just slow enough that the subscription models are easier to use. For most of my tasks these days, the model's capability is sufficient; it is just not as snappy.
Hugely grateful for what they do.
It's been a moving target for years at this point.
Both open and closed source models have been getting better, but not sure if the open source models have really been closing the gap since DeepSeek R1.
But yes: If the top closed source models were to stop getting better today, it wouldn't take long for open source to catch up.
Coding is mostly "agentic" so I'm bit puzzled.
They shouldn't, as long as your terminal emulator doesn't. Why do you think it's Wayland related?
It works perfectly fine on Niri, Hyprland and other Wayland WMs.
What problem do you have?
Even as a CC user I’m glad someone is forcing the discussion.
My prediction: within two years ‘model neutrality’ will be a topic of debate. Creating lock-in through discount pricing is anti-competitive. The model provider is the ISP; the tool, the website.
this is what i notice with openclaw as well. there have been releases where they break production features. unfortunately this is what happens when code becomes a commidity, everyone thinks that shipping fast is the moat but at the expense of suboptimality since they know a fix can be implemented quickly on the next release.
[0] https://www.reddit.com/r/LocalLLaMA/comments/1rv690j/opencod...
I would (incorrectly) assume that a product like this would be heavily tested via AI - why not? AI should be writing all the code, so why would the humans not invest in and require extreme levels of testing since AI is really good at that?
OpenCode has been much more stable for me in the 6 months or so that I’ve been comparing the two in earnest.
I expect that from something guiding the market, but there have been times where stuff changes, and it isn't even clear if it is a bug or a permanent decision. I suspect they don't even know.
https://old.reddit.com/r/LocalLLaMA/comments/1rv690j/opencod...
They also don't let you run all local models, but specific whitelisted by another 3rd party: https://github.com/anomalyco/opencode/issues/4232
Long tool outputs/command outputs everything in my harness is spilled over to the filesystem. Context messages are truncated and split to filesystem with a breadcrumb for retrieving the full message.
Works really well.
The simplicity of extending pi is in itself addictive, but even in its raw form it does the job well.
Before finding pi I had written a lot of custom stuff on top of all the provider specific CLI tools (codex, Claude, cursor-agent, Gemini) - but now I don’t have to anymore (except if I want to use my anthropic sub, which I will now cancel for that exact reason)
Pi is refreshingly minimal in terms of system prompts, but still works really well and that makes me wonder whether other harnesses are overdoing. Look at OpenCode's prompts, for instance - long, mostly based on feels and IMO unnecessary. I would've liked to just overwrite OC's system prompts with Pi's (to get other features that Pi doesn't have) but that isn't possible today (without maintaining a custom fork)
I’m sure there’s a more elegant way to say this, but OpenCode feels like an open source Claude Code, while pi feels like an open source coding agent.
I used it recently inside a CI workflow in GitLab to automatically create ChangeLog.md entries for commits. That + Qwen 3.5 has been pretty successful. The job starts up Pi programatically, points it at the commits in question, and tells it to explore and get all the context it needs within 600 seconds... and it works. I love that this is possible.
The big expensive models are great at planning tasks and reviewing the implementation of a task. They can better spot potential gotchas, performance or security gaps, subtle logic and nuance that cheaper models fail to notice.
The small cheap models are actually great (and fast) at generating decent code if they have the right direction up front.
So I do all the spec writing myself (with some LLM assistance), and I hand it to a Supervisor agent who coordinates between subagents. Plan -> implement -> review -> repeat until the planner says “all done”.
I switch up my models all the time (actively experimenting) but today I was using GPT 5.4 for review and planning, costing me about $0.4-$1 for a good sized task, and Kimi for implementation. Sometimes my spec takes 4-5 review loops and the cost can add up over an 8 hour day. Still cheaper than Claude Max (for now, barely).
Each agent retains a fairly small context window which seems to keep costs down and improves output. Full context can be catastrophic for some models.
As for the spec writing, this is the fun part for me, and I’ve been obsessing over this process, and the process of tracking acceptance criteria and keeping my agents aligned to it. I have a toolkit cooking, you can find in my comment history (aiming to open source it this week).
I just did a one hour vibe session today, ripping out a library dependency and replacing it with another and pushing the library to pypi. I should take my task list and let the local model replicate the work and see how it works out.
I didn't dig further
Seems like there's many github issues about this actually
https://github.com/anomalyco/opencode/issues/14336
And then the official docs: https://opencode.ai/docs/troubleshooting/#linux-wayland--x11...
> Linux: Wayland / X11 issues
> On Linux, some Wayland setups can cause blank windows or compositor errors.
> If you’re on Wayland and the app is blank/crashing, try launching with OC_ALLOW_WAYLAND=1.
> If that makes things worse, remove it and try launching under an X11 session instead.
OC_ALLOW_WAYLAND=1 didn't work for me (Ubuntu 24.04)
Suggesting to use a different display server to use a TUI (!!) seems a bit wild to me. I didn't put a lot of time into investigating this so maybe there is another reason than Wayland. Anyway I'm using Pi now
I'm not a US citizen, so both companies are the same, as far as I'm concerned.
You need to set an explicit "small model" in OpenCode to disable that.
Have fun on windows - automatic no from me. https://github.com/anomalyco/opencode/issues?q=is%3Aissue%20...
I’m sure we’ll all learn a lot from these early days of agentic coding.
All code is not fungible, "irreverent code that kinda looks okay at first glance" might be a commodity, but well-tested, well-designed and well-understood code is what's valuable.
that #12446 PR hasn't even been resolved to won't merge and last change was a week ago (in a repo with 1.8k+ open PRs)
JS is not something that was developed with CLI in mind and on top of that that language does not lend itself to be good for LLM generation as it has pretty weak validation compared to e.g. Rust, or event C, even python.
Not to mention memory usage or performance.
Must be a karmic response from “Free” /s
Interesting you say this because I'd say the opposite is true historically, especially in the systems software community and among older folks. "Do one thing and do it well" seems to be the prevailing mindset behind many foundational tools. I think this why so many are/were irked by systemd. On the other hand newer tools that are more heavily marketed and often have some commercial angle seem to be in a perpetual state of tacking on new features in lieu of refining their raison d'etre.
EDIT: The system I bought last summer for $1980 and just took delivery of in October, Beelink GTR 9 Pro, is now $2999.... wow...
I can’t find the tweet from Mario (the author), but he prefers the Typescript/npm ecosystem for non-performance critical systems because it hits a sweet spot for him. I admire his work and he’s a real polyglot, so I tend to think he has done his homework. You’ll find pi memory usage quite low btw.
So far what I am learning (from watching all of this) is that our constant claims that quality and security matter seem to not be true on average. Depressingly.
Code today can be as verbose and ugly as ever, because from here on out, fewer people are going to read it, understand and care about it.
What's valuable, and you know this I think, is how much money your software will sell for, not how fine and polished your code is.
Code was a liability. Today it's a liability that cost much much less.
The choice isn't "telemetry or you're blindfolded", the other options include actually interacting with your userbase. Surveys exist, interviews exist, focus groups exist, fostering communities that you can engage is a thing, etc.
For example, I was recruited and paid $500 to spend an hour on a panel discussing what developers want out of platforms like DigitalOcean, what we don't like, where our pain points are. I put the dollar amount there only to emphasize how valuable such information is from one user. You don't get that kind of information from telemetry.
Only for the non-pro users. After all, those users were happy to use excel to write the programs.
What we're seeing now is that more and more developers find they are happy with even less determinism than the Excel process.
Maybe they're right; maybe software doesn't need any coherence, stability, security or even correctness. Maybe the class of software they produce doesn't need those things.
I, unfortunately, am unable to adopt this view.
But as agents move from prototypes to production, the calculus changes. Production systems need: - Memory continuity across sessions - Predictable behavior across updates - Security boundaries that don't leak
The tools that prioritize these will win the enterprise market. The ones that don't will stay in the prototype/hobbyist space.
We're still in the "move fast" phase, but the "break things" part is starting to hurt real users. The pendulum will swing back.
How much value are you going to be able to extract over its lifetime once your customers want to see some additional features or improvements?
How much expensive maintenance burden are you incurring once any change (human or LLM generated) is likely to introduce bugs you have no better way of identifying than shipping to your paying customers?
Maybe LLM+tooling is going to get there with producing a comprehensible and well tested system but my anectodal experience is not promising. I find that AI is great until you hit its limit on a topic and then it will merrily generate tokens in a loop suggesting the same won't-work-fix forever.
There are limits to what even AI can do to code, within practical time-limits. Using AI also costs money. So, easier it is to maintain and evolve a piece of software, the cheaper it will be to the owners of that application.
Like Rails/DHH was one phase, Git/GitHub another.
And right now it's kinda Claude Code. But they're so obviously really bad at development that it feels like a MLM scam.
I'm just describing the feeling I'm getting, perhaps badly. I use Claude, I recommended Claude for the company I worked at. But by god they're bloody awful at development.
It feels like the point where someone else steps in with a rock solid, dependable, competitor and then everyone forgets Claude Code ever existed.
Imagine someone using it at work, where they are only allowed to use a GitHub Copilot Business subscription (which is supported in OpenCode). Now they have sent proprietary code to a third party, and don't even know they're doing it.
From what I've heard, the metrics used by Anthropic to detect unauthorized clients is pretty easy to sidestep if you look at the existing solutions out there. Better than getting your account banned.
The belief is that the subscriptions are subsidized by them (or just heavily cut into profit margins) so for whatever reason they're trying to maintain control over the harness - maybe to gather more usage analytics and gain an edge over competitors and improve their models better to work with it, or perhaps to route certain requests to Haiku or Sonnet instead of using Opus for everything, to cut down on the compute.
Given the ample usage limits, I personally just use Claude Code now with their 100 USD per month subscription because it gives me the best value - kind of sucks that they won't support other harnesses though (especially custom GUIs for managing parallel tasks/projects). OpenCode never worked well for me on Windows though, also used Codex and Gemini CLI.
When setting your token limits, their economics calculations likely assume that those optimizations are going to work. If you're using a different agent, you're basically underpaying for your tokens.
API = way more expensive, allowed to use on your terms without anthropic hindering you.
One-price-per-month subscriptions (Claude Code Pro/MAX @ $20/$100/$200 a month) use a different authentication mechanism, OAUTH. The useful difference is you get a lot more inference than you can for the same cost using the API but they require you to use Claude Code as a client.
Some clients have made it simple to use your subscription key with them and they are getting cease and desist letters.
(Ok, technically o1-pro is even more expensive, but I'm assuming that's a "please move on" pricing)
I'd rather switch to OpenAI than give up my favorite harness.
I'm 13 years into this industry, this is the first I'm hearing of this.
We all know it’s extremely, extremely hard to interact with your userbase.
> For example I was paid $500 an hour
+the time to find volunteers doubled that, so for $1000 an hour x 10 user interviews, a free software can have feedback from 0.001% of their users. I dislike telemetry, but it’s a lie to say it’s optional.
—a company with no telemetry on neither of our downloadable or cloud product.
I'm building a full stack web app, simple but with real API integrations with CC.
Moving so fast that I can barely keep a hold on what I'm testing and building at the same time, just using Sonnet. It's not bad at all. A lot of the specs develop as I'm testing the features, either as an immediate or a todo / gh issue.
How can you manage an agentic flow?
If you respond twice to their theme query probes, the whole thing bricks. Or if you're slightly out of order. It's very delicate.
Edit: it's not. https://github.blog/changelog/2026-01-16-github-copilot-now-...
They must be eating insane amounts of $$$ for this. I wouldn't expect it to last
Still, I feel like "will commit illegal mass murder against their own citizens" is a significant enough degree more evil. I think lots of corporations will help their government murder citizens of other countries, but very few would go so far as to agree to murder their own (fellow) citizens ... just to get a juicy contract.
See https://models.dev for a comparison against the normal "vanilla" API.
No, it is not. Your users will tell you what you need to know, you just have to pay attention.
> I dislike telemetry, but it’s a lie to say it’s optional.
No, the lie is believing it’s necessary. Software was successful before telemetry was a thing, and tools without telemetry continue to be successful. Plenty of independent developers ship zero telemetry in their products and continue to be successful.
You can point Claude Code at a local inference server (e.g. llama.cpp, vLLM) and see which model names it sends each request to. It's not hard to do a MITM against it either. Claude Code does send some requests to Haiku, but not the ones you're making with whatever model you have it set to - these are tool result processing requests, conversation summary / title generation requests, etc - low complexity background stuff.
Now, Anthropic could simply take requests to their Opus model and internally route them to Sonnet on the server side, but then it wouldn't really matter which harness was used or what the client requests anyway, as this would be happening server-side.
Build the single pane of glass everyone uses. Offer it under cost. Salt the earth and kill everything else that moves.
Nobody can afford to run alternative interfaces, so they die. This game is as old as time. Remember Reddit apps? Alternative Twitter clients?
In a few years, CC will be the only survivor and viable option.
It also kneecaps attempts to distill Opus.
We also have taboos against betraying/murdering/whatever people of other tribes, but those taboos are much weaker and get relaxed sometimes (eg. in war). My point is, it takes significantly more anti-social (ie. evil) behavior to betray your own tribe, in the deepest way possible, than it does to do horrible things to other tribes.
This is just as much true for Russians murdering Ukranians as Ukranians murdering Russians, or any other conflict group: almost all Russians would consider a Russian who helps kill Russians to be more evil than a Russian who kills Ukranians (and vice versa).
But you're still choosing evil when you could try local models
It’s not like moving from android to iOS.
For me it's $0.8/kWh during peak, $0.47 off peak, and super off peak of $0.15. I accidentally left a little mini 500W heater on all day, while I was out, costing > 5% of your whole month!