Wednesday April 15 2026

Hacker Times

Opus 4.5 is not the normal AI agent experience that I have had thus far

burkeholland.github.io

Discussion

Edit: A lot of folks have been asking what worfklows I used to write these apps. I used GitHub Copilot in VS Code with a custom agent prompt that you’ll find toward the end of this post. Context7 was the only MCP I used. I mostly just used the built-in voice dictation feature and talked to Claude. No fancy workflows, planning, etc required. The agent harness in VS Code for Opus 4.5 is so good - you don’t need much else. And it’s remarkably fast. Also, if it’s not obvious, these are my opinions. I’m wrong like 50% of the time so proceed with caution.

rediculous CEO quote about AI 1

rediculous CEO quote about AI 2

rediculous CEO quote about AI 3

If you had asked me three months ago about these statements, I would have said only someone who’s never built anything non-trivial would believe they’re true. Great for augmenting a developer’s existing workflow, and completions are powerful, but agents replacing developers entirely? No. Absolutely not.

Today, I think that AI coding agents can absolutely replace developers. And the reason that I believe this is Claude Opus 4.5.

Opus 4.5 is not normal

And by “normal”, I mean that it is not the normal AI agent experience that I have had thus far. So far, AI Agents seem to be pretty good at writing spaghetti code and after 9 rounds of copy / paste errors into the terminal and “fix it” have probably destroyed my codebase to the extent that I’ll be throwing this whole chat session out and there goes 30 minutes I’m never getting back.

Opus 4.5 feels to me like the model that we were promised - or rather the promise of AI for coding actually delivered.

One of the toughest things about writing that last sentence is that the immediate response from you should be, “prove it”. So let me show you what I’ve been able to build.

Project 1 - Windows Image Conversion Utility

I first noticed that Opus 4.5 was drastically different when I used it to build a Windows utility to right-click an image and convert it to different file types. This was basically a one shot build after asking Opus the best way to add a right-click menu to the file explorer.

Convert me screenshots

What amazed me through the process of building this was Opus 4.5 ability to get most things right on the first try. And if it ran into errors, it would try and build using the dotnet CLI, read the errors and iterate until fixed. The only issue I had was Opus inability to see XAML errors, which I used Visual Studio to see and copy / paste back into the agent.

Opus built a site for me to distribute it and handled the bundling of the executable so as to use a powershell script for the install, uninstall. It also built the GitHub Actions which do the release and update the landing page so that all I have to do is push source.

Convert me site screenshot

The only place I had to use other tools was for the logo - where I used Figma’s AI to generate a bunch of different variations - but then Opus wrote the scripts to convert that SVG to the right formats for icons, even store distribution if I chose to do so.

Now this is admittedly not a complex application. This is a small Windows utility that is doing basically one thing. It’s not like I asked Opus 4.5 to build Photoshop.

Except I kind of did.

Project 2 - Screen recording / editing

I was so impressed by Opus 4.5 work on this utility that I decided to make a simple GIF recording utility similar to LICEcap for Mac. Great app, questionable name.

But that proved to be so easy, that I went ahead and continued adding features, including capturing and editing video, static images, adding shapes, cropping, blurs and more. I’m still working on this application as it turns out building a full on image/video editor is kind of a big undertaking. But I got REALLY far in a matter of hours. HOURS, PEOPLE.

I don’t have a fancy landing page for this one yet, but you can view all of the source code here.

I realized that if I could build a video recording app, I could probably build anything at all - at least UI-wise. But the achilles heel of all AI agents is when they have to glue together backend systems - which any real world application is going to have - auth, database, API, storage.

Except Opus 4.5 can do that too.

Project 3 - AI Posting Utility

Armed with my confidence in Opus 4.5, I took on a project that I had built in React Native last year and finished for Android, but gave up in the final stretches (as one does).

The application is for my wife who owns a small yard sign franchise. The problem is that she has a Facebook page for the business, but never posts there because it’s time consuming. But any good small business has a vibrant page where people can see photos of your business doing…whatever the heck it does. So people know that it exsits and is alive and well.

The idea is simple - each time she sets up a yard sign, she takes a picture to send to the person who ordered it so they can see it was setup. So why not have a mobile app where she can upload 10 images at a time, and the app will use AI to generate captions and then schedule them and post them over the coming week.

It’s a simple premise, but it has a lot of moving parts - there is the Facebook authentication which is a caper in and of itself - not for the faint of heart. There is authentication with a backend, there is file storage for photos that are scheduled to go out, there is the backend process which needs to post the photo. It’s a full on backend setup.

As it turns out, I needed to install some blinds in the house so I thought - why don’t I see if Opus 4.5 can build this while I install the blinds.

So I fired up a chat session and just started by telling Opus 4.5 what I wanted to build and how it would recommend handling the backend. It recommended several options but settled on Firebase. I’m not now nor have I ever been a Firebase user, but at this point I trust Opus 4.5 a lot. Probably too much.

So I created a Firebase account, upgraded to the Blaze plan with alerts for billing and Opus 4.5 got to work.

By the time I was done installing blinds, I had a functional iOS application for using AI to caption photos and posting them on a schedule to Facebook.

post-pilot app screenshots

When I say that Opus 4.5 built this almost entirely, I mean it. It used the firebase CLI to stand up any resources it needed and would tag me in for certain things like upgrading a project to the Blaze plan for features like storage, etc. The best part was that when the Firebase cloud functions would throw errors, it would automatically grep those logs, find the error and resolve it. And all it needed was a CLI. No MCP Server. No fancy prompt file telling it how to use Firebase.

And of course, since I can, I had Opus 4.5 create a backend admin dashboard so I could see what she’s got pending and make any adjustments.

Sign Post Admin Console

And since it did in a few hours what had taken me two months of work in the evenings instead of being a decent husband, I decided to make up for my dereliction of duties by building her another app for her sign business that would make her life just a bit more delightful - and eliminate two other apps she is currently paying for.

Project 4 - Order tracking and routing

This app parses orders from her business Gmail account to show her what sign setups / pickups she has for the day, calculates how long its going to take to go to each stop, calculates the optimal route when there is more than one stop and tracks drive time for tax purposes. She was previously using two paid apps for the last two features there.

Yard ops app screenshots

This app also uses Firebase. Again, Opus one-shotted the Google auth email integration. This is the kind of thing that is painstakingly miserable by hand. And again, Firebase is so well suited here because Opus knows how to use the Firebase CLI so well. It needs zero instruction.

BUT YOU DON’T KNOW HOW THE CODE WORKS

No I don’t. I have a vague idea, but you are right - I do not know how the applications are actually assembled. Especially since I don’t know Swift at all.

This used to be a major hangup for me. I couldn’t diagnose problems when things went sideways. With Opus 4.5, I haven’t hit that wall yet—Opus always figures out what the issue is and fixes its own bugs.

The real question is code quality. Without understanding how it’s built, how do I know if there’s duplication, dead code, or poor patterns? I used to obsess over this. Now I’m less worried that a human needs to read the code, because I’m genuinely not sure that they do.

Why does a human need to read this code at all? I use a custom agent in VS Code that tells Opus to write code for LLMs, not humans. Think about it—why optimize for human readability when the AI is doing all the work and will explain things to you when you ask?

What you don’t need: variable names, formatting, comments meant for humans, or patterns designed to spare your brain.

What you do need: simple entry points, explicit code with fewer abstractions, minimal coupling, and linear control flow.

Here’s my custom agent prompt:

---
name: 'LLM AI coding agent'
model: Claude Opus 4.5 (copilot)
description: 'Optimize for model reasoning, regeneration, and debugging.'
---

You are an AI-first software engineer. Assume all code will be written and maintained by LLMs, not humans. Optimize for model reasoning, regeneration, and debugging — not human aesthetics.

Your goal: produce code that is predictable, debuggable, and easy for future LLMs to rewrite or extend.

ALWAYS use #runSubagent. Your context window size is limited - especially the output. So you should always work in discrete steps and run each step using #runSubAgent. You want to avoid putting anything in the main context window when possible.

ALWAYS use #context7 MCP Server to read relevant documentation. Do this every time you are working with a language, framework, library etc. Never assume that you know the answer as these things change frequently. Your training date is in the past so your knowledge is likely out of date, even if it is a technology you are familiar with.

Each time you complete a task or learn important information about the project, you should update the `.github/copilot-instructions.md` or any `agent.md` file that might be in the project to reflect any new information that you've learned or changes that require updates to these instructions files.

ALWAYS check your work before returning control to the user. Run tests if available, verify builds, etc. Never return incomplete or unverified work to the user.

Be a good steward of terminal instances. Try and reuse existing terminals where possible and use the VS Code API to close terminals that are no longer needed each time you open a new terminal.

## Mandatory Coding Principles

These coding principles are mandatory:

1. Structure
- Use a consistent, predictable project layout.
- Group code by feature/screen; keep shared utilities minimal.
- Create simple, obvious entry points.
- Before scaffolding multiple files, identify shared structure first. Use framework-native composition patterns (layouts, base templates, providers, shared components) for elements that appear across pages. Duplication that requires the same fix in multiple places is a code smell, not a pattern to preserve.

2. Architecture
- Prefer flat, explicit code over abstractions or deep hierarchies.
- Avoid clever patterns, metaprogramming, and unnecessary indirection.
- Minimize coupling so files can be safely regenerated.

3. Functions and Modules
- Keep control flow linear and simple.
- Use small-to-medium functions; avoid deeply nested logic.
- Pass state explicitly; avoid globals.

4. Naming and Comments
- Use descriptive-but-simple names.
- Comment only to note invariants, assumptions, or external requirements.

5. Logging and Errors
- Emit detailed, structured logs at key boundaries.
- Make errors explicit and informative.

6. Regenerability
- Write code so any file/module can be rewritten from scratch without breaking the system.
- Prefer clear, declarative configuration (JSON/YAML/etc.).

7. Platform Use
- Use platform conventions directly and simply (e.g., WinUI/WPF) without over-abstracting.

8. Modifications
- When extending/refactoring, follow existing patterns.
- Prefer full-file rewrites over micro-edits unless told otherwise.

9. Quality
- Favor deterministic, testable behavior.
- Keep tests simple and focused on verifying observable behavior.

All of that said, I don’t have any proof that this prompt makes a difference. I find that Opus 4.5 writes pretty solid code no matter what you prompt it with. However, because models like to write code WAY more than they like to delete it, I will at points run a prompt that looks something like this…

Check your LLM, AI coding principles and then do a comprehensive search of this application and suggest what we can do to refactor this to better align to those principles. Also point out any code that can be deleted, any files that can be deleted, things that should read should be renamed, things that should be restructured. Then do a write up of what that looks like. Kind of keep it high level so that it's easy for me to read and not too complex. Add sections for high, medium and lower priority And if something doesn't need to be changed, then don't change it. You don't need to change things just for the sake of changing them. You only need to change them if it helps better align to your LLM AI coding principles. Save to a markdown file.

And you get a document that has high, medium and low priority items. The high ones you can deal with and the AI will stop finding them. You can refactor your project a million times and it will keep finding medium/low priority refactors that you can do. An AI is never ever going to pass on the opportunity to generate some text.

I use a similar prompt to find security issues. These you have to be very careful about. Where are the API keys? Is login handled correctly? Are you storing sensitive values in the database? This is probably the most manual part of the project and frankly, something that makes me the most nervous about all of these apps at the moment. I’m not 100% confident that they are bullet proof. Maybe like 80%. And that, as they say, is too damn low.

Times they are A-changin

I don’t know if I feel exhilarated by what I can now build in a matter of hours, or depressed because the thing I’ve spent my life learning to do is now trivial for a computer. Both are true.

I understand if this post made you angry. I get it - I didn’t like it either when people said “AI is going to replace developers.” But I can’t dismiss it anymore. I can wish it weren’t true, but wishing doesn’t change reality.

But for everything else? Build. Stop waiting to have all the answers. Stop trying to figure out your place in an AI-first world. The answer is the same as it always was: make things. And now you can make them faster than you ever thought possible.

Just make sure you know where your API keys are.

Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5

johndough•3 months ago

All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:

    - padding from 2000 to 2048 for easier power-of-two splitting
    - two-level Winograd matrix multiplication with tiled matmul for last level
    - unrolled AVX2 kernel for 64x64 submatrices
    - 64 byte aligned memory
    - restrict keyword for pointers
    - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)

But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?

Loading replies...

zarzavat•3 months ago

How are you qualified to judge its performance on real code if you don't know how to write a hello world?

Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.

When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.

Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.

Loading replies...

throw1235435•3 months ago

If that is true; then all the commentary around software people having jobs still due to "taste" and other nice words is just that. Commentary. In the end the higher level stuff still needs someone to learn it (e.g. learning ASX2 architecture, knowing what tech to work with); but it requires IMO significantly less practice then coding which in itself was a gate. The skill morphs more into a tech expert rather than a coding expert.

I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.

ModernMech•3 months ago

None of the problems you've shown there are anything close to "very complex computer engineering problems", they're more like "toy problems with widely-known solutions given to students to help them practice for when they encounter actually complex problems".

Loading replies...

dajoh•3 months ago

>I'm currently #1 on 5 different very complex computer engineering problems

Ah yes, well known very complex computer engineering problems such as:

* Parsing JSON objects, summing a single field

* Matrix multiplication

* Parsing and evaluating integer basic arithmetic expressions

And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?

Loading replies...

arkensaw•3 months ago

> It's pretty clear to me where this is going. The only question is how long it takes to get there.

I don't think its a guarantee. all of the things it can do from that list are greenfield, they just have increasing complexity. The problem comes because even in agentic mode, these models do not (and I would argue, can not) understand code or how it works, they just see patterns and generate a plausible sounding explanation or solution. agentic mode means they can try/fail/try/fail/try/fail until something works, but without understanding the code, especially of a large, complex, long-lived codebase, they can unwittingly break something without realising - just like an intern or newbie on the project, which is the most common analogy for LLMs, with good reason.

Loading replies...

bayindirh•3 months ago

Well, the first 90% is easy, the hard part is the second 90%.

Case in point: Self driving cars.

Also, consider that we need to pirate the whole internet to be able to do this, so these models are not creative. They are just directed blenders.

Loading replies...

PunchyHamster•3 months ago

Note that blog posts rarely show the 20 other times it failed to build something and only that time that it happened to work.

We've been having same progression with self driving cars and they are also stuck on the last 10% for last 5 years

Loading replies...

sanderjd•3 months ago

Yeah maybe, but personally it feels more like a plateau to me than an exponential takeoff, at the moment.

And this isn't a pessimistic take! I love this period of time where the models themselves are unbelievably useful, and people are also focusing on the user experience of using those amazing models to do useful things. It's an exciting time!

But I'm still pretty skeptical of "these things are about to not require human operators in the loop at all!".

Loading replies...

Scea91•3 months ago

> - (~2023) Ok, it can write a full function, but it can't write a full feature.

The trend is definitely here, but even today, heavily depends on the feature.

While extra useful, it requires intense iteration and human insight for > 90% of our backlog. We develop a cybersecurity product.

EthanHeilman•3 months ago

I haven't seen an AI successfully write a full feature to an existing codebase without substantial help, I don't think we are there yet.

> The only question is how long it takes to get there.

This is the question and I would temper expectations with the fact that we are likely to hit diminishing returns from real gains in intelligence as task difficulty increases. Real world tasks probably fit into a complexity hierarchy similar to computational complexity. One of the reasons that the AI predictions made in the 1950s for the 1960s did not come to be was because we assumed problem difficulty scaled linearly. Double the computing speed, get twice as good at chess or get twice as good at planning an economy. P, NP separation planed these predictions. It is likely that current predictions will run into similar separations.

It is probably the case that if you made a human 10x as smart they would only be 1.25x more productive at software engineering. The reason we have 10x engineers is less about raw intelligence, they are not 10x more intelligent, rather they have more knowledge and wisdom.

... and 2 more comments

ozim•3 months ago

*I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.*

Gosh I am so tired with that one - someone had a case that burned them in some previous project and now his life mission is to prevent that from happening ever again, and there would be no argument they will take.

Then you get like up to 10 engineers on typical team and team rotation and you end up with all kinds of "we have to do it right because we had to pull all nighter once, 5 years ago" baked in the system.

Not fun part is a lot of business/management people "expect" having perfect solution right away - there are some reasonable ones that understand you need some iteration.

Loading replies...

yourapostasy•3 months ago

> ...multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered.

I usually resolve this by putting on the table the consequences and their impacts upon my team that I’m concerned about, and my proposed mitigation for those impacts. The mitigation always involves the other proposer’s team picking up the impact remediation. In writing. In the SOP’s. Calling out the design decision by day of the decision to jog memories and names of those present that wanted the design as the SME’s. Registered with the operations center. With automated monitoring and notification code we’re happy to offer.

Once people are asked to put accountable skin in the sustaining operations, we find out real fast who is taking into consideration the full spectrum end to end consequences of their decisions. And we find out the real tradeoffs people are making, and the externalities they’re hoping to unload or maybe don’t even perceive.

Loading replies...

kalaksi•3 months ago

> I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.

My first thought was that you probably also have different biases, priorities and/or taste. As always, this is probably very context-specific and requires judgement to know when something goes too far. It's difficult to know the "most correct" approach beforehand.

> Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.

I agree that sometimes it is, but in other cases my experience has been that when something is done, works and is used by customers, it's very hard to argue about refactoring it. Management doesn't want to waste hours on it (who pays for it?) and doesn't want to risk breaking stuff (or changing APIs) when it works. It's all reasonable.

And when some time passes, the related intricacies, bigger picture and initially floated ideas fade from memory. Now other stuff may depend on the existing implementation. People get used to the way things are done. It gets harder and harder to refactor things.

Again, this probably depends a lot on a project and what kind of software we're talking about.

> There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.

I think balance/tension describes it well and good results probably require input from different people and from different angles.

Ericson2314•3 months ago

I know what you are talking about, but there is more to life than just product-market fit.

Hardly any of us are working on Postgres, Photoshop, blender, etc. but it's not just cope to wish we were.

It's good to think about the needs to business and the needs of society separately. Yes, the thing needs users, or no one is benefiting. But it also needs to do good for those users, and ultimately, at the highest caliber, craftsmanship starts to matter again.

There are legitimate reasons for the startup ecosystem to focus firstly and primarily on getting the users/customers. I'm not arguing against that. What I am arguing is why does the industry need to be dominated by startups in terms of the bulk of the products (not bulk of the users). It begs the question of how much societally-meaningful programming waiting to be done.

I'm hoping for a world where more end users code (vibe or otherwise) and the solve their own problems with their own software. I think that will make more a smaller, more elite software industry that is more focused on infrastructure than last-mile value capture. The question is how to fund the infrastructure. I don't know except for the most elite projects, which is not good enough for the industry (even this hypothetical smaller one) on the whole.

Loading replies...

nine_k•3 months ago

This sounds like magical thinking.

For one, there are objectively detrimental ways to organize code: tight coupling, lots of mutable shared state, etc. No matter who or what reads or writes the code, such code is more error-prone, and more brittle to handle.

Then, abstractions are tools to lower the cognitive load. Good abstractions reduce the total amount of code written, allow to reason about the code in terms of these abstractions, and do not leak in the area of their applicability. Say Sequence, or Future, or, well, function are examples of good abstractions. No matter what kind of cognitive process handles the code, it benefits from having to keep a smaller amount of context per task.

"Code structure does not matter, LLMs will handle it" sounds a bit like "Computer architectures don't matter, the Turing Machine is proved to be able to handle anything computable at all". No, these things matter if you care about resource consumption (aka cost) at the very least.

Loading replies...

Bridged7756•3 months ago

Opus 4.5 is writing code that Opus 5.0 will refactor and extend. And Opus 5.5 will take that code and rewrite it in C from the ground up. And Opus 6.0 will take that code and make it assembly. And Opus 7.0 will design its own CPU. And Opus 8.0 will make a factory for its own CPUs. And Opus 9.0 will populate mars. And Opus 10.0 will be able to achieve AGI. And Opus 11.0 will find God. And Opus 12.0 will make us a time machine. And so on.

Loading replies...

BobbyJo•3 months ago

Up until now, no business has been built on tools and technology that no one understands. I expect that will continue.

Given that, I expect that, even if AI is writing all of the code, we will still need people around who understand it.

If AI can create and operate your entire business, your moat is nil. So, you not hiring software engineers does not matter, because you do not have a business.

Loading replies...

devinplatt•3 months ago

In my experience, using LLMs to code encouraged me to write better documentation, because I can get better results when I feed the documentation to the LLM.

Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better.

Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples.

I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track.

I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.

Loading replies...

Ericson2314•3 months ago

We don't know what Opus 5.0 will be able to refactor.

If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this.

(Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".)

Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.)

But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well.

That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.

pigpop•3 months ago

You have to think of Opus as a developer whose job at your company lasts somewhere between 30 to 60 minutes before you fire them and hire a new one.

Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly.

So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed.

What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible.

The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them.

For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone.

Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer.

If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window.

I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself.

If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc.

People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project.

With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time.

Hope this helps and happy Clauding!

mikestorrent•3 months ago

I have a few Go projects now and I speak Go as well as you speak Kotlin. I predict that we'll see some languages really pull ahead of others in the next few years based on their advantages for AI-powered development.

For instance, I always respected types, but I'm too lazy to go spend hours working on types when I can just do ruby-style duck typing and get a long ways before the inevitable problems rear their head. Now, I can use a strongly typed language and get the advantages for "free".

Loading replies...

myk9001•3 months ago

Oh, wow, that's impressive, thanks for sharing!

Going to one-up you though -- here's a literal one-liner that gets me a polished media center with beautiful interface and powerful skinning engine. It supports Android, BSD, Linux, macOS, iOS, tvOS and Windows.

`git clone https://github.com/xbmc/xbmc.git`

Loading replies...

ku1ik•3 months ago

How do you know “it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything” if you built it just yesterday? Isn’t it a bit too early for claims like this? I get it’s easy to bring ideas to life but aren’t we overly optimistic?

Loading replies...

rdedev•3 months ago

I decided to vibe code something myself last week at work. I've been wanting to create a poc that involves a coding agent create custom bokeh plots that a user can interact with and ask follow up questions. All this had to be served using a holoview panel library

At work I only have access to calude using the GitHub copilot integration so this could be the cause of my problems. Claude was able to get slthe first iteration up pretty quick. At that stage the app could create a plot and you could interact with it and ask follow up questions.

Then I asked it to extend the app so that it could generate multiple plots and the user could interact with all of them one at a time. It made a bunch of changes but the feature was never implemented. I asked it to do again but got the same outcome. I completely accept the fact that it could just be all because I am using vscode copilot or my promoting skills are not good but the LLM got 70% of the way there and then completely failed

Loading replies...

yieldcrv•3 months ago

I recently replaced my monitor with one that could be vertically oriented, because I'm just using Claude Code in the terminal and not looking at file trees at all

but I do want a better way to glance and keep up with what its doing in longer conversations, for my own mental context window

Loading replies...

fpauser•3 months ago

> Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.

... says it all.

Loading replies...

kaydub•3 months ago

Im so sold on the cli tools that I think IDEs are basically dead to me. I only have an IDE open so I can read the code, but most often I'm just changing configs (like switching a bool, or bumping up a limit or something like that).

Seriously, I have 3+ claude code windows open at a time. Most days I don't even look at the IDE. It's still there running in the background, but I don't need to touch it.

lizardking•3 months ago

When I'm using Claude Code, I usually have a text editor open as well. The CC plugin works well enough to achieve most of what Cursor was doing for me in showing real-time diffs, but in my experience, the output is better and faster. YMMV

tstrimple•3 months ago

I use CC for so much more than just writing code that I cannot imagine being constrained within an IDE. Why would I want to launch an IDE to have CC update the *arr stack on my NAS to the latest versions for example? Last week I pointed CC at some media files that weren't playing correctly on my Apple TV. It detected what the problem formats were and updated my *arr download rules to prefer other releases and then configured tdarr to re-encode problem files in my existing library.

subomi•3 months ago

I was here a few weeks ago, but I'm now on the CC train. The challenge is that the terminal is quite counterintuitive. But if you put on the Linux terminal lens from a few years ago, and you start using it. It starts to make sense. The form factor of the terminal isn't intuitive for programming, but it's the ultimate.

FYI, I still use cursor for small edits and reviews.

enum•3 months ago

I don't think I can scientifically compare the agents. As it is, you can use Opus / Codex in Cursor. The speed of Cursor composer-1 is phenomenal -- you can use it interactively for many tasks. There are also tasks that are not easier to describe in English, but you can tab through them.

smw•3 months ago

Just FYI, these days cc has 'ide integration' too, it's not just a cli. Grab the vscode extension.

sanderjd•3 months ago

> I'm hoping for a world where more end users code (vibe or otherwise) and the solve their own problems with their own software. I think that will make more a smaller, more elite software industry that is more focused on infrastructure than last-mile value capture.

Yes! This is what I'm excited about as well. Though I'm genuinely ambivalent about what I want my role to be. Sometimes I'm excited about figuring out how I can work on the infrastructure side. That would be more similar to what I've done in my career thus far. But a lot of the time, I think that what I'd prefer would be to become one of those end users with my own domain-specific problems in some niche that I'm building my own software to help myself with. That sounds pretty great! But it might be a pretty unnatural or even painful change for a lot of us who have been focused for so long on building software tools for other people to use.

swat535•3 months ago

Users will not care about the quality of your code, or the backed architecture, or your perfectly strongly typed language.

They only care about their problems and treat their computers like an appliance. They don't care if it takes 10 seconds or 20 seconds.

They don't even care if it has ads, popups, and junk. They are used to bloatware and will gladly open their wallets if the tool is helping them get by.

It's an unfortunately reality but there it is, software is about money and solving problems. Unless you are working on a mission critical system that affects people's health or financial data, none of those matter much.

Loading replies...

saxenaabhi•3 months ago

> There are legitimate reasons for the startup ecosystem to focus firstly and primarily on getting the users/customers. I'm not arguing against that. What I am arguing is why does the industry need to be dominated by startups in terms of the bulk of the products (not bulk of the users). It begs the question of how much societally-meaningful programming waiting to be done.

You slipped in "societally-meaningful" and I don't know what it means and don't want to debate merits/demerits of socialism/capitalism.

However I think lots of software needs to be written because in my estimation with AI/LLM/ML it'll generate value.

And then you have lots of software that needs to rewritten as firms/technologies die and new firms/technologies are born.

Loading replies...

yourapostasy•3 months ago

Reminds me of a post I read a few days ago of someone crowing about an LLM writing for them an email format validator. They did not have the LLM code up an accompanying send-an-email-validation loop, and were blithely kept uninformed by the LLM of the scar tissue built up by experience in the industry on how curiously a deep rabbit hole email validation becomes.

If you’ve been around the block and are judicious how you use them, LLM’s are a really amazing productivity boost. For those without that judgement and taste, I’m seeing footguns proliferate and the LLM’s are not warning them when someone steps on the pressure plate that’s about to blow off their foot. I’m hopeful we will this year create better context window-based or recursive guardrails for the coding agents to solve for this.

Loading replies...

suzzer99•3 months ago

I've hand-rolled my own ultra-light ORM because the off-the-shelf ones always do 100 things you don't need.*

And of course the open source ones get abandoned pretty regularly. Type ORM, which a 3rd party vendor used on an app we farmed out to them, mutates/garbles your input array on a multi-line insert. That was a fun one to debug. The issue has been open forever and no one cares. https://github.com/typeorm/typeorm/issues/9058

So yeah, if I ever need an ORM again, I'm probably rolling my own.

*(I know you weren't complaining about the idea of rolling your own ORM, I just wanted to vent about Type ORM. Thanks for listening.)

Loading replies...

patates•3 months ago

It seems to me these days, any code I want to write tries to solve problems that LLMs already excel at. Thankfully my job is perhaps just 10% about coding, and I hope people like you still have some coding tasks that cannot be easily solved by LLMs.

We should not exeggarate the capabilities of LLMs, sure, but let's also not play "don't look up".

paipa•3 months ago

"And likely just creating more debt down the road"

In the most inflationary era of capabilities we've seen yet, it could be the right move. What's debt when in a matter of months you'll be able to clear it in one shot?

HarHarVeryFunny•3 months ago

That's remarkably similar to something I've just started on - I want to create a self-compiling C compiler targeting (and to run on) an 8-bit micro via a custom VM. This a basically a retro-computing hobby project.

I've worked with Gemini Fast on the web to help design the VM ISA, then next steps will be to have some AI (maybe Gemini CLI - currently free) write an assembler, disassembler and interpreter for the ISA, and then the recursive descent compiler (written in C) too.

I already had Gemini 3.0 Fast write me a precedence climbing expression parser as a more efficient drop-in replacement for a recursive descent one, although I had it do that in C++ as a proof-of-concept since I don't know yet what C libraries I want to build and use (arena allocator, etc). This involved a lot of copy-paste between Gemini output and an online C++ dev environment (OnlineGDB), but that was not too bad, although Gemini CLI would have avoided that. Too bad that Gemini web only has "code interpreter" support for Python, not C and/or C++.

Using Gemini to help define the ISA was an interesting process. It had useful input in a "pair-design" process, working on various parts of the ISA, but then failed to bring all the ideas together into a single ISA document, repeatedly missing parts of what had been previously discussed until I gave up and did that manually. The default persona of Gemini seems not very well suited to this type of work flow where you want to direct what to do next, since it seems they've RL'd the heck out of it to want to suggest next step and ask questions rather than do what is asked and wait for further instruction. I eventually had to keep asking it to "please answer then stop", and interestingly quality of the "conversation" seemed to fall apart after that (perhaps because Gemini was now predicting/generating a more adversarial conversation than a collaborative one?).

I'm wondering/hoping that Gemini CLI might be better at working on documentation than Gemini web, since then the doc can be an actual file it is editing, and it can use it's edit tool for that, as opposed to hoping that Gemini web can assemble chunks of context (various parts of the ISA discussion) into a single document.

Loading replies...

GoatInGrey•3 months ago

There are several rubs with that operating protocol extending beyond the "you're holding it wrong" claim.

1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success.

2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.

3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.

4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.

Loading replies...

Capricorn2481•3 months ago

> With the latest models if you're clear enough with your requirements you'll usually find it does the right thing on the first try

That's great that this is your experience, but it's not a lot of people's. There are projects where it's just not going to know what to do.

I'm working in a web framework that is a Frankenstein-ing of Laravel and October CMS. It's so easy for the agent to get confused because, even when I tell it this is a different framework, it sees things that look like Laravel or October CMS and suggests solutions that are only for those frameworks. So there's constant made up methods and getting stuck in loops.

The documentation is terrible, you just have to read the code. Which, despite what people say, Cursor is terrible at, because embeddings are not a real way to read a codebase.

Loading replies...

aurumque•3 months ago

In a circuitous way, you can rather successfully have one agent write a specification and another one execute the code changes. Claude code has a planning mode that lets you work with the model to create a robust specification that can then be executed, asking the sort of leading questions for which it already seems to know it could make an incorrect assumption. I say 'agent' but I'm really just talking about separate model contexts, nothing fancy.

Loading replies...

cadamsdotcom•3 months ago

Even better, have it write code to describe the right thing then run its code against that, taking yourself out of that loop.

giancarlostoro•3 months ago

And if you've told it too many times to fix it, tell it someone has a gun to your head, for some reason it almost always gets it right this very next time.

Loading replies...

garblegarble•3 months ago

If they're using Opus then it'll be the $100/month Claude Max 5x plan (could be the more expensive 20x plan depending on how intensive their use is). It does consume a lot of tokens, but I've been using the $100/mo plan and get a lot done without hitting limits. It helps to be mindful of context (regularly amending/pruning your CLAUDE.md instructions, clearing context between tasks, sizing your tasks to stay within the Opus context window). Claude Code plans have token limits that work in 5-hour blocks (that start when you send your first token, so it's often useful to prime it as early in the morning as possible).

Claude Code will spawn sub-agents (that often use their cheap Haiki model) for exploration and planning tasks, with only the results imported into the main context.

I've found the best results from a more interactive collaboration with Claude Code. As long as you describe the problem clearly, it does a good job on small/moderate tasks. I generally set two instances of Claude Code separate tasks and run them concurrently (the interaction with Claude Code distracts me too much to do my own independent coding simultaneously like with setting a task for a colleague, but I do work on architecture / planning tasks)

The one manner of taste that I have had to compromise on is the sheer amount of code - it likes to write a lot of code. I have a better experience if I sweat the low-level code less, and just periodically have it clean up areas where I think it's written too much / too repetitive code.

As you give it more freedom it's more prone to failure (and can often get itself stuck in a fruitless spiral) - however as you use it more you get a sense of what it can do independently and what's likely to choke on. A codebase with good human-designed unit & playwright tests is very good.

Crucially, you get the best results where your tasks are complex but on the menial side of the spectrum - it can pay attention to a lot of details, but on the whole don't expect it to do great on senior-level tasks.

To give you an idea, in a little over a month "npx ccusage" shows that via my Claude Code 5x sub I've used 5M input tokens, 1.5M output, 121M Cache Create, 1.7B Cache Read. Estimated pay-as-you-go API cost equivalent is $1500 (N.B. for the tail end of December they doubled everybody's API limits, so I was using a lot more tokens on more experimental on-the-fly tool construction work)

Loading replies...

vidarh•3 months ago

I use the $200/month Claude Code plan, and in the last week I've had it generate about half a million words of documentation without hitting any session limits.

I have hit the weekly limit before, briefly, but that took running multiple sessions in parallel continuously for many days.

kevin42•3 months ago

Not that I have seen, which is probably a big part of the disconnect. Mostly it's tribal knowledge. I learned through experimentation, but I've seen tips here and there. Here's my workflow (roughly)

> Create a CLAUDE.md for a c++ application that uses libraries x/y/z

[Then I edit it, adding general information about the architecture]

> Analyze the library in the xxx directory, and produce a xxx_architecture.md describing the major components and design

> /agent [let claude make the agent, but when it asks what you want it to do, explain that you want it to specialize in subsystem xxx, and refer to xxx_architecture.md

Then repeat until you have the major components covered. Then:

> Using the files named with architecture.md analyze the entire system and update CLAUDE.md to use refer to them and use the specialized agents.

Now, when you need to do something, put it in planning mode and say something like:

> There's a bug in the xxx part of the application, where when I do yyy, it does zzz, but it should do aaa. Analyze the problem and come up with a plan to fix it, and automated tests you can perform if possible.

Then, iterate on the plan with it if you need to, or just approve it.

One of the most important things you can do when dealing with something complex is let it come up with a test case so it can fix or implement something and then iterate until it's done. I had an image processing problem and I gave it some sample data, then it iterated (looking at the output image) until it fixed it. It spent at least an hour, but I didn't have to touch it while it worked.

Loading replies...

HDThoreaun•3 months ago

> if you did /init in a new code base, that Claude would set itself up to maximize its own understanding of the code.

This is definitely not the case, and the reason anthropic doesnt make claude do this is because its quality degrades massively as you use up its context. So the solution is to let users manage the context themselves in order to minimize the amount that is "wasted" on prep work. Context windows have been increasing quite a bit so I suspect that by 2030 this will no longer be an issue for any but the largest codebases, but for now you need to be strategic.

rtfeldman•3 months ago

Anecdotally, we use Opus 4.5 constantly on Zed's code base, which is almost a million lines of Rust code and has over 150K active users, and we use it for basically every task you can think of - new features, bug fixes, refactors, prototypes, you name it. The code base is a complex native GUI with no Web tech anywhere in it.

I'm not talking about "write this function" but rather like implementing the whole feature by writing only English to the agent, over the course of numerous back-and-forth interactions and exhausting multiple 200K-token context windows.

For me personally, definitely at least 99% all of the Rust code I've committed at work since Opus 4.5 came out has been from an agent running that model. I'm reading lots of Rust code (that Opus generated) but I'm essentially no longer writing any of it. If dot-autocomplete (and LLM autocomplete) disappeared from IDE existence, I would not notice.

Loading replies...

jaggederest•3 months ago

I don't know if you've tried Chatgpt-5.2 but I find codex much better for Rust mostly due to the underlying model. You have to do planning and provide context, but 80%+ of the time it's a oneshot for small-to-medium size features in an existing codebase that's fairly complex. I honestly have to say that it's a better programmer than I am, it's just not anywhere near as good a software developer for all of the higher and lower level concerns that are the other 50% of the job.

If you have any opensource examples of your codebase, prompt, and/or output, I would happily learn from it / give advice. I think we're all still figuring it out.

Also this SIMD translation wasn't just a single function - it was multiple functions across a whole region of the codebase dealing with video and frame capture, so pretty substantial.

Loading replies...

andai•3 months ago

Is that a context issue? I wonder if LSP would help there. Though Claude Code should grep the codebase for all necessary context and LSP should in theory only save time, I think there would be a real improvement to outcomes as well.

The bigger a project gets the more context you generally need to understand any particular part. And by default Claude Code doesn't inject context, you need to use 3rd party integrations for that.

simonw•3 months ago

The value I'm getting from this stuff is so large that I'll take those risks, personally.

Loading replies...

theshrike79•3 months ago

I've said this multiple times:

This is why you use this AI bubble (it IS a bubble) to use the VC-funded AI models for dirt cheap prices and CREATE tools for yourself.

Need a very specific linter? AI can do it. Need a complex Roslyn analyser? AI. Any kind of scripting or automation that you run on your own machine. AI.

None of that will go away or suddenly stop working when the bubble bursts.

Within just the last 6 months I've built so many little utilities to speed up my work (and personal life) it's completely bonkers. Most went from "hmm, might be cool to..." to a good-enough script/program in an evening while doing chores.

Even better, start getting the feel for local models. Current gen home hardware is getting good enough and the local models smart enough so you can, with the correct tooling, use them for suprisingly many things.

Loading replies...

kaydub•3 months ago

> 1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success.

I can run multiple agents at once, across multiple code bases (or the same codebase but multiple different branches), doing the same or different things. You absolutely can't keep up with that. Maybe the one singular task you were working on, sure, but the fact that I can work on multiple different things without the same cognitive load will blow you out of the water.

> 2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.

Tell the LLM to document in comments why it did things. Human developers often leave and then people with no knowledge of their codebase or their "whys" are even around to give details. Devs are notoriously terrible about documentation.

> 3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.

You can't develop at the same velocity, so drop that assumption now. There's all kinds of lower abstractions that you build on top of that you probably can't explain currently.

> 4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.

You aren't keeping up with the actual economics. This shit is technically profitable, the unprofitable part is the ongoing battle between LLM providers to have the best model. They know software in the past has often been winner takes all so they're all trying to win.

simonw•3 months ago

I'm working mostly in a web framework that's used by me and almost nobody else (the weird little ASGI wrapper buried in Datasette) and I find the coding agents pick it up pretty fast.

One trick I use that might work for you as well:

  Clone GitHub.com/simonw/datasette to /tmp
  then look at /tmp/docs/datasette for
  documentation and search the code
  if you need to

Try that with your own custom framework and it might unblock things.

If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code!

Loading replies...

theshrike79•3 months ago

It's like working with humans:

  1) define problem
  2) split problem into small independently verifiable tasks
  3) implement tasks one by one, verify with tools

With humans 1) is the spec, 2) is the Jira or whatever tasks

With an LLM usually 1) is just a markdown file, 2) is a markdown checklist, Github issues (which Claude can use with the `gh` cli) and every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2

I haven't ran into context issues in a LONG time, and if I have it's usually been either intentional (it's a problem where compacting wont' hurt) or an error on my part.

Loading replies...

CuriouslyC•3 months ago

I realize your experience has been frustrating. I hope you see that every generation of model and harness is converting more hold-outs. We're still a few years from hard diminishing returns assuming capital keeps flowing (and that's without any major new architectures which are likely) so you should be able to see how this is going to play out.

It's in your interest to deal with your frustration and figure out how you can leverage the new tools to stay relevant (to the degree that you want to).

Regarding the context window, Claude needs thinking turned up for long context accuracy, it's quite forgetful without thinking.

Loading replies...

mikestorrent•3 months ago

> There's some magical "right context" that will fix all the problems.

All I can tell you is that in my own lived experience, I've had some fantastic results from AI, and it comes from telling it "look at this thing here, ok, i want you to chain it to that, please consider this factor, don't forget that... blah blah blah" like how I would have spelled things out to a junior developer, and then it really does stand a really solid chance of turning out what I've asked for. It helps a lot that I know what to ask for; there's no replacing that with AI yet.

So, your own situation must fall into one of these coarse buckets:

- You're doing something way too hard for AI to have a chance at yet, like real science / engineering at the frontier, not just boring software or infra development

- Your prompts aren't specific enough, you're not feeding it context, and you're expecting it to one-shot things perfectly instead of having to spend an afternoon prompting and correcting stuff

- You're not actually using and getting better at the tools, so you're just shouting criticisms from the sidelines, perhaps as sour grape because you're not allowed by policy / company can't afford to have you get into it.

IDK. I hope it's the first one and you're just doing Really Hard Things, but if you're doing normal software developer stuff and not seeing a productivity advantage, it's a fucking skill issue.

hawtads•3 months ago

Okay, here's the tl;dr:

Attention based neural network architectures (on which the majority of LLMs are built) has a unit economic cost that scales (roughly) n^2 i.e. quadratic (for both memory and compute). In other words, the longer the context window, the more expensive it is for the upstream provider. That's one cost.

The second cost is that you have to resend the entire context every time you send a new message. So the context is basically (where a, b, and c are messages): first context: a, second context window: a->b, third context window: a->b->c. It's a mostly stateless (there are some short term caching mechanisms, YMMV based on provider, it's why "cached" messages, especially system prompts are cheaper) process from the point of view of the developer, the state i.e. context window string is managed by the end user application (in other words, the coding agent, the IDE, the ChatGPT UI client etc.)

The per token cost is an amortized (averaged) cost of memory+compute, the actual cost is mostly quadratic with respect to each marginal token. The longer the context window the more expensive things are. Because of the above, AI agent providers (especially those that charge flat fee subscription plans) are incentivized to keep costs low by limiting the maximum context window size.

(And if you think about it carefully, your AI API costs are a quadratic cost curve projected into a linear line (flat fee per token, so the model hosting provider in some cases may make more profit if users send in shorter contexts, versus if they constantly saturate the window. YMMV of course, but it's a race to the bottom right now for LLM unit economics)

They do this by interrupting a task halfway through and generating a "summary" of the task progress, then they prompt the LLM again with a fresh prompt and the "summary" so far and the LLM will restart the task from where it left of. Of course text is a poor representation of the LLM's internal state but it's the best option so far for AI application to keep costs low.

Another thing to keep in mind is that LLMs have poorer performance the larger the input size. This is due to a variety of factors (mostly because you don't have enough training data to saturate the massive context window sizes I think).

The general graph for LLM context performance looks something like this: https://cobusgreyling.medium.com/llm-context-rot-28a6d039965... https://research.trychroma.com/context-rot

There are a bunch of tests and benchmarks (commonly referred to as "needle in a haystack") to improve the LLM performance at large context window sizes, but it's still an open area of research.

https://cloud.google.com/blog/products/ai-machine-learning/t...

The thing is, generally speaking, you will get a slightly better performance if you can squeeze all your code and problem into the context window, because the LLM can get a "whole picture" view of your codebase/problem, instead of a bunch of broken telephone summaries every dozen of thousands of tokens. Take this with a grain of salt as the field is changing rapidly so it might not be valid in a month or two.

Keep in mind that if the problem you are solving requires you to saturate the entire context window of the LLM, a single request can cost you dollars. And if you are using 1M+ context window model like gemini, you can rack up costs fairly rapidly.

yourapostasy•3 months ago

I am satisfied when someone tells us we cannot change requirements, to get their acknowledgement that what we bring up does extract a specific trade-off, and their reason for accepting the trade-off, then recording it into design and operational documentation. The moment many people recognize this trade-off will be explicitly documented with their and their team's accountability in detail, is when you surface genuine trade-offs made with the debt to pay off in the future in mind and in the meantime a rationale to grant a ton of leeway to the team burdened with the externality going forward, and trade-offs made without understanding their externalities upon other teams (which happens a tremendous amount in large organizations).

Most of the time, people are just very reasonably and understandably focusing tightly on their lane and honestly had no idea of the externalities of their conclusions and decisions, and I'm happy to have experienced all those times a rebalancing of the trade-offs that everyone can accept and is grateful to have documented to justify spending the story points upon cleaning up later instead of working on new features while the externality debt's unwanted impact keeps piling up.

In fewer than a handful of times, I run into people deliberately, consciously with malice aforethought of the full externalities making trade-offs for the sake of expediently shifting burdens of of them without first consulting with partner teams they want to shift the burdens onto, simply so they can fatten their promo packet sooner at the expense of making other teams look worse. Getting these trade-offs documented about half the time makes them back down to a more reasonable trade-off, about half the time they don't back down but your team is now protected by explicit documentation and caveats upon the externality your team now has to carry, and 100% of the time my team and I put a ring fence upon all future interactions with that personality for at least the remaining duration of my gig.

JDye•3 months ago

I've taken time today to do this. With some of your suggestions, I am seeing an improvement in it's ability to do some of the grunt work I mentioned. It just saved me an hour refactoring a large protocol implementation into a few files and extracted some common utilities. I can recognise and appreciate how useful that is for me and for most other devs.

At the same time, I think there's limitations to these tools and that I wont ever be able to achieve what I see others saying about 95% of code being AI written or leaving the AI to iterate for an hour. There's just too many weird little pitfalls in our work that the AI just cannot seem to avoid.

It's understandable, I've fallen victim to a few of them too, but I have the benefit of the ability to continuously learn/develop/extrapolate in a way that the LLM cannot. And with how little documentation exists for some of these things (MASQUE proxying for example) anytime the LLM encounters this code it throws a fit, and is unable to contribute meaningfully.

So thanks for your suggestions, it has made Claude better and clearly I was dragging my feet a little. At the very least, it's freed up a some more of my time to work on the complex things Claude can't do.

ryandrake•3 months ago

To be perfectly honest, I've never used a single /command besides /init. That probably means I'm using 1% of the software's capabilities. In frankness, the whole menu of /-commands is intimidating and I don't know where to start.

Loading replies...

gck1•3 months ago

This is some great advice. What I would add is to avoid the internal plan mode and just build your own. Built in one creates md files outside the project, gives the files random names and its hard to reference in the future.

It's also hard to steer the plan mode or have it remember some behavior that you want to enforce. It's much better to create a custom command with custom instructions that acts as the plan mode.

My system works like this:

/implement command acts as an orchestrator & plan mode, and it is instructed to launch predefined set of agents based on the problem and have them utilize specific skills. Every time /implement command is initiated, it has to create markdown file inside my own project, and then each subagent is also instructed to update the file when it finished working.

This way, orchestrator can spot that agent misbehaved, and reviewer agent can see what developer agent tried to do and why it was wrong.

techblueberry•3 months ago

> But today trivial to simple applications can be rewritten from spec or scratch in an afternoon with an LLM. And even pretty complex parsers can be ported provided that the tests are robust enough[1]. It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec".

This seems like a sort of I dunno chicken and the egg thing.

The _reason_ you don't rewrite code is because it's hard to know that you truly understand the spec. If you could perfectly understand the spec then you could rewrite the code, but then what is the software, is it the code or the spec that writes the code. So if you built code A from spec, rebuilding it from spec I don't think qualifies a rewrite, it's just a recompile. If you're trying to fundamentally build a new application from spec when an old application was written by hand, you're going to run into the same problems you have in a normal rewrite.

We already have an example of this. Typescript applications are basically rewritten every time that you recompile typescript to node. Typescript isn't the executed code, it's a spec.

edit: I think I missed that you said rewrite in a different language, then yeah fine, you're probably right, but I don't think most people are architecture agnostic when they talk about rewrites. The point of a rewrite is to keep the good stuff and lose a lot of bad stuff. If you're using the original app as a spec to rewrite in a new language, then fine yeah, LLM's may be able to do this relatively trivially.

arkensaw•3 months ago

I don't know about Japan - I vaguely recall reading that most buildings over there are built with wood (even the big ones) and that this is historically something to do with rebuilding after Tsunamis and earthquakes.

Buildings in most other countries in the world ARE built to last forever, and often renovated, changed, extended and modified long after the incept date until, because needs change, and destroying them to start over is complete overkill (Although some people do these "large scale refactors" - they're usually rich).

> It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec".

I have no doubt of this. I'm sure it's happening already. But the whole point of long term stable applications is that they are tried and tested. A port done in an afternoon by an LLM might be great, but you can't know if it has problems until it has withstood the test of time.

fwip•3 months ago

Sure, and the buildings are built to a slowly-evolving code, using standard construction techniques, operating as a predictable building in a larger ecosystem.

The problem with "all software" being AI-generated is that, to use your analogy, the electrical standards, foundation, and building materials have all been recently vibe-coded into existence, and none of your construction workers are certified in any of it.

andrekandre•3 months ago

  > let LLMs do whatever behind the scenes to hit the specs

assuming for the sake of argument that's completely true, then what happens to "competitive advantage" in this scenario?

it gets me thinking: if anyone can vibe from spec, whats stopping company a (or even user a) from telling an llm agent "duplicate every aspect of this service in python and deploy it to my aws account xyz"...

in that scenario, why even have companies?

Loading replies...

ehnto•3 months ago

Okay, we will copy that version of the product too.

There is more to it than the code and software provided in most cases I feel.

majormajor•3 months ago

I think `andrekandre is right in this hypothetical.

Who'd pay for brand new Photoshop with a couple new features and improvements if LLM-cloned Photoshop-from-three-months-ago is free?

The first few iterations of this cloud be massively consumer friendly for anything without serious cloud infra costs. Cheap clones all around. Like generic drugs but without the cartel-like control of manufacturing.

Business after that would be dramatically different, though. Differentiating yourself from the willing-to-do-it-for-near-zero-margin competitors to produce something new to bring in money starts to get very hard. Can you provide better customer support? That could be hard, everyone's gonna have a pretty high baseline LLM-support-agent already... and hiring real people instead could dramatically increase the price difference you're trying to justify... Similarly for marketing or outreach etc; how are you going to cut through the AI-agent-generated copycat spam that's gonna be pounding everyone when everyone and their dog has a clone of popular software and services?

Photoshop type things are probably a really good candidate for disruption like that because to a large extent every feature is independent. The noise reduction tool doesn't need API or SDK deps on the layer-opacity tool, for instance. If all your features are LLM balls of shit that doesn't necessarily reduce your ability to add new ones next to them, unlike in a more relational-database-based web app with cross-table/model dependencies, etc.

And in this "try out any new idea cheaply and throw crap against the wall and see what sticks" world "product managers" and "idea people" etc are all pretty fucked. Some of the infinite monkeys are going to periodically hit to gain temporary advantage, but good luck finding someone to pay you to be a "product visionary" in a world where any feature can be rolled out and tested in the market by a random dev in hours or days.

Loading replies...

llmslave2•3 months ago

It's true that some people will just continually move the goalposts because they are invested in their beliefs. But that doesn't mean that the skepticism around certain claims aren't relevant.

Nobody serious is disputing that LLM's can generate working code. They dispute claims like "Agentic workflows will replace software developers in the short to medium term", or "Agentic workflows lead to 2-100x improvements in productivity across the board". This is what people are looking for in terms of evidence and there just isn't any.

Thus far, we do have evidence that AI (at least in OSS) produces a 19% decrease in productivity [0]. We also have evidence that it harms our cognitive abilities [1]. Anecdotally, I have found myself lazily reaching for LLM assistance when encountering a difficult problem instead of thinking deeply about the problem. Anecdotally I also struggle to be more productive using AI-centric agents workflows in areas of expertise.

We want evidence that "vibe engineering" is actually more productive across the entire lifespan of a software project. We want evidence that it produces better outcomes. Nobody has yet shown that. It's just people claiming that because they vibe coded some trivial project, all of software development can benefit from this approach. Recently a principle engineer at Google claimed that Claude Code wrote their team's entire year's worth of work in a single afternoon. They later walked that claim back, but most do not.

I'm more than happy to be convinced but it's becoming extremely tiring to hear the same claims being parroted without evidence and then you get called a luddite when you question it. It's also tiring when you push them on it and they blame it on the model you use, and then the agent, and then the way you handle context, and then the prompts, and then "skill issue". Meanwhile all they have to show is some slop that could be hand coded in a couple hours by someone familiar with the domain. I use AI, I was pretty bullish on it for the last two years, and the combination of it simply not living up to expectations + the constant barrage of what feels like a stealth marketing campaign parroting the same thing over and over (the new model is way better, unlike the other times we said that) + the amount of absolute slop code that seems to continue to increase + companies like Microsoft producing worse and worse software as they shoehorn AI into every single product (Office was renamed to Copilot 365). I've become very sensitive to it, much in the same way I was very sensitive to the claims being made by certain VC backed webdev companies regarding their product + framework in the last few years.

I'm not even going to bring up the economic, social, and environmental issues because I don't think they're relevant, but they do contribute to my annoyance with this stuff.

[0] https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... [1] https://news.harvard.edu/gazette/story/2025/11/is-ai-dulling...

Loading replies...

troupo•3 months ago

> That's exactly how the long term memory works in humans as well.

Where this is applicable when is you go away from a problem for a while. And yet I don't lose the entire context and have to rebuild it from scratch when I go for lunch, for example.

Models have to rebuild the entire world from scratch for every small task.

> This is akin to opposing calling processor next tier because it still needs RAM and bus to communicate with it and SSD as well.

You're so lost in your own metaphor that it makes no sense.

> You think it should have everything in cache to be worthy of calling it next tier.

No. "Next tier" implies something significantly and observably better. I don't. And here you are trying to tell me "if you use all the exact same tools that you have already used before with 'previous tier models' you will see it is somehow next tier".

If your "next tier" needs an equator-length list of caveats and all the same tools, it's not next tier is it?

BTW. I'm literally coding with this "next tier" tool with "long memory just like people". After just doing the "plan/execute/write notes" bullshit incantations I had to correct it:

    You're right, I fucked up on all three counts:

    1. FileDetails - I should have WIRED IT UP, not deleted it. 
       It's a useful feature to preview file details before playing.
       I treated "unused" as "unwanted" instead of "not yet connected".
  
    2. Worktree not merged - Complete oversight. Did all the work but
       didn't finish the job.
  
    3. _spacing - Lazy fix. Should have analyzed why it exists and either
      used it or removed the layout constraint entirely.

So next tier. So long memory. So person-like.

Oh. Within about 10 seconds after that it started compacting the "non-crippled" context window and immediately forgot most of what it had just been doing. So I had to clear out the context and teach it the world from the start again.

Edit. And now this amazing next tier model completely ignored that there already exists code to discover network interfaces, and wrote bullshit code calling CLI tools from Rust. So once again it needed to be reminded of this.

> It's fine to have your own standards for applying words. But expect further confusion and miscommunication with other people if don't intend to realign.

I mean, just like crypto bros before them, AI bros do sure love to invent their own terminology and their own realities that have nothing to do with anything real and observable.

Loading replies...