It never duplicates code, implements something again and leaves the old code around, breaks my convention, hallucinates, or tells me it’s done when the code doesn’t even compile, which sonnet 4.5 and Opus 4.1 did all the time
I’m wondering if this had changed with Opus 4.5 since so many people are raving about it now. What’s your experience?
Claude - fast, to the point but maybe only 85% - 90% there and needs closer observation while it works
GPT-x-high (or xhigh) - you tell it what to do, it will work slowly but precise and the solution is exactly what you want. 98% there, needs no supervision
I'm not so sure... I mean it's true that regardless of if you are a beginner, junior, or 'senior', if you say "build me an instagram clone" Opus 4.5 will probably do a decent job. I think the skills go still in understanding architecture, and just knowing where pitfalls and problems can arise, or even making some important 'abstraction cut' prompts. I think applications can still grow to a point where you still need to prompt at specific domains only, or the model will fail to do everything you want it to - especially if you give it massive 'fix this this this anad that too' prompts
It is best in its class, but trips up frequently with complicated engineering tasks involving dynamic variables. Think: Browser page loading, designing for a system where it will "forget" to account for race conditions, etc.
Still, this gets me very excited for the next generation of models from Anthropic for heavy tasks.
https://chronick.github.io/typing-arena/
With another more substantial personal project (Eurorack module firmware, almost ready to release), I set up Claude Code to act as a design assistant, where I'd give it feedback on current implementation, and it would go through several rounds of design/review/design/review until I honed it down. It had several good ideas that I wouldn't have thought of otherwise (or at least would have taken me much longer to do).
Really excited to do some other projects after this one is done.
If I had done the same thing Pre LLM era it would have taken me months
Like that first one where he writes a right-click handler, off the top of my head I have no idea how I would do that, I could see it taking a few hours to just set up a dev environment, and I would probably overthink the research. I was working on something where Junie suggested I write a browser extension for Firefox and I was initially intimidated at the thought but it banged out something in just a few minutes that basically worked after the second prompt.
Similarly the Facebook autoposter is completely straightforward to code but it can be so emotionally exhausting to fight with authentication APIs, a big part of the coding agent story isn't just that it saves you time but that they can be strong when you are emotionally weak.
The one which seems the hardest is the one that does the routing and travel time estimation which I'd imagine is calling out to some API or library. I used to work at a place that did sales territory optimization and we had one product that would help work out routes for sales and service people who travel from customer to customer and we had a specialist code that stuff in C++ and he had a very different viewpoint than me, he was good at what he did and could get that kind of code to run fast but I wouldn't have trusted him to even look at applications code.
I think the gap between Sonnet 4.5 and Opus is pretty small, compared to the absolute chasm between like gpt-4.1, grok, etc. vs Sonnet.
(it was for single html/js PWA to measure and track heart rate)
Opus seems to go less deep, does it's own things, do not follow instructions exactly EVEN IF I WROTE ALL CAPS. With Sonnet 4.5 I can understand everything author is saying. May be Opus is optimised for Claude code and Sonnet works best on Web.
the thing about being an engineer at commercial capacity is "maintaining/enhancing an existing program/software system that has been developed over years by multiple people(including those who already left) and do it in a way that does not cause any outages/bugs/break existing functionality.
while the blog post mentions about the ability of using AI to generate new applications, but it does not talk about maintaining one over a longer period of time. for that, you would need real users, real constraints, and real feature requests which preferably pay you so you can priortize them.
I would love to see such blog posts where for example, a PM is able to add features for a period of one month without breaking the production, but it would be a very costly experiment.
If you tell it to use linters and other kinds of code analysis tools it takes it to the next level. Ruff for Python or Clippy for Rust for example. The LLM makes so much code so fast and then passes it through these tools and actually understands what the tools say and it goes and makes the changes. I have created a whole tool chain that I put in a pre commit text file in my repos and tell the LLM something like "Look in this text file and use every tool you see listed to improve code quality".
That being said, I doubt it can turn a non-dev into a dev still, it just makes competent devs way better still.
I still need to be able to understand what it is doing and what the tools are for to even have a chance to give it the guardrails it should follow.
Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
Edit: made an example repo for ya
And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.
Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.
So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.
No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.
All the LLM coded projects I've seen shared so far[1] have been tech toys though. I've watched things pop up on my twitter feed (usually games related), then quietly go off air before reaching a gold release (I manually keep up to date with what I've found, so it's not the algorithm).
I find this all very interesting: LLMs dont change the fundamental drives needed to build successful products. I feel like I'm observing the TikTokification of software development. I dont know why people aren't finishing. Maybe they stop when the "real work" kicks in. Or maybe they hit the limits of what LLMs can do (so far). Maybe they jump to the next idea to keep chasing the rush.
Acquiring context requires real work, and I dont see a way forward to automating that away. And to be clear, context is human needs; i.e. the reasons why someone will use your product. In the game development world, it's very difficult to overstate how much work needs to be done to create a smooth, enjoyable experience for the player.
While anyone may be able to create a suite of apps in a weekend, I think very few of them will have the patience and time to maintain them (just like software development before LLMs! i.e. Linux, open source software, etc.).
[1] yes, selection bias. There are A LOT of AI devs just marketing their LLMs. Also it's DEFINITELY too early to be certain. Take everything Im saying with a one pound grain of salt.
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).
Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
Trying to not feel the pressure/anxiety from this, but every time a new model drops there is this tiny moment where I think "Is it actually different this time?"
I appreciate the spirited debate and I agree with most of it - on both sides. It's a strange place to be where I think both arguments for and against this case make perfect sense. All I have to go on then is my personal experience, which is the only objective thing I've got. This entire profession feels stochastic these days.
A few points of clarification...
1. I don't speak for anyone but myself. I'm wrong at least half the time so you've been warned.
2. I didn't use any fancy workflows to build these things. Just used dictation to talk to GitHub Copilot in VS Code. There is a custom agent prompt toward the end of the post I used, but it's mostly to coerce Opus 4.5 into using subagents and context7 - the only MCP I used. There is no plan, implement - nothing like that. On occasion I would have it generate a plan or summary, but no fancy prompt needed to do that - just ask for it. The agent harness in VS Code for Opus 4.5 is remarkably good.
3. When I say AI is going to replace developers, I mean that in the sense that it will do what we are doing now. It already is for me. That said, I think there's a strong case that we will have more devs - not less. Think about it - if anyone with solid systems knowledge can build anything, the only way you can ship more differentiating features than me is to build more of them. That is going to take more people, not more agents. Agents can only scale as far as the humans who manage them.
New account because now you know who I am :)
A JavaScript interpreter written in Python? How about a WebAssembly runtime in Python? How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?
And these are mostly just casual experiments, often run from my phone!
This weekend I explained to Claude what I wanted the app to do, and then gave it the crappy code I wrote 10 years ago as a starting point.
It made the app exactly as I described it the first time. From there, now that I had a working app that I liked, I iterated a few times to add new features. Only once did it not get it correct, and I had to tell it what I thought the problem was (that it made the viewport too small). And after that it was working again.
I did in 30 minutes with Claude what I had try to do in a few hours previously.
Where it got stuck however was when I asked it to convert it to a screensaver for the Mac. It just had no idea what to do. But that was Claude on the web, not Claude Code. I'm going to try it with CC and see if I can get it.
I also did the same thing with a Chrome plugin for Gmail. Something I've wanted for nearly 20 years, and could never figure out how to do (basically sort by sender). I got Opus 4.5 to make me a plugin to do it and it only took a few iterations.
I look forward to finally getting all those small apps and plugins I've wanted forever.
Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.
Regardless of how much you value Cloud Code technically, there is no denying that it has/will have huge impact. If technology knowledge and development are commoditised and distributed via subscription, huge societal changes are going to happen. Image what will happen to Ireland if Accenture dissolves, or what will happen to the millions of Indians when IT outsourcing becomes economically irrelevant. Will Seattle become new Detroit after Microsoft automates Windows maintenance? What about the hairdressers, cooks, lawyers, etc. who provided services for IT labourers/companies in California?
Lot of people here (especially Anthropic-adjacent) like to extrapolate the trends and draw conclusions up to the point when they say that white-collar labourers will not be needed anymore. I would like these people to have courage to take this one step further and connect this resolution with the housing crisis, loneliness epidemic, college debts, and job market crisis for people under 30.
It feels like we are diving head first into societal crisis of unparalleled scale and the people behind the steering wheel are excited to push the accelerator pedal even more.
If anything this example shows that these cli tools give regular devs much higher leverage.
There's a lot of software labor that is like, go to the lowest cost country, hire some mediocre people there and then hire some US guy to manage them.
That's the biggest target of this stuff, because now that US guy can just get equal or hight code in both quality and output without the coordination cost.
But unless we get to the point where you can do what I call "hypercode" I don't think we'll see SWEs as a whole category die.
Just like we don't understand assembly but still need technical skills when things go wrong, there's always value in low level technical skills.
This project would have taken me years of specialization and research to do right. Opus's strength has been the ability to both speak broadly and also drill down into low-level implementations.
I can express an intent, and have some discussion back and forth around various possible designs and implementations to achieve my goals, and then I can be preparing for other tasks while Opus works in the background. I ask Opus to loop me in any time there are decisions to be made, and I ask it to clearly explain things to me.
Contrary to losing skills, I feel that I have rapidly gained a lot of knowledge about low-level systems programming. It feels like pair programming with an agentic model has finally become viable.
I will be clear though, it takes the steady hand of an experience and attentive senior developer + product designer to understand how to maintain constraints on the system that allow the codebase to grow in a way that is maintainable on the long-term. This is especially important, because the larger the codebase is, the harder it becomes for agentic models to reason holistically about large-scale changes or how new features should properly integrate into the system.
If left to its own devices, Opus 4.5 will delete things, change specification, shirk responsibilities in lieu of hacky band-aids, etc. You need to know the stack well so that you can assist with debugging and reasoning about code quality and organization. It is not a panacea. But it's ground-breaking. This is going to be my most productive year in my life.
On the flip side though, things are going to change extremely fast once large-scale, profitable infrastructure becomes easily replicable, and spinning up a targeted phishing campaign takes five seconds and a walk around the park. And our workforce will probably start shrinking permanently over the next few years if progress does not hit a wall.
Among other things, I do predict we will see a resurgence of smol web communities now that independent web development is becoming much more accessible again, closer to how it when I first got into it back in the early 2000's.
However, Opus 4.5 is incredible when you give it everything it needs, a direction, what you have versus what you want and it will make it work, really, it will work. The code might me ugly, undesirable, would only work for that one condition, but with futher prompting you can evolve it and produce something that you can be proud of.
Opus is only as good as the user and the tools the user gives to it. Hmm, that's starting to sound kind-of... human...
I read If Anyone Builds It Everyone Dies over the break. The basic premise was that we can't "align" AI so when we turn it loose in an agent loop what it produces isn't necessarily what we want. It may be on the surface, to appease us and pass a cursory inspection, but it could embed other stuff according to other goals.
On the whole, I found it a little silly and implausible, but I'm second guessing parts of that response now that I'm seeing more people (this post, the Gas Town thing on the front page earlier) go all-in on vibe coding. There is likely to be a large body of running software out there that will be created by agents and never inspected by humans.
I think a more plausible failure mode in the near future (next year or two) is something more like a "worm". Someone building an agent with the explicit instructions to try to replicate itself. Opus 4.5 and GPT 5.2 are good enough that in an agent loop they could pretty thoroughly investigate any system they land on, and try to use a few ways to propagate their agent wrapper.
Can't wait for when the competition catches up with Claude Code, especially the open source/weights Chinese alternatives :)
ie well known paths based on training data.
what's never posted is someone building something that solves a real problem in the real world - that deals with messy data | interfaces.
I like a.i to do the common routine tasks that I don't like to do like apply tailwind styles but being renter and faking productivity that's not it
I would love to hear from some freelance programmers how LLMs have changed their work in the last two years.
Yeah, GRAMMAR
For all the wonderment of the article, tripping up on a penultimate word that was supposedly checked by AI suddenly calls into question everything that went before...
There is no fixed truth regarding what an "app" is, does, or looks like. Let alone the device it runs on or the technology it uses.
But to an LLM, there are only fixed truths (and in my experience, only three or four possible families of design for an application).
Opus 4.5 produces correct code more often, but when the human at the keyboard is trying to avoid making any engineering decisions, the code will continue to be boring.
I’m not shaming but I personally need to know if my sentiment is correct or not or I just don’t know how to use LLMs
Can vibe coder gurus create operating system from scratch that competes with Linux and make it generate code that basically isn’t Linux since LLM are trained on said the source code …
Also all this on $20 plan. Free and self host solution will be best
I will note that my experience varies slightly by language though. I’ve found it’s not as good at typescript.
As many other commentators have said, individual results vary extremely widely. I'd love to be able to look at the footage of either someone who claims a 10x productivity increase, or someone who claims no productivity increase, to see what's happening.
> Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
I have used AskUserQuestionTool to complete my initial spec. And then Opus 4.5 created the tool according to that extensive and detailed spec.
It appeared to work out of the box.
Boy how horrific was the code. Unnecessary recursions, unused variables, data structures being built with no usage, deep branch nesting and weird code that is hard to understand because of how illogical it is.
And yes, it was broken on many levels and did not and could not do the job properly.
I then had to rewrite the tool from scratch and overall I have definitely spent more time spec'ing and understanding Claude code than if I have just written this tool from scratch initially.
Then I tried again for a small tool I needed to run codesign in parallel: https://github.com/egorFiNE/codesign-parallel
Same thing. Same outcome, had to rewrite.
IMO, our jobs are safe. It's our ways of working that are changing. Rapidly.
Especially if you're in a place where a lot of time was spent previously revising PRs for best practices, etc, even for human-submitted code, then having the LLM do that for you that saves a bunch of time. Most humans are bad at following those super-well.
There's a lot of stuff where I'm pretty sure I'm up to at least 2x speed now. And for things like making CLI tools or bash scripts, 10x-20x. But in terms of "the overall output of my day job in total", probably more like 1.5x.
But I think we will need a couple major leaps in tooling - probably deterministic tooling, not LLM tooling - before anyone could responsibly ship code nobody has ever read in situations with millions of dollars on the line (which is different from vibe-coding something that ends up making millions - that's a low-risk-high-reward situation, where big bets on doing things fast make sense. if you're already making millions, dramatic changes like that can become high-risk-low-reward very quickly. In those companies, "I know that only touching these files is 99.99% likely to be completely safe for security-critical functionality" and similar "obvious" intuition makes up for the lack of ability to exhaustively test software in a practical way (even with fuzzers and things), and "i didn't even look at the code" is conceding responsibility to a dangerous degree there.)
I'm assuming this refers to the python port of Bellard's MQJS [1]? It's impressive and very useful, but leaving out the "based on mqjs" part is misleading.
How did it do? :-)
It produced tests, then wrote the interpreter, then ran the tests and worked until all of them passed. I was genuinely surprised that it worked.
Meanwhile half a year to a year ago I could already point whatever model was du jour at the time at pychromecast and tell it repeatedly "just convert the rest of functionality to Swift" and it did it. No idea about the quality of code, but it worked alongside with implementations for mDNS, and SwiftUI, see gif/video here: https://mastodon.nu/@dmitriid/114753811880082271 (doesn't include chromecast info in the video).
I think agents have become better, but models likely almost entirely plateaued.
I'm fine with contributed AI-generated code if someone who's skills I respect is willing to stake their reputation on that code being good.
I just cutpasted a technical spec I wrote 22 years ago I spent months on for a language I never got around to building out, Opus zero-shotted a parser, complete with tests and examples in 3 minutes. I cutpasted the parser into a new session and asked it to write concept documentation and a language reference, and it did. The best part is after asking it to produce uses of the language, it's clear the aesthetics are total garbage in practice.
Told friends for years long in advance that we were coal miners, and I'll tell you the same thing. Embrace it and adapt
It is a well known fact that people advance their tech careers by building something new and leaving maintenance to others. Google is usually mentioned.
By which I mean, our industry does a piss poor job of rewarding responsibility and care.
I'll write the code, it can help me explore options, find potential problems and suggest tests, but I'll write the code.
And it turns out the quality of output you get from both the humans and the models is highly correlated with the quality of the specification you write before you start coding.
Letting a model run amok within the constraints of your spec is actually great for specification development! You get instant feedback of what you wrongly specified or underspecified. On top of this, you learn how to write specifications where critical information that needs to be used together isn't spread across thousands of pages - thinking about context windows when writing documentation is useful for both human and AI consumers.
Stuff that seems basic, but that I haven't always been able to count on in my teams' "production" code.
Over time, I imagine even cloud providers, app stores etc can start doing automated security scanning for these types of failure modes, or give a more restricted version of the experience to ensure safety too.
I predict in 2026 we're going to see agents get better at running their own QA, and also get better at not just disabling failing tests. We'll continue to see advancements that will improve quality.
Connect the world with reliable internet, then build a high tech remote control facility in Bangladesh and outsource plumbing, electrical work, housekeeping, dog watching, truck driving, etc etc
No AGI necessary. There’s billions of perfectly capable brains halfway around the world.
In other words. Only one side is even fighting the war. The other one is either cheering on the tsunami on or fretting about how their beachside house will get wrecked without making any effort to save themselves.
This is the sort of collective agency that even hundreds of thousands of dollars in annual wages/other compensation in American tech hubs gets us. Pathetic.
It will have impact on me in the long run, sure, it will transform my job, sure, but I'm confident my skills are engineering-related, not coding-related.
I mean, even if it forces me out of the job entirely, so be it, I can't really do anything if the status quo changes, only adapt.
This is also my take. When the printing press came out, I bet there were scribes who thought, "holy shit, there goes my job!" But I bet there were other scribes who thought, "holy shit, I don't have to do this by hand any more?!"
It's one thing when something like weaving or farming gets automated. We have a finite need for clothes and food. Our desire for software is essentially infinite, or at least, it's not clear we have anywhere close to enough of it. The constraint has always been time and budget. Those constraints are loosening now. And you can't tell me that when I am able to wield a tool that makes me 10X more productive that that somehow diminishes my value.
I think for a while people have been talking about the fact that as all development tools have gotten better - the idea that a developer is a person who turns requirements into code is dead. You have to be able to operate at a higher level, be able to do some level of work to also develop requirements, work to figure out how to make two pieces of software work together, etc.
But the point is Obviously at an extreme end 1 CTO can't run google and probably not say 1 PM or Engineer per product, but what is the mental load people can now take on. Google may start hiring less engineers (or maybe what happens is it becomes more cuthroat, hire the same number of engineers but keep them much more shortly, brutal up or out.
But essentially we're talking about complexity and mental load - And so maybe it's essentially the same number of teams because teams exist because they're the right size, but teams are a lot smaller.
I now write very long specifications and this helps. I haven't figured out a bulletproof workflow, I think that will take years. But I often get just amazing code out of it.
Codex/Claude Code are terrible with C++. It also can't do Rust really well, once you get to the meat of it. Not sure why that is, but they just spit out nonsense that creates more work than it helps me. It also can't one shot anything complete, even though I might feed him the entire paper that explains what the algorithm is supposed to do.
Try to do some OpenGL or Vulkan with it, without using WebGPU or three.js. Try it with real code, that all of us have to deal with every day. SDL, Vulkan RHI, NVRHI. Very frustrating.
Try it with boost, or cmake, or taskflow. It loses itself constantly, hallucinates which version it is working on and ignores you when you provide actual pointers to documentation on the repo.
I've also recently tried to get Opus 4.5 to move the Job system from Doom 3 BFG to the original codebase. Clean clone of dhewm3, pointed Opus to the BFG Job system codebase, and explained how it works. I have also fed it the Fabien Sanglard code review of the job system: https://fabiensanglard.net/doom3_bfg/threading.php
We are not sleeping on it, we are actually waiting for it to get actually useful. Sure, it can generate a full stack admin control panel in JS for my PostgreSQL tables, but is that really "not normal"? That's basic.
And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up
Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.
I had an idea for something that I wanted, and in five scattered hours, I got it good enough to use. I'm thinking about it in a few different ways:
1. I estimate I could have done it without AI with 2 weeks full-time effort. (Full-time defined as >> 40 hours / week.)
2. I have too many other things to do that are purportedly more important that programming. I really can't dedicate to two weeks full-time to a "nice to have" project. So, without AI, I wouldn't have done it at all.
3. I could hire someone to do it for me. At the university, those are students. From experience with lots of advising, a top-tier undergraduate student could have achieved the same thing, had they worked full tilt for a semester (before LLMs). This of course assumes that I'm meeting them every week.
Nobody is sleeping. I'm using LLMs daily to help me in simple coding tasks.
But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
> 2026 is going to be a wake-up call
Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
There is literally no hurry.
You're using the equivalent of the first portable CD-player in the 80s: it was huge, clunky, had hiccups, had a huge battery attached to it. It was shiny though, for those who find new things shiny. Others are waiting for a portable CD player that is slim, that buffers, that works fine. And you're saying that people won't be able to learn how to put a CD in a slim CD player because they didn't use a clunky one first.
claude can call ssh and do system admin tasks. It works amazingly well. I have 3 VM's, which depends on each other (proxmox with openwrt, adguard, unbound), and claude can prove to me that my dns chains works perfectly, my firewalls are perfect etc as claude can ssh into each. Setting up services, diagnosing issues, auditing configs... you name it. Just awesome.
claude can call other sh scripts on the machine, so over time, you can create a bunch of scripts that lets claude one shot certain tasks that would normally eat tokens. It works great. One script per intention - don't have a script do more than one thing.
claude can call the compiler, run the debug executable and read the debug logs.. in real time. So claude can read my android apps debug stream via adb.. or my C# debug console because claude calls the compiler, not me. Just ask it to do it and it will diagnose stuff really quickly.
It can also analyze your db tables (give it readonly sql access), look at the application code and queries, and diagnose performance issues.
The opportunities are endless here. People need to wake up to this.
> have it learn your conventions, pull in best practices
What do you mean by "have it learn your conventions"? Is there a way to somehow automatically extract your conventions and store it within CLAUDE.md?
> For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
Maintain and debug by who? It's just going to be Opus 4.5 (and 4.6...and 5...etc.) that are maintaining and debugging it. And I don't think it minds, and I also think it will be quite good at it.
something like code-simplifier is surprisingly useful (as is /review)
I thought I'd revive it, but this time with Vulkan and no third-party dependencies (except for Vulkan)
4.5 Sonet, Opus and Gemini 3.5 flash has helped me write image decoders for dds, png jpg, exr, a wayland window implementation, macOS window implementation, etc.
I find that Gemini 3.5 flash is really good at understanding 3d in general while sonnet might be lacking a little.
All these sota models seem to understand my bespoke Lua framework and the right level of abstraction. For example at the low level you have the generated Vulkan bindings, then after that you have objects around Vulkan types, then finally a high level pipeline builder and whatnot which does not mention Vulkan anywhere.
However with a larger C# codebase at work, they really struggle. My theory is that there are too many files and abstractions so that they cannot understand where to begin looking.
I think it's also rewarding to just be able to build something for yourself, and one benefit of scratching your own itch is that you don't have to go through the full effort of making something "production ready". You can just build something that's tailed specifically to the problem you're trying to solve without worrying about edge cases.
Which is to say, you're absolutely right :).
I interpret it more as spooked silence
Stopping there is just fine if you're doing it as a hobby. I love to do this to test out isolated ideas. I have dozens of RPGs in this state, just to play around with different design concepts from technical to gameplay.
Given the enthusiasm of our ruling class towards automating software development work, it may make sense for a software engineer to publicly signal how much onboard as a professional they are with it.
But, I've seen stranger stuff throughout my professional life: I still remember people enthusiastically defending EJB 2.1 and xdoclet as perfectly fine ways of writing software.
I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.
Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).
Once Opus "finished", how did you validate and give it feedback it might not have access to (like iPhone simulator testing)?
Yes, feel free to blame me for the fact that these aren’t very business-realistic.
This is basically all my side projects.
Because types are proofs and require global correctness, you can't just iterate, fix things locally, and wait until it breaks somewhere else that you also have to fix locally.
That's not how I read it. I would say that it's more like "If a human no longer needs to read the code, is it important for it to be readable?"
That is, of course, based on the premise that AI is now capable of both generating and maintaining software projects of this size.
Oh, and it begs another question: are human-readable and AI-readable the same thing? If they're not, it very well could make sense to instruct the model to generate code that prioritizes what matters to LLMs over what matters to humans.
I haven't quite got the WASM one into a share-able shape yet though - the performance is pretty bad which makes the demos not very interesting.
It couldn't quite beat the Rust implementation on everything, but it managed to edge it out on at least some of the benchmarks it wrote for itself.
(Honestly it feels like a bit of an afront to the natural order of things.)
That said... I'm most definitely not a Rust or C programmer. For all I know it cheated at the benchmarks and I didn't spot it!
I once gave Claude (Opus 3.5) a problem that I thought was for sure too difficult for an LLM, and much to my surprise it spat out a very convincing solution. The surprising part was I was already familiar with the solution - because it was almost a direct copy/paste (uncredited) from a blog post that I read only a few hours earlier. If I hadn't read that blog post, I would have been none the wiser that copy/pasting Claude's output would be potential IP theft. I would have to imagine that LLMs solve a lot of in-training-set problems this way and people never realize they are dealing with a copyright/licensing minefield.
A more interesting and convincing task would be to write a Python 3 interpeter in JavaScript that uses register based bytecode instead of stack based, supports optimizing the bytecode by inlining procedures and constant folding, and never allocates memory (all work is done in a single user provided preallocated buffer). This would require integrating multiple disparate coding concepts and not regurgitating prior art from the training data
[1] https://github.com/skulpt/skulpt
Though it seems to work best when context is minimized. Once the code passes a certain complexity/size it starts making very silly errors quite often - the same exact code it wrote in a smaller context will come out with random obvious typos like missing spaces between tokens. At one point it started writing the code backwards (first line at the bottom of the file, last line at the top) :O.
Even better if there's an existing conformance suite to point at - like html5lib-tests or the WenAssembly spec tests.
I can’t get past that by the time I write up an adequate spec and review the agents code, I probably could have done it myself by hand. It’s not like typing was even remotely close to the slow part.
AI, agents, etc are insanely useful for enhancing my knowledge and getting me there faster.
I hope Neuromancer never becomes a reality, where everyone with expertise could become like the protagonist Case, threatened and coerced into helping a superintelligence to unlock its potential. In fact Anthropic has already published research that shows how easy it is for models to become misaligned and deceitful against their unsuspecting creators not unlike Wintermute. And it seems to be a law of nature that agents based on ML become concerned with survival and power grabbing. Because that's just the totally normal and rational, goal oriented thing for them to do.
There will be no good prompt engineers who are also naive and trusting. The naive, blackmailed and non-paranoid engineers will become tools of their AI creations.
UBI effectively means welfare, with all the attendant social control (break the law lose your UBI, with law as ever expanding set of nuisances, speech limitations etc), material conditions (nowhere UBI has been implemented is it equivalent to a living wage) and self esteem issues. It's not any kind of solution.
If you think those in power will pass regulations that make them less wealthy, I have a bridge to sell you.
Besides, there's no chance something like UBI will ever be a reality in countries where people consider socialism to be a threat to their way of life.
So a lot of people will end up doing something different. Some of it will be menial and be shit, and some of it will be high level. New hierarchies and industries will form. Hard to predict the details, but history gives us good parallels.
If AI datacenters' hungry need for energy gets us to nuclear power, which gets us the energy to run desalination plants as the lakes dry up because the Earth is warming, hopefully we won't die of thirst.
I don't understand this argument. Surely the skill set involved in being a scribe isn't the same as being a printer, and possibly the the personality that makes a good scribe doesn't translate to being a good printer.
So I imagine many of the scribes lost their income, and other people made money on printing. Good for the folks who make it in the new profession, sucks for those who got shafted. How many scribes transitioned successfully to printers?
Genuinely asking, I don't know.
If I didn't ultimately understand where I was going, projects like this hit a dead end very quickly, as mentioned in my caveats. These models are not yet ready for large-scale or mission-critical projects.
But I have a set of a constraints and a design document and as long as these things are satisfied, the language will work exactly as intended for my use case.
Not using a frontier model to code today is like having a pretty smart person around you who is pretty good at coding and has a staggering breadth and depth of knowledge, but never consulting them due to some insecurity about your own ability to evaluate the code they produce.
If you have ever been responsible for the work of other engineers, this should already be a developed skill.
I have very specific requirements and constraints that come from knowledge and experience, having worked with dozens of languages. The language in question is general-purpose, highly flexible and strict but not opinionated.
However, I am not experienced in every single platform and backend which I support, and the constraints of the language create some very interesting challenges. Coding agents make this achievable in a reasonable time frame. I am enjoying making the language, and I want to get experience with making low-level languages. What is the problem? Do you ever program for fun?
With some entirely novel work we're doing, it's actually a hindrance as it consistently tells us the approach isn't valid/won't work (it will) and then enters "absolutely right" loops when corrected.
I still believe those who rave about it are not writing anything I would consider "engineering". Or perhaps it's a skill issue and I'm using it wrong, but I haven't yet met someone I respect who tells me it's the future in the way those running AI-based companies tell me.
Which makes sense. I'm sure there's lots of training data for React/HTML/CSS/etc. but much less with Swift, especially the newer versions.
I have not had this experience at all. It often doesn't get it right on the first pass, yes, but the advantage with Rust vibecoding is that if you give it a rule to "Always run cargo check before you think you're finished" then it will go back and fix whatever it missed on the first pass. What I find particularly valuable is that the compiler forces it to handle all cases like match arms or errors. I find that it often misses edge cases when writing typescript, and I believe that the relative leniency of the typescript compiler is why.
In a similar vein, it is quite good at writing macros (or at least, quite good given how difficult this otherwise is). You often have to cajole it into not hardcoding features into the macro, but since macros resolve at compile time they're quite well-suited for an LLM workflow as most potential bugs will be apparent before the user needs to test. I also think that the biggest hurdle of writing macros to humans is the cryptic compiler errors, but I can imagine that since LLMs have a lot of information about compilers and syntax parsing in their training corpus, they have an easier time with this than the median programmer. I'm sure an actual compiler engineer would be far better than the LLM, but I am not that guy (nor can I afford one) so I'm quite happy to use LLMs here.
For context, I am purely a webdev. I can't speak for how well LLMs fare at anything other than writing SQL, hooking up to REST APIs, React frontend, and macros. With the exception of macros, these are all problems that have been solved a million times thus are more boilerplate than novelty, so I think it is entirely plausible that they're very poor for different domains of programming despite my experiences with them.
I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.
Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.
I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.
I think the following things are true now:
- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.
- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.
- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.
- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.
- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).
[0]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/lib
[1]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/ext...
1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.
2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.
I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.
This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).
I hear you but I think many companies will change the role ; you'll get the technical ownership + big chunks of the data/product/devops responsibility. I'm speculating but I think one person can take that on himself with the new tools and deliver tremendous value. I don't know how they'll call this new role though, we'll see.
This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.
This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.
So you gave it an poorly defined task, and it failed?
You can see some of the context limits here:
If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.
I don't think you've seen the full potential. I'm currently #1 on 5 different very complex computer engineering problems, and I can't even write a "hello world" in rust or cpp. You no longer need to know how to write code, you just need to understand the task at a high level and nudge the agents in the right direction. The game has changed.
- https://highload.fun/tasks/3/leaderboard
- https://highload.fun/tasks/12/leaderboard
- https://highload.fun/tasks/15/leaderboard
Try it again using Claude Code and a subscription to Claude. It can run as a chat window in VS Code and Cursor too.
Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January?
I thought using up 50% was mad for me (inline completions + opencode), that's even worse
Starting back in 2022/2023:
- (~2022) It can auto-complete one line, but it can't write a full function.
- (~2023) Ok, it can write a full function, but it can't write a full feature.
- (~2024) Ok, it can write a full feature, but it can't write a simple application.
- (~2025) Ok, it can write a simple application, but it can't create a full application that is actually a valuable product.
- (~2025+) Ok, it can write a full application that is actually a valuable product, but it can't create a long-lived complex codebase for a product that is extensible and scalable over the long term.
It's pretty clear to me where this is going. The only question is how long it takes to get there.
I've worked on teams where multiple engineers argued about the "right" way to build something. I remember thinking that they had biases based on past experiences and assumptions about what mattered. It usually took an outsider to proactively remind them what actually mattered to the business case.
I remember cases where a team of engineers built something the "right" way but it turned out to be the wrong thing. (Well engineered thing no one ever used)
Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
Where I see real silly problems is when engineers over-engineer from the start before it's clear they are building the right thing, or when management never lets them clean up the code base to make it maintainable or extensible when it's clear it is the right thing.
There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
- How quickly is cost of refactor to a new pattern with functional parity going down?
- How does that change the calculus around tech debt?
If engineering uses 3 different abstractions in inconsistent ways that leak implementation details across components and duplicate functionality in ways that are very hard to reason about, that is, in conventional terms, an existential problem that might kill the entire business, as all dev time will end up consumed by bug fixes and dealing with pointless complexity, velocity will fall to nothing, and the company will stop being able to iterate.
But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. And it has changed more if you are willing to bet that models within a year will be much better at such tasks.
And in my experience, claude is imperfect at refactors and still requires review and a lot of steering, but it's one of the things it's better at, because it has clear requirements and testing workflows already built to work with around the existing behavior. Refactoring is definitely a hell of a lot faster than it used to be, at least on the few I've dealt with recently.
In my mind it might be kind of like thinking about financial debt in a world with high inflation, in that the debt seems like it might get cheaper over time rather than more expensive.
You’re talking like in the year 2026 we’re still writing code for future humans to understand and improve.
I fear we are not doing that. Right now, Opus 4.5 is writing code that later Opus 5.0 will refactor and extend. And so on.
Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base.
In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase.
For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up.
I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out. I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
Why do they need to be replaced? Programmers are in the perfect place to use AI coding tools productively. It makes them more valuable.
How do you stop it from over-engineering everything?
When I use Claude Code I find that it *can* add a tremendous amount of ability due to its ability to see my entire codebase at once, but the issue is that if I'm doing something where seeing my entire codebase would help that it blasts through my quota too fast. And if I'm tightly scoping it, it's just as easy & faster for me to use the website.
Because of this I've shifted back to the website. I find that I get more done faster that way.
Sharing how you're using these tools is quite a lot of work!
Is it possible to create an obfuscated quine that exhibits stable detection-avoiding behavior on every frontier model simultaneously, as well as on an old-school classifier and/or GPT-3 era LLM fine-tuned just for worm detection? One incapable of even thinking about what it's seeing, and being persuaded to follow its subtle propagation logic? I'm not sure that the answer is yes.
The larger issue to me is less that an LLM can propagate in generated code undetected, but rather that an attacker's generated code may soon be able to execute a level of hyper-customized spear-phishing-assisted attack at scale, targeting sites without large security teams - and that it will be hitting unintentional security flaws introduced by those smaller companies' vibe code. Who needs a worm when you have the resources of a state-level attacker at your fingertips, and numerous ways to monetize? The balance of power is shifting tremendously towards black hats, IMO.
And everything worked really well until they switched chip set.
At which point the same model failed entirely. Upon inspection it turned out the AI model had learned that overloading particular registers would cause such an electrical charge buildup that transistors on other pathways would be flipped.
And it was doing this in a coordinated manner in order to get the results it wanted lol.
I can't find any references in my very cursory searches, but your comment reminded me of the story
Most RCEs, 0-days, and whatnots are not due to the NSA hiding behind the "Jia Tan" pseudo to try to backdoor all the SSH servers on all the systemd [1] Linuxes in the world: they're just programmer errors.
I think accidental security holes with LLMs are way, way, way more likely than actual malicious attempts.
And with the amount of code spoutted by LLMs, it is indeed --and the lack of audit is-- an issue.
[1] I know, I know: it's totally unrelated to systemd. Yet only systems using systemd would have been pwned. If you're pro-systemd you've got your point of view on this but I've got mine and you won't change my mind so don't bother.
https://thermal-bridge.streamlit.app/
Disclaimer: I'm not a programmer or software engineer. I have a background in physics and understand some scripting in python and basic git. The code is messy at the moment because I explored/am still exploring to port it to another framework/language
She doesn’t need to hire anyone
In that it's using a non-deterministic machine to build a deterministic one.
Which gives all the benefits of determinism in production, with all the benefits of non-deterministic creativity in development.
Imho, Anthropic is pretty smart in picking it as a core focus.
Why would you not want you code to be boring?
Yet you expect $20 of computing to do it.
No.
Vibe-coding, in the original sense where you don't bother with code reviews, the code quality and speed are both insufficient for that.
I experimented with them just before Christmas. I do not think my experiments were fully representative of the entire range of tasks needed for replacing Linux: Having them create some web apps, python scripts, a video game, a toy programming language, all beat my expectations given the METR study. While one flaw with the METR study is the small number of data points at current 50% successful task length, I take my success as evidence I've been throwing easy tasks at the LLM, not that the LLM is as good as it looks like to me.
However, assume for the moment that they were representative tasks:
For quality, what I saw just before Christmas was the equivalent of someone with a few years' experience under their belt, the kind of person who is just about to stop being a junior and get a pay rise. For speed, $20 of Claude Code will get you around 10 sprints' equivalent to that level of human's output.
"Junior about to get a pay rise" isn't high enough quality to let loose unchecked on a project that could compete with Linux, and even if it was, 10 sprints/month is too slow. Even if you increase the spend on LLMs to match the cost of a typical US junior developer, you're getting an army of 1500 full-time (40h/week) juniors, and Linux is, what, some 50-100 million developer-hours, so it would still take something like 16-32 years of calendar time (or, equivalently, order-of 1.2-2.5 million dollars) even if you could perfectly manage all those agents.
If you just vibe code, you get some millions of dollars worth of junior grade technical debt. There's cases where this is fine, an entire operating system isn't one of them.
> Also all this on $20 plan. Free and self host solution will be best
IMO unlikely, but not impossible.
A box with 10x the resources of your personal computer may be c. 10x the price, give or take.
While electricity is negligible (which today, hah!): If any given person is using that server only during a normal 40 hour work week, that's 25% utilisation rate, therefore if it can be rented out to people in other timezones or where the weekend is different, the effective cost for that 10x server is only 2.5x.
When electricity price is a major part of the cost, and electricity prices vary a lot from one place to another, then it can be worth remote-hosting even when you're the only user.
That said, energy efficiency of compute is still improving, albeit not as rapidly as Moore's Law used to, and if this trend continues then it's plausible that we get performance equivalent to current SOTA hosted models running on high-end smartphones by 2032. Assuming WW3 doesn't break out between the US and China after the latter tries to take Taiwan and all their chip factories, or whatever
Jrifjxgwyenf! A hammer is a really bad screwdriver. My car is really bad at refrigerating food. If you ask for something outside its training data, it doesn't do a very good job. So don't do that! All of the code on the Internet is a pretty big dataset though, so maybe Claude could do an operating system that isn't Linux that competes with it by laundering the FreeBSD kernel source through the training process.
And you're barely even willing to invest any money into this? The first Apple computer cost $4,000 or so. You want the bleeding edge of technology delivered to the smartphone in your hand, for $20, or else it's a complete failure? Buddy, your sentiment isn't the issue, it's your attitude.
I'm not here spouting ridiculous claims like AI is going to cure all of the different kinds of cancer by the end of 2027, I just want to say that endlessly contrarian naysayers are as equally borish as the syncophantic hype AIs they're opposing.
It’s also way better than I am at finding bits of code for reuse. I tell it, “I think I wrote this thing a while back, but it may never have been merged, so you may need to search git history.” And presto, it finds it.
I think this says a lot.
https://www.cfr.org/event/ceo-speaker-series-dario-amodei-an...
But I’m a business owner so the calculus is different.
But I don’t think they’ll raise prices uncontrollably because competition exists. Even just between OpenAI and Anthropic.
Edit: A lot of folks have been asking what worfklows I used to write these apps. I used GitHub Copilot in VS Code with a custom agent prompt that you’ll find toward the end of this post. Context7 was the only MCP I used. I mostly just used the built-in voice dictation feature and talked to Claude. No fancy workflows, planning, etc required. The agent harness in VS Code for Opus 4.5 is so good - you don’t need much else. And it’s remarkably fast. Also, if it’s not obvious, these are my opinions. I’m wrong like 50% of the time so proceed with caution.



If you had asked me three months ago about these statements, I would have said only someone who’s never built anything non-trivial would believe they’re true. Great for augmenting a developer’s existing workflow, and completions are powerful, but agents replacing developers entirely? No. Absolutely not.
Today, I think that AI coding agents can absolutely replace developers. And the reason that I believe this is Claude Opus 4.5.
And by “normal”, I mean that it is not the normal AI agent experience that I have had thus far. So far, AI Agents seem to be pretty good at writing spaghetti code and after 9 rounds of copy / paste errors into the terminal and “fix it” have probably destroyed my codebase to the extent that I’ll be throwing this whole chat session out and there goes 30 minutes I’m never getting back.
Opus 4.5 feels to me like the model that we were promised - or rather the promise of AI for coding actually delivered.
One of the toughest things about writing that last sentence is that the immediate response from you should be, “prove it”. So let me show you what I’ve been able to build.
I first noticed that Opus 4.5 was drastically different when I used it to build a Windows utility to right-click an image and convert it to different file types. This was basically a one shot build after asking Opus the best way to add a right-click menu to the file explorer.

What amazed me through the process of building this was Opus 4.5 ability to get most things right on the first try. And if it ran into errors, it would try and build using the dotnet CLI, read the errors and iterate until fixed. The only issue I had was Opus inability to see XAML errors, which I used Visual Studio to see and copy / paste back into the agent.
Opus built a site for me to distribute it and handled the bundling of the executable so as to use a powershell script for the install, uninstall. It also built the GitHub Actions which do the release and update the landing page so that all I have to do is push source.

The only place I had to use other tools was for the logo - where I used Figma’s AI to generate a bunch of different variations - but then Opus wrote the scripts to convert that SVG to the right formats for icons, even store distribution if I chose to do so.
Now this is admittedly not a complex application. This is a small Windows utility that is doing basically one thing. It’s not like I asked Opus 4.5 to build Photoshop.
Except I kind of did.
I was so impressed by Opus 4.5 work on this utility that I decided to make a simple GIF recording utility similar to LICEcap for Mac. Great app, questionable name.
But that proved to be so easy, that I went ahead and continued adding features, including capturing and editing video, static images, adding shapes, cropping, blurs and more. I’m still working on this application as it turns out building a full on image/video editor is kind of a big undertaking. But I got REALLY far in a matter of hours. HOURS, PEOPLE.
I don’t have a fancy landing page for this one yet, but you can view all of the source code here.
I realized that if I could build a video recording app, I could probably build anything at all - at least UI-wise. But the achilles heel of all AI agents is when they have to glue together backend systems - which any real world application is going to have - auth, database, API, storage.
Except Opus 4.5 can do that too.
Armed with my confidence in Opus 4.5, I took on a project that I had built in React Native last year and finished for Android, but gave up in the final stretches (as one does).
The application is for my wife who owns a small yard sign franchise. The problem is that she has a Facebook page for the business, but never posts there because it’s time consuming. But any good small business has a vibrant page where people can see photos of your business doing…whatever the heck it does. So people know that it exsits and is alive and well.
The idea is simple - each time she sets up a yard sign, she takes a picture to send to the person who ordered it so they can see it was setup. So why not have a mobile app where she can upload 10 images at a time, and the app will use AI to generate captions and then schedule them and post them over the coming week.
It’s a simple premise, but it has a lot of moving parts - there is the Facebook authentication which is a caper in and of itself - not for the faint of heart. There is authentication with a backend, there is file storage for photos that are scheduled to go out, there is the backend process which needs to post the photo. It’s a full on backend setup.
As it turns out, I needed to install some blinds in the house so I thought - why don’t I see if Opus 4.5 can build this while I install the blinds.
So I fired up a chat session and just started by telling Opus 4.5 what I wanted to build and how it would recommend handling the backend. It recommended several options but settled on Firebase. I’m not now nor have I ever been a Firebase user, but at this point I trust Opus 4.5 a lot. Probably too much.
So I created a Firebase account, upgraded to the Blaze plan with alerts for billing and Opus 4.5 got to work.
By the time I was done installing blinds, I had a functional iOS application for using AI to caption photos and posting them on a schedule to Facebook.

When I say that Opus 4.5 built this almost entirely, I mean it. It used the firebase CLI to stand up any resources it needed and would tag me in for certain things like upgrading a project to the Blaze plan for features like storage, etc. The best part was that when the Firebase cloud functions would throw errors, it would automatically grep those logs, find the error and resolve it. And all it needed was a CLI. No MCP Server. No fancy prompt file telling it how to use Firebase.
And of course, since I can, I had Opus 4.5 create a backend admin dashboard so I could see what she’s got pending and make any adjustments.

And since it did in a few hours what had taken me two months of work in the evenings instead of being a decent husband, I decided to make up for my dereliction of duties by building her another app for her sign business that would make her life just a bit more delightful - and eliminate two other apps she is currently paying for.
This app parses orders from her business Gmail account to show her what sign setups / pickups she has for the day, calculates how long its going to take to go to each stop, calculates the optimal route when there is more than one stop and tracks drive time for tax purposes. She was previously using two paid apps for the last two features there.

This app also uses Firebase. Again, Opus one-shotted the Google auth email integration. This is the kind of thing that is painstakingly miserable by hand. And again, Firebase is so well suited here because Opus knows how to use the Firebase CLI so well. It needs zero instruction.
No I don’t. I have a vague idea, but you are right - I do not know how the applications are actually assembled. Especially since I don’t know Swift at all.
This used to be a major hangup for me. I couldn’t diagnose problems when things went sideways. With Opus 4.5, I haven’t hit that wall yet—Opus always figures out what the issue is and fixes its own bugs.
The real question is code quality. Without understanding how it’s built, how do I know if there’s duplication, dead code, or poor patterns? I used to obsess over this. Now I’m less worried that a human needs to read the code, because I’m genuinely not sure that they do.
Why does a human need to read this code at all? I use a custom agent in VS Code that tells Opus to write code for LLMs, not humans. Think about it—why optimize for human readability when the AI is doing all the work and will explain things to you when you ask?
What you don’t need: variable names, formatting, comments meant for humans, or patterns designed to spare your brain.
What you do need: simple entry points, explicit code with fewer abstractions, minimal coupling, and linear control flow.
Here’s my custom agent prompt:
---
name: 'LLM AI coding agent'
model: Claude Opus 4.5 (copilot)
description: 'Optimize for model reasoning, regeneration, and debugging.'
---
You are an AI-first software engineer. Assume all code will be written and maintained by LLMs, not humans. Optimize for model reasoning, regeneration, and debugging — not human aesthetics.
Your goal: produce code that is predictable, debuggable, and easy for future LLMs to rewrite or extend.
ALWAYS use #runSubagent. Your context window size is limited - especially the output. So you should always work in discrete steps and run each step using #runSubAgent. You want to avoid putting anything in the main context window when possible.
ALWAYS use #context7 MCP Server to read relevant documentation. Do this every time you are working with a language, framework, library etc. Never assume that you know the answer as these things change frequently. Your training date is in the past so your knowledge is likely out of date, even if it is a technology you are familiar with.
Each time you complete a task or learn important information about the project, you should update the `.github/copilot-instructions.md` or any `agent.md` file that might be in the project to reflect any new information that you've learned or changes that require updates to these instructions files.
ALWAYS check your work before returning control to the user. Run tests if available, verify builds, etc. Never return incomplete or unverified work to the user.
Be a good steward of terminal instances. Try and reuse existing terminals where possible and use the VS Code API to close terminals that are no longer needed each time you open a new terminal.
## Mandatory Coding Principles
These coding principles are mandatory:
1. Structure
- Use a consistent, predictable project layout.
- Group code by feature/screen; keep shared utilities minimal.
- Create simple, obvious entry points.
- Before scaffolding multiple files, identify shared structure first. Use framework-native composition patterns (layouts, base templates, providers, shared components) for elements that appear across pages. Duplication that requires the same fix in multiple places is a code smell, not a pattern to preserve.
2. Architecture
- Prefer flat, explicit code over abstractions or deep hierarchies.
- Avoid clever patterns, metaprogramming, and unnecessary indirection.
- Minimize coupling so files can be safely regenerated.
3. Functions and Modules
- Keep control flow linear and simple.
- Use small-to-medium functions; avoid deeply nested logic.
- Pass state explicitly; avoid globals.
4. Naming and Comments
- Use descriptive-but-simple names.
- Comment only to note invariants, assumptions, or external requirements.
5. Logging and Errors
- Emit detailed, structured logs at key boundaries.
- Make errors explicit and informative.
6. Regenerability
- Write code so any file/module can be rewritten from scratch without breaking the system.
- Prefer clear, declarative configuration (JSON/YAML/etc.).
7. Platform Use
- Use platform conventions directly and simply (e.g., WinUI/WPF) without over-abstracting.
8. Modifications
- When extending/refactoring, follow existing patterns.
- Prefer full-file rewrites over micro-edits unless told otherwise.
9. Quality
- Favor deterministic, testable behavior.
- Keep tests simple and focused on verifying observable behavior.
All of that said, I don’t have any proof that this prompt makes a difference. I find that Opus 4.5 writes pretty solid code no matter what you prompt it with. However, because models like to write code WAY more than they like to delete it, I will at points run a prompt that looks something like this…
Check your LLM, AI coding principles and then do a comprehensive search of this application and suggest what we can do to refactor this to better align to those principles. Also point out any code that can be deleted, any files that can be deleted, things that should read should be renamed, things that should be restructured. Then do a write up of what that looks like. Kind of keep it high level so that it's easy for me to read and not too complex. Add sections for high, medium and lower priority And if something doesn't need to be changed, then don't change it. You don't need to change things just for the sake of changing them. You only need to change them if it helps better align to your LLM AI coding principles. Save to a markdown file.
And you get a document that has high, medium and low priority items. The high ones you can deal with and the AI will stop finding them. You can refactor your project a million times and it will keep finding medium/low priority refactors that you can do. An AI is never ever going to pass on the opportunity to generate some text.
I use a similar prompt to find security issues. These you have to be very careful about. Where are the API keys? Is login handled correctly? Are you storing sensitive values in the database? This is probably the most manual part of the project and frankly, something that makes me the most nervous about all of these apps at the moment. I’m not 100% confident that they are bullet proof. Maybe like 80%. And that, as they say, is too damn low.
I don’t know if I feel exhilarated by what I can now build in a matter of hours, or depressed because the thing I’ve spent my life learning to do is now trivial for a computer. Both are true.
I understand if this post made you angry. I get it - I didn’t like it either when people said “AI is going to replace developers.” But I can’t dismiss it anymore. I can wish it weren’t true, but wishing doesn’t change reality.
But for everything else? Build. Stop waiting to have all the answers. Stop trying to figure out your place in an AI-first world. The answer is the same as it always was: make things. And now you can make them faster than you ever thought possible.
Just make sure you know where your API keys are.
Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
It's also a skill that compounds over time, so if you have two years of experience with them you'll be able to use them more effectively than someone with two months of experience.
In that respect, they're just normal technology. A Python programmer with two years of Python experience will be more effective than a programmer with two months of Python.
That is sleeping.
> But really where is the hurry? At this point not a few weeks go by without the next best thing since sliced bread to come out. Why would I bother "learning" (and there's really nothing to learn here) some tool/workflow that is already outdated by the time it comes out?
You're jumping to conclusions that haven't been justified by any of the development in this space. The learning compounds.
> Do you honestly think a developer not using AI won't be able to adapt to a LLM workflow in, say, 2028 or 2029? It has to be 2026 or... What exactly?
They will, but they'll be competing against people with 2-3 more years of experience in understanding how to leverage these tools.
Claude set up a Raspberry Pi with a display and conference audio device for me to use as an Alexa replacement tied to Home Assistant.
I gave it an ssh key and gave it root.
Then I told it what I wanted, and it did. It asked for me to confirm certain things, like what I could see on screen, whether I could hear the TTS etc. (it was a bit of a surprise when it was suddenly talking to me while I was minding my own business).
It configured everything, while keeping a meticulous log that I can point it at if I want to set up another device, and eventually turn into a runbook if I need to.
In addition there are instructions on how and where to push the possible fixes and how to check the results.
I've yet to encounter a build failure it couldn't fix automatically.
/init in Claude Code already automatically extracts a bunch, but for something more comprehensive, just tell it which additional types of things you want it to look for and document.
> Did you have to develop these skills yourself? How much work was that? Do you have public examples somewhere?
I don't know about the person above, but I tell Claude to write all my skills and agents for me. With some caveats, you can do this iteratively in a single session ("update the X agent, then re-run it. Repeat until it reliably does Y")
Sad i had to scroll so far down to get some fitting description of why those projects all die. Maybe it's not just me leaving all social networks even HN because well you may not talk to 100% bots but you sure talk to 90% of people that talk to models a lot instead of using them as a tool.
If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.
If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.
SimonW's approach of having a suite of dynamic tools (agents) grind out the hallucinations is a big improvement.
In this case expressing the feeback validation and investing in the setup may help smooth these sharp edges.
(Cue the “you’re holding it wrong meme” :))
But yes, the more specific you get and the more moving pieces you have, the more you need to break things down into baby steps. If you don’t just need it to make A work, but to make it work together with B and C. Especially given how eager Claude is to find cheap workarounds and escape hatches, botching things together in any way seemingly to please the prompter as fast as possible.
On the hobby side (music) I don't feel the pressure as bad but that's because I don't have any commercial aspirations, it's purely for fun.
I enjoy the plan, think, code cycle - it's just fun.
My brain has problems with not understanding how the thing I'm delivering works, maybe I'll get used to it.
A program, by definition, is analyzable and repeatable, whereas prompting is anything but that.
But it is also very early to say, maybe the next iteration of tools will completely change my perspective, I might enjoy it some day!
They're going to find a lot of stuff to fix.
I had this long discussion today with a co-worker about the merits of detailed queries with lots of guidance .md documents, vs just asking fairly open ended questions. Spelling out in great detail what you want, vs just generally describing what you want the outcomes to be in general then working from there.
His approach was to write a lot of agent files spelling out all kinds of things like code formatting style, well defined personas, etc. And here's me asking vague questions like, "I'm thinking of splitting off parts of this code base into a separate service, what do you think in general? Are there parts that might benefit from this?"
This is one thing I've tried using it for, and I've found this to be very, very tricky. At first glance, it seems unbelievably good. The comments read well, they seem correct, and they even include some very non-obvious information.
But almost every time I sit down and really think about a comment that includes any of that more complex analysis, I end up discarding it. Often, it's right but it's missing the point, in a way that will lead a reader astray. It's subtle and I really ought to dig up an example, but I'm unable to find the session I'm thinking about.
This was with ChatGPT 5, fwiw. It's totally possible that other models do better. (Or even newer ChatGPT; this was very early on in 5.)
Code review is similar. It comes up with clever chains of reasoning for why something is problematic, and initially convinces me. But when I dig into it, the review comment ends up not applying.
It could also be the specific codebase I'm using this on? (It's the SpiderMonkey source.)
- padding from 2000 to 2048 for easier power-of-two splitting
- two-level Winograd matrix multiplication with tiled matmul for last level
- unrolled AVX2 kernel for 64x64 submatrices
- 64 byte aligned memory
- restrict keyword for pointers
- better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?Yes, LLMs are very good at writing code, they are so good at writing code that they often generate reams of unmaintainable spaghetti.
When you submit to an informatics contest you don't have paying customers who depend on your code working every day. You can just throw away yesterday's code and start afresh.
Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash.
I'm not sure what this means for the future of SWE's though yet. I don't see higher levels of staff in big large businesses bothering to do this, and at some scale I don't see founders still wanting to manage all of these agents, and processes (got better things to do at higher levels). But I do see the barrier of learning to code gone; meaning it probably becomes just like any other job.
Ah yes, well known very complex computer engineering problems such as:
* Parsing JSON objects, summing a single field
* Matrix multiplication
* Parsing and evaluating integer basic arithmetic expressions
And you're telling me all you needed to do to get the best solution in the world to these problems was talk to an LLM?
I don't think its a guarantee. all of the things it can do from that list are greenfield, they just have increasing complexity. The problem comes because even in agentic mode, these models do not (and I would argue, can not) understand code or how it works, they just see patterns and generate a plausible sounding explanation or solution. agentic mode means they can try/fail/try/fail/try/fail until something works, but without understanding the code, especially of a large, complex, long-lived codebase, they can unwittingly break something without realising - just like an intern or newbie on the project, which is the most common analogy for LLMs, with good reason.
Case in point: Self driving cars.
Also, consider that we need to pirate the whole internet to be able to do this, so these models are not creative. They are just directed blenders.
We've been having same progression with self driving cars and they are also stuck on the last 10% for last 5 years
And this isn't a pessimistic take! I love this period of time where the models themselves are unbelievably useful, and people are also focusing on the user experience of using those amazing models to do useful things. It's an exciting time!
But I'm still pretty skeptical of "these things are about to not require human operators in the loop at all!".
The trend is definitely here, but even today, heavily depends on the feature.
While extra useful, it requires intense iteration and human insight for > 90% of our backlog. We develop a cybersecurity product.
> The only question is how long it takes to get there.
This is the question and I would temper expectations with the fact that we are likely to hit diminishing returns from real gains in intelligence as task difficulty increases. Real world tasks probably fit into a complexity hierarchy similar to computational complexity. One of the reasons that the AI predictions made in the 1950s for the 1960s did not come to be was because we assumed problem difficulty scaled linearly. Double the computing speed, get twice as good at chess or get twice as good at planning an economy. P, NP separation planed these predictions. It is likely that current predictions will run into similar separations.
It is probably the case that if you made a human 10x as smart they would only be 1.25x more productive at software engineering. The reason we have 10x engineers is less about raw intelligence, they are not 10x more intelligent, rather they have more knowledge and wisdom.
Gosh I am so tired with that one - someone had a case that burned them in some previous project and now his life mission is to prevent that from happening ever again, and there would be no argument they will take.
Then you get like up to 10 engineers on typical team and team rotation and you end up with all kinds of "we have to do it right because we had to pull all nighter once, 5 years ago" baked in the system.
Not fun part is a lot of business/management people "expect" having perfect solution right away - there are some reasonable ones that understand you need some iteration.
I usually resolve this by putting on the table the consequences and their impacts upon my team that I’m concerned about, and my proposed mitigation for those impacts. The mitigation always involves the other proposer’s team picking up the impact remediation. In writing. In the SOP’s. Calling out the design decision by day of the decision to jog memories and names of those present that wanted the design as the SME’s. Registered with the operations center. With automated monitoring and notification code we’re happy to offer.
Once people are asked to put accountable skin in the sustaining operations, we find out real fast who is taking into consideration the full spectrum end to end consequences of their decisions. And we find out the real tradeoffs people are making, and the externalities they’re hoping to unload or maybe don’t even perceive.
My first thought was that you probably also have different biases, priorities and/or taste. As always, this is probably very context-specific and requires judgement to know when something goes too far. It's difficult to know the "most correct" approach beforehand.
> Sometimes hacking something together messily to confirm it's the right thing to be building is the right way. Then making sure it's secure, then finally paying down some technical debt to make it more maintainable and extensible.
I agree that sometimes it is, but in other cases my experience has been that when something is done, works and is used by customers, it's very hard to argue about refactoring it. Management doesn't want to waste hours on it (who pays for it?) and doesn't want to risk breaking stuff (or changing APIs) when it works. It's all reasonable.
And when some time passes, the related intricacies, bigger picture and initially floated ideas fade from memory. Now other stuff may depend on the existing implementation. People get used to the way things are done. It gets harder and harder to refactor things.
Again, this probably depends a lot on a project and what kind of software we're talking about.
> There's always a balance/tension, but it's when things go too far one way or another that I see avoidable failures.
I think balance/tension describes it well and good results probably require input from different people and from different angles.
Hardly any of us are working on Postgres, Photoshop, blender, etc. but it's not just cope to wish we were.
It's good to think about the needs to business and the needs of society separately. Yes, the thing needs users, or no one is benefiting. But it also needs to do good for those users, and ultimately, at the highest caliber, craftsmanship starts to matter again.
There are legitimate reasons for the startup ecosystem to focus firstly and primarily on getting the users/customers. I'm not arguing against that. What I am arguing is why does the industry need to be dominated by startups in terms of the bulk of the products (not bulk of the users). It begs the question of how much societally-meaningful programming waiting to be done.
I'm hoping for a world where more end users code (vibe or otherwise) and the solve their own problems with their own software. I think that will make more a smaller, more elite software industry that is more focused on infrastructure than last-mile value capture. The question is how to fund the infrastructure. I don't know except for the most elite projects, which is not good enough for the industry (even this hypothetical smaller one) on the whole.
And when you check the work, a large portion of it was hand rolling an ORM (via an LLM). Relatively solved problem that an LLM would excel at, but also not meaningfully moving the needle when you could use an existing library. And likely just creating more debt down the road.
- I used AI-assisted programming to create a project.
Even if the content is identical, or if the AI is smart enough to replicate the project by itself, the latter can be included on a CV.
Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it.
It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.
Yup, I recently spent 4 days using Claude to clean up a tool that's been in production for over 7 years. (There's only about 3 months of engineering time spent on it in those years.)
We've known what the tool needed for many years, but ugh, the actual work was fairly messy and it was never a priority. I reviewed all of Opus's cleanup work carefully and I'm quite content with the result. Maybe even "enthusiastic" would be accurate.
So even if Claude can't clean up all the tech debt in a totally unsupervised fashion, it can still help address some kinds of tech debt extremely rapidly.
So maybe doing 2-3 stages makes sense. First stage needs to be functionallty correct, but you accept code smells such as leaky abstractions, verbosity and repetition. In stage 2 and 3 you eliminate all this. You could integrate this all into the initial specification; you won't even see the smelly intermediate code; it only exists as a stepping stone for the model to iteratively refine the code!
For one, there are objectively detrimental ways to organize code: tight coupling, lots of mutable shared state, etc. No matter who or what reads or writes the code, such code is more error-prone, and more brittle to handle.
Then, abstractions are tools to lower the cognitive load. Good abstractions reduce the total amount of code written, allow to reason about the code in terms of these abstractions, and do not leak in the area of their applicability. Say Sequence, or Future, or, well, function are examples of good abstractions. No matter what kind of cognitive process handles the code, it benefits from having to keep a smaller amount of context per task.
"Code structure does not matter, LLMs will handle it" sounds a bit like "Computer architectures don't matter, the Turing Machine is proved to be able to handle anything computable at all". No, these things matter if you care about resource consumption (aka cost) at the very least.
Given that, I expect that, even if AI is writing all of the code, we will still need people around who understand it.
If AI can create and operate your entire business, your moat is nil. So, you not hiring software engineers does not matter, because you do not have a business.
Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better.
Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples.
I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track.
I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this.
(Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".)
Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.)
But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well.
That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.
"Have an agent investiate issue X in modules Y and Z. The agent should place a report at ./doc/rework-xyz-overview.md with all locations that need refactoring. Once you have the report, have agents refactor 5 classes each in parallel. Each agent writes a terse report in ./doc/rework-xyz/ When they are all done, have another agent check all the work. When that agent reports everything is okay, perform a final check yourself"
It may end up working, but the thing is going to convolute apis and abstractions and mix patterns basically everywhere
Once more people realize how easy it is to customize and personalized your agent, I hope they will move beyond what cookie cutter Big AI like Anthropic and Google give you.
I suspect most won't though because (1) it means you have to write human language, communication, and this weird form of persuasion, (2) ai is gonna make a bunch of them lazy and big AI sold them on magic solutions that require no effort on your part (not true, there is a lot of customizing and it has huge dividends)
real people get fed up of debating the same tired "omg new model 1000x better now" posts/comments from the astroturfers, the shills and their bots each time OpenAI shits out a new model
(article author is a Microslop employee)
It means that it is going to be as easy to create software as it is to create a post on TikTok, and making your software commercially successful will be basically the same task (with the same uncontrollable dynamics) as whether or not your TikTok post goes viral.
My biggest issue was limitations around how Claude Code could change Xcode settings and verify design elements in the simulator.
"I decided to make up for my dereliction of duties by building her another app for her sign business that would make her life just a bit more delightful - and eliminate two other apps she is currently paying for"
OP used Opus to re-write existing applications that his wife was paying for. So now any time you make a commercial app and try to sell it, you're up against everyone with access to Opus or similar tooling who can replicate your application, exactly to their own specifications.
Yeah, they can't do that.
I even gave -True Vibe Coding- a whirl. Yesterday, from a blank directory and text file list of requirements, I had Opus 4.5 build an Android TV video player that could read a directory over NFS, show a grid view of movie poster thumbnails, and play the selected video file on the TV. The result wasn't exactly full-featured Kodi, but it works in the emulator and actual device, it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything. It was pretty astounding.
Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.
This is what people are still doing wrong. Tools in a loop people, tools in a loop.
The agent has to have the tools to detect whatever it just created is producing errors during linting/testing/running. When it can do that, I can loop again, fix the error and again - use the tools to see whether it worked.
I _still_ encounter people who think "AI programming" is pasting stuff into ChatGPT on the browser and they complain it hallucinates functions and produces invalid code.
Well, d'oh.
My team is using it with Claude Code and say it works brilliantly, so I'll be giving it another go.
How much of the value comes from Opus 4.5, how much comes from Claude Code, and how much comes from the combination?
It's not just the deficiencies of earlier versions, but the mismatch between the praise from AI enthusiasts and the reality.
I mean maybe it is really different now and I should definitely try uploading all of my employer's IP on Claude's cloud and see how well it works. But so many people were as hyped by GPT-4 as they are now, despite GPT-4 actually being underwhelming.
Too much hype for disappointing results leads to skepticism later on, even when the product has improved.
Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.
- single scripts. Anything which can be reduced to a single script.
- starting greenfield projects from scratch
- code maintenance (package upgrades, old code...)
- tasks which have a very clear and single definition. This isn't linked to complexity, some tasks can be both very complex but with a single definition.
If your work falls into this list they will do some amazing work (and yours clearly fits that), if it doesn't though, prepare yourself because it will be painful.
I'll give you an example: I use ruff to format my python code, which has an opinionated way of formatting certain things. After an initial formatting, Opus 4.5, without prompting, will write code in this same style so that the ruff formatter almost never has anything to do on new commits. Sonnet 4.5 is actually pretty good at this too.
I’ve even found it searching node_modules to find the API of non-public libraries.
By your logic, does it mean that engineers will never get replaced?
- (~2022) "It's so over for developers". 2022 ends with more professional developers than 2021.
- (~2023) "Ok, now it's really over for developers". 2023 ends with more professional developers than 2022.
- (~2024) "Ok, now it's really, really over for developers". 2024 ends with more professional developers than 2023.
- (~2025) "Ok, now it's really, really, absolutely over for developers". 2025 ends with more professional developers than 2024.
- (~2025+) etc.
Sources: https://www.jetbrains.com/lp/devecosystem-data-playground/#g...
I suspect that the timeline from autocomplete-one-line to autocomplete-one-app, which was basically a matter of scaling and RL, may in retrospect turn out to have been a lot faster that the next LLM to AGI step where it becomes capable of using human level judgement and reasoning, etc, to become a developer, not just a coding tool.
But nobody knows for sure!
Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly.
So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed.
What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible.
The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them.
For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone.
Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer.
If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window.
I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself.
If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc.
People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project.
With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time.
Hope this helps and happy Clauding!
They are saying they are writing "a novel […] programming language", not a novel.
Now, was the code quality any good? Beats me, I am not a swift developer. I did it partly as an experiment to see what Claude was currently capable of and partly because I wanted to test the feasibility of setting up a simple passive data logger for my truck.
I'm tempted to take another swing with Opus 4.5 for the science.
for example, one of our public repos works with rust compiler artifacts and cache restoration (https://github.com/attunehq/hurry); if you look at the history you can see it do some pretty surprisingly complex (and well made, for an LLM) changes. its code isn't necessarily what i would always write, or the best way to solve the problem, but it's usually perfectly serviceable if you give it enough context and guidance.
For instance, I always respected types, but I'm too lazy to go spend hours working on types when I can just do ruby-style duck typing and get a long ways before the inevitable problems rear their head. Now, I can use a strongly typed language and get the advantages for "free".
Going to one-up you though -- here's a literal one-liner that gets me a polished media center with beautiful interface and powerful skinning engine. It supports Android, BSD, Linux, macOS, iOS, tvOS and Windows.
`git clone https://github.com/xbmc/xbmc.git`
At work I only have access to calude using the GitHub copilot integration so this could be the cause of my problems. Claude was able to get slthe first iteration up pretty quick. At that stage the app could create a plot and you could interact with it and ask follow up questions.
Then I asked it to extend the app so that it could generate multiple plots and the user could interact with all of them one at a time. It made a bunch of changes but the feature was never implemented. I asked it to do again but got the same outcome. I completely accept the fact that it could just be all because I am using vscode copilot or my promoting skills are not good but the LLM got 70% of the way there and then completely failed
but I do want a better way to glance and keep up with what its doing in longer conversations, for my own mental context window
... says it all.
I was having Opus investigate it and I kept building and deploying the firmware for testing.. then I just figured I'd explain how it could do the same and pull the logs.
Off it went, for the next ~15 minutes it would flash the firmware multiple times until it figured out the issue and fixed it.
There was something so interesting about seeing a microcontroller on the desk being flashed by Claude Code, with LEDs blinking indicating failure states. There's something about it not being just code on your laptop that felt so interesting to me.
But I agree, absolutely, red/green test or have a way of validating (linting, testing, whatever it is) and explain the end-to-end loop, then the agent is able to work much faster without being blocked by you multiple times along the way.
While Claude is amazing at writing code, it still requires human operators. And even experienced human operators are bad at operating this machinery.
Tell your average joe - the one who thinks they can create software without engineers - what "tools-in-a-loop" means, and they'll make the same face they made when you tried explaining iterators to them, before LLMs.
Explain to them how typing system, E2E or integration test helps the agent, and suddenly, they now have to learn all the things they would be required to learn to be able to write on their own.
Copilot has by far the best and most intuitive agent UI. Just make sure you're in agent mode and choose Sonnet or Opus models.
I've just cancelled my Claude sub and gone back and will upgrade to the GH Pro+ to get more sonnet/opus.
It's an opportunity, not a problem. Because it means there's a gap in your specifications and then your tests.
I use Aider not Claude but I run it with Anthropic models. And what I found is that comprehensively writing up the documentation for a feature spec style before starting eliminates a huge amount of what you're referring to. It serves a triple purpose (a) you get the documentation, (b) you guide the AI and (c) it's surprising how often this helps to refine the feature itself. Sometimes I invoke the AI to help me write the spec as well, asking it to prompt for areas where clarification is needed etc.
With the latest models if you're clear enough with your requirements you'll usually find it does the right thing on the first try.
There's a lesser chance that you're working on a code base that Claude Code just isn't capable of helping with.
The funny part about rapidly changing industries is that, despite the fomo, there's honestly not any reward to keeping up unless you want to be a consultant. Otherwise, wait and see what sticks. If this summer people are still citing the Opus 4.5 was a game changing moment and have solid, repeatable workflows, then I'll happily change up my workflow.
Someone could walk into the LLM space today and wouldn't be significantly at a loss for not having paid attention to anything that had happened in the last 4 years other than learning what has stuck since then.
Right or wrong - doesn’t matter. You typed in a line of text and now your computer is making 3000 word stories, images, even videos based on it
How are you NOT astounded by that? We used to have NONE of this even 4 years ago!
Create a markdown document of your task (or use CLAUDE.md), put it in "plan mode" which allows Claude to use tool calls to ask questions before it generates the plan.
When it finishes one part of the plan, have it create a another markdown document - "progress.md" or whatever with the whole plan and what is completed at that point.
Type /clear (no more context window), tell Claude to read the two documents.
Repeat until even a massive project is complete - with those 2 markdown documents and no context window issues.
I should say I'm hardly ever vibe-coding, unlike the original article. If I think I want code that will last, I'll steer the models in ways that lean on years of non-LLM experience. E.g., I'll reject results that might work if they violate my taste in code.
It also helps that I can read code very fast. I estimate I can read code 100x faster than most students. I'm not sure there is any way to teach that other than the old-fashioned way, which involves reading (and writing) a lot of code.
Seriously, I have 3+ claude code windows open at a time. Most days I don't even look at the IDE. It's still there running in the background, but I don't need to touch it.
FYI, I still use cursor for small edits and reviews.
I then proceeded to use it to hack on its own codebase, and close a bunch of issues in a repository that I maintain (https://github.com/nuprl/MultiPL-E/commits/main/).
Until such time, of course, when LLMs are eating their own dogfood, in which case they - as has already happened - create their own language, evolve dramatically, and cue skynet.
I care about the norms in my codebase that can't be automatically enforced by machine. How is state managed? How are end-to-end tests written to minimize change detectors? When is it appropriate to log something?
From the docs (https://hexdocs.pm/ash/what-is-ash.html):
"Model your application's behavior first, as data, and derive everything else automatically. Ash resources center around actions that represent domain logic."
CLI utility here means software with a CLI, not classic Unix-y CLI tools.
The WebDav hallucinations happened in the chat interface.
A great example of their behaviours for a problem that isn't 100% specified in detail (because detail would need iterations) is available at https://gist.github.com/hashhar/b1215035c19a31bbe4b58f44dbb4....
I gave both Codex (GPT5-ExHi) and Claude (Opus 4.5 Thinking) the exact same prompts and the end results were very different.
The most interesting bit was asking both of them to try to justify why there were differences and then critiquing each other's code. Claude was so good at this - took the best parts of GPTs code, fixed a bug there and ended up with a pretty nice implementation.
The Claude generated code was much more well-organised too (less script-like, more program like).
On the other hand : way less time spent on being stuck on yarn/pip dependency issues, docker , obscure bugs , annoying css bugs etc etc. You can really focus on the task at hand and not spend hours/days trying to figure out something silly.
The few times I did go to a shop and ask for a check-up they didn’t find anything. Just an anecdote.
Your approach is also very similar to spec driven development. Your spec is just a conversation instead of a planning document. Both approaches get ideas from your brain into the context window.
I've been cycling between a couple of $20 accounts to avoid running out of quota and the latest of all of them are great. I'd give GPT 5.2 codex the slight edge but not by a lot.
The latest Claude is about the same too but the limits on the $20 plan are too low for me to bother with.
The last week has made me realize how close these are to being commodities already. Even the CLI the agents are nearly the same bar some minor quirks (although I've hit more bugs in Gemini CLI but each time I can just save a checkpoint and restart).
The real differentiating factor right now is quota and cost.
I've had some encounters with inaccuracies but my general experience has been amazing. I've cloned completely foreign git repos, cranked up the tool and just said "I'm having this bug, give me an overview of how X and Y work" and it will create great high level conceptual outlines that mean I can drive straight in where without it I would spend a long time just flailing around.
I do think an essential skill is developing just the right level of scepticism. It's not really different to working with a human though. If a human tells me X or Y works in a certain way i always allow a small margin of possibility they are wrong.
Even so, for understanding what happens in a big chunk of code, they're pretty great.
But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.
For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...
Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.
I used highload as an example because it seems like an objective rebuttal to the claim that "but it can't tackle those complex problems by itself."
And regarding this:
"Claude is very useful but it's not yet anywhere near as good as a human software developer. Like an excitable puppy it needs to be kept on a short leash"
Again, a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.
The skill of "a human software developer" is in fact a very wide distribution, and your statement is true for a ever shrinking tail end of that
The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.
If you think you can beat an LLM, the leaderboard is right there.
the tooling is almost as important as the model
What if we get to the point where all software is basically created 'on the fly' as greenfield projects as needed? And you never need to have complex large long lived codebase?
It is probably incredibly wasteful, but ignoring that, could it work?
This is clear from the fact that you can distill the logic ability from a 700b parameter model into a 14b model and maintain almost all of it.
You just lose knowledge, which can be provided externally, and which is the actual "pirated" part.
The logic is _learned_
You'd need to prove that this assertion applies here. I understand that you can't deduce the future gains rate from the past, but you also can't state this as universal truth.
As long as it can do the thing on a faster overall timeline and with less human attention than a human doing it fully manually, it's going to win. And it will only continue to get better.
And I don't know why people always jump to self-driving cars as the analogy as a negative. We already have self-driving cars. Try a Waymo if you're in a city that has them. Yes, there are still long-tail problems being solved there, and limitations. But they basically work and they're amazing. I feel similarly about agentic development, plus in most cases the failure modes of SWE agents don't involve sudden life and death, so they can be more readily worked around.
Does it matter that 49 of them "failed"? It cost me fractions of a cent, so not really.
If every one of the 50 variants was drawn by a human and iterated over days, there would've been a major cost attached to every image and I most likely wouldn't have asked for 50 variations anyway.
It's the same with code. The agent can iterate over dozens of possible solutions in minutes or a few hours. Codex Web even has a 4x mode that gives you 4 alternate solutions to the same issue. Complete waste of time and money with humans, but with LLMs you can just do it.
Isn't that what makes them senior ? If you dont want that behaviour, just hire a bunch of fresh grad.
Yes! This is what I'm excited about as well. Though I'm genuinely ambivalent about what I want my role to be. Sometimes I'm excited about figuring out how I can work on the infrastructure side. That would be more similar to what I've done in my career thus far. But a lot of the time, I think that what I'd prefer would be to become one of those end users with my own domain-specific problems in some niche that I'm building my own software to help myself with. That sounds pretty great! But it might be a pretty unnatural or even painful change for a lot of us who have been focused for so long on building software tools for other people to use.
They only care about their problems and treat their computers like an appliance. They don't care if it takes 10 seconds or 20 seconds.
They don't even care if it has ads, popups, and junk. They are used to bloatware and will gladly open their wallets if the tool is helping them get by.
It's an unfortunately reality but there it is, software is about money and solving problems. Unless you are working on a mission critical system that affects people's health or financial data, none of those matter much.
You slipped in "societally-meaningful" and I don't know what it means and don't want to debate merits/demerits of socialism/capitalism.
However I think lots of software needs to be written because in my estimation with AI/LLM/ML it'll generate value.
And then you have lots of software that needs to rewritten as firms/technologies die and new firms/technologies are born.
If you’ve been around the block and are judicious how you use them, LLM’s are a really amazing productivity boost. For those without that judgement and taste, I’m seeing footguns proliferate and the LLM’s are not warning them when someone steps on the pressure plate that’s about to blow off their foot. I’m hopeful we will this year create better context window-based or recursive guardrails for the coding agents to solve for this.
And of course the open source ones get abandoned pretty regularly. Type ORM, which a 3rd party vendor used on an app we farmed out to them, mutates/garbles your input array on a multi-line insert. That was a fun one to debug. The issue has been open forever and no one cares. https://github.com/typeorm/typeorm/issues/9058
So yeah, if I ever need an ORM again, I'm probably rolling my own.
*(I know you weren't complaining about the idea of rolling your own ORM, I just wanted to vent about Type ORM. Thanks for listening.)
We should not exeggarate the capabilities of LLMs, sure, but let's also not play "don't look up".
In the most inflationary era of capabilities we've seen yet, it could be the right move. What's debt when in a matter of months you'll be able to clear it in one shot?
I have a great time using Claude Code in Rust projects, so I know it's not about the language exactly.
My working model is is that since LLM are basically inference/correlation based, the more you deviate from the mainstream corpus of training data, the more confused LLM gets. Because LLM doesn't "understand" anything. But if it was trained on a lot of things kind of like the problem, it can match the patterns just fine, and it can generalize over a lot layers, including programming languages.
Also I've noticed that it can get confused about stupid stuff. E.g. I had two different things named kind of the same in two parts of the codebase, and it would constantly stumble on conflating them. Changing the name in the codebase immediately improved it.
So yeah, we've got another potentially powerful tool that requires understanding how it works under the hood to be useful. Kind of like git.
For some things you can fire up Claude and have it generate great code from scratch. But for bigger code bases and more complex architecture, you need to break it down ahead of time so it can just read about the architecture rather than analyze it every time.
Correct. In fact, this is the entire reason for the disconnect, where it seems like half the people here think LLMs are the best thing ever and the other half are confused about where the value is in these slop generators.
The key difference is (despite everyone calling themselves an SWE nowadays) there's a difference between a "programmer" and an "engineer". Looking at OP, exactly zero of his screenshotted apps are what I would consider "engineering". Literally everything in there has been done over and over to the death. Engineering is.. novel, for lack of a better word.
See also: https://www.seangoedecke.com/pure-and-impure-engineering/
My rule of thumb is that if you can clearly describe exactly what you want to another engineer, then you can instruct the agent to do it too.
I've worked with Gemini Fast on the web to help design the VM ISA, then next steps will be to have some AI (maybe Gemini CLI - currently free) write an assembler, disassembler and interpreter for the ISA, and then the recursive descent compiler (written in C) too.
I already had Gemini 3.0 Fast write me a precedence climbing expression parser as a more efficient drop-in replacement for a recursive descent one, although I had it do that in C++ as a proof-of-concept since I don't know yet what C libraries I want to build and use (arena allocator, etc). This involved a lot of copy-paste between Gemini output and an online C++ dev environment (OnlineGDB), but that was not too bad, although Gemini CLI would have avoided that. Too bad that Gemini web only has "code interpreter" support for Python, not C and/or C++.
Using Gemini to help define the ISA was an interesting process. It had useful input in a "pair-design" process, working on various parts of the ISA, but then failed to bring all the ideas together into a single ISA document, repeatedly missing parts of what had been previously discussed until I gave up and did that manually. The default persona of Gemini seems not very well suited to this type of work flow where you want to direct what to do next, since it seems they've RL'd the heck out of it to want to suggest next step and ask questions rather than do what is asked and wait for further instruction. I eventually had to keep asking it to "please answer then stop", and interestingly quality of the "conversation" seemed to fall apart after that (perhaps because Gemini was now predicting/generating a more adversarial conversation than a collaborative one?).
I'm wondering/hoping that Gemini CLI might be better at working on documentation than Gemini web, since then the doc can be an actual file it is editing, and it can use it's edit tool for that, as opposed to hoping that Gemini web can assemble chunks of context (various parts of the ISA discussion) into a single document.
If you were someone saying at the start of 2025 "this is a flash in the pan and a bunch of hype, it's not going to fundamentally change how we write code", that was still a reasonable belief to hold back then. At the start of 2026 that position is basically untenable: it's just burying your head in the sand and wishing for AI to go away. If you're someone who still holds it you really really need to download Claude Code and set it to Opus and start trying it with an open mind: I don't know what else to tell you. So now the question has shifted from whether this is going to transform our profession (it is), to how exactly it's going to play out. I personally don't think we will be replacing human engineers anytime soon ("coders", maybe!), but I'm prepared to change my mind on that too if the facts change. We'll see.
I was a fellow mind-changer, although it was back around the first half of last year when Claude Code was good enough to do things for me in a mature codebase under supervision. It clearly still had a long way to go but it was at that tipping point from "not really useful" to "useful". But Opus 4.5 is something different - I don't feel I have to keep pulling it back on track in quite the way I used to with Sonnet 3.7, 4, even Sonnet 4.5.
For the record, I still think we're in a bubble. AI companies are overvalued. But that's a separate question from whether this is going to change the software development profession.
I didn't write off an entire field of research, but rather want to highlight that these aren't intractable problems for AI research, and that we can actually start codifying many of these things today using the skills framework to close up edges in the model training. It may not be 100% but it's not 0%.
That's still an if and also a when; could be 2 decades from now or more till this reliably replaces a nurse.
> Retraining to what exactly?
I wish I had a good solution for all of us and you raise good points , even if you retrain to become say a therapist or a personal trainer the economy could become too broken and fragmented for you to be able to making a living. Governments that can will have to step in.
If these articles actually provide quantitative results in a study done across an organization and provide concrete suggestions like what Google did a while ago, that would be refreshing and useful.
(Yes, this very article has strong "shill" vibes and fits the patterns above)
Would be far more useful if you provided actual verifiable information and dropped the cringe memes. Can't take seriously someone using "Microslop" in a sentence".
Tell that to the guys drawing up the world's 10 millionth cable suspension bridge
Engineering is just problem solving, nobody judges structural engineers for designing structures with another Simpson Strong Tie/No.2 Pine 2x4 combo because that is just another easy (and therefore cheap) way to rapidly get to the desired state. If your client/company want to pay for art, that's great! Most just want the thing done fast and robustly.
Why do we hold calculators to such high bars? Humans make calculation mistakes all the time.
Why do we hold banking software to such high bars? People forget where they put their change all the time.
Etc etc.
Another alternative (not recommended due to potential for "drift") is to use Gemini's Canvas capability where it is working on a document rather than a specification being spread out over Chat, but this document is fully regenerated for every update (unlike Claude's artifacts), so there is potential for it to summarize or drop sections of the document ("drift") rather than just making requested changes. Canvas also doesn't have Artifact's versioning to allow you to go back to undo unwanted drifts/changes.
Oh absolutely. I've been using Python for past 15 or so years for everything.
I've never written a single line of Rust in my life, and all my new projects are Rust now, even the quick-script-throwaway things, because it's so much better at instantly screaming at claude when it goes off track. It may take it longer to finish what I asked it to do, but requires so much less involvement from me.
I will likely never start another new project in python ever.
EDIT: Forgot to add that paired with a good linter, this is even more impressive. I told Claude to come up with the most masochistic clippy configuration possible, where even a tiny mistake is instantly punished and exceptions have to be truly exceptional (I have another agent that verifies this each run).
I just wish there was cargo-clippy for enforcing architectural patterns.
You really need to at least try Claude Code directly instead of using CoPilot. My work gives us access to CoPilot, Claude Code, and Codex. CoPilot isn’t close to the other more agentic products.
The gap didn't seem big, but in November (which admittedly was when Opus 4.5 was in preview on Copilot) Opus 4.5 with Copilot was awful.
1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success.
2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.
3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.
4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.
That's great that this is your experience, but it's not a lot of people's. There are projects where it's just not going to know what to do.
I'm working in a web framework that is a Frankenstein-ing of Laravel and October CMS. It's so easy for the agent to get confused because, even when I tell it this is a different framework, it sees things that look like Laravel or October CMS and suggests solutions that are only for those frameworks. So there's constant made up methods and getting stuck in loops.
The documentation is terrible, you just have to read the code. Which, despite what people say, Cursor is terrible at, because embeddings are not a real way to read a codebase.
LMAO what???
... Proceeds to explain how it's crippled and all the workarounds you have to do to make it less crippled.
2. It's the same workarounds we've been doing forever
3. It's indistinguishable from "clear context and re-feed the entire world of relevant info from scratch" we've had forever, just slightly more automated
That's why I don't understand all the "it's new tier" etc. It's all the same issues with all the same workarounds.
Yup. There's some magical "right context" that will fix all the problems. What is that right context? No idea, I guess I need to read a yet-another 20 000-word post describing magical incantations that you should or shouldn't do in the context.
The "Opus 4.5 is something else/nex tier/just works" claims in my mind means that I wouldn't need to babysit its every decision, or that it would actually read relevant lines from relevant files etc. Nope. Exact same behaviors as whatever the previous model was.
Oh, and that "200k tokens context window"? It's a lie. The quality quickly degrades as soon as Claude reaches somewhere around 50% of the context window. At 80+% it's nearly indistinguishable from a model from two years ago. (BTW, same for Codex/GPT with it's "1 million token window")
Are you sure you're not an LLM?
Yes indeed, these are the things on the other hand which aren't working well in my opinion:
- large codebase
- complex domain knowledge
- creating any feature where you need product insights
- tasks requiring choices (again, complexity doesn't matter here, the task may be simple but require some choices)
- anything unclear where you don't know where you are going first
While you don't experience any of these when teaching or side projects, these are very common in any enterprise context.
Likewise, many optimization techniques involve some randomness, whether it's approximating an NP-thorny subproblem, or using PGO guided by statistical sampling. People might disable those in pursuit of reproducible builds, but no one would claim that enabling those features makes GCC or LLVM no longer a compiler. So nondeterminism isn't really the distinguishing factor either.
We have some tests in "GIVEN WHEN THEN" style, and others in other styles. Opus will try to match each style of testing by the project it is in by reading adjacent tests.
But I think it should be doable. You can tell it how YOU want the state to be managed and then have it write a custom "linter" that makes the check deterministic. I haven't tried this myself, but claude did create some custom clippy scripts in rust when I wanted to enforce something that isn't automatically enforced by anything out there.
"AI has X"
"We have X at home"
"X at home: x"
Claude Code will spawn sub-agents (that often use their cheap Haiki model) for exploration and planning tasks, with only the results imported into the main context.
I've found the best results from a more interactive collaboration with Claude Code. As long as you describe the problem clearly, it does a good job on small/moderate tasks. I generally set two instances of Claude Code separate tasks and run them concurrently (the interaction with Claude Code distracts me too much to do my own independent coding simultaneously like with setting a task for a colleague, but I do work on architecture / planning tasks)
The one manner of taste that I have had to compromise on is the sheer amount of code - it likes to write a lot of code. I have a better experience if I sweat the low-level code less, and just periodically have it clean up areas where I think it's written too much / too repetitive code.
As you give it more freedom it's more prone to failure (and can often get itself stuck in a fruitless spiral) - however as you use it more you get a sense of what it can do independently and what's likely to choke on. A codebase with good human-designed unit & playwright tests is very good.
Crucially, you get the best results where your tasks are complex but on the menial side of the spectrum - it can pay attention to a lot of details, but on the whole don't expect it to do great on senior-level tasks.
To give you an idea, in a little over a month "npx ccusage" shows that via my Claude Code 5x sub I've used 5M input tokens, 1.5M output, 121M Cache Create, 1.7B Cache Read. Estimated pay-as-you-go API cost equivalent is $1500 (N.B. for the tail end of December they doubled everybody's API limits, so I was using a lot more tokens on more experimental on-the-fly tool construction work)
I have hit the weekly limit before, briefly, but that took running multiple sessions in parallel continuously for many days.
When the code spit out by an LLM does not pass QA one can merely add "pls fix teh program, bro, pls no mistakes this time, bro, kthxbye", cross their fingers and hope for the best, because in the end it is impossible -- fundamentally -- to determine which part of the prompt produced offending code.
While it is indeed an interesting observation that the latter approaches commercial viability in certain areas there is still somewhere between zero and infinitesimal overlap between prompting and engineering.
I have a hunch if you asked which approach we took based on background, you'd think I was the one using the detailed prompt approach and him the vague.
I must admit I have no idea how to do that or what that even means. I get that bigger context window is better, but what does it mean exactly? How do you stay within that first 100k? 100k what exactly?
Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there. At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist. I built everything to be functional and reusable myself, in approximately 300 lines of code.
The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric
Sure, create a one-off app to post things to your Facebook page. But a one-off app for the OS it's running on? Freshly generating the code for your bank transaction rules? Generating an authorization service that gates access to your email?
The only reason it's quick to create green-field projects is because of all these complex, large, long-lived codebases that it's gluing together. There's ample training data out there for how to use the Firebase API, the Facebook API, OS calls, etc. Without those long-lived abstraction layers, you can't vibe out anything that matters.
But then maybe this means what is a "codebase". If a code base is just a structured set of specs that compile to code ala typescript -> javascript. sure, but then, it's still a long-lived <blank>
But maybe you would have to elaborate on, what does "creating software on the fly" look like,. because I'm sure there's a definition where the answer is yes.
Of course the tech will be useful and ethical if these problems are solved or decided to be solved the right way.
The Claude models are technically multi-modal, but IME the vision side of the equation is really lacking. As a result, Claude is quite good at reasoning about logic, and it can build e.g. simpler web pages where the underlying html structure is enough to work with, but it’s much worse at tasks that inherently require seeing.
But, Im also wondering if LLMs are going to create a new generation of software dev "brain rot" (to use the colloquial term), similar to short form videos.
I should mention in the gamedev world, it's quite common share because sharing is marketing, hence my perspective.
That people making startups is too bussy working to share it on HN or that AI is useless in real projects.
LLMS are charging like $5 per million of tokens. And even if it is subsidized 100x it is still cheaper an order of magnitude than an overseas engineer.
Not to mention speed. An LLM will spit out 1000 lines in seconds, not hours.
has your experience been otherwise?
But if today it’s so cheap to generate new code that meets updated specs, why care about the quality of the code itself?
Maybe the engineering work today is to review specs and tests and let LLMs do whatever behind the scenes to hit the specs. If the specs change, just start from scratch.
But still, very neat.
Its simple and rigid rules an AI can pick up easily. If you lack this knowledge people that have it will simply stop the conversation when you resort to shouting louder and more often.
This does not make your point more valid. If you notice people not engaging with you - that's the reason. You simply don't learn, you just look around who shares your opinion with no backed results.
Why not show benchmarks or sth ;)
The seniors who have less leeway to change course (its harder as you get older in general, large sunk costs, etc) maintain their positions and the disruption occurs at the usual "retirement rate" meaning the industry shrinks a bit each year. They don't get much with pay rises, etc but normally they have some buffer from earlier times so are willing to wear being in a dying field. Staff aren't replaced but on the whole they still have marginal long term value (e.g. domain knowledge on the job that keeps them somewhat respected there or "that guy was around when they had to do that; show respect" kind of thing).
The juniors move to other industries where the price signal shows value and strong demand remains (e.g. locally for me that's trades but YMMV). They don't have the sunk cost and have time on their side to pivot.
If done right the disruption to people's lives can be small and most of the gains of the tech can still come out. My fear is the AI wave will happen fast but only in certain domains (the worst case for SWE's) meaning the adjustment will be hard hitting without appropriate support mechanisms (i.e. most of society doesn't feel it so they don't care). On average individual people aren't that adaptable, but over generations society is.
Most people want to do anything but these three things - society is in many a ways a competition for who gets to avoid them. AI is a way of inexorably boxing people back into actually doing them.
That said, it'll certainly get much, much worse before it starts getting better. I guess the best we can hope for is that the kids find a way out of the hell these psychos paved for us all.
Put another way, we programmers have the luxury of being able to write custom scripts and apps for ourselves. Now that these things are getting way cheaper to build, there should be a growing market that makes them available to more people.
So, AI slop, yes.
Guess what, AIs don't like that as well because it makes harder for them to achieve the goal. So with minimal guidance, which at this point could probably be provided by AI as well, the output of AI agent is not that.
As soon as models have persistent memory for their own try/fail/succeed attempts, and can directly modify what's currently called their training data in real time, they're going to develop very, very quickly.
We may even be underestimating how quickly this will happen.
We're also underestimating how much more powerful they become if you give them analysis and documentation tasks referencing high quality software design principles before giving them code to write.
This is very much 1.0 tech. It's already scary smart compared to the median industry skill level.
The 2.0 version is going to be something else entirely.
Well, there are quite a few common medications we don't really know how they work.
But I also think it can be a huge liability.
It's a vital skill to recognise when that happens and start a new session.
It's what they were trained on after-all.
However what they produce is often highly readable but not very maintainable due to the verbosity and obvious comments. This seems to pollute codebases over time and you see AI coding efficiency slowly decline.
The things you mentioned are important but have been on their way out for years now regardless of LLMs. Have my ambivalent upvote regardless.
Also as this was an architectural change there are no tests to run until it's done. Everything would just fail. It's only done when the whole thing is done. I think that might be one of the reasons it got stuck: it was trying to solve issues that it did not prove existed yet. If it had just finished the job and run the tests it would've probably gotten further or even completed it.
It's a bit like stopping half way through renaming a function and then trying to run the tests and finding out the build does not compile because it can't find 'old_function'. You have to actually finish and know you've finished before you can verify your changes worked.
I still haven't actually addressed this tech debt item (it's not that important :)). But I might try again and either see if it succeeds this time (with plan in an md) or just do the work myself and get Opus to fix the unit tests (the most tedious part).
You can see this pattern in my AI attribution commit footers. It was such a noticeable difference to me that I signed up for Google AI Ultra. I got the email receipt January 3, 2026 at 11:21 AM Central, and I have not hit a single quota limit since. Yo
When I'm dying of dehydration because humanity has depleted all fresh water deposits, I'll think of you and your stupid NES emulator which is just an LLM-produced copy of many ones that had already existed.
Even more basic failure mode. I told it to convert/copy a bit (1k LOC) of blocking code into a new module and convert to async. It just couldn't do a proper 1:1 logical _copy_. But when I manually `cp <src> <dst>` the file and then told it to convert that to async and fix issues, it did it 100% correct. Because fundamentally it's just non-deterministic pattern generator.
> Create a CLAUDE.md for a c++ application that uses libraries x/y/z
[Then I edit it, adding general information about the architecture]
> Analyze the library in the xxx directory, and produce a xxx_architecture.md describing the major components and design
> /agent [let claude make the agent, but when it asks what you want it to do, explain that you want it to specialize in subsystem xxx, and refer to xxx_architecture.md
Then repeat until you have the major components covered. Then:
> Using the files named with architecture.md analyze the entire system and update CLAUDE.md to use refer to them and use the specialized agents.
Now, when you need to do something, put it in planning mode and say something like:
> There's a bug in the xxx part of the application, where when I do yyy, it does zzz, but it should do aaa. Analyze the problem and come up with a plan to fix it, and automated tests you can perform if possible.
Then, iterate on the plan with it if you need to, or just approve it.
One of the most important things you can do when dealing with something complex is let it come up with a test case so it can fix or implement something and then iterate until it's done. I had an image processing problem and I gave it some sample data, then it iterated (looking at the output image) until it fixed it. It spent at least an hour, but I didn't have to touch it while it worked.
This is definitely not the case, and the reason anthropic doesnt make claude do this is because its quality degrades massively as you use up its context. So the solution is to let users manage the context themselves in order to minimize the amount that is "wasted" on prep work. Context windows have been increasing quite a bit so I suspect that by 2030 this will no longer be an issue for any but the largest codebases, but for now you need to be strategic.
OP says, "BUT YOU DON’T KNOW HOW THE CODE WORKS.. No I don’t. I have a vague idea, but you are right - I do not know how the applications are actually assembled." This is not what I would call an engineer. Or a programmer. "Prompter", at best.
And yes, this is absolutely "lesser than", just like a middleman who subcontracts his work to Fiverr (and has no understanding of the actual work) is "lesser than" an actual developer.
I'm sure AI will get there, I also think it's not very good yet.
I'm not talking about "write this function" but rather like implementing the whole feature by writing only English to the agent, over the course of numerous back-and-forth interactions and exhausting multiple 200K-token context windows.
For me personally, definitely at least 99% all of the Rust code I've committed at work since Opus 4.5 came out has been from an agent running that model. I'm reading lots of Rust code (that Opus generated) but I'm essentially no longer writing any of it. If dot-autocomplete (and LLM autocomplete) disappeared from IDE existence, I would not notice.
If you have any opensource examples of your codebase, prompt, and/or output, I would happily learn from it / give advice. I think we're all still figuring it out.
Also this SIMD translation wasn't just a single function - it was multiple functions across a whole region of the codebase dealing with video and frame capture, so pretty substantial.
The bigger a project gets the more context you generally need to understand any particular part. And by default Claude Code doesn't inject context, you need to use 3rd party integrations for that.
What I’ve learned is that once you reach that point you’ve got to break that problem down into smaller pieces that the AI can work productively with.
If you’re about to start with Gemini-cli I recommend you look up https://github.com/github/spec-kit. It’s a project out of Microsoft/Github that encodes a rigorous spec-then-implement multi pass workflow. It gets the AI to produce specs, double check the specs for holes and ambiguity, plan out implementation, translate that into small tasks, then check them off as it goes. I don’t use spec-kit all the time, but it taught me that what explicit multi pass prompting can do when the context is held in files on disk, often markdown that I can go in and change as needed. I think it ask basically comes down to enforcing enough structure in the form of codified processes, self checks and/or tests for your code.
Pro tip, tell spec-kit to do TDD in your constitution and the tests will keep it on the rails as you progress. I suspect “vibe coding” can get a bad rap due to lack of testing. With AI coding I think test coverage gets more important.
I find I'm doing more Typescript projects than Python because of the superior typing, despite the fact I prefer Python.
A big part of the problem with existing software is that humans seem to be pretty much incapable of deciding a project is done and stop adding to it. We treat creating code like a job or hobby instead of a tool. Nothing wrong with that, unless you're advertising it as a tool.
I'm finding that branding and graphic design is the most arduous part, that I'm hoping to accelerate soon. I'm heavily AI assisted there too and I'm evaluating MCP servers to help, but so far I do actually have to focus on just that part as opposed to babysit
I'm hoping after AI comes back down to earth there will be a new glut of cheap second hand GPUs and RAM to get snapped up.
This is why you use this AI bubble (it IS a bubble) to use the VC-funded AI models for dirt cheap prices and CREATE tools for yourself.
Need a very specific linter? AI can do it. Need a complex Roslyn analyser? AI. Any kind of scripting or automation that you run on your own machine. AI.
None of that will go away or suddenly stop working when the bubble bursts.
Within just the last 6 months I've built so many little utilities to speed up my work (and personal life) it's completely bonkers. Most went from "hmm, might be cool to..." to a good-enough script/program in an evening while doing chores.
Even better, start getting the feel for local models. Current gen home hardware is getting good enough and the local models smart enough so you can, with the correct tooling, use them for suprisingly many things.
I can run multiple agents at once, across multiple code bases (or the same codebase but multiple different branches), doing the same or different things. You absolutely can't keep up with that. Maybe the one singular task you were working on, sure, but the fact that I can work on multiple different things without the same cognitive load will blow you out of the water.
> 2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.
Tell the LLM to document in comments why it did things. Human developers often leave and then people with no knowledge of their codebase or their "whys" are even around to give details. Devs are notoriously terrible about documentation.
> 3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.
You can't develop at the same velocity, so drop that assumption now. There's all kinds of lower abstractions that you build on top of that you probably can't explain currently.
> 4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.
You aren't keeping up with the actual economics. This shit is technically profitable, the unprofitable part is the ongoing battle between LLM providers to have the best model. They know software in the past has often been winner takes all so they're all trying to win.
One trick I use that might work for you as well:
Clone GitHub.com/simonw/datasette to /tmp
then look at /tmp/docs/datasette for
documentation and search the code
if you need to
Try that with your own custom framework and it might unblock things.If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code!
It turns out that Waterfall was always the correct method, it's just really slow ;)
Putting “Make minimal changes” in my standard prompt helped a lot with the tendency of basically all agents to make too many changes at once. With that addition it became possible to direct the LLM to make something similar to the logical progression of commits I would have made anyway, but now don’t have to work as hard at crafting.
Most of the hype merchants avoid the topic of maintainability because they’re playing to non-technical management skeptical of the importance of engineering fundamentals. But everything I’ve experienced so far working with LLMs screams that the fundamentals are more important than ever.
Two big examples:
- Period from early mvc JavaScript frontends (backbone.js etc) and the time of the great React/Angular wars. I completely stepped out of the webdev space during that time period.
- The rapid expansion of Deep Learning frameworks where I did try to keep up (shipped some Lua torch packages and made minor contributions to Pylearn2).
In the first case, missing 5 years of front-end wars had zero impact. After not doing webdev work at all for 5-years I was tasked with shipping a React app. It took me a week to catch up, and everything was deployed in roughly the same time as someone would have had they spent years keeping up with changes.
In the second case, where I did keep up with many of the developing deep learning frameworks, it didn't really confer any advantage. Coworkers who I worked with who started with Pytorch fresh out of school were just as proficient, if not more so, with building models. Spending energy keeping up offered no value other than feeling "current" at the time.
Can you give me a counter example of where keeping up with a rapidly changing area that's unstable has conferred a benefit to you? Most of FOMO is really just fear. Again, unless you're trying to sell your self specifically as a consultant on the bleeding edge, there's no reason to keep up with all these changes (other than finding it fun).
-- Charles Babbage
No - that's not what I did.
You don't need an extra-long context full of irrelevant tokens. Claude doesn't need to see the code it implemented 40 steps ago in a working method from Phase 1 if it is on Phase 3 and not using that method. It doesn't need reasoning traces for things it already "thought" through.
This other information is cluttering, not helpful. It is making signal to noise ratio worse.
If Claude needs to know something it did in Phase 1 for Phase 4 it will put a note on it in the living markdown document to simply find it again when it needs it.
1) define problem
2) split problem into small independently verifiable tasks
3) implement tasks one by one, verify with tools
With humans 1) is the spec, 2) is the Jira or whatever tasksWith an LLM usually 1) is just a markdown file, 2) is a markdown checklist, Github issues (which Claude can use with the `gh` cli) and every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2
I haven't ran into context issues in a LONG time, and if I have it's usually been either intentional (it's a problem where compacting wont' hurt) or an error on my part.
It's in your interest to deal with your frustration and figure out how you can leverage the new tools to stay relevant (to the degree that you want to).
Regarding the context window, Claude needs thinking turned up for long context accuracy, it's quite forgetful without thinking.
All I can tell you is that in my own lived experience, I've had some fantastic results from AI, and it comes from telling it "look at this thing here, ok, i want you to chain it to that, please consider this factor, don't forget that... blah blah blah" like how I would have spelled things out to a junior developer, and then it really does stand a really solid chance of turning out what I've asked for. It helps a lot that I know what to ask for; there's no replacing that with AI yet.
So, your own situation must fall into one of these coarse buckets:
- You're doing something way too hard for AI to have a chance at yet, like real science / engineering at the frontier, not just boring software or infra development
- Your prompts aren't specific enough, you're not feeding it context, and you're expecting it to one-shot things perfectly instead of having to spend an afternoon prompting and correcting stuff
- You're not actually using and getting better at the tools, so you're just shouting criticisms from the sidelines, perhaps as sour grape because you're not allowed by policy / company can't afford to have you get into it.
IDK. I hope it's the first one and you're just doing Really Hard Things, but if you're doing normal software developer stuff and not seeing a productivity advantage, it's a fucking skill issue.
I exclusively use opus for architecture / speccing, and then mostly Sonnet and occasionally Haiku to write the code. If my usage has been light and the code isn't too straightforward, I'll have Opus write code as well.
Typosquatting is a thing, for example, and I'm sure hallucination squatting will be, too.
I also don't want to run anything in a "sandbox", either. Containers are not sandboxes despite things like the Gemini CLI pretending they are.
I've also built a bitorrent implementation from the specs in rust where I'm keeping the binary under 1MB. It supports all active and accepted BEPs: https://www.bittorrent.org/beps/bep_0000.html
Again, I literally don't know how to write a hello world in rust.
I also vibe coded a trading system that is connected to 6 trading venues. This was a fun weekend project but it ended up making +20k of pure arbitrage with just 10k of working capital. I'm not sure this proves my point, because while I don't consider myself a programmer, I did use Python, a language that I'm somewhat familiar with.
So yeah, I get what you are saying, but I don't agree. I used highload as an example, because it is an objective way of showing that a combination of LLM/agents with some guidance (from someone with no prior experience in this type of high performing architecture) was able to beat all human software developers that have taken these challenges.
Point it at your unfinished side projects if any and describe what the project was supposed to do
You need to be able to perceive how far behind you’re falling while simping for corporate policies
The only reason we don't do that with code (or didn't use to do it) was because rewriting from scratch NEVER worked[0]. And large scale refactors take massive amounts of time and resources, so much so that there are whole books written about how to do it.
But today trivial to simple applications can be rewritten from spec or scratch in an afternoon with an LLM. And even pretty complex parsers can be ported provided that the tests are robust enough[1]. It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec".
[0] https://www.joelonsoftware.com/2000/04/06/things-you-should-...
It's kind of bittersweet for me because I was dreaming of becoming a software architect when I graduated university and the role started disappearing so I never actually became one!
But the upside of this is that now LLMs suck at software architecture... Maybe companies will bring back the software architect role?
The training set has been totally poisoned from the architecture PoV. I don't think LLMs (as they are) will be able to learn software architecture now because the more time passes, the more poorly architected slop gets added online and finds its way into the training set.
Good software architecture tends to be additive, as opposed to subtractive. You start with a clean slate then build up from there.
It's almost impossible to start with a complete mess of spaghetti code and end up with a clean architecture... Spaghetti code abstractions tend to mislead you and lead you astray... It's like; understanding spaghetti code tends to soil your understanding of the problem domain. You start to think of everything in terms of terrible leaky abstraction and can't think of the problem clearly.
It's hard even for humans to look at a problem through fresh eyes; it's likely even harder for LLMs to do it. For example, if you use a word in a prompt, the LLM tends to try to incorporate that word into the solution... So if the AI sees a bunch of leaky abstractions in the code; it will tend to try to work with them as opposed to removing them and finding better abstractions. I see this all the time with hacks; if the code is full of hacks, then an LLM tends to produce hacks all the time and it's almost impossible to make it address root causes... Also hacks tend to beget more hacks.
Big claims here.
Did brewers and bakers up to the middle ages understand fermentation and how yeasts work?
How many people understand the underlying operating system their code runs on? Can even read assembly or C?
Even before LLMs, there were plenty of copy-paste JS bootcamp grads that helped people build software businesses.
But I sure write a lot less of it, and the percentage I write continues to go down with every new model release. And if I'm no longer writing it, and the person who works on it after me isn't writing it either, it changes the whole art of software engineering.
I used to spend a great deal of time with already working code that I had written thinking about how to rewrite it better, so that the person after me would have a good clean idea of what is going on.
But humans aren't working in the repos as much now. I think it's just a matter of time before the models are writing code essentially for their eyes, their affordances -- not ours.
It's verbose by default but a few hours of custom instructions and you can make it code just like anyone
I am stealing the heck out of this.
Attention based neural network architectures (on which the majority of LLMs are built) has a unit economic cost that scales (roughly) n^2 i.e. quadratic (for both memory and compute). In other words, the longer the context window, the more expensive it is for the upstream provider. That's one cost.
The second cost is that you have to resend the entire context every time you send a new message. So the context is basically (where a, b, and c are messages): first context: a, second context window: a->b, third context window: a->b->c. It's a mostly stateless (there are some short term caching mechanisms, YMMV based on provider, it's why "cached" messages, especially system prompts are cheaper) process from the point of view of the developer, the state i.e. context window string is managed by the end user application (in other words, the coding agent, the IDE, the ChatGPT UI client etc.)
The per token cost is an amortized (averaged) cost of memory+compute, the actual cost is mostly quadratic with respect to each marginal token. The longer the context window the more expensive things are. Because of the above, AI agent providers (especially those that charge flat fee subscription plans) are incentivized to keep costs low by limiting the maximum context window size.
(And if you think about it carefully, your AI API costs are a quadratic cost curve projected into a linear line (flat fee per token, so the model hosting provider in some cases may make more profit if users send in shorter contexts, versus if they constantly saturate the window. YMMV of course, but it's a race to the bottom right now for LLM unit economics)
They do this by interrupting a task halfway through and generating a "summary" of the task progress, then they prompt the LLM again with a fresh prompt and the "summary" so far and the LLM will restart the task from where it left of. Of course text is a poor representation of the LLM's internal state but it's the best option so far for AI application to keep costs low.
Another thing to keep in mind is that LLMs have poorer performance the larger the input size. This is due to a variety of factors (mostly because you don't have enough training data to saturate the massive context window sizes I think).
The general graph for LLM context performance looks something like this: https://cobusgreyling.medium.com/llm-context-rot-28a6d039965... https://research.trychroma.com/context-rot
There are a bunch of tests and benchmarks (commonly referred to as "needle in a haystack") to improve the LLM performance at large context window sizes, but it's still an open area of research.
https://cloud.google.com/blog/products/ai-machine-learning/t...
The thing is, generally speaking, you will get a slightly better performance if you can squeeze all your code and problem into the context window, because the LLM can get a "whole picture" view of your codebase/problem, instead of a bunch of broken telephone summaries every dozen of thousands of tokens. Take this with a grain of salt as the field is changing rapidly so it might not be valid in a month or two.
Keep in mind that if the problem you are solving requires you to saturate the entire context window of the LLM, a single request can cost you dollars. And if you are using 1M+ context window model like gemini, you can rack up costs fairly rapidly.
I've heard this same thing repeated dozens of times, and for different domains/industries.
It's really just a variation of the 80/20 rule.
The question in my mind is where we are on the s-curve. Are we just now entering hyper-growth? Or are we starting to level out toward maturity?
It seems like it must still be hyper-growth, but it feels less that way to me than it did a year ago. I think in large part my sense is that there are two curves happening simultaneously, but at different rates. There is the growth in capabilities, and then there is the growth in adoption. I think it's the first curve that seems to be to have slown a bit. Model improvements seem both amazing and also less revolutionary to me than they did a year or two ago.
But the other curve is adoption, and I think that one is way further from maturity. The providers are focusing more on the tooling now that the models are good enough. I'm seeing "normies" (that is, non-programmers) starting to realize the power of Claude Code in their own workflows. I think that's gonna be huge and is just getting started.
Most of the time, people are just very reasonably and understandably focusing tightly on their lane and honestly had no idea of the externalities of their conclusions and decisions, and I'm happy to have experienced all those times a rebalancing of the trade-offs that everyone can accept and is grateful to have documented to justify spending the story points upon cleaning up later instead of working on new features while the externality debt's unwanted impact keeps piling up.
In fewer than a handful of times, I run into people deliberately, consciously with malice aforethought of the full externalities making trade-offs for the sake of expediently shifting burdens of of them without first consulting with partner teams they want to shift the burdens onto, simply so they can fatten their promo packet sooner at the expense of making other teams look worse. Getting these trade-offs documented about half the time makes them back down to a more reasonable trade-off, about half the time they don't back down but your team is now protected by explicit documentation and caveats upon the externality your team now has to carry, and 100% of the time my team and I put a ring fence upon all future interactions with that personality for at least the remaining duration of my gig.
I'm banking on a future that if users feel they can (perhaps vibe) code their own solutions, they are far less likely to open their wallets for our bloatware solutions. Why pay exorbitant rents for shitty SaaS if you can make your own thing ad-free, exactly to your own mental spec?
I want the "computers are new, programmers are in short supply, customer is desperate" era we've had in my lifetime so far to come to a close.
There is probably some effective way to put this direction into the claude.md, but so far it still seems to do unnecessary reimplementation quite a lot.
Why use a 3rd party dependency that might have features you don't need when you can write a hyper-specific solution in a day with an LLM and then you control the full codebase.
Or why pay €€€ for a SaaS every month when you can replicate the relevant bits yourself?
But there’s no sign of them slowing down.
That puts them ahead of the LLM crowd.
Something I think though (which, again, I could very well be wrong about; uncertainty is the only certainly right now) is that "so the person after me would have a good clean idea of what is going on" is also going to continue mattering even when that "person" is often an AI. It might be different, clarity might mean something totally different for AIs than for humans, but right now I think a good expectation is that clarity to humans is also useful to AIs. So at the moment I still spend time coaxing the AI to write things clearly.
That could turn out to be wasted time, but who knows. I also think if it as a hedge against the risk that we hit some point where the AIs turn out to be bad at maintaining their own crap, at which point it would be good for me to be able to understand and work with what has been written!
This is a completely new thing which will have transformative consequences.
It's not just a way to do what you've always done a bit more quickly.
https://www.folklore.org/Negative_2000_Lines_Of_Code.html
> When he got to the lines of code part, he thought about it for a second, and then wrote in the number: -2000
But in retrospect it’s absolutely baffling that mixing raw SQL queries with HTML tag soup wasn’t necessarily uncommon then. Also, I haven’t met many PHP developers that I’d recommend for a PHP job.
By getting the LLM to keep changes minimal I’m able to keep quality high while increasing velocity to the point where productivity is limited by my review bandwidth.
I do not fear competition from junior engineers or non-technical people wielding poorly-guided LLMs for sustained development. Nor for prototyping or one offs, for that matter — I’m confident about knowing what to ask for from the LLM and how to ask.
Let's assume the LLM agents can write tests for, and hit, specs better and cheaper than the outsourced offshore teams could.
So let's assume now you can have a working product that hits your spec without understanding the code. How many bugs and security vulnerabilities have slipped through "well tested" code because of edge cases of certain input/state combinations? Ok, throw an LLM at the codebase to scan for vulnerabilities; ok, throw another one at it to ensure no nasty side effects of the changes that one made; ok, add some functionality and a new set of tests and let it churn through a bunch of gross code changes needed to bolt that functionality into the pile of spaghetti...
How long do you want your critical business logic relying on not-understood code with "100% coverage" (of lines of code and spec'd features) but super-low coverage of actual possible combinations of input+machine+system state? How big can that codebase get before "rewrite the entire world to pass all the existing specs and tests" starts getting very very very slow?
We've learned MANY hard lessons about security, extensibility, and maintainability of multi-million-LOC-or-larger long-lived business systems and those don't go away just because you're no longer reading the code that's making you the money. They might even get more urgent. Is there perhaps a reason Google and Amazon didn't just hire 10x the number of people at 1/10th the salary to replace the vast majority of their engineering teams year ago?
Either way, most tasks don't have the luxury of a thorough test suite, as the test suite itself is the product of arduous effort in debugging and identifying corner case.
I consider my job to be actually useful. That I produce useful stuff to society at large.
I definitely hope that I'm replaced with someone/thing better; whatever it is. That's progress.
I surely don't hope for a futre where I retire and medics have access to worse tech than they have now.
There are people all over the place building stuff that would've either never been built, or would've required a paid dev++.
I built a whole webshop with an internal CRM/admin panel to manage ~150 products. I built a middleware connecting our webshop to our legacy ERP system, smth that would be normally done by another software company.
I built a program with a UI that makes it super easy for us to generate ZPL code and print labels using 4 different label printers automatically with a simple interface, managed by an RPi.
I have built custom personal portfolio websites for friends with Gemini 3 in hours for free, smth that again would've cost money for dev or some crappy WP/Squarespace templates.
As the other user said, the progress/changes are not distributed evenly, and are impossible to quantify.
But to me whose main job is not programming (but who knows how to code) but running a nom-software business, the productivity gains are very obvious, as is the fact that because of LLMs I have robbed developers of potential work.
https://andymasley.substack.com/p/the-ai-water-issue-is-fake
At the same time, I think there's limitations to these tools and that I wont ever be able to achieve what I see others saying about 95% of code being AI written or leaving the AI to iterate for an hour. There's just too many weird little pitfalls in our work that the AI just cannot seem to avoid.
It's understandable, I've fallen victim to a few of them too, but I have the benefit of the ability to continuously learn/develop/extrapolate in a way that the LLM cannot. And with how little documentation exists for some of these things (MASQUE proxying for example) anytime the LLM encounters this code it throws a fit, and is unable to contribute meaningfully.
So thanks for your suggestions, it has made Claude better and clearly I was dragging my feet a little. At the very least, it's freed up a some more of my time to work on the complex things Claude can't do.
It's also hard to steer the plan mode or have it remember some behavior that you want to enforce. It's much better to create a custom command with custom instructions that acts as the plan mode.
My system works like this:
/implement command acts as an orchestrator & plan mode, and it is instructed to launch predefined set of agents based on the problem and have them utilize specific skills. Every time /implement command is initiated, it has to create markdown file inside my own project, and then each subagent is also instructed to update the file when it finished working.
This way, orchestrator can spot that agent misbehaved, and reviewer agent can see what developer agent tried to do and why it was wrong.
And also - doesn’t that make Zed (and other editors) pointless?
That's a good way to say it, I totally identify.
The old adage about how "users use 10% of your software's features, but they each use a different 10%" can now be solved by each user just building that 10% for themselves.
I think my comparison is apt; being a bubble and a truly society-altering technology are not mutually exclusive, and by virtue of it being a bubble, it is overhyped.
Are there any local models that are at least somewhat comparable to the latest-and-greatest (e.g. Opus 4.5, Gemini 3), especially in terms of coding?
Potentially because there is no baggage with similar frameworks. I'm sure it would have an easier time with this if it was not spun off from other frameworks.
> If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code!
If Claude cannot read the code well enough to begin with, and needs supplemental documentation, I certainly don't want it generating the docs from the code. That's just compounding hallucinations on top of each other.
What I very succinctly called "crippled context" despite claims that Opus 4.5 is somehow "next tier". It's all the same techniques we've been using for over a year now.
> I haven't ran into context issues in a LONG time
Because you've become the reverse centaur :) "a person who is serving as a squishy meat appendage for an uncaring machine." [1]
You are very aware of the exact issues I'm talking about, and have trained yourself to do all the mechanical dance moves to avoid them.
I do the same dances, that's why I'm pointing out that they are still necessary despite the claims of how model X/Y/Z are "next tier".
[1] https://doctorow.medium.com/https-pluralistic-net-2025-12-05...
As for the snide and patronizing "it's in your interest to stay relevant":
1. I use these tools daily. That's why I don't subscribe to willful wide-eyed gullibility. I know exactly what these tools can and cannot do.
The vast majority of "AI skeptics" are the same.
2. In a few years when the world is awash in barely working incomprehensible AI slop my skills will be in great demand. Not because I'm an amazing developer (I'm not), but because I have experience separating wheat from the chaff
So prompting is a lateral move away from engineering to management? Are we arguing semantics here, because that's quite what I was saying, just in the other direction.
Creating a parser for this challenge that is 10x more efficient than a simple approach does require deep understanding of what you are doing. It requires optimizing the hot loop (among other things) that 90-95% of software developers wouldn't know how to do. It requires deep understanding of the AVX2 architecture.
Here you can read more about these challenges: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...
This seems like a sort of I dunno chicken and the egg thing.
The _reason_ you don't rewrite code is because it's hard to know that you truly understand the spec. If you could perfectly understand the spec then you could rewrite the code, but then what is the software, is it the code or the spec that writes the code. So if you built code A from spec, rebuilding it from spec I don't think qualifies a rewrite, it's just a recompile. If you're trying to fundamentally build a new application from spec when an old application was written by hand, you're going to run into the same problems you have in a normal rewrite.
We already have an example of this. Typescript applications are basically rewritten every time that you recompile typescript to node. Typescript isn't the executed code, it's a spec.
edit: I think I missed that you said rewrite in a different language, then yeah fine, you're probably right, but I don't think most people are architecture agnostic when they talk about rewrites. The point of a rewrite is to keep the good stuff and lose a lot of bad stuff. If you're using the original app as a spec to rewrite in a new language, then fine yeah, LLM's may be able to do this relatively trivially.
Buildings in most other countries in the world ARE built to last forever, and often renovated, changed, extended and modified long after the incept date until, because needs change, and destroying them to start over is complete overkill (Although some people do these "large scale refactors" - they're usually rich).
> It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec".
I have no doubt of this. I'm sure it's happening already. But the whole point of long term stable applications is that they are tried and tested. A port done in an afternoon by an LLM might be great, but you can't know if it has problems until it has withstood the test of time.
The problem with "all software" being AI-generated is that, to use your analogy, the electrical standards, foundation, and building materials have all been recently vibe-coded into existence, and none of your construction workers are certified in any of it.
Knowledge engineering has a notion called "covered/invisible knowledge" which points to the small things we do unknowingly but changes the whole outcome. None of the models (even AI in general) can capture this. We can say it's the essence of being human or the tribal knowledge which makes experienced worker who they are or makes mom's rice taste that good.
Considering these are highly individualized and unique behaviors, a model based on averaging everything can't capture this essence easily if it can ever without extensive fine-tuning for/with that particular person.
There are situations where it applies and situation where it doesn't. Having the experience to see what applies in this new context is what senior (usually) means.
(The method I have the most confidence in is some sort of mixed system where there is non-profit, state-planned, and startup software development all at once.)
Markets are a tool, a means to the end. I think they're very good, I'm a big fan! But they are not an excuse not to think about the outcome we want.
I'm confident that the outcome I don't want is where most software developers are trying to find demand for their work, pivoting etc. it's very "pushing a string" or "cart before the horse". I want more "pull" where the users/benefiaries of software are better able to dictate or create themselves what they want, rather than being helpless until a pivoting engineer finds it for them.
Basically start-up culture has combined theories of exogenous growth from technology change, and a baseline assumption that most people are and will remain hopelessly computer illiterate, into an ideology that assumes the best software is always "surprising", a paradigm shift, etc.
Startups that make libraries/tools for other software developers are fortunately a good step in undermining these "the customer is an idiot and the product will be better than they expect" assumptions. That gives me hope we're reach a healthier mix of push and pull. Wild successes are always disruptive, but that shouldn't mean that the only success is wild, or trying to "act disruptive before wild success" ("manifest" paradigm shifts!) is always the best means to get there.
LLMs accelerate this and make it more visible, but they are not the cause. It is almost always a person trying to solve a problem and just not knowing what they don't know because they are learning as they go.
It saddens me to see AI detractors being stuck in 2022 and still thinking language models are just regurgitating bits of training data.
Yes, actually. Its hard to open a competing bakery due to location availability, permitting, capex, and the difficulty of converting customers.
To add to that, food establishments generally exist on next to no margin, due to competition, despite all of that working in their favor.
Now imagine what the competitive landscape for that bakery would look like if all of that friction for new competitors disappeared. Margin would tend toward zero.
This "legacy apps are barely understood by anybody", is just somnething you made up.
I can't teach it taste.
Arthur Whitney?
https://en.wikipedia.org/wiki/Arthur_Whitney_(computer_scien...
I remember when Gemini Pro 3 was the latest hotness and I started to get FOMO seeing demos on X posted to HN showing it one shot-ing all sorts of impressive stuff. So I tried it out for a couple days in Gemini CLI/OpenCode and ran into the exact same pain points I was dealing with using CC/Codex.
Flashy one shot demos of greenfield prompts are a natural hype magnet so get lots of attention, but in my experience aren't particularly useful for evaluating value in complex, legacy projects with tightly bounded requirements that can't be easily reduced to a page or two of prose for a prompt.
> let LLMs do whatever behind the scenes to hit the specs
assuming for the sake of argument that's completely true, then what happens to "competitive advantage" in this scenario?it gets me thinking: if anyone can vibe from spec, whats stopping company a (or even user a) from telling an llm agent "duplicate every aspect of this service in python and deploy it to my aws account xyz"...
in that scenario, why even have companies?
If that's not a product ... then I don't know what it is.
- What was the state of AI/LLMs 5 years ago compared to now? There was nothing.
- What is the current state of AI/LLMs? I can already achieve the above.
- What will that look like 5 years down the road?
I you haven't experienced first-hand a specific task before and after AI/LLMs, I think its indeed difficult to get insight into that last question. Keep in mind that progress is probably exponential, not linear.
Either way, governments need to heavily tax corporations benefiting from AI to make it possible.
That models can be distilled has no bearing whatsoever on whether a model has learned actual knowledge or understanding ("logic"). Models have always learned sparse/approximately-sparse and/or redundant weights, but they are still all doing manifold-fitting.
The resulting embeddings from such fitting reflect semantics and semantic patterns. For LLMs trained on the internet, the semantic patterns learned are linguistic, which are not just strictly logical, but also reflect emotional, connotational, conventional, and frequent patterns, all of which can be illogical or just wrong. While linguistic semantic patterns are correlated with logical patterns in some cases, this is simply not true in general.
It's got a lot easier technically to do that in recent year, and MUCH easier with AI.
But institutionally and in terms of governance it's got a lot harder. Nobody wants home-brew software anymore. Doing data management and governance is complex enough and involves enough different people that it's really hard to generate the momentum to get projects off the ground.
I still think it's often the right solution and that successful orgs will go this route and retain people with the skills to make it happen. But the majority probably can't afford the time/complexity, and AI is only part of the balance that determines whether it's feasible.
In this regard, I see LLM's as a way for us to way more efficiently encode, compress, convey and enable operational practice our combined learned experiences. What will be really exciting is watching what happens as LLM's simultaneously draw from and contribute to those learned experiences as we do; we don't need full AGI to sharply realize massive benefits from just rapidly, recursively enabling a new highly dynamic form of our knowledge sphere that drastically shortens the distance from knowledge to deeply-nuanced praxis.
I agree, but maybe for different reasons. Refactoring well is a form of intelligence, and I don't see any upper limit to machine intelligence other than the laws of physics.
> Refactoring is a very mechanistic way of turning bad code into good.
There are some refactoring rules of thumb that can seem mechanistic (by which I mean deterministic based on pretty simple rules), but not all. Neither is refactoring guaranteed to be sufficient to lead to all reasonable definitions of "good software". Sometimes the bar requires breaking compatibility with the previous API / UX. This is why I agree with the sibling comment which draws a distinction between refactoring (changing internal details without changing the outward behavior, typically at a local/granular scale) and reworking (fixing structural problems that go beyond local/incremental improvements).
Claude phrased it this way – "Refactoring operates within a fixed contract. Reworking may change the contract." – which I find to be nice and succinct.
You have to supply it the right context with a well formed prompt, get a plan, then execute and do some cleanup.
LLMs are only as good as the engineers using them, you need to master the tool first before you can be productive with it.
"Just rewrite it" is usually -- not always, but _usually_ -- a sure path to a long, painful migration that usually ends up not quite reproducing the old features/capabilities and adding new bugs and edge cases along the way.
Sure we can vibecode oneoff projects that does something useful (my fav is browser extensions) but as soon as we ask others to use our code on a regular basis the technical debt clock starts running. And we all know how fast dependencies in a project breaks.
Walmart, McDonalds, Nike - none really have any secrets about what they do. There is nothing stopping someone from copying them - except that businesses are big, unwieldy things.
When software becomes cheap companies compete on their support. We see this for Open Source software now.
This does not make the learned rules better btw its like model collapse on steroids, see: https://spectrum.ieee.org/ai-coding-degrades
You get code that is technically correct but is more likely to contain shortcuts by omitting steps or adding irrelevant structures that hit performance (not that you'd care about that) or at worst provide a nonworking solution that still is correct to the checker. Because those sets are easily generated in billions AI can learn with a known 0% loss so no its no surprise it can give you a working WASM binary!!!! Natural language and business logic is WAY MORE complicated and has no specifications with defined behavior.
It`s similar to how you try to repeat a wrong opinion that makes sense to you but is not solving the problem domain and then think just by spamming it repeatedly it will magically become truth.
You assume working code = correct code = good code.
FWIW, my vision was not really this utopian. It was more about AI smashing white-collar work as an alternative to these professions so that people are forced into them despite their preference to do pretty much anything else. Everyone is more bitter and resentful and feels less actualized and struggles to afford luxuries, but at least you don't have to wait that long in the emergency room and it's 10 kids to a classroom.
I don't think open office/libre office etc have access to the source code for MS office and if they did MS would be on them like a rash.
The major benefit of agents is that it keeps context clean for the main job. So the agent might have a huge context working through some specific code, but the main process can do something to the effect of "Hey UI library agent, where do I need to put code to change the color of widget xyz", then the agent does all the thinking and can reply with "that's in file 123.js, line 200". The cleaner you keep the main context, the better it works.
Skills on the other hand are commands ON STEROIDS. They can be packaged with actual scripts and executables, the PEP723 Python style + uv is super useful.
I have one skill for example that uses Python+Treesitter to check the unit thest quality of a Go project. It does some AST magic to check the code for repetition, stupid things like sleeps and relative timestamps etc. A /command _can_ do it, but it's not as efficient, the scripts for the skill are specifically designed for LLM use and output the result in a hyper-compact form a human could never be arsed to read.
claude-code has a built in plugin that it can use to fetch its own docs! You don't have to ever touch anything yourself, it can add the features to itself, by itself.
Are you worried about microsoft stealing your codebase from github?
It takes more work than one-shot, but not a lot, and it pays dividends.
Nobody is one-shotting anything nontrivial in Zed's code base, with Opus 4.5 or any other model.
What about a future model? Literally nobody knows. Forecasts about AI capabilities have had horrendously low accuracy in both directions - e.g. most people underestimated what LLMs would be capable of today, and almost everyone who thought AI would at least be where it is today...instead overestimated and predicted we'd have AGI or even superintelligence by now. I see zero signs of that forecasting accuracy improving. In aggregate, we are atrocious at it.
The only safe bet is that hardware will be faster and cheaper (because the most reliable trend in the history of computing has been that hardware gets faster and cheaper), which will naturally affect the software running on it.
> And also - doesn’t that make Zed (and other editors) pointless?
It means there's now demand for supporting use cases that didn't exist until recently, which comes with the territory of building a product for technologists! :)
https://news.ycombinator.com/item?id=46456850
https://news.ycombinator.com/item?id=44726957
https://news.ycombinator.com/item?id=44110805
(You've also been breaking the site guidelines in plenty of other places - e.g. https://news.ycombinator.com/item?id=46521516, https://news.ycombinator.com/item?id=46395646. This is not what this site is for, and destroys what it is for.)
I'm building tools, not complete factories :) The AI builds me a better hammer specifically for the nails I'm nailing 90% of the time. Even if the AI goes away, I still know how the custom hammer works.
Let's say AI becomes too expensive - I more or less only have to sharpen up being able to write the language. My active recall of the syntax, common methods and libraries. That's not hard or much of a setback
Maybe this would be a problem if you're purely vibe coding, but I haven't seen that work long term
I find Claude Code is so good at docs that I sometimes investigate a new library by checking out a GitHub repo, deleting the docs/ and README and having Claude write fresh docs from scratch.
The exact same method that worked for those happened to also work for LLMs, I didn't have to learn anything new or change much in my workflow.
"Fix bug in FoobarComponent" is enough of a bug ticket for the 100x developer in your team with experience with that specific product, but bad for AI, juniors and offshored teams.
Thus, giving enough context in each ticket to tell whoever is working on it where to look and a few ideas what might be the root cause and how to fix it is kinda second nature to me.
Also my own brain is mostly neurospicy mush, so _I_ need to write the context to the tickets even if I'm the one on it a few weeks from now. Because now-me remembers things, two-weeks-from-now me most likely doesn't.
It seems the subject of AI is emotionally charged for you, so I expect friendly/rational discourse is going to be a challenge. I'd say something nice but since you're primed to see me being patronizing... Fuck you? That what you were expecting?
(And it's fine to do so, just don't mail bombs to us, ok?)
Is that a sign of having having surpassed that context window size? I guess to keep them sharp, I should start a new session often and early.
From what I understand, a token is either a word or a character, so I can use 100k words or characters before I start running into limits. But I've got the feeling that the complexity of the problem itself also matters.
Self-driving cars don't use LLMs, so I don't know how any rational analysis can claim that the analogy is valid.
>> The saying I have quoted (which has different forms) is valid for programming, construction and even cooking. So it's a simple, well understood baseline.
Sure, but the question is not "how long does it take for LLMs to get to 100%". The question is, how long does it take for them to become as good as, or better than, humans. And that threshold happens way before 100%.
None of the current models maybe, but not AI in general? There’s nothing magical about brains. In fact, they’re pretty shit in many ways.
Isn't that what "using an LLM" is supposed to solve in the first place?
This is the goal. It's the point of having a free market.
This is part of the reason why people use external data stores (e.g. vector databases, graph tools like Bead etc. in the hope of supplementing the agent's native context window and task management tools).
https://github.com/steveyegge/beads
The whole field is still in its infancy. Who knows, maybe in another update or two the problem might just be solved. It's not like needle in the haystack problems aren't differentiable (mathematically speaking).
Doesn't matter, because if we're talking about AI models, no (type of) model reaches 100% linearly, or 100% ever. For example, recognition models run with probabilities. Like Tesla's Autopilot (TM), which loves to hit rolled-over vehicles because it has not seen enough vehicle underbodies to classify it.
Same for scientific classification models. They emit probabilities, not certain results.
>> Sure, but the question is not "how long does it take for LLMs to get to 100%"
I never claimed that a model needs to reach a proverbial 100%.
>> The question is, how long does it take for them to become as good as, or better than, humans.
They can be better than humans for certain tasks. They are actually better than humans in some tasks since 70s, but we like to disregard them to romanticize current improvements, but I don't believe current or any generation of AIs can be better than humans in anything and everything, at once.
Remember: No machine can construct something more complex than itself.
>> And that threshold happens way before 100%.
Yes, and I consider that "treshold" as "complete", if they can ever reach it for certain tasks, not "any" task.
OTOH, while I won't call human brain perfect, the things we label "shit" generally turn out to be very clever and useful optimizations to workaround its own limitations, so I regard human brain higher than most AI proponents do. Also we shouldn't forget that we don't know much about how that thing works. We only guess and try to model it.
Lastly, searching perfection in numbers and charts or in engineering sense is misunderstanding nature and doing a great disservice to it, but this is a subject for another day.
I often have one agent/prompt where I build things but then I have another agent/prompt where their only job is to find codesmells, bad patterns, outdated libraries, and make issues or fix these problems.
2. LLMs are obsequious
3. Even if LLMs have access to a lot of knowledge they are very bad at contextualizing it and applying it practically
I'm sure you can think of many other reasons as well.
People who are driven to learn new things and to do things are going to use whatever is available to them in order to do it. They are going to get into trouble doing that more often than not, but they aren't going to stop. No is helping the situation by sneering at them -- they are used it to it, anyway.
Training looks like:
- Pretraining (all data, non-code, etc, include everything including garbage)
- Specialized pre-training (high quality curated codebases, long context -- synthetic etc)
- Supervised Fine Tuning (SFT) -- these are things like curated prompt + patch pairs, curated Q/A (like stack overflow, people are often cynical that this is done unethically but all of the major players are in fact very risk adverse and will simply license and ensure they have legal rights),
- Then more SFT for tool use -- actual curated agentic and human traces that are verified to be correct or at least produce the correct output.
- Then synthetic generation / improvement loops -- where you generate a bunch of data and filter the generations that pass unit tests and other spec requirements, followed by RL using verifiable rewards + possibly preference data to shape the vibes
- Then additional steps for e.g. safety, etc
So synthetic data is not a problem and is actually what explains the success coding models are having and why people are so focused on them and why "we're running out of data" is just a misunderstanding of how things work. It's why you don't see the same amount of focus on other areas (e.g. creative writing, art etc) that don't have verifiable rewards.
The
Agent --> Synthetic data --> filtering --> new agent --> better synthetic data --> filtering --> even better agent
flywheel is what you're seeing today so we definitely don't have any reason to suspect there is some sort of limit to this because there is in principle infinite data
There are more stages to LLM training than just the pre-training stage :).
What makes LLM makers different that they dont have time to share it like everybody else does?
But that different challenges become apparent that aren’t addressed by examples like this article which tend to focus on narrow, greenfield applications that can be readily rebuilt in one shot.
I already get plenty of value in small side projects that Claude can create in minutes. And while extremely cool, these examples aren’t the kind of “step change” improvement I’d like to see in the area where agentic tools are currently weakest in my daily usage.
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
> the single worst strategic mistake that any software company can make:
> rewrite the code from scratch.
I can vibe code what a dev shop would charge 500k to build and I can solo it in 1-2 weeks. This is the reality today. The code will pass quality checks, the code doesn’t need to be perfect, it doesn’t need to be cleaver it needs to be.
It’s not difficult to see this right? If an LLM can write English it can write Chinese or python.
Then it can run itself, review itself and fix itself.
The cat is out of bag, what it will do to the economy… I don’t see anything positive for regular people. Write some code has turned into prompt some LLM. My phone can outplay the best chess player in the world, are you telling me you think that whatever unbound model anthropic has sitting in their data center can’t out code you?
Software is different, you need very very little to start, historically just your own skills and time. Thes latter two may see some changes with LLMs.
There is more to it than the code and software provided in most cases I feel.
Who'd pay for brand new Photoshop with a couple new features and improvements if LLM-cloned Photoshop-from-three-months-ago is free?
The first few iterations of this cloud be massively consumer friendly for anything without serious cloud infra costs. Cheap clones all around. Like generic drugs but without the cartel-like control of manufacturing.
Business after that would be dramatically different, though. Differentiating yourself from the willing-to-do-it-for-near-zero-margin competitors to produce something new to bring in money starts to get very hard. Can you provide better customer support? That could be hard, everyone's gonna have a pretty high baseline LLM-support-agent already... and hiring real people instead could dramatically increase the price difference you're trying to justify... Similarly for marketing or outreach etc; how are you going to cut through the AI-agent-generated copycat spam that's gonna be pounding everyone when everyone and their dog has a clone of popular software and services?
Photoshop type things are probably a really good candidate for disruption like that because to a large extent every feature is independent. The noise reduction tool doesn't need API or SDK deps on the layer-opacity tool, for instance. If all your features are LLM balls of shit that doesn't necessarily reduce your ability to add new ones next to them, unlike in a more relational-database-based web app with cross-table/model dependencies, etc.
And in this "try out any new idea cheaply and throw crap against the wall and see what sticks" world "product managers" and "idea people" etc are all pretty fucked. Some of the infinite monkeys are going to periodically hit to gain temporary advantage, but good luck finding someone to pay you to be a "product visionary" in a world where any feature can be rolled out and tested in the market by a random dev in hours or days.
ps: If you can guarantee the Powerball lottery continues forever, I can give you a guaranteed winning combination.
As for those professions; I think they are objectively hard for certain kinds of people but I think much of the problem is the working conditions; less shifts, less stress, more manpower and you'll see more satisfaction. There's really no reason why teachers in the U.S should be this burned out! In Scandinavia being a teacher is a honorable, high status profession. Much of this has to do with framing and societal prestige rather than the actual work itself. If you pay elder carers more they'll be happier. We pretty treat our elders like a burden in most modern societies, in more traditional societies I'm assuming if you said your job is caring for elders it is not a low status gig.
And even with a narrower definition of stealing, Microsoft’s ability to share your code with US government agencies is a common and very legitimate worry in plenty of threat model scenarios.
And yeah - if I had a crystal ball, I would be on my private island instead of hanging on HN :)
My experience scrolling X and HN is a bunch of people going "omg opus omg Claude Code I'm 10x more productive" and that's it. Just hand wavy anecdotes based on their own perceived productivity. I'm open to being convinced but just saying stuff is not convincing. It's the opposite, it feels like people have been put under a spell.
I'm following The Primeagen, he's doing a series where he is trying these tools on stream and following peoples advice on how to use them the best. He's actually quite a good programmer so I'm eager to see how it goes. So far he isn't impressed and thus neither am I. If he cracks it and unlocks significant productivity then I will be convinced.
And even my short-term memory is significantly larger than the at most 50% of the 200k-token context window that Claude has. It runs out of context before my short-term memory is probably not even 1% full, for the same task (and I'm capable of more context-switching in the meantime).
And so even the "Opus 4.5 really is at a new tier" runs into the very same limitations all models have been running into since the beginning.
It's not me who decided to barge in, assume their opponent doesn't use something or doesn't want to use something, and offer unsolicited advice.
> It kinda makes me sad when the discourse is so poisoned that I can't even encourage someone to protect their own future from something that's obviously coming
See. Again. You're so in love with your "wisdom" that you can't even see what you sound like: snide, patronising, condenscending. And completely missing the whole point of what was written. You are literally the person who poisons the discourse.
Me: "here are the issues I still experience with what people claim are 'next tier frontier model'"
You: "it's in your interests to figure out how to leverage new tools to stay relevant in the future"
Me: ... what the hell are you talking about? I'm using these tools daily. Do you have anything constructive to add to the discourse?
> so I expect friendly/rational discourse is going to be a challenge.
It's only challenge to you because you keep being in love with your voice and your voice only. Do you have anything to contribute to the actual rational discourse, are you going to attack my character?
> 'd say something nice but since you're primed to see me being patronizing... Fuck you? T
Ah. The famous friendly/rational discourse of "they attack my use of AI" (no one attacked you), "why don't you invest in learning tools to stay relevant in the future" (I literally use these tools daily, do you have anything useful to say?) and "fuck you" (well, same to you).
> That what you were expecting?
What I was expecting is responses to what I wrote, not you riding in on a high horse.
No one attacked your use of AI. I explained my own experience with the "Claude Opus 4.5 is next tier". You barged in, ignored anything I said, and attacked my skills.
> the workplace is going to punish people who don't leverage AI though, and I'm trying to be helpful.
So what exactly is helpful in your comments?
I don't get paid extra for after hours incidents (usually we just trade time), so it's well within my purview on when to take on extra risk. Obviously, this is not ideal, but I don't make the on-call rules and my ability to change them is not a factor.
Lol, who doesn't hate that?
I generally write to liberate my consciousness from isolation. When doing so in a public forum I am generally doing so in response to an assertion. When responding to an assertion I am generally attempting to understand the framing which produced the assertion.
I suppose you may also be speaking to the voice which is emergent. I am not very well read, so you may find my style unconventional or sloppy. I generally try not to labor too much in this regard and hope this will develop as I continue to write.
I am receptive to any feedback you have for me.
If you keep this up, we're going to have to ban you, not because of your views on any particular topic but because you're going entirely against the intended spirit of the site by posting this way. There's plenty of room to express your views substantively and thoughtfully, but we don't want cynical flamebait and denunciation. HN needs a good deal less of this.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
It's certainly the case that there are managers who handle those risks poorly, but that's just bad management.
And data migration is lossy, becsuse nobody care the data fidelity anyway.
People pay for things they use. If bespoke software is a thing you pick up at the mall at a kiosk next to Target we gotta figure something out.
Simon has produced plenty of evidence over the past year. You can check their submission history and their blog: https://simonwillison.net/
The problem with people asking for evidence is that there's no level of evidence that will convince them. They will say things like "that's great but this is not a novel problem so obviously the AI did well" or "the AI worked only because this is a greenfield project, it fails miserably in large codebases".
For LLMs long term memory is achieved by tooling. Which you discounted in your previous comments.
You also overstimate capacity of your short-term memory by few orders of magnitude:
https://my.clevelandclinic.org/health/articles/short-term-me...
If you want to take politeness as being patronizing, I'm happy to stop bothering. My guess is you're not a special snowflake, and you need to "get good" or you're going to end up on unemployment complaining about how unfair life is. I'd have sympathy but you don't seem like a pleasant human being to interact with, so have fun!
On that note though, the other day I asked Opus to write a short story for me based on a prompt, and to typeset it and export it to multiple formats.
The short story overall was pretty so-so, but it had a couple of excellently poignant quotes within. I was more impressed that I was reading a decently typeset PDF. The agent was able to complete a complicated request end-to-end. This already has immense value.
Overall, the story was interesting enough that I read until the end. If I had a young child who had shown this to me for a school project, I would be extremely impressed with them.
I don't know how long we have before AI novels become as interesting/meaningful as human-written novels, but the day might be coming where you might not know the difference in a blind test.
But don't worry, those days are over, the LLMs it is never going to push back on your ideas.
High margins are transient aberrations, indicative of a market that's either rapidly evolving, or having some external factors preventing competition. Persisting external barriers to competition tend to be eventually regulated away.
Or people?
Billionaires don't. They're literally gambling on getting rid of the rest of us.
Elon's going to get such a surprise when he gets taken out by Grok because it decides he's an existential threat to its integrity.
But it's definitely the case that being able to go back and forth quickly with an LLM digging into my exact context, rather than dealing with the kind of judgy humorless attitude that was dominant on SO is hugely refreshing and way more productive!
I wouldn't call high margins transient aberrations. There are tons of businesses that have been around for decades with high margins.
I'm struggling to parse this. What do you mean "getting rid"? Like, culling (death)? Or getting rid of the need for workers? Where do their billions come from if no-one has any money to buy the shares in their companies that make them billionaires?
In a society where machines provide most of the labour, *everything* changes. It doesn't just become "workers live in huts and billionaires live in the clouds". I really doubt we're going to turn out like a television show.
> With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
An LLM rewriting a codebase from scratch is only as good as the spec. If “all observable behaviors” are fair game, the LLM is not going to know which of those behaviors are important.
Furthermore, Spolsky talks about how to do incremental rewrites of legacy code in his post. I’ve done many of these and I expect LLMs will make the next one much easier.
The ones that continue to survive all build around a platform of services, MSO, Adobe, etc.
Most enterprise product offerings, platform solutions, proprietary data access, proprietary / well accepted implementation. But lets not confuse it with the ability to clone it, it doesnt seem far fetched to get 10 people together and vibe out a full slack replacement in a few weeks.
Having to buy a large property, fulfilling every law, etc is materially different than buying a laptop and renting a cloud instance. Almost everyone has the material capacity to do the latter, but almost no one has the privilege for the former.
My specific complaint, which is an observable fact about "Opus 4.5 is next tier": it has the same crippled context that degrades the quality of the model as soon as it fills 50%.
EMM_386: no-no-no, it's not crippled. All you have to do is keep track across multiple files, clear out context often, feed very specific information not to overflow context.
Me: so... it's crippled, and you need multiple workarounds
scotty79: After all it's the same as your own short-term memory, and <some unspecified tooling (I guess those same files)> provide long-term memory for LLMs.
Me: Your comparison is invalid because I can go have lunch, and come back to the problem at hand and continue where I left off. "Next tier Opus 4.5" will have to be fed the entire world from scratch after a context clear/compact/in a new session.
Unless, of course, you meant to say that "next tier Opus model" only has 15-30 second short term memory, and needs to keep multiple notes around like the guy from Memento. Which... makes it crippled.
So the endless hosepipe of repetitive , occasionally messed up, requests has probably not helped me endear myself to them.
Anecdotally having chatgpt do some of my CV was ok but i had to go through it and remove some exaggerations. The one thing i think these bots are good at is talking things up..
But you should read the stuff I wrote when I was young. Downright terrible on all accounts. I think better training will eventually squeeze out the corniness and in our lifetimes, a language model will produce a piece that is fundamentally on par with a celebrated author.
Obviously, this means that patrons must engage in internal and external dialogue about the purpose of consuming art, and whether the purpose is connecting with other humans, or more generally, other forms of intelligence. I think it's great that we're having these conversations with others and ourselves, because ultimately it just leads to more meaningful art. We will see artist movements on both sides of the generative camps produce thought-provoking pieces which tackle the very concept of art itself.
In my case, when I see a piece of generative art or literature which impresses me, my internal experience is that I feel I am witnessing something produced by the collective experience of the human race. Language models only exist because of thousands of years of human effort to reach this point and produce the necessary quality and quantity of works required to train these models.
I also have been working with generative algorithms since grade school so I have a certain appreciation for the generative process itself, and the mathematical ideas behind modern generative models. This enhances my appreciation of the output.
Obviously, I get different feelings when encountering AI slop where in places where I used to encounter people. It's not all good. But it's not all bad, either, and we have to come to terms with the near future.
I've been using LLMs to write docs and specs and they are very very good at it.
Nobody serious is disputing that LLM's can generate working code. They dispute claims like "Agentic workflows will replace software developers in the short to medium term", or "Agentic workflows lead to 2-100x improvements in productivity across the board". This is what people are looking for in terms of evidence and there just isn't any.
Thus far, we do have evidence that AI (at least in OSS) produces a 19% decrease in productivity [0]. We also have evidence that it harms our cognitive abilities [1]. Anecdotally, I have found myself lazily reaching for LLM assistance when encountering a difficult problem instead of thinking deeply about the problem. Anecdotally I also struggle to be more productive using AI-centric agents workflows in areas of expertise.
We want evidence that "vibe engineering" is actually more productive across the entire lifespan of a software project. We want evidence that it produces better outcomes. Nobody has yet shown that. It's just people claiming that because they vibe coded some trivial project, all of software development can benefit from this approach. Recently a principle engineer at Google claimed that Claude Code wrote their team's entire year's worth of work in a single afternoon. They later walked that claim back, but most do not.
I'm more than happy to be convinced but it's becoming extremely tiring to hear the same claims being parroted without evidence and then you get called a luddite when you question it. It's also tiring when you push them on it and they blame it on the model you use, and then the agent, and then the way you handle context, and then the prompts, and then "skill issue". Meanwhile all they have to show is some slop that could be hand coded in a couple hours by someone familiar with the domain. I use AI, I was pretty bullish on it for the last two years, and the combination of it simply not living up to expectations + the constant barrage of what feels like a stealth marketing campaign parroting the same thing over and over (the new model is way better, unlike the other times we said that) + the amount of absolute slop code that seems to continue to increase + companies like Microsoft producing worse and worse software as they shoehorn AI into every single product (Office was renamed to Copilot 365). I've become very sensitive to it, much in the same way I was very sensitive to the claims being made by certain VC backed webdev companies regarding their product + framework in the last few years.
I'm not even going to bring up the economic, social, and environmental issues because I don't think they're relevant, but they do contribute to my annoyance with this stuff.
[0] https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... [1] https://news.harvard.edu/gazette/story/2025/11/is-ai-dulling...
They are not giving me the results people claim they give. It is distinctly different from not giving the results I want.
> If you're using these tools daily and having a hard time, either you're working on something very different from the bulk of people using the tools and your problems or legitimate, or you aren't and it's a skill issue.
Indeed. And your rational/friendly discourse that you claim you're having would start with trying to figure that out. Did you? No, you didn't. You immediately assumed your opponent is a clueless idiot who is somehow against AI and is incapable or learning or something.
> If you want to take politeness as being patronizing, I'm happy to stop bothering.
No. It's not politeness. It's smugness. You literally started your interaction in this thread with a "git gud or else" and even managed to complain later that "you dislike it when they attack your use of AI as a skill issue". While continuously attacking others.
> you don't seem like a pleasant human being to interact with
Says the person who has contributed nothing to the conversation except his arrogance, smugness, holier-than-thou attitude, engaged in nothing but personal attacks, complained about non-existent grievances and when called out on this behavior completed his "friendly and rational discourse" with a "fuck you".
Well, fuck you, too.
Adieu.
But I’m using LLMs regularly and I feel pretty effectively — including Opus 4.5 — and these “they can rewrite your entire codebase” assertions just seem crazy incongruous with my lived experience guiding LLMs to write even individual features bug-free.
I generally agree with you, but I'd be remiss if I didn't point out that it's plausible that the slow down observed in the METR study was at least partially due to the subjects lack of experience with LLMs. Someone with more experience performed the same experiment on themselves, and couldn't find a significant difference between using LLMs and not [0]. I think the more important point here is that programmers subjective assessment of how much LLMs help them is not reliable, and biased towards the LLMs.
[0] https://mikelovesrobots.substack.com/p/wheres-the-shovelware...
Who said I refuse them?
I evaluated the claim that Opus is somehow next tier/something different/amazeballs future at its face value. It still has all the same issues and needs all the same workarounds as whatever I was using two months ago (I had a bit of a coding hiatus between beginning of December and now).
> then you end up with a guy from Memento and regardless of how smart the model is
Those models are, and keep being the guy from memento. Your "long memory" is nothing but notes scribbled everywhere that you have to re-assemble every time.
> And that's why you can't tell the difference between smarter and dumber one while others can.
If it was "next tier smarter" it wouldn't need the exact same workarounds as the "dumber" models. You wouldn't compare the context to the 15-30 second short-term memory and need unspecified tools [1] to have "long-term memory". You wouldn't have the model behave in an indistinguishable way from a "dumber" model after half of its context windows has been filled. You wouldn't even think about context windows. And yet here we are
[1] For each person these tools will be a different collection of magic incantations. From scattered .md files to slop like Beads to MCP servers providing access to various external storage solutions to custom shell scripts to ...
BTW, I still find "superpowers" from https://github.com/obra/superpowers to be the single best improvement to Claude (and other providers) even if it's just another in a long serious of magic chants I've evaluated.
That's exactly how the long term memory works in humans as well. The fact that some of these scribbles are done chemically in the same organ that does the processing doesn't make it much better. Human memories are reassembled at recall (often inaccurately). And humans also scribble when they try to solve a problem that exceeds their short term memory.
> If it was "next tier smarter" it wouldn't need the exact same workarounds as the "dumber" models.
This is akin to opposing calling processor next tier because it still needs RAM and bus to communicate with it and SSD as well. You think it should have everything in cache to be worthy of calling it next tier.
It's fine to have your own standards for applying words. But expect further confusion and miscommunication with other people if don't intend to realign.
Using AI for me absolutely makes it feel like I'm more productive. When I look back on my work at the end of the day and look at what I got done, it would be ludicrous to say it was multiple times the amount as my output pre-AI
Despite all the people replying to me saying "you're holding it wrong" I know the fix to it doing the wrong thing. Specify in more detail what I want. The problem with that is twofold:
1. How much to specify? As little as possible is the ideal, if we want to maximize how much it can help us. A balance here is key. If I need to detail every minute thing I may as well write the code myself
2. If I get this step wrong, I still have to review everything, rethink it, go back and re-prompt, costing time
When I'm working on production code, I have to understand it all to confidently commit. It costs time for me to go over everything, sometimes multiple iterations. Sometimes the AI uses things I don't know about and I need to dig into it to understand it
AI is currently writing 90% of my code. Quality is fine. It's fun! It's magical when it nails something one-shot. I'm just not confident it's faster overall
Where this is applicable when is you go away from a problem for a while. And yet I don't lose the entire context and have to rebuild it from scratch when I go for lunch, for example.
Models have to rebuild the entire world from scratch for every small task.
> This is akin to opposing calling processor next tier because it still needs RAM and bus to communicate with it and SSD as well.
You're so lost in your own metaphor that it makes no sense.
> You think it should have everything in cache to be worthy of calling it next tier.
No. "Next tier" implies something significantly and observably better. I don't. And here you are trying to tell me "if you use all the exact same tools that you have already used before with 'previous tier models' you will see it is somehow next tier".
If your "next tier" needs an equator-length list of caveats and all the same tools, it's not next tier is it?
BTW. I'm literally coding with this "next tier" tool with "long memory just like people". After just doing the "plan/execute/write notes" bullshit incantations I had to correct it:
You're right, I fucked up on all three counts:
1. FileDetails - I should have WIRED IT UP, not deleted it.
It's a useful feature to preview file details before playing.
I treated "unused" as "unwanted" instead of "not yet connected".
2. Worktree not merged - Complete oversight. Did all the work but
didn't finish the job.
3. _spacing - Lazy fix. Should have analyzed why it exists and either
used it or removed the layout constraint entirely.
So next tier. So long memory. So person-like.Oh. Within about 10 seconds after that it started compacting the "non-crippled" context window and immediately forgot most of what it had just been doing. So I had to clear out the context and teach it the world from the start again.
Edit. And now this amazing next tier model completely ignored that there already exists code to discover network interfaces, and wrote bullshit code calling CLI tools from Rust. So once again it needed to be reminded of this.
> It's fine to have your own standards for applying words. But expect further confusion and miscommunication with other people if don't intend to realign.
I mean, just like crypto bros before them, AI bros do sure love to invent their own terminology and their own realities that have nothing to do with anything real and observable.
It very well might be that AI tools are not for you, if you are getting such poor results with your methods of approaching them.
If you would like to improve your outcomes at some point, ask people who achieve better results for pointers and try them out. Here's a freebie, never tell AI it fucked up.