Plus there are use-cases for LLMs that go beyond augmenting your ability to produce code, especially for learning new technologies. The yield depends on the distribution of tasks you have in your role. For example, if you are in lots of meetings, or have lots of administrative overhead to push code, LLMs will help less. (Although I think applying LLMs to pull request workflow, commit cleanup and reordering, will come soon).
1. googling stuff about how APIs work 2. writing boilerplate 3. typing syntax correctly
These three things combined make up a huge amount of programming. But when real cognition is required I find I'm still thinking just as hard in basically the same ways I've always thought about programming: identifying appropriate abstractions, minimizing dependencies between things, pulling pieces together towards a long term goal. As far as I can tell, AI still isn't really capable of helping much with this. It can even get in the way, because writing a lot of code before key abstractions are clearly understood can be counterproductive and AI tends to have a monolithic rather than decoupled understanding of how to program. But if you use it right it can make certain tasks less boring and maybe a little faster.
A lot of senior engineers in the big tech companies spend most of their time in meetings. They're still brilliant. For instance, they read papers and map out the core ideas, but they haven't been in the weeds for a long time. They don't necessarily know all the day-to-day stuff anymore.
Things like: which config service is standard now? What's the right Terraform template to use? How do I write that gnarly PromQL query? How do I spin up a new service that talks to 20 different systems? Or in general, how do I map my idea to deployable and testable code in the company's environment?
They used to have to grab a junior engineer to handle all that boilerplate and operational work. Now, they can just use an AI to bridge that gap and build it themselves.
In some cases, LLMs can be a real speed boost. Most of the time, that has to do with writing boilerplate and prototyping a new "thing" I want to try out.
Inevitably, if I like the prototype, I end up re-writing large swaths of it to make it even half way productizable. Fundamentally, LLMs are bad at keeping an end goal in mind while working on a specific feature and it's terrible at holding enough context to avoid code duplication and spaghetti.
I'd like to see them get better and better, but they really are limited to whatever code they can ingest on the internet. A LOT of important code is just not open for consumption in sufficient quantities for it to learn. For this reason, I suspect LLMs will really never be all that good for non-web based engineering. Wheres all the training data gonna come from?
* if my Github actions ran 10x faster, so I don't start reading about "ai" on hackernews while waiting to test my deployment and not noticing the workflow was done an hour ago
* if the Google cloud console deployment page had 1 instead of 10 vertical scroll bars and wasn't so slow and janky in Firefox
* if people started answering my peculiar but well-researched stackoverflow questions instead of nitpicking and discussing whether they belong on superuser vs unix vs ubuntu vs hermeneutics vs serverfault
* if MS Teams died
anyway, nice to see others having the same feeling about llm's
Now for senior developers, AI has been tremendous. Example: I'm building a project where I hit the backend in liveview, and internally I have to make N requests to different APIs in parallel and present the results back. My initial version to test the idea had no loading state, waiting for all requests to finish before sending back.
I knew that I could use Phoenix Channels, and Elixir Tasks, and websockets to push the results as they came in. But I didn't want to write all that code. I could already taste it and explain it. Why couldn't I just snap my fingers?
Well AI did just that. I wrote what I wanted in depth, and bada bing, the solution I would have written is there.
Vibe coders are not gonna make it.
Engineers are having the time of their lives. It's freeing!
Basically, the ability to order my thoughts into a task list long & clear enough for the LLM to follow that I can be working on 3 or so of these in parallel, and maybe email. Any individual run may be faster or slower than I can do it manually, but critically, they take less total human time / attention. No individual technique is fundamentally tricky here, but it is still a real skill.
If you read the article, the author is simply not there, and sees what they know as only 1 weeks worth of knowledge. So for their learning rate .. maybe they need 3x longer of learning & experience?
I'm a pretty huge proponent for AI-assisted development, but I've never found those 10x claims convincing. I've estimated that LLMs make me 2-5x more productive on the parts of my job which involve typing code into a computer, which is itself a small portion of that I do as a software engineer.
That's not too far from this article's assumptions. From the article:
> I wouldn't be surprised to learn AI helps many engineers do certain tasks 20-50% faster, but the nature of software bottlenecks mean this doesn't translate to a 20% productivity increase and certainly not a 10x increase.
I think that's an under-estimation - I suspect engineers that really know how to use this stuff effectively will get more than a 0.2x increase - but I do think all of the other stuff involved in building software makes the 10x thing unrealistic in most cases.
Now that LLMs have actually fulfilled that dream — albeit by totally different means — many devs feel anxious, even threatened. Why? Because LLMs don’t just autocomplete. They generate. And in doing so, they challenge our identity, not just our workflows.
I think Colton’s article nails the emotional side of this: imposter syndrome isn’t about the actual 10x productivity (which mostly isn't real), it’s about the perception that you’re falling behind. Meanwhile, this perception is fueled by a shift in what “software engineering” looks like.
LLMs are effectively the ultimate CASE tools — but they arrived faster, messier, and more disruptively than expected. They don’t require formal models or diagrams. They leap straight from natural language to executable code. That’s exciting and unnerving. It collapses the old rites of passage. It gives power to people who don’t speak the “sacred language” of software. And it forces a lot of engineers to ask: What am I actually doing now?
One thing that AI has helped me with is finding pesky bugs. I mainly work on numerical simulations. At one point I was stuck for almost a week trying to figure out why my simulation was acting so strange. Finally I pulled up chatgpt, put some of my files into the context and wrote a prompt explaining the strange behavior and what I thought might be happening. In a few seconds it figured out that I had improperly scaled one of my equations. It came down to a couple missing parentheses, and once I fixed it the simulation ran perfectly.
This has happened a few times where AI was easily able to see something I was overlooking. Am I a 10x developer now that I use AI? No... but when used well, AI can have a hugely positive impact on what I am able to get done.
What I've seen with AI is that it does not save my coworkers from the pain of overcomplicating simple things that they don't really think through clearly. AI does not seem to solve this.
Most of the AI productivity stories I hear sound like they're optimizing for the wrong metric. Writing code faster doesn't necessarily mean shipping better products faster. In my experience, the bottleneck is rarely "how quickly can we type characters into an editor" - it's usually clarity around requirements, decision-making overhead, or technical debt from the last time someone optimized for speed over maintainability.
The author mentions that real 10x engineers prevent unnecessary work rather than just code faster. That rings true to me. I've seen more productivity gains from saying "no" to features or talking teams out of premature microservices(or adopting Kafka :D) than from any coding tool.
What worries me more is the team dynamic this creates. When half your engineers feel like they're supposed to be 10x more productive and aren't, that's a morale problem that compounds. The engineers who are getting solid 20-30% gains from AI (which seems realistic) start questioning if they're doing it wrong.
Has anyone actually measured this stuff properly in a production environment with consistent teams over 6+ months? Most of the data I see is either anecdotal or from artificial coding challenges.
That aside: I still think complaining about "hallucination" is a pretty big "tell".
So it's not like I'm delivering features in one day that would have taken two weeks. But I am delivering features in two weeks that have a bunch of extra niceties attached to them. Reality being what it is, we often release things before they are perfect. Now things are a bit closer to perfect when they are released.
I hope some of that extra work that's done reduces future bug-finding sessions.
Now let's say you use Claude code, or whatever, and you're able to create the same web app over a weekend. You spend 6 hours a day on Saturday and Sunday, in total 12 hours.
That's 10x increase in productivity right there. Did it make you a 10x better programmer? Nope, probably not. But your productivity went up by a tenfold.
And at least to me, that's sort of how it has worked. Things I didn't have motivation or energy to get into before, I can get into over a weekend.
Overall it feels negligible too me in its current state.
It's not a ground-breaking app, its CRUD and background jobs and CSV/XLSX exports and reporting, but I found that I was able to "wireframe" with real code and thus come up with unanswered questions, new requirements, etc. extremely early in the project.
Does that make me a 10x engineer? Idk. If I wasn't confident working with CC, I would have pushed back on the project in the first place unless management was willing to devote significant resources to this. I.e. "is this really a P1 project or just a nice to have?" If these tools didn't exist I would have written spec's and excalidraw or Sketch/Figma wireframes that would have taken me at least the same amount of time or more, but there'd be less functional code for my team to use as a resource.
This is not to disagree with the OP, but to point out that, even for engineers, the speedups might not appear where you expect. [EDIT I see like 4 other comments making the same point :)]
What makes an excellent engineer is risk mitigation and designing systems under a variety of possible constraints. This design is performed using models of the domains involved and understanding when and where these models hold and where they break down. There's no "10x". There is just being accountable for designing excellent systems to perform as desired.
If there were a "10x" software engineer, such an engineer would prevent data breaches from occurring, which is a common failure mode in software to the detriment of society. I want to see 10x less of that.
1. Tech Company's should be able to accelerate and supplant the FAANGs of this world. Like even if 10x was discounted to 5x. It would mean that 10 human years of work would be shrunk down to 2 to make multi-billion dollar companies. This is not happening right now. If this does not start happening with the current series of model, murphy's law (e.g. interest rate spike at some point) or just damn show me the money brutal questions would tell people if it is "working".
2. I think Anthropic's honcho did a back of the envelope number of 600$ for every human in the US(I think just it was just the US) was necessary to justify Nvidia's market Cap. This should play out by the end of this year or in Q3 report.
So true, a lot of value and gains are had when tech leads can effectively negotiate and creatively offer less costly solutions to all aspects of a feature.
I believe his original thesis remains true: "There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity."
Over the years this has been misrepresented or misinterpreted to suggest it's false but it sure feels like "Agentic Coding" is a single development promising a massive multiplier in improvement that once again is, another accidental tool that can be helpful but is definitely not a silver bullet.
https://arxiv.org/abs/2507.09089
Obviously it depends on what you are using the AI to do, and how good a job you do of creating/providing all the context to give it the best chance of being successful in what you are asking.
Maybe a bit like someone using a leaf blower to blow a couple of leaves back and forth across the driveway for 30 sec rather than just bending down to pick them up.... It seems people find LLMs interesting, and want to report success in using them, so they'll spend a ton of time trying over and over to tweak the context and fix up what the AI generated, then report how great it was, even though it'd have been quicker to do it themselves.
I think agentic AI may also lead to this illusion of, or reported, AI productivity ... you task an agent to do something and it goes off and 30 min later creates what you could have done in 20 min while you are chilling and talking to your workmates about how amazing this new AI is ...
Consider a fully loaded cost of 200k for an engineer or $16,666 per month. They only have to be >1.012x engineer for the "AI" to be worth it. Of course that $200 dollars per month is probably VC subsidized right now but there is lots of money on the table for <2x improvement.
For Terraform, specifically, Claude 4 can get thrown into infinite recursive loops trying to solve certain issues within the bounds of the language. Claude still tries to add completely invalid procedures into things like templates.
It does seem to work a bit better for standard application programming tasks.
Even when you do write code, you often only care about specific aspects—you just want to automate the rest.
This is hard to reconcile with modern business models. If you tell someone that a software engineer can also design, they’ll just fire the designer and pile more work on the engineer. But it doesn’t change the underlying truth: a single engineer who can touch many parts of the software with low cognitive friction is simply a better kind of engineer.
https://www.businessinsider.com/ai-coding-tools-may-decrease...
Linear was a very early-stage product I tested a few months after their launch where I was genuinely blown away by the polish and experience relative to their team size. That was in 2020, pre-LLMs.
I have yet to see an equally polished and impressive early-stage product in the past few years, despite claims of 10x productivity.
It's like discussing in a gaming guild how to reach the next level. It isn't real.
Then it came time to make a change to one of the charts. Team members were asking me questions about it. "How can we make this axis display only for existing data rather than range?" I'm scrolling through code in a screenshare that I absolutely reviewed, I remember doing it, I remember clicking the green arrow in Cursor, but I'm panicking because this doesn't look like code I've ever seen, and I'm seeing gaping mistakes and stupid patterns and a ton of duplicated code. Yeah I reviewed it, but bit by bit, never really all at once. I'd never grocked the entire file. They're asking me questions to which I don't have answers, for code "I'd just written." Man it was embarrassing!
And then to make the change, the AI completely failed at it. Plotly.js's type definitions are super out of date and the Python library is more fleshed out, so the AI started hallucinating things that exist on Python and not in JS - so now I gotta head to the docs anyway. I had to get much more manual, and the autocomplete of cursor was nice while doing so, but sometimes I'd spend more time tab/backspacing after realizing the thing it recommended was actually wrong, than I'd have spent just quickly typing the entire whatever thing.
And just like a hit, now I'm chasing the dragon. I'd love to get that feeling back of entering a new era of programming, where I'm hugely augmented. I'm trying out all the different AI tools, and desperately wishing there was an autocomplete as fast and multi-line and as good as jumping around as Cursor, available in nvim. But they all let me down. Now that I'm paying more attention, I'm realizing the code really isn't good at all. I think it's still very useful to have Claude generate a lot of boilerplate, or come in and make some tedious changes for me, or just write all my tests, but beyond that, I don't know. I think it's improved my productivity maybe 20%, all things considered. Still amazing! I just wish it was good as I thought it was when I first tried it.
While Gemini performed well in tweaking visualizations (it even understood the output of matplotlib) and responding to direct prompts, it struggled with debugging and multi-step refactorings, occasionally failing with generic error messages. My takeaway is that these tools are incredibly productive for greenfield coding with minimal constraints, but when it comes to making code reusable or architecturally sound, they still require significant human guidance. The AI doesn’t prioritize long-term code quality unless you actively steer it in that direction.
Also, one underestimated aspect is that LLMs don’t get writer’s block or get tired (so long as you can pay to keep the tokens flowing).
Also, one of the more useful benefits of coding with LLMs is that you are explicitly defining the requirements/specs in English before coding. This effectively means LLM-first code is likely written via Behavior Driven Development, so it is easier to review, troubleshoot, upgrade. This leads to lower total cost of ownership compared to code which is just cowboyed/YOLOed into existence.
I have found for myself it helps motivate me, resulting in net productivity gain from that alone. Even when it generates bad ideas, it can get me out of a rut and give me a bias towards action. It also keeps me from procrastinating on icky legacy codebases.
If I'm using it to remember the syntax or library for something I used to know how to do, it's great.
If I'm using it to explore something I haven't done before, it makes me faster, but sometimes it lies to me. Which was also true of Stack Overflow.
But when I ask it to so something fairly complex on it's own, it usually tips over. I've tried a bunch of tests with a bunch of models, and it never quite gets it right. Sometimes it's minor stuff that I can fix if I bang on it long enough, and sometimes it's a steaming pile that I end up tossing in the garbage.
For example, I've asked it to code me a web-based calculator, or a 3D model of the solar system using WebGL, and none of the models I've tried have been able to do either.
So no, imho people with no app dev skills cannot just build something over a weekend, at least something that won‘t break when the first user logs in.
That being said, I am a generalist with 10+ years of experience and can spot the good parts from bad parts and can wear many hats. Sure, I do not know everything, but, hey did I know everything when AI was not there? I took help from SO, Reddit and other places. Now, I go to AI, see if it makes sense, apply the fix, learn and move on.
However most paid jobs don't fall into this category.
Things like: build a settings system with org, user, and project level settings, and the UI to edit them.
A task like that doesn’t require a lot of thinking and planning, and is well within most developers’ abilities, but it can still take significant time. Maybe you need to create like 10 new files across backend and frontend, choose a couple libraries to help with different aspects, style components for the UI and spend some time getting the UX smooth, make some changes to the webpack config, and so on. None of it is difficult, per se, but it all takes time, and you can run into little problems along the way.
A task like that is like 10-20% planning, and 80-90% going through the motions to implement a lot of unoriginal functionality. In my experience, these kinds of tasks are very common, and the speedup LLMs can bring to them, when prompted well, is pretty dramatic.
This is where I have found LLMs to be most useful. I have never been able to figure out how to get it to write code that isn't a complete unusable disaster zone. But if you throw your problem at it, it can offer great direction in plain English.
I have decades of research, planning, and figuring things out under my belt, though. That may give me an advantage in guiding it just the right way, whereas the junior might not be able to get anything practical from it, and thus that might explain their focus on code generation instead?
It reads like this project would have taken your company 9 weeks before, and now will take the company 9 weeks.
>What makes an excellent engineer is risk mitigation and designing systems under a variety of possible constraints.
I take it that those fields also don't live by the "move fast and break things" motto?
The co-founder of a company I worked at was one for a period (he is not a 10xer anymore - I don't think someone can maintain that output forever with life constraints). He literally wrote the bulk of a multi-million line system, most of the code is still running today without much change and powering a unicorn level business.
I literally wouldn't believe it, but I was there for it when it happened.
Ran into one more who I thought might be one, but he left the company too early to really tell.
I don't think AI is going to produce any 10x engineers because what made that co-founder so great was he had some kind of sixth sense for architecture, that for most of us mortals we need to take more time or learn by trial and error how to do. For him, he was just writing code and writing code and it came out right on the first try, so to speak. Truly something unique. AI can produce well specified code, but it can't do the specifying very well today, and it can't reason about large architectures and keep that reasoning in its context through the implementation of hundreds of features.
One of our EMs did this this week. He did a lot of homework: spoke to quite a few experts and pretty soon realised this task was too hard for his team to ever accomplish, if it was even possible. Lobbied the PM and, a VP and a C-level, but managed to stop a lot of wasted work from being done.
Sometimes the most important language to know as a dev is English*
s/English/YourLanguageOfChoice/g
I guess this leaves open question about the distribution of productivity across programmers and the difference between the min and the mean. Is productivity normally distributed? Log normal? Some kind of power law?
Junior: 100 total lines of code a day
Senior: 10,000 total lines of code a day
Guru: -100 total lines of code a day
As in, it's now completely preventing you from doing things you could have before?
Where I see major productivity gains are on small, tech debt like tasks, that I could not justify before. Things that I can start with an async agent, let sit until I’ve got some downtime on my main tasks (the ones that involve all that coordination). Then I can take the time to clean them up and shepherd them through.
The very best case of these are things where I can move a class of problem from manually verified to automatically verified as that kick starts a virtuous cycle that makes the ai system more productive.
But many of them are boring refactors that are just beyond what a traditional refactoring tool can do.
Internally we expected 15%-25%. A big-3 consultancy told senior leadership "35%-50%" (and then tried to upsell an AI Adoption project). And indeed we are seeing 15%-35% depending on which part of the org you look and how you measure the gains.
What about just noticing that coworkers are repeatedly doing something that could easily be automated?
Here's what the 5x to 10x flow looks like:
1. Plan out the tasks (maybe with the help of AI)
2. Open a Git worktree, launch Claude Code in the worktree, give it the task, let it work. It gets instructions to push to a Github pull request when it's done. Claude gets to work. It has access to a whole bunch of local tools, test suites, and lots of documentation.
3. While that terminal is running, I go start more tasks. Ideally there are 3 to 5 tasks running at a time.
4. Periodically check on the tabs to make sure they're not stuck or lost their minds.
5. Finally, review the finished pull requests and merge them when they are ready. If they have issues then go back to the related chat and tell it to work on it some more.
With that flow it's reasonable to merge 10 to 20 pull requests every day. I'm sure someone will respond "oh just because there are a lot of pull requests, doesn't mean you are productive!" I don't know how to prove to you that the PRs are productive other than just say that they are each basically equivalent to what one human does in one small PR.
A few notes about the flow:
- For the AI to work independently, it really needs tasks that are easy to medium difficulty. There are definitely 'hard' tasks that need a lot of human attention in order to get done successfully.
- This does take a lot of initial investment in tooling and documentation. Basically every "best practice" or code pattern that you want to use use in the project must be written down. And the tests must be as extensive as possible.
Anyway the linked article talks about the time it takes to review pull requests. I don't think it needs to take that long, because you can automate a lot..
- Code style issues are fully automated by the linter.
- Other checks like unit test coverage can be checked in the PR as well.
- When you have a ton of automated tests that are checked in the PR, that also reduces how much you need to worry about as a code reviewer.
With all those checks in place, I think it can pretty fast to review a PR. As the human you just need to scan for really bad code patterns, and maybe zoom in on highly critical areas, but most of the code can be eyeballed pretty quickly.
With enough rules and good prompting this is not true. The code I generate is usually better than what I'd do by hand.
The reason the code is better all the extra polish and gold plating is essentially free.
Everything I generate comes out commented great error handling, logging, SOLID, and united tested using established patterns in the code base.
Now I don't want to sound like a doomsayer but it appears to me that application programming and corresponding software companies are likely to disappear within the next 10 years or so. We're now in a transitional phase were companies who can afford enough AI compute time have an advantage. However, this phase won't last long.
Unless there is a principal block to further enhance AI programming, not just simple functions but whole apps can be created with a prompt. However, this is not where it is going to stop. Soon, there will be no need for apps in the traditional sense. End users will use AI to manipulate and visualize data and operating systems will integrate the AI services needed for this. "Apps" can be created on the fly and are constantly adjusted to the users' needs.
Creating apps will not remain a profitable business. If there is an app X someone likes, they can prompt their AI to create an app with the same features, but perhaps with these or those small changes, and the AI will create it for them, including thorough tests and quality assurance.
Right now, in the transitional phase, senior engineers might feel they are safe because someone has to monitor and check the AI output. But there is no reason why humans would be needed for that step in the long run. It's cheaper to have 3 AIs quality test and improve the outputs of one generating AI. I'm sure many companies are already experimenting with this, and at some point the output of such iterative design procedures will have far less bugs than any code produced by humans. Only safety critical essential features such as operating systems and banking will continue to be supervised by humans, though perhaps mostly for legal reasons.
Although I hope it's not but to me the end of software development seems a logical long-term consequence of current AI development. Perhaps I've missed something, I'd be interested in hearing from people who disagree.
It's ironic because in my great wisdom I chose to quit my day job in academia recently to fulfill my lifelong dream of bootstrapping a software company. I'll see if I can find a niche, maybe some people appreciate hand-crafted software in the future for its quirks and originality...
- solo projects
- startups with few engineers doing very little intense code review if any at all
- people who don't know how to code themselves.
Nobody else is realistically able to get 10x multipliers. But that doesn't mean you can't get a 1.5-2x multiplier. I'd say even myself at a large company that moves slow have been able to realize this type of multiplier on my work using cursor/claude code. But as mentioned in the article the real bottleneck becomes processes and reviews. These have not gotten any faster - so in real terms time to ship/deliver isn't much different than before.
The only attempt that we should make at minimizing review times is by making them higher priority than development itself. Technically this should already be the case but in my experience almost no engineer outside of really disciplined companies and not in FAANG actually makes reviews a high priority, because unfortunately code reviews are not usually part of someones performance review and slows down your own projects. And usually your project manager couldn't give two shits about someone elses work being slow.
Processes are where we can make the biggest dent. Most companies as they get large have processes that get in the way of forward velocity. AI first companies will minimize anything that slows time to ship. Companies simply utilizing AI and expecting 10x engineers without actually putting in the work to rally around AI as a first class citizen will fall behind.
The credit lies with a more functional style of C++ and typescript (the languages i use for hobbies and work, respectively), but claude has sort of taken me out of the bubble I was brought up in and introduced new ideas to me.
However, I've also noticed that LLM products also tend to reinforce your biases. If you dont ask it to critique you or push back, it often tells you what a great job you did and how incredible your code is. You see this with people who have gotten into a kind of psychotic feedback loop with ChatGPT and who now believe they can escape the matrix.
I think LLMs are powerful, but only for a handful of use cases. I think the majority of what theyre marketed for right now is techno-solutionism and theres an impending collapse in VC funding for companies that are plugging in chatgpt APIs for everything from insurance claims to medical advice
(1) for my day job, it doesn't make me super productive with creation, but it does help with discovery, learning, getting myself unstuck, and writing tedious code.
(2) however, the biggest unlock is it makes working on side projects __immensely__ easier. Before AI I was always too tired to spend significant time on side projects. Now, I can see my ideas come to life (albeit with shittier code), with much less mental effort. I also get to improve my AI engineering skills without the constraint of deadlines, data privacy, tool constraints etc..
The smartest programmer I know is so impressive mainly for two reasons: first, he seems to have just an otherworldly memory and seems to kind of have absolutely every little feature and detail of the programming languages he uses memorized. Second, his real power is really in cognitive ability, or the ability to always quickly and creatively come up with the smartest and most efficient yet elegant and clean solution to any given problem. Of course somewhat opinionated but in a good way. Funnily he often wouldn't know the academic/common name for some algorithm he arrived at but it just happened to be what made sense to him and he arrived at it independently. Like a talented musician with perfect pitch who can't read notation or doesn't know theory yet is 10x more talented than someone who has studied it all.
When I pair program with him, it's evident that the current iteration of AI tools is not as quick or as sharp. You could arrive at similar solutions but you would have to iterate for a very long time. It would actually slow that person down significantly.
However, there is such a big spectrum of ability in this field that I could actually see this increasing for example my productivity by 10x. My background/profession is not in software engineering but when I do it in my free time the perfectionist tendencies make me work very slowly. So for me these AI tools are actually cool for generating the first crappy proof of concepts for my side projects/ideas, just to get something working quickly.
I guess this is still the "caveat" that can keep the hype hopes going. But I've found at a team velocity level, with our teams, where everyone is actively using agentic coding like Claude Code on the daily, we actually didn't see an increase in team velocity yet.
I'm curious to hear anecdotal from other teams, has your team seen velocity increase since it adopted agentic AI?
[And to those saying we're using it wrong... well I can't argue with something that's not falsifiable]
This article thinks that most people who say 10x productivity are claiming 10x speedup on end-to-end delivering features. If that's indeed what someone is saying, they're most of the time quite simply wrong (or lying).
But I think some people (like me) aren't claiming that. Of course the end to end product process includes a lot more work than just the pure coding aspect, and indeed none of those other parts are getting a 10x speedup right now.
That said, there are a few cases where this 10x end-to-end is possible. E.g. when working alone, especially on new things but not only - you're skipping a lot of this overhead. That's why smaller teams, even solo teams, are suddenly super interesting - because they are getting a bigger speedup comparatively speaking, and possibly enough of one to be able to rival larger teams.
I doubt that's the commonly desired outcome, but it is what I want! If AI gets too expensive overnight (say 100x), then I'll be able to keep chugging along. I would miss it (claude-code), but I'm betting that by then a second tier AI would fit my process nearly as well.
I think the same class of programmers that yak shave about their editor, will also yak shave about their AI. For me, it's just augmenting how I like to work, which is probably different than most other people like to work. IMO just make it fit your personal work style... although I guess that's problematic for a large team... look, even more reasons not to have a large team!
These conversations on AI code good, vs AI code bad constantly keep cropping up.
I feel we need to build a cultural norm to share examples places of succeeded, and failures, so that we can get to some sort of comparison and categorization.
The sharing also has to be made non-contentious, so that we get a multitude of examples. Otherwise we’d get nerd-sniped into arguing the specifics of a single case.
It may change in the future, but AI is without a doubt improving our codebase right now. Maybe not 10X but it can easily 2X as long as you actually understand your codebase enough to explain it in writing.
I haven't begun doing side projects or projects for self, yet. But I did go down the road of finding out what would be needed to do something I wished existed. It was much easier to explore and understand the components and I might have a decent chance at a prototype.
The alternative to this would have been to ask people around or formulate extensively researched questions for online forums, where I'd expect to get half cryptic answers (and a jibe at my ignorance every now and then) at a pace that I would take years before I had something ready.
I see the point for AI as a prototyping and brainstorming tool. But I doubt we are at a point where I would be comfortable pushing changes to a production environment without giving 3x the effort in reviewing. Since there's a chance of the system hallucinating, I have a genuine fear that it would seem accurate, but what it would do is something really really stupid.
And while I don't categorically object to AI tools, I think your selling objections to them short.
It's completely legitimate to want an explainable/comprehendable/limited-and-defined tool rather than a "it just works" tool. Ideally, this puts one in an "I know its right" position rather than a "I scanned it and it looks generally right and seems to work" position.
I find it impossible to work out who to trust on the subject, given that I'm not working directly with them, so remain entirely on the fence.
But nobody has ever managed to get there despite decades of research and work done in this area. Look at the work of Gerald Sussman (of SICP fame), for example.
So all you're saying is it makes the easy bit easier if you've already done, and continue to do, the hard bit. This is one of the points made in TFA. You might be able to go 200mph in a straight line, but you always need to slow down for the corners.
What you need is just boring project management. Have a proper spec, architecture and tasks split into manageable chunks with enough information to implement them.
Then you just start watching TV and say "implement github issue #42" to Claude and it'll get on with it.
But if you say "build me facebook" and expect a shippable product, you'll have a bad time.
It’s a rubber duck that’s pretty educated and talks back.
100%. The biggest challenge with software is not that it’s too hard to write, but that it’s too easy to write.
You are right that typing speed isn't the bottleneck, but wrong about what AI actually accelerates. The 10x engineers aren't typing faster they're exploring 10 different architectural approaches in the time it used to take to try one, validating ideas through rapid prototyping, automating the boring parts to focus on the hard decisions.
You can't evaluate a small sample size of people who are not exploiting the benefits well and come to an accurate assessment of the utility of a new technology.
Skill is always a factor.
And I think that sentence is a pretty big tell, so ...
https://www.windowscentral.com/software-apps/sam-altman-ai-w...
https://brianchristner.io/how-cursor-ai-can-make-developers-...
https://thenewstack.io/the-future-belongs-to-ai-augmented-10...
For much of what I build with AI, I'm not saving two weeks. I'm saving infinity weeks — if LLMs didn't exist I would have never built this tool in the first place.
For me it's 50-50 reading other people's code and getting a feel for the patterns and actually writing the code.
I sort of want to get back to that... it was really good at getting ideas across.
I've been a bit of that engineer (though not at the same scale), like say wrote 70% of a 50k+ loc greenfield service. But I'm not sure it really means I'm 10x. Sometime this comes from just being the person allowed to do it, that doesn't get questioned in it's design choices, decisions of how to structure and write the code, that doesn't get any push back on having massive PRs where others almost just paper stamp it.
And you can really only do this at the greenfield phase, when things are not yet in production, and there's so much baseline stuff that's needed in the code.
But it ends up being the 80/20 rule, I did the 80% of the work in 20% of the time it'll take to go to prod, because that 20% remaining will eat up 80% of the time.
What's your experience? And what do the "kids" use these days to indicate alternative options (as above — though for that, I use bash {} syntax too) or to signal "I changed my mind" or "let me fix that for you"?
Because I might just not have a great imagination, but it's very hard for me to see how you basically automate the review process on anything that is business critical or has legal risks.
I'm always baffled by this. If you can't do it that well by hand, how can you discriminate its quality so confidently?
I get there is a artist/art consumer analogy to be made (i.e. you can see a piece is good without knowing how to paint), but I'm not convinced it is transferrable to code.
Also, not really my experience when dealing with IaC or (complex) data related code.
Let’s boil this down to an easy set of reproducible steps any engineer can take to wrangle some sense from their AI trip.
Me: Here's the relevant part of the code, add this simple feature.
Opus: here's the modified code blah blah bs bs
Me: Will this work?
Opus: There's a fundamental flaw in blah bleh bs bs here's the fix, but I only generate part of the code, go hunt for the lines to make the changes yourself.
Me: did you change anything from the original logic?
Opus: I added this part, do you want me to leave it as it was?
Me: closes chat
Then unfortunately you're leaving yourself at a serious disadvantage.
Good for you if you're able to live without a calculator, but frankly the automated tool is faster and leaves you less exhausted so you should be taking advantage of it.
Being able to sit down after a long way of work and ask an AI model to implement some bug or feature on something while you relax and _not_ type code is a major boon. It is able to immediately get context and be productive even when you are not.
I hear this take a lot but does it really make that much of an improvement over what we already had with search engines, online documentation and online Q&A sites?
For 20 a month I can get my stupid tool and utility ideas from "it would be cool if I could..." to actual "works well enough for me" -tools in an evening - while I watch my shows at the same time.
After a day at work I don't have the energy to start digging through, say, OpenWeather's latest 3.0 API and its nuances and how I can refactor my old code to use the new API.
Claude did it in maybe one episode of What We Do in the Shadows :D I have a hook that makes my computer beep when Claude is done or pauses for a question, so I can get back, check what it did and poke it forward.
This seems to be the current consensus.
A very similar quote from another recent AI article:
One host compares AI chatbots to “a very smart assistant who has a dozen Ph.D.s but is also high on ketamine like 30 percent of the time.”
https://lithub.com/what-happened-when-i-tried-to-replace-mys...
This alone is where I get a lot of my value. Otherwise, I'm using Cursor to actively solve smaller problems in whatever files I'm currently focused on. Being able to refactor things with only a couple sentences is remarkably fast.
The more you know about your language's features (and their precise names), and about higher-level programming patterns, the better time you'll have with LLMs, because it matches up with real documentation and examples with more precision.
It's funny, Github Copilot puts these models in the 'bargin bin' (they are free in 'ask' mode, whereas the other models count against your monthly limit of premium requests) and it's pretty clear why, they seem downright nerfed. They're tolerable for basic questions but you wouldn't use them if price weren't a concern.
Brandwise, I don't think it does OpenAI any favors to have their models be priced as 'worthless' compared to the other models on premium request limits.
Now I can always switch to a different model, increase the context, prompt better etc. but I still feel that actual good quality AI code is just out of arms reach, or when something clicks, and the AI magically starts producing exactly what I want, that magic doesn't last.
Like with stable diffusion, people who don't care as much or aren't knowledgeable enough to know better, just don't get what's wrong with this.
A week ago, I received a bug ticket claiming one of the internal libs i wrote didn't work. I checked out the reporter's code, which was full of weird issues (like the debugger not working and the typescript being full of red squiggles), and my lib crashed somewhere in the middle, in some esoteric minified js.
When I asked the guy who wrote it what's going on, he admitted he vibe coded the entire project.
Even if LLMs worked perfectly without hallucinations (they don't and might never), a conscientious developer must still comprehend every line before shipping it. You can't review and understand code 10x faster just because an LLM generated it.
In fact, reviewing generated code often takes longer because you're reverse-engineering implicit assumptions rather than implementing explicit intentions.
The "10x productivity" narrative only works if you either:
- Are not actually reviewing the output properly
or
- Are working on trivial code where correctness doesn't matter.
Real software engineering, where bugs have consequences, remains bottlenecked by human cognitive bandwidth, not code generation speed. LLMs shifted the work from writing to reviewing, and that's often a net negative for productivity.
There's many jobs that can be eliminated with software, but haven't because managers don't want to hire SWEs without proven value. I don't think HN realizes how big that market is.
With AI, the managers will replace their employees with a bunch of code they don't understand, watch that code fail in 3 years, and have to hire SWEs to fix it.
I'd bet those jobs will outnumber the ones initially eliminated by having non-technical people deliver the first iteration.
Many of those jobs will be high-skill/impact because they are necessarily focused on fixing stuff AI can't understand.
Nor do they produce those (do they?). That is what I would like to see. Formal models and diagrams are not needed to produce code. Their point is that they allow us to understand code and to formalize what we want it to do. That's what I'm hoping AI could do for me.
The problem is that AI needs to be spoon-fed overly detailed dos and donts, and even then the output can't be trusted without carefully checking it. It's easy to reach a point where breaking down the problem into pieces small enough for AI to understand takes more work than just writing the code.
AI may save time when it generates the right thing on the first try, but that's a gamble. The code may need multiple rounds of fixups, or end up needing a manual rewrite anyway, after wasting time and effort on instructing the AI. The ceiling of AI capabilities is very uneven and unpredictable.
Even worse, the AI can confidently generate code that looks superficially correct, but has subtle bugs/omissions/misinterpretations that end up costing way more time and effort than the AI saved. It has uncanny ability to write nicely structured, well-commented code that is just wrong.
Using AI will change nothing in this context.
The conversation around LLMs is so polarized. Either they’re dismissed as entirely useless, or they’re framed as an imminent replacement for software developers altogether.
Hallucinations are worth talking about! Just yesterday, for example, Claude 4 Sonnet confidently told me Godbolt was wrong wrt how clang would compile something (it wasn’t). That doesn’t mean I didn’t benefit heavily from the session, just that it’s not a replacement for your own critical thinking.
Like any transformative tool, LLMs can offer a major productivity boost but only if the user can be realistic about the outcome. Hallucinations are real and a reason to be skeptical about what you get back; they don’t make LLMs useless.
To be clear, I’m not suggesting you specifically are blind to this fact. But sometimes it’s warranted to complain about hallucinations!
To be clear, I did not classify "all the AI-supporters" as being in those three categories, I specifically said the people posting that they are getting 10x improvements thanks to AI.
Can you tell me about what you've done to no longer have any hallucinations? I notice them particularly in a language like Terraform, the LLMs add properties that do not exist. They are less common in languages like Javascript but still happen when you import libraries that are less common (e.g. DrizzleORM).
What I'm about to discuss is about me, not you. I have no idea what kind of systems you build, what your codebase looks like, use case, business requirements etc. etc. etc. So it is possible writing tests is a great application for LLMs for you.
In my day to day work... I wish that developers where I work would stop using LLMs to write tests.
The most typical problem with LLM-generated tests on the codebase where I work is that the test code is almost extremely tightly coupled to the implementation code. Heavy use of test spies is a common anti-pattern. The result is a test suite that is testing implementation details, rather than "user-facing" behaviour (user could be a code-level consumer of the thing you are testing).
The problem with that type of a test is that is a fragile test. One of the key benefits of automated tests is that they give you a safety net to refactor implementation to your heart's content without fear of having broken something. If you change an implementation detail, and the "user-facing" behaviour does not change, your tests should pass. When tests are tightly coupled to implementation, they will fail and now your tests, in the worst of cases, might actually be creating negative value for you ... since you every code change now requires you to keep tests up to date even when what you actually care about testing "is this thing working correctly?" hasn't changed.
The root of this problem isn't even the LLM, it's just that the LLM makes it a million times worse. Developers often feel like writing tests are a menial chore that needs to be done after the fact to satisfy code coverage policy. Few developers, at many organizations, have ever truly worked TDD or learned testing best practices, how to write easy to test implementation code etc.
There are atleast 10 posts on HN these days with the same discussion in circle.
1. AI sucks at code
2. you are not using my magic prompting technique
AI's so far haven't been able to beat that.
However I have found AI's to be great when working with unfamiliar tools. Where the effort involve in reading the docs etc. far outweigh the benefits. In my case using AI's to generate JasperReports .jrxml files made me more productive.
[In fact you can sometimes find that 10x bigger diff leads to decreased productivity down the line...]
I'm not sure about agentic coding. Need another month at it.
I wonder if that's all it is, or if the lack of context you mention is a more fundamental issue.
Best analogy I've ever heard and it's completely accurate. Now, back to work debugging and finishing a vibe coded application I'm being paid to work on.
If you're not specific enough, it will definitely spit out a half-baked pseudocode file where it expects you to fill in the rest. If you don't specify certain libraries, it'll use whatever is featured in the most blogspam. And if you're in an ecosystem that isn't publicly well-documented, it's near useless.
if I want to throw a shuriken abiding to some artificial, magic Magnus force like in the movie wanted, both chatGpt and Claude let me down, using pygame. what if I wanted c-level performance or if I wanted to use zig? burp.
It works like the average Microsoft employee, like some doped version of an orange wig wearer who gets votes because his daddys kept the population as dumb as it gets after the dotcom x Facebook era. in essence, the ones to be disappointed by are the Chan-Zuckerbergs of our time. there was a chance, but there also was what they were primed for
People keep focusing on general intelligence style capabilities but that is the golden grail. The world could go through multiple revolutions before finding that golden grail, but even before then everything would have changed beyond recognition.
So write an integration over the API docs I just copy-pasted.
And the knock-on effect is that there is less menial work. Artists are commissioned less for the local fair, their friend's D&D character portrait, etc. Programmers find less work building websites for small businesses, fixing broken widgets, etc.
I wonder if this will result in fewer experts, or less capable ones. As we lose the jobs that were previously used to hone our skills will people go out of their way to train themselves for free or will we just regress?
This really irritates me. I’ve had the same experience with teammates’ pull requests they ask me to review. They can’t be bothered to understand the thing, but then expect you to do it for them. Really disrespectful.
This seems excessive to me. Do you comprehend the machine code output of a compiler?
The names all looked right, the comments were descriptive, it has test cases demonstrating the code work. It looks like something I'd expect a skilled junior or a senior to write.
The thing is, the code didn't work right, and the reasons it didn't work were quite subtle. Nobody would have fixed it without knowing how to have done it in the first place, and it took me nearly as long to figure out why as if I'd just written it myself in the first place.
I could see it being useful to a junior who hasn't solved a particular problem before and wanted to get a starting point, but I can't imagine using it as-is.
It's a brave, weird and crazy new world. "The future is now, old man."
I was surprised with claude code I was able to get a few complex things done that I had anticipated to be a few weeks to uncover, stitch together and get moving.
Instead I pushed Claude to consistently present the correct udnerstanding of the problem, strucutre, approach to solving things, and only after that was OK, was it allowed to propose changes.
True to it's shiny things corpus, it will over complicate things because it hasn't learned that less is more. Maybe that reflects the corpus of the average code.
Looking at how folks are setting up their claude.md and agents can go a long way if you haven't had a chance yet.
But maybe another thing is not considered - while things may take longer, they ease cognitive load. If you have to write a lot of boilerplate or you have a task to do, but there are too many ways to do it, you can ask AI to play it out for you.
What benefit I can see the most is that I no longer use Google and things like Stack Overflow, but actual books and LLMs instead.
On the security layer, I wrote that code mostly by hand, with some 'pair programming' with Claude to get the Oauth handling working.
When I have the agent working on tasks independently, it's usually working on feature-specific business logic in the API and frontend. For that work it has a lot of standard helper functions to read/write data for the current authenticated user. With that scaffolding it's harder (not impossible) for the bot to mess up.
It's definitely a concern though, I've been brainstorming some creative ways to add extra tests and more auditing to look out for security issues. Overall I think the key for extremely fast development is to have an extremely good testing strategy.
With AI the extra quality and polish is basically free and instantaneous.
Related - agentic LLMs may be slow to produce output but they are parallelizable by an individual unlike hand-written work.
First, until I can re-learn boundaries, they are a fiasco for work-life balance. It's way too easy to have a "hmm what if X" thought late at night or first thing in the morning, pop off a quick ticket from my phone, assign to Copilot, and then twenty minutes later I'm lying in bed reviewing a PR instead of having a shower, a proper breakfast, and fully entering into work headspace.
And on a similar thread, Copilot's willingness to tolerate infinite bikeshedding and refactoring is a hazard for actually getting stuff merged. Unlike a human colleague who loses patience after a round or two of review, Copilot is happy to keep changing things up and endlessly iterating on minutiae. Copilot code reviews are exhausting to read through because it's just so much text, so much back and forth, every little change with big explanations, acknowledgments, replies, etc.
The suggestions were always unusably bad. The /fix were always obviously and straight up false unless it was a super silly issue.
Claude Code with Opus model on the other hand was mind-blowing to me and made me change my mind on almost everything wrt my opinion of LLMs for coding.
You still need to grow the skill of how to build the context and formulate the prompt, but the buildin execution loop is a complete game changer and I didn't realize that until I actually used it effectively on a toy project myself.
MCP in particular was another thing I always thought was massively over hyped, until I actually started to use some in the same toy project.
Frankly, the building blocks already exist at this point to make a vast majority of all jobs redundant (and I'm thinking about all grunt work office jobs, not coding in particular). The tooling still need to be created, so I'm not seeing a short term realization (<2 yrs), but medium term (5+yrs)?
You should expect most companies to let people go at staggering numbers, with only small amounts of highly skilled people left to administer the agents
This is particularly true for headlines like this one which stand alone as statements.
https://www.construx.com/blog/productivity-variations-among-...
The article relates about actual, experienced engineers trying to get even better. That's a completely different matter.
Point still remains for junior and semi-senior devs though, or any dev trying to leap over a knowledge barrier with LLMs. Emphasis on good pipelines and human (eventually maybe also LLM based) peer-reviews will be very important in the years to come.
# loop over the images
for filename in images_filenames:
# download the image
image = download_image(filename)
# resize the image
resize_image(image)
# upload the image
upload_image(image)
I prefer to push for self documenting code anyway, never saw the need for docs other than for an API when I'm calling something like a black box.
What is particularly useful is the comments about reasoning about new code added at my request.
https://github.com/micahscopes/radix_immutable
I took an existing MIT licensed prefix tree crate and had Claude+Gemini rewrite it to support immutable quickly comparable views. The execution took about one day's work, following two or three weeks thinking about the problem part time. I scoured the prefix tree libraries available in rust, as well as the various existing immutable collections libraries and found that nothing like this existed. I wanted O(1) comparable views into a prefix tree. This implementation has decently comprehensive tests and benchmarks.
No code for the next two but definitely results...
Tabu search guided graph layout:
https://bsky.app/profile/micahscopes.bsky.social/post/3luh4d...
https://bsky.app/profile/micahscopes.bsky.social/post/3luh4s...
Fast Gaussian blue noise with wgpu:
https://bsky.app/profile/micahscopes.bsky.social/post/3ls3bz...
In both these examples, I leaned on Claude to set up the boilerplate, the GUI, etc, which gave me more mental budget for playing with the challenging aspects of the problem. For example, the tabu graph layout is inspired by several papers, but I was able to iterate really quickly with claude on new ideas from my own creative imagination with the problem. A few of them actually turned out really well.
(edit)
I asked it to generate a changelog: https://github.com/wglb/gemini-chat/blob/main/CHANGELOG.md
I use it similar to the parent poster when I am working with an unfamiliar API, in that I will ask for simple examples of functionality that I can easily verify are correct and then build upon them quickly.
Also, let me know when your calculator regularly hallucinates. I find it exhausting to have an LLM dump out a "finished" implementation and have to spend more time reviewing it than it would take to complete it myself from scratch.
As far as writing "tedious" code goes, I think the AI agents are great. Where I have personally found a huge advantage is in keeping documentation up-to-date. I'm not sure if it's because I have ADHD or because my workload is basically enough for 3 people, but this is an area I struggle with. In the past, I've often let the code be it's own documentation, because that would be better than having out-dated/wrong documentation. With AI agents, I find that I can have good documentation that I don't need to worry about beyond approving in the keep/discard part of the AI agent. I also rarely write SQL, bicep, yaml configs and similar these days, because it's so easy to determine if the AI agent got it wrong. This requires you're an expert on infrastructure as code and SQL, but if you are, the AI agents are really fast. I think this is one of the areas where they 10x at times. I recently wrote an ingress for an ftp pod (don't ask), and writing all those ports for passive mode would've taken me a while. There are a lot of risk involved. If you can't spot errors or outdated functionality quickly, then I would highly recommend you don't do this. Bicep LLM output is often not up to date, and since the docs are excellent what I do in those situations is that I copy/paste what I need. Then I let the AI agent update things like parameters, which certainly isn't 10x but still faster than I can do it.
Similarily it's rather good at writing and maintaining automatic tests. I wouldn't recommend this unless you're working with actively dealing with corrupted states directly in your code. But we do fail-fast programming/Design by Contract so the tests are really just an extra precaution and compliance thing, meaning that they aren't as vital as they will be for more implicit ways of dealing with error handling.
I don't think AI's are good at helping you with learning or getting unstuck. I guess it depends on how you would normally deal with. If the alternative is "google programming" and I imagine it is sort of similar and probably more effective. It's probably also more dangerous. At least we've found that our engineers are more likely to trust the LLM than a medium article or a stackoverflow thread.
I don’t think models are doing that. They certainly can retrieve a huge amount of information that would otherwise only be available to specialists such as people with PhDs… but I’m not convinced the models have the same level of understanding as a human PhD.
It’s easy to test though- the models simply have to write and defend a dissertation!
To my knowledge, this has not yet been done.
But it is the most productive intern I've ever pair programmed with. The real ones hallucinate about as often too.
The best way to think of chat bot "AI" is as the compendium of human intelligence as recorded in books and online media available to it. It is not intelligent at all on its own and its judgement can't be better than its human sources because it has no biological drive to sythesize and excel. Its best to think of AI as a librarian of human knowledge or an interactive Wikipedia which is designed to seem like an intelligent agent but is actually not.
I don't buy that. The linked article makes a solid argument for why that's not likely to happen: agentic loop coding tools like Claude Code can speed up the "writing code and getting it working" piece, but the software development lifecycle has so much other work before you get to the "and now we let Claude Code go brrrrrrr" phase.
Toy project viability does not connect with making people redundant in the process (ever, really) — at least not for me. Care to elaborate where do you draw the optimism from?
I'm gonna pivot to building bomb shelters maybe
Or stockpiling munitions to sell during the troubles
Maybe some kind of protest support saas. Molotov deliveries as a service, you still have to light them and throw them but I guarantee next day delivery and they will be ready to deploy into any data center you want to burn down
What Im trying to say is "companies letting people go in staggering numbers" is a societal failure state not an ideal
I think where I've become very hesitant is a lot of the programs that I touch has customer data belonging to clients with pretty hard-nosed legal teams. So it's quite difficult for me to imagine not reviewing the production code by hand.
1) The junior developer is able to learn from experience and feedback, and has a whole brain to use for this purpose. You may have to provide multiple pointers, and it may take them a while to settle into the team and get productive, but sooner or later they will get it, and at least provide a workable solution if not what you may have come up with yourself (how much that matters depends on how wisely you've delegated tasks to them). The LLM can't learn from one day to the next - it's groundhog day every day, and if you have to give up with the LLM after 20 attempts it'd be the exact same thing tomorrow if you were so foolish to try again. Companies like Anthropic apparently aren't even addressing the need for continual learning, since they think that a larger context with context compression will work as an alternative, which it won't ... memory isn't the same thing as learning to do a task (learning to predict the actions that will lead to a given outcome).
2) The junior developer, even if they are only marginally useful to begin with, will learn and become proficient, and the next generation of senior developer. It's a good investment training junior developers, both for your own team and for the industry in general.
I use "tab-tab" auto complete to speed through refactorings and adding new fields / plumbing.
It's easily a 3x productivity gain. On a good day it might be 10x.
It gets me through boring tedium. It gets strings and method names right for languages that aren't statically typed. For languages that are statically typed, it's still better than the best IDE AST understanding.
It won't replace the design and engineering work I do to scope out active-active systems of record, but it'll help me when time comes to build.
Coding in a chat interface, and expecting the same results as with dedicated tools is ... 1-1.5 years old at this point. It might work, but your results will be subpar.
As a junior I used to think it was ok to spend much less time on the review than the writing, but unless the author has diligently detailed their entire process a good review often takes nearly as long. And unsurprisingly enough working with an AI effectively requires that detail in a format the AI can understand (which often takes longer than just doing it).
I know that a whole bunch of people will respond with the exact set of words that will make it show up right away on Google, but that's not the point: I couldn't remember what language it used, or any other detail beyond what I wrote and that it had been shared on Hacker News at some point, and the first couple Google searches returned a million other similar but incorrect things. With an LLM I found it right away.
Me, typing into a search engine, a few years ago: "Postgres CTE tutorial"
Me, typing into any AI engine, in 2025: "Here is my schema and query; optimize the query using CTEs and anything else you think might improve performance and readability"
This can't be a serious question? 5 minutes of testing will prove to you that it's not just better, it's a totally new paradigm. I'm relatively skeptical of AI as a general purpose tool, but in terms of learning and asking questions on well documented areas like programming language spec, APIs etc it's not even close. Google is dead to me in this use case.
If you try it yourself you'll soon find out that the answer is a very obvious yes.
You don't need a paid plan to benefit from that kind of assistance, either.
https://en.m.wikipedia.org/wiki/Ketamine
Because of its hallucinogenic properties?
I'm curious, this is js/ts? Asking because depending on the lang, good old machine refactoring is either amazeballs (Java + IDE) or non-existent (Haskell).
I'm not js/ts so I don't know what the state of machine refactoring is in VS code ... But if it's as good as Java then "a couple of sentences" is quite slow compared to a keystroke or a quick dialog box with completion of symbol names.
Forcing the discussion of invariants, and property-based testing -- seems to improve on the issues you're mentioning (when using e.g. Opus 4), especially when combined with the "use the public API" or interface abstractions.
Well-written bullshit in perfect prose is still bullshit.
This is emphatically NOT my experience with a large C++ codebase.
Prompts are especially good for building a new template of structure for a new code module or basic boilerplate for some of the more verbose environments. eg. Android Java programming can be a mess, huge amounts of code for something simple like an efficient scrolling view. AI takes care of this - it's obvious code, no thought, but it's still over 100 lines scattered in XML (the view definitions), resources, and in multiple Java files.
Do you really want to be copying boilerplate like this across to many different files? Prompts that are well integrated to the IDE (they give a diff to add the code) are great (also old style Android before Jetpack sucked) https://stackoverflow.com/questions/40584424/simple-android-...
In other words, it matters whether the AI is creating technical debt.
Yes, if it isn't your being overpaid in the view of a lot of people. Step out of the way and let an expert use the keyboard.
How can you not read and understand code but spend time writing it? That's bad code in that situation.
Source: try working with assembly and binary objects only which really do require working out what's going on. Code is meant to be human readable remember...
The training cutoff comes into play here a bit, but 95% of the time I'm fuzzy searching like that I'm happy with projects that have been around for a few years and hence are both more mature and happen to fall into the training data.
It is a serious question. I've spent much more than 5 minutes testing this, and I've found that your "totally new paradigm" is for morons
That 20 minutes, repeated over and over over the course of a career, is the difference between being a master versus being an amateur
You should value it, even if your employer doesn't.
Your employer would likely churn you into ground beef if there was a financial incentive to, never forget that
At this point I am close to deciding to fully boycott it yes
> If you try it yourself you'll soon find out that the answer is a very obvious yes
I have tried plenty over the years, every time a new model releases and the hype cycle fires up again I look in to see if it is any better
I try to use it a couple of weeks, decide it is overrated and stop. Yes it is improving. No it is not good enough for me to trust
claude config set --global preferredNotifChannel terminal_bell
https://docs.anthropic.com/en/docs/claude-code/terminal-conf...
If an AI assistant was the equivalent of “a dozen PhDs” at any of the places I’ve worked you would see an 80-95% productivity reduction by using it.
This property is likely an important driver of ketamine abuse and it being rather strongly 'moreish', as well as the subjective experiences of strong expectation during a 'trip'. I.e. the tendency to develop redose loops approaching unconsciousness in a chase to 'get the message from the goddess' or whatever, which seems just out of reach (because it's actually a feeling of expectation and not actually a partially installed divine T3 rig).
It's not always right, but I find it helpful when it finds related changes that I should be making anyway, but may have overlooked.
Another example: selecting a block that I need to wrap (or unwrap) with tedious syntax, say I need to memoize a value with a React `useMemo` hook. I can select the value, open Quick Chat, type "memoize this", and within milliseconds it's correctly wrapped and saved me lots of fiddling on the keyboard. Scale this to hundreds of changes like these over a week, it adds up to valuable time-savings.
Even more powerful: selecting 5, 10, 20 separate values and typing: "memoize all of these" and watching it blast through each one in record time with pinpoint accuracy.
We use a Team plan ($500 /mo), which includes 250 ACUs per month. Each bug or small task consumes anywhere between 1-3 ACUs, and fewer units are consumed if you're more precise with your prompt upfront. A larger prompt will usually use fewer ACUs because follow-up prompts cause Devin to run more checks to validate its work. Since it can run scripts, compilers, linters, etc. in its own VM -- all of that contributes to usage. It can also run E2E tests in a browser instance, and validate UI changes visually.
They recommend most tasks should stay under 5 ACUs before it becomes inefficient. I've managed to give it some fairly complex tasks while staying under that threshold.
So anywhere between $2-6 per task usually.
I called it a toy project because I'm not earning money with it - hence it's a toy.
It does have medium complexity with roughly 100k loc though.
And I think I need to repeat myself, because you seem to read something into my comment that I didn't say: the building blocks exist doesn't mean that today's tooling is sufficient for this to play out, today.
I very explicitly set a time horizon of 5 yrs.
I suspect that some researchers with a very different approach will come up with a neural network that learns and works more like a human in future though. Not the current LLMS but something with a much more efficient learning mechanism that doesn't require a nuclear power station to train.
These are exactly the people that are going to stay, medium term.
Let's explore a fictional example that somewhat resembles my, and I suspect a lot of peoples current dayjob.
A Micro-Service architecture, each team administers 5-10 services and the whole application, which is once again only a small part of the platform as a whole is developed by maybe 100-200 devs. So something like ~200 micro services
The application architects are gonna be completely save in their jobs. And so are the lead devs in each team - at least from my perspective. Anyone else? I suspect MBAs in 5 yrs will not see their value anymore. That's gonna be the vast majority of all devs, that's likely going to cost 50% of the devs their jobs. And middle management will be slimmed down just as quickly, because you suddenly need a lot less managers.
So always aim for outcomes, not output :)
At my company, we did promote people quickly enough that they are now close to double their salaries when they started a year or so ago, due to their added value as engineers in the team. It gets tougher as they get into senior roles, but even there, there's quite a bit of room for differentiation.
Additionally, since this is a market, you should not even expect to be paid twice for 2x value provided — then it makes no difference to a company if they get two 1x engineers instead, and you are really not that special if you are double the cost. So really, the "fair" value is somewhere in between: 1.5x to equally reward both parties, or leaning one way or the other :)
Except it also blurs the lines and sets incorrect expectations.
Management often see code being developed quickly (without full understanding of the fine line between PoC and production ready) and soon they expect it to be done with CC in 1/2 the time or less.
Figma on the other hand makes it very clear it is not code.
They could have just said "the most important language [...] is spoken language".
Thats the key right there. Try to use it in a project that handles PII, needs data to be exact, or has many dependencies/libraries and needs to not break for critical business functions.
Well, the people who quote from TFA have usually at least read the part they quoted ;)
Thinking about it personally, a 10X label means I'm supposedly the smartest person in the room and that I'm earning 1/10th what I should be. Both of those are huge negatives.
Its like Wordpress all over again but with people even less able to code. There's going to be vast amounts of opportunities for people to get into the industry via this route but its not going to be a very nice route for many of them. Lots of people who understand software even less than c-suite holding the purse-strings.
I am not allowed to use LLMs at work for work code so I can't tell what claims are real. Just my 80s game reimplementations of Snake and Asteroids.
A schematic of a useless amplifier that oscillates looks just as pretty as one of a correct amplifier. If we just want to use it as a repeated print for the wallpaper of an electronic lab, it doesn't matter.
I had a task to do a semi-complex UI addition, the whole week was allocated for that.
I sicked the corp approved Github Copilot with 4o and Claude 3.7 at it and it was done in an afternoon. It's ~95% functionally complete, but ugly as sin. (The model didn't understand our specific Tailwind classes)
Now I can spend the rest of the week on polish.
Usually, such a loop just works. In the cases where it doesn't, often it's because the LLM decided that it would be convenient if some method existed, and therefore that method exists, and then the LLM tries to call that method and fails in the linting step, decides that it is the linter that is wrong, and changes the linter configuration (or fails in the test step, and updates the tests). If in this loop I automatically revert all test and linter config changes before running tests, the LLM will receive the test output and report that the tests passed, and end the loop if it has control (or get caught in a failure spiral if the scaffold automatically continues until tests pass).
It's not an extremely common failure mode, as it generally only happens when you give the LLM a problem where it's both automatically verifiable and too hard for that LLM. But it does happen, and I do think "hallucination" is an adequate term for the phenomenon (though perhaps "confabulation" would be better).
Aside:
> I can't imagine an agent being given permission to iterate Terraform
Localstack is great and I have absolutely given an LLM free rein over terraform config pointed at localstack. It has generally worked fine and written the same tf I would have written, but much faster.
That seemed to me be to be the author's point.
His article resonated with me. After 30 years of development and dealing with hype cycles, offshoring, no-code "platforms", endless framework churn (this next version will make everything better!), coder tribes ("if you don't do typescript, you're incompetent and should be fired"), endless bickering, improper tech adopting following the FANGs (your startup with 0 users needs kubernetes?) and a gazillion other annoyances we're all familiar with, this AI stuff might be the thing that makes me retire.
To be clear: it's not AI that I have a problem with. I'm actually deeply interested in it and actively researching it from a math's up approach.
I'm also a big believer in it, I've implemented it in a few different projects that have had remarkable efficiency gains for my users, things like automatically extracting values from a PDF to create a structured record. It is a wonderful way to eliminate a whole class of drudgery based tasks.
No, the thing that has me on the verge of throwing in the towel is the wholesale rush towards devaluing human expertise.
I'm not just talking about developers, I'm talking about healthcare providers, artists, lawyers, etc...
Highly skilled professionals that have, in some cases, spent their entire lives developing mastery of their craft. They demand a compensation rate commensurate to that value, and in response society gleefully says "meh, I think you can be replaced with this gizmo for a fraction of the cost."
It's an insult. It would be one thing if it were true - my objection could safely be dismissed as the grumbling of a buggy whip manufacturer, however this is objectively, measurably wrong.
Most of the energy of the people pushing the AI hype goes towards obscuring this. When objective reality is presented to them in irrefutable ways, the response is inevitably: "but the next version will!"
It won't. Not with the current approach. The stochastic parrot will never learn to think.
That doesn't mean it's not useful. It demonstrably is, it's an incredibly valuable tool for entire classes of problems, but using it as a cheap replacement for skilled professionals is madness.
What will the world be left with when we drive those professionals out?
Do you want an AI deciding your healthcare? Do you want a codebase that you've invested your life savings into written by an AI that can't think?
How will we innovate? Who will be able to do fundamental research and create new things? Why would you bother going into the profession at all? So we're left with AIs training on increasingly polluted data, and relying on them to push us forward. It's a farce.
I've been seriously considering hanging up my spurs and munching popcorn through the inevitable chaos that will come if we don't course correct.
That problem statement is:
- Not all tests add value
- Some tests can even create dis-value (ex: slow to run, thus increasing CI bills for the business without actually testing anything important)
- Few developers understand what good automated testing looks like
- Developers are incentivized to write tests just to satisfy code coverage metrics
- Therefore writing tests is a chore and an afterthought
- So they reach for an LLM because it solves what they perceive as a problem
- The tests run and pass, and they are completely oblivious to the anti-patterns just introduced and the problems those will create over time
- The LLMs are generating hundreds, if not thousands, of these problems
So yeah, the problem is 100% the developers who don't understand how to evaluate the output of a tool that they are using.
But unlike functional code, these tests are - in many cases - arguably creating disvalue for the business. At least the functional code is a) more likely to be reviewed and code quality problems addressed and b) even if not, it's still providing features for the end user and thus adding some value.
I am curious if this is still understandable in wider software engineering circles, esp outside the HN and Linux bubbles.
I worked with many junior developers that didn't learn and kept making the same mistakes and asking the same questions even months in the job.
I found LLMs to be far more advanced in comparison to what I had to deal with.
It expands match blocks against highly complex enums from different crates, then tab completes test cases after I write the first one. Sometimes even before that.
I read and understand 100% of the code it outputs, so I'm not so worried about falling too far astray...
being too prescriptive about it (like prompting "don't write comments") makes the output worse in my experience
How often do you use coding LLMs?
Weren't there 2 or 3 dating apps that were launched before the "vibecoding" craze that went extremely popular and got extremely hacked weeks/months in? I also distinctly remember a social network having firebase global tokens on the clientside, also a few years ago.
I have experimented with vibe coding. With Claude Code I could produce a useful and usable small React/TS application, but it was hard to maintain and extend beyond a fairly low level of complexity. I totally agree that vibe coding (at the moment) is producing a lot of slop code, I just don't think Tea is an example of it from what I understand.
That has nothing to do with AI/LLMs.
If you can't understand what the tool spits out either; learn, throw it away, or get it to make something you can understand.
It's not about lines of code or quality it's about solving a problem. If the problem creates another problem then it's bad code. If it solves the problem without causing that then great. Move onto the next problem.
How have you found it not to be significantly better for those purposes?
The "not good enough for you to trust" is a strange claim. No matter what source of info you use, outside of official documentation, you have to assess its quality and correctness. LLM output is no different.
This sort of implies you are not reading and deeply understanding your LLM output, doesn't it?
I am pretty strongly against that behavior
they are the equivalent.
there is already an 80-95% productivity reduction by just reading about them on Hacker News.
But you're right.
There's the old trope that systems programmers are smarter than applications programmers but SWE-Bench puts the lie to that. Sure, SWE-Bench problems are all in the language of software, applications programmers take badly specified tickets in the language of product managers, testers and end users and have to turn that into the language of SWE-Bench to get things done. I am not that impressed with 65% performance on SWE-Bench because those are not the kind of tickets that I have to resolve at work, but rather at work if I want to use AI to help maintain a large codebase I need to break the work down into that kind of ticket.
"Toy project" is usually used in a different context (demonstrate something without really doing something useful): yours sounds more like a "hobby project".
tl;dr: in the future when vibe coding works 100% of the time, logically the only companies that will exist are the ones that have processes that AI can’t do, because all the other parts of the supply chain can all be done in-house
Intelligence is not some universal abstract thing acheivable after a certain computational threshold is reached. Rather its a quality of the behavior patterns of specific biological organisms following their drives.
There are so many flaws in your plan, I have no doubt that "AI" will ruin some companies that try to replace humans with a "tin can". LLMs are being inserted loosey-goosey into too many places by people that don't really understand the liability problems it creates. Because the LLM doesn't think, it doesn't have a job to protect, it doesn't have a family to feed. It can be gamed. It simply won't care.
The flaws in "AI" are already pretty obvious to anyone paying attention. It will only get more obvious the more LLMs get pushed into places they really do not belong.
Again, appreciate your thoughts, I have a huge amount of respect for your work. I hope you have a good one!
Depending on the environment, I can imagine the worst devs being net negative.
It helps me being lazy because I have a rough expectation of what the outcome should be - and I can directly spot any corner cases or other issues the AI proposed solution has, and can either prompt it to fix that, or (more often) fix those parts myself.
The bottom 20% may not have enough skill to spot that, and they'll produce superficially working code that'll then break in interesting ways. If you're in an organization that tolerates copy and pasting from stack overflow that might be good enough - otherwise the result is not only useless, but as it provides the illusion of providing complete solution you're also closing the path of training junior developers.
Pretty much all AI attributed firings were doing just that: Get rid of the juniors. That'll catch up with us in a decade or so. I shouldn't complain, though - that's probably a nice earning boost just before retirement for me.
I must comprehend code at the abstraction level I am working at. If I write Python, I am responsible for understanding the Python code. If I write Assembly, I must understand the Assembly.
The difference is that Compilers are deterministic with formal specs. I can trust their translation. LLMs are probabilistic generators with no guarantees. When an LLM generates Python code, that becomes my Python code that I must fully comprehend, because I am shipping it.
That is why productivity is capped at review speed, you can't ship what you don't understand, regardless of who or what wrote it.
> Why? Because LLMs don’t just autocomplete. They generate. And in doing so, they challenge our identity, not just our workflows.
is what raised flags in my head. Rather than explain the difference between glorified autocompletion and generation, the post assumes there is a difference then uses florid prose to hammer in the point it didn't prove.
I've heard the paragraph "why? Because X. Which is not Y. And abcdefg" a hundred times. Deepseek uses it on me every time I ask a question.
I've told the same Claude to write me unit tests for a very well known well-documented API. It was too dumb to deduce what edge cases it should test, so I also had to give it a detailed list of what to test and how. Despite all of that, it still wrote crappy tests that misused the API. It couldn't properly diagnose the failures, and kept adding code for non-existing problems. It was bad at applying fixes even when told exactly what to fix. I've wasted a lot of time cleaning up crappy code and diagnosing AI-made mistakes. It would have been quicker to write it all myself.
I've tried Claude and GPT4o for a task that required translating imperative code that writes structured data to disk field by field into explicit schema definitions. It was an easy, but tedious task (I've had many structs to convert). AI hallucinated a bunch of fields, and got many types wrong, wasting a lot of my time on diagnosing serialization issues. I really wanted it to work, but I've burned over $100 in API credits (not counting subscriptions) trying various editors and approaches. I've wasted time and money managing context for it, to give it enough of the codebase to stop it from hallucinating the missing parts, but also carefully trim it to avoid distracting it or causing rot. It just couldn't do the work precisely. In the end I had scrap it all, and do it by hand myself.
I've tried gpt4o and 4-mini-high to write me a specific image processing operation. They could discuss the problem with seemingly great understanding (referencing academic research, advanced data structures). I even got a Python that had correct syntax on the first try! But the implementation had a fundamental flaw that caused numeric overflows. AI couldn't fix it itself (kept inventing stupid workarounds that didn't work or even defeated the point of the whole algorithm). When told step by step what to do to fix it, it kept breaking other things in the process.
I've tried to make AI upgrade code using an older version of a dependency to a newer one. I've provided it with relevant quotes from the docs (I know it would have been newer than its knowledge cutoff), and even converted parts of the code myself, so it could just follow the pattern. The AI couldn't properly copy-paste code from one function to another. It kept reverting things. When I pointed out the issues, it kept apologising, saying what new APIs it's going to use, and then use the old APIs again!
I've also briefly tried GH copilot, but it acted like level 1 tech support, despite burning tokens of a more capable model.
This has never been the case in any company I've ever worked at. Even if you can finish your day's work in, say, 4 hours, you can't just dip out for the other 4 hours of the day.
Managers and teammates expect you to be available at the drop of a hat for meetings, incidents, random questions, "emergencies", etc.
Most jobs I've worked at eventually devolve into something like "Well, I've finished what I wanted to finish today. I could either stare at my monitor for the rest of the day waiting for something to happen, or I could go find some other work to do. Guess I'll go find some other work to do since that's slightly less miserable".
You also have to delicately "hide" the fact that you can finish your work significantly faster than expected. Otherwise the expectations of you change and you just get assigned more work to do.
Anyway, I still see hallucinations in all languages, even javascript, attempting to use libraries or APIs that do not exist. Could you elaborate on how you have solved this problem?
Your article does not specifically say 10x, but it does say this:
> Kids today don’t just use agents; they use asynchronous agents. They wake up, free-associate 13 different things for their LLMs to work on, make coffee, fill out a TPS report, drive to the Mars Cheese Castle, and then check their notifications. They’ve got 13 PRs to review. Three get tossed and re-prompted. Five of them get the same feedback a junior dev gets. And five get merged.
> “I’m sipping rocket fuel right now,” a friend tells me. “The folks on my team who aren’t embracing AI? It’s like they’re standing still.” He’s not bullshitting me. He doesn’t work in SFBA. He’s got no reason to lie.
That's not quantifying it specifically enough to say "10x", but it is saying no uncertain terms that AI engineers are moving fast and everyone else is standing still by comparison. Your article was indeed one of the ones I specifically wanted to respond to as the language directly contributed to the anxiety I described here. It made me worry that maybe I was standing still. To me, the engineer you described as sipping rocket fuel is an example both of the "degrees of separation" concept (it confuses me you are pointing to a third party and saying they are trustworthy, why not simply describe your workflow?), and the idea that a quick burst of productivity can feel huge but it just doesn't scale in my experience.
Again, can you tell me about what you've done to no longer have any hallucinations? I'm fully open to learning here. As I stated in the article, I did my best to give full AI agent coding a try, I'm open to being proven wrong and adjusting my approach.
It can actually be worse when they do. Formalizing behavior means leaving out behavior that can't be formalized, which basically means if your language has undefined behavior then the handling of that will be maximally confusing, because your compiler can no longer have hacks for handling it in a way that "makes sense".
Gemini CLI (it's free and I'm cheap) will run the build process after making changes. If an error occurs, it will interpret it and fix it. That will take care of it using functions that don't exist.
I can get stuck in a loop, but in general it'll get somewhere.
The thing is that company is hunting for better value, and you are looking for a better deal.
If company can get 2x engineers' production at lower cost, you are only more valuable than having 2 engineers producing as much if you are cheaper. Your added value is this extra 1x production, but if you are selling "that" at the same price, they are just as well off by hiring two engineers instead of you: there is no benefit to having you over them.
If you can do it cheaper, then you are more valuable the cheaper you are. Which is why I said 1.5x cost is splitting the value/cost between you and the employer.
You know like when the loom came out there were probably quite a few models but using it was similar. Like cars are now.
An LLM is an auto-regressive model - it is trying to predict continuations of training samples purely based on the training samples. It has no idea what were the real-world circumstances of the human who wrote a training sample when they wrote it, or what the real-world consequences were, if any, of them writing it.
For an AI to learn on the job, it would need to learn to predict it's own actions in any specific circumstance (e.g. circumstance = "I'm seeing/experiencing X, and I want to do Y"), based on it's own history of success and failure in similar circumstances... what actions led to a step towards the goal Y? It'd get feedback from the real world, same as we do, and therefore be able to update it's prediction for next time (in effect "that didn't work as expected, so next time I'll try something different", or "cool, that worked, I'll remember that for next time").
Even if a pre-trained LLM/AI did have access to what was in the mind of someone when they wrote a training sample, and what the result of this writing action was, it would not help, since the AI needs to learn how to act based on what is in it's own (ever changing) "mind", which is all it has to go on when selecting an action to take.
The feedback loop is also critical - it's no good just learning what action to take/predict (i.e what actions others took in the training set), unless you also have the feedback loop of what the outcome of that action was, and whether that matches what you predicted to happen. No amount of pre-training can remove the need for continual learning for the AI to correct it's own on-the-job mistakes, and learn from it's own experience.
The 5% is an increase in straight-ahead code speed. I spend a small fraction of my time typing code. Smaller than I'd like.
And it very well might be an economically rational subscription. For me personally, I'm subscription averse based on the overhead of remembering that I have a subscription and managing it.
Just by virtue of Rust being relatively short-lived I would guess that your code base is modular enough to live inside reasonable context limits, and written following mostly standard practice.
One of the main files I work on is ~40k lines of code, and one of the main proprietary API headers I consume is ~40k lines of code.
My attempts at getting the models available to Copilot to author functions for me have often failed spectacularly - as in I can't even get it to generate edits at prescribed places in the source code, follow examples from prescribed places. And the hallucination issue is EXTREME when trying to use the big C API I alluded to.
That said Claude Code (which I don't have access to at work) has been pretty impressive (although not what I would call "magical") on personal C++ projects. I don't have Opus, though.
But I have rules that are quite important for successfully completing a task by my standards and it's very frustrating when the LLM randomly ignores them. In a previous comment I explained my experiences in more detail but depending on the circumstances instruction compliance is 9/10 times at best, with some instructions/tasks as poor as 6/10 in the most "demanding" scenarios particularly as the context window fills up during a longer agentic run.
Not even remotely
> LLM output is no different
It is different
A search result might take me to the wrong answer but an LLM might just invent nonsense answers
This is a fundamentally different thing and is more difficult to detect imo
"You had a problem. You tried to solve it with regex. Now you have two problems"
1) your original problem 2) your broken regex
I would like to propose an addition
"You had a problem. You tried to solve it with AI generated regex. Now you have three problems"
1) your original problem 2) your broken regex 3) your reliance on AI
Except the documentation lies and in reality your vendor shipped you a part with timing that is slightly out of sync with what the doc says and after 3 months of debugging, including using an oscilloscope, you figure out WTF is going on. You report back to your supplier and after two weeks of them not saying any thing they finally reply that the timings you have reverse engineered are indeed the correct timings, sorry for any misunderstandings with the documentation.
As an application's engineer, my computer doesn't lie to me and memory generally stays at a value I set it to unless I did something really wrong.
Backend services are the easiest thing in the world to write, I am 90% sure that all the bullshit around infra is just artificial job security, and I say this as someone who primarily does backend work now days.
It's conceivable that thats going to happen, eventually. but that'd likely require models a lot more advanced to what we have now.
The agent approach with lead devs administering and merging the code the agents made is feasible with today's models. The missing part is the tooling around the models and the development practices that that standardizes this workflow.
That's what I'd expect to take around 5 yrs to settle.
There's a long history in AI where neural nets were written off as useless (Minsky was the famous destroyer of the idea, I think) and yet in the end they blew away the alternatives completely.
We have something now that's useful in that it is able to glom a huge amount of knowledge but the cost of doing so it tremendous and therefore in many ways it's still ridiculously inferior to nature because it's only a partial copy.
A lot of science fiction has assumed that robots, for example, would automatically be superior to humans - but are robots self-repairing or self replicating? I was reading recently about how the reasons why many developers like python are the reasons why it can never be made fast. In other words you cannot have everything - all features come at a cost. We will probably have less human and more human AIs because they will offer us different trade offs.
And you are confident that the human receptionist will never fall for social engineering?
I don't think data protection is even close to the biggest problem with replacing all/most employees with bots.
I was watching to learn how other devs are using Claude Code, as my first attempt I pretty quickly ran into a huge mess and was specifically looking for how to debug better with MCP.
The most striking thing is she keeps on having to stop it doing really stupid things. She slightly glosses over those points a little bit by saying things like "I roughly know what this should look like, and that's not quite right" or "I know that's the old way of installing TailwindCSS, I'll just show you how to install Context7", etc.
But in each 10 minute episodes (which have time skips while CC thinks) it happens at least twice. She has to bring her senior dev skills in, and it's only due to her skill that she can spot the problem in seconds flat.
And after watching much of it, though I skipped a few episodes at the end, I'm pretty certain I could have coded the same app quicker than she did without agentic AI, just using the old chat window AIs to bash out the React boilerplate and help me quickly scan the documentation for getting offline. The initial estimate of 18 days the AI came up with in the plan phase would only hold truye if you had to do it "properly".
I'm also certain she could have too.
[1] https://www.youtube.com/watch?v=erKHnjVQD1k
It's worth a watch if you're not doing agentic coding yet. There were points I was impressed with what she got it to do. The TDD section was quite impressive in many ways, though it immediately tried to cheat and she had to tell it to do it properly.
If I'm writing a series of very similar test cases, it's great for spamming them out quickly, but I still need to make sure they're actually right. This is easier to spot errors because I didn't type them out.
It's also decent for writing various bits of boilerplate for list / dict comprehensions, log messages (although they're usually half wrong, but close enough to what I was thinking), time formatting, that kind of thing. All very standard stuff that I've done a million times but I may be a little rusty on. Basically StackOverflow question fodder.
But for anything complex and domain-specific, it's more wrong than it's right.
Literally unwinnable scenarios. Only way to succeed is to just sit your ass in the chair. Almost no manager actually cares about your actual output - they all care about presentation and appearances.
I _never_ made the claim that you could call that 10x productivity improvement. I’m hesitant to categorize productivity in software in numeric terms as it’s such a nuanced concept.
But I’ll stand by my impression that a developer using ai tools will generate code at a perceptibly faster pace than one who isn’t.
I mentioned in another comment the major flaw in your productivity calculation, is that you aren’t accounting for the work that wouldn’t have gotten done otherwise. That’s where my improvements are almost universally coming from. I can improve the codebase in ways that weren’t justifiable before in places that do not suffer from the coordination costs you rightly point out.
I no longer feel like my peers are standing still, because they’ve nearly uniformly adopted ai tools. And again, you rightly point out, there isn’t much of a learning curve. If you could develop before them you can figure out how to improve with them. I found it easier than learning vim.
As for hallucinations I don’t experience them effectively _ever_. And I do let agents mess with terraform code (in code bases where I can prevent state manipulation or infrastructure changes outside of the agents control).
I don’t have any hints on how. I’m using a pretty vanilla Claude code setup. But im not sure how an agent that can write and run compile/test loops could hallucinate.
It's a pretty obvious rhetorical tactic: everybody associates "hallucination" with something distinctively weird and bad that LLMs do. Fair enough! But then they smuggle more meaning into the word, so that any time an LLM produces anything imperfect, it has "hallucinated". No. "Hallucination" means that an LLM has produced code that calls into nonexistent APIs. Compilers can and do in fact foreclose on that problem.
I posted a demo here a while ago where I try to have it draw turtle graphics:
https://news.ycombinator.com/item?id=44013939
Since then I've also provided enough glue that it can interact with the Arch Linux installer in a VM (or actual hardware, via serial port) - with sometimes hilarious results, but at least some LLMS do manage to install Arch with some guidance:
https://github.com/aard-fi/arch-installer
Somewhat amusingly, some LLMs have a tendency to just go on with it (even when it fails), with rare hallucinations - while other directly start lying and only pretend they logged in.
Which came first...
At the heart is my hobby of reading web and light novels. I've been implementing various versions of a scraper and ePub reader for over 15 years now, ever since I started working as a programmer.
I've been reimplementing it over the years with the primary goal of growing my experiences/ability. In the beginning it was a plain Django app, but it grew from that to various languages such as elixir, Java (multiple times with different architecture approaches), native Android, JS/TS Frontend and sometimes backend - react, angular, trpc, svelte tanstack and more.
So I know exactly how to implement it, as I've give through a lot of version for the same functionality. And the last version I implemented (tanstack) was in July, via Claude Code and got to feature parity (and more) within roughly 3 weeks.
And I might add: I'm not positive about this development either, whatsoever. I'm just expecting this to happen, to the detriment of our collective futures (as programmers)
> This is a fundamentally different thing and is more difficult to detect imo
99% of the time it's not. You validate and correct/accept like you would any other suggestion.
Repeat after me, token prediction is not intelligence.
We went from "this thing is a stochastic parrot that gives you poems and famous people styled text, but not much else" to "here's a fullstack app, it may have some security issues but otherwise it mainly works" in 2.5 years. People expect perfection, and move the goalposts. Give it a second. Learn what it can do today, adapt, prepare for what it can do tomorrow.
but the principle is the same: if the human isn’t doing theory-building, then no one is
> I mentioned in another comment the major flaw in your productivity calculation, is that you aren’t accounting for the work that wouldn’t have gotten done otherwise. That’s where my improvements are almost universally coming from. I can improve the codebase in ways that weren’t justifiable before in places that do not suffer from the coordination costs you rightly point out.
I'm a bit confused by this. There is work that apparently is unlocking big productivity boosts but was somehow not justified before? Are you referring to places like my ESLint rule example, where eliminating the startup costs of learning how to write one allows you to do things you wouldn't have previously bothered with? If so, I feel like I covered this pretty well in the article and we probably largely agree on the value that productivity boost. My point is still stands that that doesn't scale. If this is not what you mean, feel free to correct me.
Appreciate your thoughts on hallucinations. My guess is the difference between what we're experiencing is that in your code hallucinations are still happening but getting corrected after tests are run, whereas my agents typically get stuck in these write-and-test loops and can't figure out how to solve the problem, or it "solves" it by deleting the tests or something like that. I've seen videos and viewed open source AI PRs which end up in similar loops as to what I've experienced, so I think what I see is common.
Perhaps that's an indication of that we're trying to solve different problems with agents, or using different languages/libraries, and that explains the divergence of experiences. Either way, I still contend that this kind of productivity boost is likely going to be hard to scale and will get tougher to realize as time goes on. If you keep seeing it, I'd really love to hear more about your methods to see what I'm missing. One thing that has been frustrating me is that people rarely share their workflows after makign big claims. This is unlike previous hype cycles where people would share descriptions of exactly what they did ("we rewrote in Rust, here's how we did it", etc.) Feel free to email me at the address in my about page[1] or send me a request on LinkedIn or whatever. I'm being 100% genuine that I'd love to learn from you!
If, according to you, LLMs are so good at avoiding hallucinations these days, then maybe we should ask an LLM what hallucinations are. Claude, "in the context of generative AI, what is a hallucination?"
Claude responds with a much broader definition of the term than you have imagined -- one that matches my experiences with the term. (It also seemingly matches many other people's experiences; even you admit that "everybody" associates hallucination with imperfection or inaccuracy.)
Claude's full response:
"In generative AI, a hallucination refers to when an AI model generates information that appears plausible and confident but is actually incorrect, fabricated, or not grounded in its training data or the provided context.
"There are several types of hallucinations:
"Factual hallucinations - The model states false information as if it were true, such as claiming a historical event happened on the wrong date or attributing a quote to the wrong person.
"Source hallucinations - The model cites non-existent sources, papers, or references that sound legitimate but don't actually exist.
"Contextual hallucinations - The model generates content that contradicts or ignores information provided in the conversation or prompt.
"Logical hallucinations - The model makes reasoning errors or draws conclusions that don't follow from the premises.
"Hallucinations occur because language models are trained to predict the most likely next words based on patterns in their training data, rather than to verify factual accuracy. They can generate very convincing-sounding text even when "filling in gaps" with invented information.
"This is why it's important to verify information from AI systems, especially for factual claims, citations, or when accuracy is critical. Many AI systems now include warnings about this limitation and encourage users to double-check important information from authoritative sources."
I distinctly did not say that. I said your article was one of the ones that made me feel anxious. And it's one of the ones that spurred me to write this article. I demonstrated how your language implies a massive productivity boost from AI. Does it not? Is this not the entire point of what you wrote? That engineers who aren't using AI are crazy (literally the title) because they are missing out on all this "rocket fuel" productivity? The difference between rocket fuel and standing still has to be a pretty big improvement.
The points I make here still apply, there is not some secret well of super-productivity sitting out in the open that luddites are just too grumpy to pick up and use. Those who feel they have gotten massive productivity boosts are being tricked by occasional, rare boosts in productivity.
You said you solved hallucinations, could you share some of how you did that?
And this is the problem.
Masterful developers are the ones you pay to reduce lines of code, not create them.
You can certainly be very productive by doing what you are told. I'd probably fail at that metric against many engineers, yet people usually found me very valuable to their teams (I never asked if it was 1x or 2x or 0.5x compared to whatever they perceive as average).
The last few years, I am focused on empowering engineers to be equal partners in deciding what is being done, by teaching them to look for and suggest options which are 10% of the effort and 90% of the value for the user (or 20/80, and sometimes even 1% effort for 300% the value). Because they can best see what is simple and easy to do with the codebase, so if they put customer hat on, they unlock huge wins for their team and business.
Suppose A solves a problem and writes the solution down. B reads the answer and repeats it. Is B reasoning, when asked the same question? What about one that sounds similar?
Are they a constant source of low level annoyance? Sure. But I've never had to look at a bus timing diagram to understand how to use one, nor worried about an nginx file being rotated 90 degrees and wired up wrong!
LLMs are still stochastic parrots, though highly impressive and occasionally useful ones. LLMs are not going to solve problems like "what is the correct security model for this application given this use case".
AI might get there at some point, but it won't be solely based on LLMs.
This maybe a definition problem then. I don’t think “the agent did a dumb thing that it can’t reason out of” is a hallucination. To me a hallucination is a pretty specific failure mode, it invents something that doesn’t exist. Models still do that for me but the build test loop sets them aright on that nearly perfectly. So I guess the model is still hallucinating but the agent isn’t so the output is unimpacted. So I don’t care.
For the agent is dumb scenario, I aggressively delete and reprompt. This is something I’ve actually gotten much better at with time and experience, both so it doesn’t happen often and I can course correct quickly. I find it works nearly as well for teaching me about the problem domain as my own mistakes do but is much faster to get to.
But if I were going to be pithy. Aggressively deleting work output from an agent is part of their value proposition. They don’t get offended and they don’t need explanations why. Of course they don’t learn well either, that’s on you.
Good luck ever getting that. I've asked that about a dozen times on here from people making these claims and have never received a response. And I'm genuinely curious as well, so I will continue asking.
Right across this thread we have the author of the post saying that when they said "hallucinate", they meant that if they watched they could see their async agent getting caught in loops trying to call nonexistent APIs, failing, and trying again. And? The point isn't that foundation models themselves don't hallucinate; it's that agent systems don't hand off code with hallucinations in it, because they compile before they hand the code off.
Perhaps, start from the assumption that I have in fact spent a fair bit of time doing this job at a high level. Where does that mental exercise take you with regard to your own position on ai tools.
In fact, you don’t have to assume I’m qualified to speak on the subject. Your retort assumes that _everyone_ who gets improvement is bad at this. Assume any random proponent isn’t.
I'm trying to write a piece to comfort those that feel anxious about the wave of articles telling them they aren't good enough, that they are "standing still", as you say in your article. That they are crazy. Your article may not say the word 10x, but it makes something extremely clear: you believe some developers are sitting still and others are sipping rocket fuel. You believe AI skeptics are crazy. Thus, your article is extremely natural to cite when talking about the origin of this post.
You can keep being mad at me for not providing a detailed target list, I said several times that that's not what the point of this is. You can keep refusing to actually elaborate on how you use AI day to day and solve its problems. That's fine. I don't care. I care a lot more to talk about the people who are actually engaging with me (such as your friend) and helping me to understand what they are doing. Right now, if you're going to keep not actually contributing to the conversation, you're just kinda being a salty guy with an almost unfathomable 408,000 karma going through every HN thread every single day and making hot takes.
Frankly I've seen LLMs answer better than people trained in security theatre so be very careful where you draw the line.
If you're trying to say they struggle with what they've not seen before. Yes, provided that what is new isn't within the phase space they've been trained over. Remember there's no photographs of cats riding dinosaurs but SD models can generate them.
Deleting and re-prompting is fine. I do that too. But even one cycle of that often means the whole prompting exercise takes me longer than if I just wrote the code myself.
What people aren't doing is proving to you that their workflows work as well as they say they do. You want proof, you can DM people for their rate card and see what that costs.
The article in question[0] has the literal tag line:
> My AI Skeptic Friends Are All Nuts
how much saner is someone who isn't nuts to someone who is nuts? 10x saner? What do the specific numbers matter given you're not writing a paper?
You're enjoying the click bait benefits of using strong language and then acting offended when someone calls you out on it. Yes, maybe you didn't literally say "10x" but you said or quoted things in exactly that same ballpark and its worthy of a counter point like the OP has provided. They're both interesting articles with strong opinions that make the world a more interesting place so idk why you're trying to disown the strength with which you wrote your article.
One of the most valuable qualities of humans is laziness.
We're constantly seeking efficiency gains, because who wants to carry buckets of water, or take laundry down to the river?
Skilled developers excel at this. They are "lazy" when they code - they plan for the future, they construct code in a way that will make their life better, and easier.
LLMs don't have this motivation. They will gleefully spit out 1000 lines of code when 10 will do.
It's a fundamental flaw.
A lot of the advantage is that it can make forward progress when I can’t. I can check to see if an agent is stuck, and sometimes reprompt it, in the downtime between meetings or after lunch before I start whatever deep thinking session I need to do. That’s pure time recovered for me. I wouldn’t have finished _any_ work with that time previously.
I don’t need to optimize my time around babysitting the agent. I can do that in the margins. Watching the agents is low context work. That adds the capability to generate working solutions during times that was previously barred from that.
> As of March, 2025, this library is very new, prerelease software.
I'm not looking for personal proof that their workflows work as well as they say they do.
I just want an example of a project in production with active users depending on the service for business functions that has been written 1.5/2/5/10/whatever x faster than it otherwise would have without AI.
Anyone can vibe code a side project with 10 users or a demo meant to generate hype/sales interest. But I want someone to actually have put their money where their mouth is and give an example of a project that would have legal, security, or monetary consequences if bad code was put in production. Because those are the types of projects that matter to me when trying to evaluate people's claims (since those are what my paycheck actually depends on).
Do you have any examples like that?
I'm not offended at all. I'm saying: no, I'm not a valid cite for that idea. If the author wants to come back and say "10x developer", a term they used twenty five times in this piece, was just a rhetorical flourish, something they conjured up themselves in their head, that's great! That would resolve this small dispute neatly. Unfortunately: you can't speak for them.
Either way, I'm happy that you are getting so much out of the tools. Perhaps I need to prompt harder, or the codebase I work on has just deviated too much from the stuff the LLMs like and simply isn't a good candidate. Either way, appreciate talking to you!
That code tptacek linked you to? It's part of our (Cloudflare's) MCP framework. Which means all of the companies mentioned in this blog post are using this code in production today: https://blog.cloudflare.com/mcp-demo-day/
There you go. This is what you are looking for. Why are you refusing to believe it?
(OK fine. I guess I should probably update the readme to remove that "prerelease" line.)
They used it 25 times in their piece and in your piece stated that being interested in "the craft" is something people should do in their own time from now on. Strongly implying, if not outright stating; that the processes and practices we've refined for the past 70 years of software engineering need to move aside for the next hotness that has only been out for 6 months. Sure you never said "10x", but to me it read entirely like you're doing the "10x" dance. It was a good article and it definitely has inspired me to check it out.
At some point you have to accept that no amount of proof will convince someone that refuses to be swayed. It's very frustrating because, while these are wonderful tools already, its clear that the biggest thing that makes a positive difference is people using and improving them. They're still in relative infancy.
I want to have the kind of conversations we had back at the beginning of web development, when people were delighted at what was possible despite everything being relatively awful.
1. Would have legal, security, or monetary consequences if bad code was put in production
2. Was developed using an AI/LLM/Agent/etc that made the development many times faster than it otherwise would have (as so many people claim)
I would love to hear an example where "I used Claude to develop this hosting/ecommerce/analytics/inventory management service that is used in production by 50 paying companies. Using an LLM we deployed the project in 4 week where it would normally take us 4 months." Or "We updated an out of date code base for a client in half the time it would normally take and have not seen any issues since launch"
At the end of the day I code to get paid. And it would really help to be able to point to actual cases where both money and negative consequences of failure are on the line.
So if you have any examples please share. But the more people deflect the more skeptical I get about their claims.
I never look at my own readmes so they tend to get outdated. :/
Fixing: https://github.com/cloudflare/workers-oauth-provider/pull/59
Since my day job is creating systems that need to be operational and predictable for paying clients - examples of front end mockups, demos, apps with no users, etc don't really matter that much at the end of the day. It's like the difference between being a great speaker in a group of 3 friends vs standing up in front of a 30 person audience with your job on the line.
If you have some examples, I'd love to hear about them because I am genuinely curious.
However there is a bit of irony in that you're happy to point out my defensiveness as a potential flaw when you're getting hung up on nailing down the "10x" claim with precision. As an enjoyer of both articles I think this one is a fair retort to yours, so I think it a little disappointing to get distracted by the specifics.
If only we could accurately measure 1x developer productivity, I imagine the truth might be a lot clearer.
I mean it's pretty simple - there are a lot of big claims that I read but very few tangible examples that people share where the project has consequences for failure. Someone else replied with some helpful examples in another thread. If you want to add another one feel free, if not that's cool too.
I spent probably a day building prompts and tests and getting an example of failing behavior in Python, and then I wrote pseudocode and had it implement and write comprehensive unit tests in rust. About three passes and manual review of every line. I also have an MCP that calls out to O3 as a second opinion code review and passes it back in
Very fun stuff
I rolled out a PR that was a one shot change to our fundamental storage layer on our hot path yesterday. This was part of a large codebase and that file has existed for four years. It hadn’t been touched in 2. I literally didn’t touch a text editor on that change.
I have first hand experience watching devs do this with payment processing code that handles over a billion dollars on a given day.
When you say you didn't touch a text editor, do you mean you didn't review the code change or did you just look at the diff in the terminal/git?
You're rebutting a claim about your rant that -if it ever did exist- has been backed away from and disowned several times.
From [0]
> > Wait, now you're saying I set the 10x bar? No, I did not.
>
> I distinctly did not say that. I said your article was one of the ones that made me feel anxious. And it's one of the ones that spurred me to write this article.
and from [1]
> I'm trying to write a piece to comfort those that feel anxious about the wave of articles telling them they aren't good enough, that they are "standing still", as you say in your article. That they are crazy. Your article may not say the word 10x, but it makes something extremely clear: you believe some developers are sitting still and others are sipping rocket fuel. You believe AI skeptics are crazy. Thus, your article is extremely natural to cite when talking about the origin of this post.
Because I was the instigator of that change a second code owner was required to approve the PR as well. That PR didn't require any changes, which is uncommon but not particularly rare.
It is _common_ for me to only give feedback to the agents via the GitHub gui the same way I do humans. Occasionally I have to pull the PR down locally and use the full powers of my dev environment to review but I don't think that is any more common than with people. If anything its less common because of the tasks the agents get typically they either do well or I kill the PR without much review.
My post is about how those types of claims are unfounded and make people feel anxious unnecessarily. He just doesn't want to confront that he wrote an article that directly says these words and that those words have an effect. He wants to use strong language without any consequences. So he's trying to nitpick the things I say and ignore my requests for further information. It's kinda sad to watch, honestly.
Speaking of his rant, in it, he says this:
> [Google's] Gemini’s [programming skill] floor is higher than my own.
which, man... if that's not hyperbole, either he hasn't had much experience with the worst Gemini has to offer, or something really bad has happened to him. Gemini's floor is "entirely-gormless junior programmer". If a guy who's been consistently shipping production software since the mid-1990s isn't consistently better than that, something is dreadfully wrong.