I'm fairly AI-skeptical not on grounds of "do they work" but "are they good for the world". I feel that getting AIs to do this kind of review work is a rare case that doesn't outsource thinking and deskill workers. It doesn't trigger the same alarm bells as having the AI write the code (including having the AI fix the issues it discovers). That's setting aside environmental and other ethical concerns, which are still significant to me.
I have been impressed by the recent quality of AI code reviews*, but the experience of interacting with 3 separate AI reviewers via GitHub PRs is pretty terrible. Having more local-oriented and jj/rebase-aware review rounds would be great.
*context: fairly large PHP/Laravel backend and Vue frontend
[1]: https://milvus.io/blog/ai-code-review-gets-better-when-model...
I open sourced it on GitHub, you may search alexwwang/tdd-pipeline to find it if you are interested in it.
I wonder how we can evaluate these two options: using AI to 100X the output versus using AI to advance one's craft.
In the meantime, the productivity gain of AI is real. Case in point, An engineering org of Snowflake has met all its OKRs ahead of time in the first quarter for the time in the company's history. It had never happened, and usually meeting 70% of the planned OKR would be considered an achievement. I can imagine the stress of the engineers when they see such outcome.
You can very effectivly iterate alone using the LLM as a mirror, rephrasing what you put in and adding a bit.
You can use LLMs to quickly create prototypes to give to other human beings to help you with the next iteration.
If you get something from someone else to iterate on you can use the LLM to help you with understanding to rephrase things in a way more suitable for your understanding.
But instead everything anybody seems to be talking about seems to be one shoting things and AI iterating with other AI.
The big problem here is that the one thing AI does not have is agency. The naming AI agent is wishful thinking and marketing.
Also feels much better than pure vibe-coding (which I still do for personal projects that aren't mission critical for anyone).
It's still very slow. It took me two hours to write code that generate JSON data and then to write a web page that displays a knowledge graph.
One thing you have to be aware is that the LLM will happily generate code for you and you have to discipline it from time to time. I notice that my reading comprehension begins to suffer if I don't write the code myself and have to understand what the LLM wrote for me as opposed to the LLM correcting where I went wrong.
One thing I would like to try with an LLM is understanding a large and complex existing codebase like OpenSCAD that doesn't leverage my existing skillset(high level programming languages with OpenSCAD as primary language in the past year). That has always been a barrier to contribution for me.
This reminds me the article above. Now people have diverse ideas on agentic coding. Some suggest human-in-the-loop while others suggest giving a detailed specification and let the agent run freely; some suggest leveraging LLM's high productivity and here we get an opinion that LLM can actually slowly write good code.
It's happy to see opinions that are more practical and variant emerging, turning LLM into literally a tool instead of something to be hated or hyped.
In my own practice, I find LLMs (SOTA ones) good at medium-level tasks, those needed to reason and plan for a while. However, the design taste on architecture is unexpectedly disgusting. Sometimes writing interfaces myself and asking LLMs to fill in implementations, alongside context-completing tools like context7, deepwiki, docs.rs MCPs, etc. and giving a escape hatch (e.g. encouraging it to use the AskUser tool in Claude Code), may be considered my best practice.
I'll use AI to design the implementation of a medium sized, cross cutting feature. Review all the details, maybe iterate on just that. Then implement with Claude 4.7 Max - which runs slower, but does a better job. Then review the implementation, then have Codex GPT 5.5 xhigh fast review it - which almost always finds corner cases. Have Claude fix those - Claude is better at writing intuitive maintainable code versus Codex overengineered/shortcut filled code. (Codex is better at finding/fixing bugs and doing reviews - it's annoyingly pedantic)
Then repeat with fresh Claude/Codex instances having them both review the current staged changes and getting feedback, handling the feedback. Then covering it in tests. I mean overall I still implement the feature faster than coding it manually, but I spend a majority of the time going back and forth with reviews, handling corner cases and at the finish end up with what I feel a really solid implementation of whatever feature I'm working on. The v1 feature feels more like a v3 given the amount of iteration it already went through.
When using Claude Code or Codex, that is all gone. Claude Code is extremely eager to reach the end goal to the point that it feels like a fever dream to write code with it. In the end, I have low confidence about edge cases and fit into the project's architectural and design goals.
On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away. I'm trying really hard to find a workflow with them that I'm confident in, but I fear that workflow is just chat, search, and being a rubber duck for my thoughts.
A lot of people think a lot of things, but I donāt think the majority of people think the point of using LLMs is so they can produce low-quality code. Do they produce low-quality code sometimes or often? Of course. But they also produce high-quality code very often. And sometimes they just a āfineā job.
One of the promises - and there are plenty of cases where itās met and where it falls drastically short - is that agentic coding tools can help us code faster that is just as good or better than what a human can. One of the other big ideal payoffs is that agentic coding can allow non-programmers to create things that previously required programmers to create.
We can debate as to how successful weāve been toward the two goals above, but I think itās misguided to say that the majority of people think LLMs should produce lower quality code.
I use these tools at both work and for personal side projects and I was expecting to watch and learn. But these opinion pieces without examples are way too many now.
When LLMs started being somewhat useful for coding a few years ago, and I found they were in fact great at boilerplate, in fact pretty much only good at boilerplate ca 2023 or so, it got me thinking about all the accommodations we make in design and systems architecture that are sort of tacitly understanding who we're working with and their strengths and weaknesses.
The modern models have their own very different strengths and weaknesses compared to humans, and deploying them is a really interesting exercise of different architectural and engineering skills. I've enjoyed it, and hope I continue to.
But! Because of AI I was able to rapidly hack out like 4 variants of this feature that I didn't like. And felt comfortable throwing them away just as quick.
Iām not exactly sure what <foo> is but I feel it. I think itās quality and authenticity and craftsmanship. That difference between an expensive tool and a cheap one that you canāt easily describe but you just know it.
Is there a word for this? I bet the Japanese or Germans have a word for this.
I use AI a lot now. But I also do it in small steps. It isnāt a craftsman, but it can help me be one.
what's wrong with (depending on the language) checkstyle, sonarlint, ruff, mypy, xmllint, and/or eslint?
I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.
1. I write a list of things I want to have without AI support
2. I discuss the list with an LLM, which occasionally reveals obviously missing things I hadn't thought about or just things that would be smart to have. Or sometimes the LLM doesn't get it and wants to funnel me down a commonly walked path, which is a non-goal
3. From that list I draft an implementation plan containing things like how the code shall be structured, which language, libraries, build systems, etc to use. This may even contain some data models and considerations that are more detailed, like for example ideas about how a specific interaction shall be event sourced. I work on that, till I feel a satisfactory level of clarity has been reached
4. Actual writing of code as a back and forth between manual writing, letting an LLM write something and so on. LLMs suck at writing CSS that feels like good UX design to me, so usually templates, layout and CSS will be (re)written entirely by hand
5. Bug-hunting and guessing potential edge cases is one thing where LLMs really shine. Often if the work before that was quality the LLM has an okay time coming up with fixes that are no worse than what I would have done.
I guess he could write a code harness to do this, or gin one up really quickly, but that kind of tooling today seems like the purview of the practitioner -- you -- it's frankly faster for you to spec what you want to try this idea out if you want it automated than it would likely be to deal with his code.
I'd much rather django-admin startproject, npm init, or meteor create and get deterministic output than prompt an LLM and get who knows what.
In a mature web ecosystem, boilerplate is minimal. I worry now that we've given this task to LLMs, less development effort will go into startproject-esqe CLIs and good opinionated defaults.
Also maybe this will help: https://hnup.date/hn-sota
The qwen model is my daily driver this week.
Great how the promoters are mirroring the current anti-AI sentiment. The next step is canceling all subscriptions and not using AI at all. Maybe your mind will work again.
I have my own skill: 5 rounds of research/planning/test-planning. Interactive with me in loop for all important decisions. Starts with high level shape, then details. Planning can take 2-3 days of my time, then the implementation agent can take many hours (Opus 4.7). It splits the implementation across many phases/commits, each with its own code-review fix loop. Deep code review at the end can take another hour or two. It opens a PR, Gemini reviews, it reads out and resolves those issues.
Projects still take days or weeks, but 5x faster than doing it all myself.
Edit: the skill - https://github.com/scosman/vibe-crafting
Ingest big project, comment on it gets expensive. I'm not sure how expensive.
AI is an excellent rubber duck and test writer. Maybe I sniff my farts too much but I like my code just the way I want it lol
Because this version of AI is worth 10 trillion dollars.
While the pragmatic versions from realists you can find all over this thread are ultimately probably less of a speed boost than just having your CEO/local micromanager be conveniently on vacation during critical periods when the work actually gets done.
Now if itās my job then I canāt have a knowledge debt and if Claude is down Iāll continue working manually because I know and understand and can continue without having to understand a lot of logic before continuing
1. Some bad idea gets embedded into the context that you just can't argue away
2. Some important idea gets lost in compression and the ai wheres off into funland without recourse.
In both cases if is often better to start over or just do it yourself. I sometimes find myself asking for a summary, editing it and then using the edited one to seed a new session.
Edit: s/Finland/funland/
Often depending on how complex the feedback, I'll do it one at a time addressing each one individually. And after the feedback is addressed, I'll go back to the AI that generated the feedback and say like, "I handled 4/5 items you found, can you double check."
It's similar to handling PR feedback, where you do it, validate it, but then still have to submit it for peer review.
And maybe don't use tools that lock you into one model?
This is exactly what I settled upon after my own trying really hard. It is liberating, I have no fear at all!
But if you're not used to code reviewing, it can certainly help to still write yourself.
Guessing youāre not at FAANG or similar company. For the last 6 months at least thereās been tremendous pressure from leadership (including highly experienced IC engineers) to let AI take the reigns, assumption being that future AI assistants will be able to deal with any level of complexity and tech debt created today.
Given that everyone agrees that reviewing all AI-generated code is impractical (if you let the agents rip at maximum available bandwidth), and that āharness engineeringā is at best immature and at worst complete snake oil when it comes to ensuring system stability, maintainability, and quality, I do believe that itās fair to claim that most engineers are, in fact, supportive of low quality code generated by LLMs.
Fwiw I do see pushback here and there, but only from the lowest rungs on the career ladder - ICs with enough experience to see where this train is headed, but no ability to save it. Management needs to see the results of their policies first, and that will take months or even years to fully play out.
No not really. These are separate questions from what the article posits. The argument is about how do we use these tools, our approach as developers, and if the results are going to be as rosy as advertised.
I'm following an Ideas -> PRD -> Issues -> Tasks methodology, where each task has a bunch of sub-tasks. I have it just do one (or a few, I'm having it do Red/Green/Refactor as separate sub-steps, so I review the Red case, and then once that's good, do the Green and Refactor steps, and review those).
People believe that you can only use LLMs for sloppy programming. But you can also use it for writing ten times more code of Swiss cheese model tests, and domain specific languages.
You write ten times more code than necessary and all that extra code is testing. Projects like SqlLite do that because they need to be perfect.
Before LLMs we had to use engineers for that and it was a painful and repetitive work, and they were always late and made much more mistakes than LLMs, specially because it was dull and tedious for great engineers to spend their time into.
Now we write tests and when all test pass we write new test for checking the tests.
We divide each complex problem in small subproblems and we warrantee each of them by formal means. We have multiple ways of solving the same problem, usually with one brute force solution that is simple and warranted to work but inefficient, and we can use it to compare with more efficient methods.
Before machines could do that, people doing that were burned down and exhausted, and always leaved pending work to complete.
Man so much work to retrofit something that obviously, simply, plainly - just does not work. How about just writing the code yourself? You can even consult AI on the libraries or whatever, but how about just building that model in your head YOURSELF and not loading up on AI slop and trying to memorise that crap. The names of the functions will ring different in your memory once you spend some time thinking over whether you picked the right and clear name vs. just going with whatever statistical median the slop machine picked for you.
Having taste and the ability to author high quality prompts is still the most important thing. It was always the most important thing if you think abstractly about how all of this works.
- Opus 4.7 writes the code - I make GPT-5.5 in Codex to review it (given context) - I provide the review back to Opus and ask it to verify the review findings - Make Opus plan the fixes then execute them - Ask GPT-5.5 to review the fixes and check if they solve the problems
The downside is you use less tokens.
You're better off plonking down an existing framework and getting all the structural boilerplate benefits the LLM can leverage.
LLMs are far better at frameworks they have a lot of training data for, if have been around for a while. They write more idiomatic, ecosystem friendly code. Does that still matter?
And I wasn't attached to that complex implementation in the way I would be if I architected it from scratch, so it was easy to move on.
I feel like AI promises a factory that can make Walmart quality tools. Which I think will make the well-crafted tools more important than ever.
Once this is done, the mechanical coding parts are mostly routine (for codex)
What I've tried to do is make the bot write detailed spec documents, slowly building it over time as I explain the full problem.
It works for the most part but it's you have some non standard requirement, the agent seems to skip over that part of the spec document when it starts to code. Or it would have needless checks for situations that I said will never happen
You will outgrow it at some point.
IMO if you are not shipping out faster then the faster work gains are meaningless.
If you are shipping faster, youāre probably picking up more work and shipping everything too fast leading to burnout.
You say āall that timeā babysitting AIs but in my experience it isnāt that much time, if anything the back and forth at the planning stages is more productive than when Iām doing it by myself because Iām being asked questions and having to think things through from different angles.
Kind of a shocking thing to see argued on HN. Maybe it's just the vibe coders.
My goal is to draft the solution with ai, write it myself but faster with auto complete, then throw ai review.
Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that. Simply ask for covering corner cases with tests, test all the known non happy paths, look for weaknesses, verify adherence to SOLID principles, do security audits, etc. It will find issues. With bigger projects, you can actually make it file those issues in gh with labels and priorities. And then you can make it iterate on fixing issues with separate PRs.
On a recent project, I made it implement a simple benchmark test for measuring throughput. I had a hunch it was doing very sub optimal things. I then asked it to look for potential performance bottlenecks and use the benchmark to verify improvements. At that point I already had a lot of end to end tests to verify correctness. So, these performance tweaks were relatively low risk. I got about two orders of magnitude improvement and a lot more graceful behavior when pushed to the limit.
If you have a bit of experience engineering systems, just treat these tools like they are junior developers. Competent but likely to skip some essential steps. So, just double check with a lot pointed questions "did you do X? If not, do it now". Anything that needs repeated asking, turn it into a guard rail / skill.
There's a bit of effort and skill involved with this. I imagine a lot of less experienced developers might struggle to get good results because they aren't asking for the right things.
By default it uses pi agent core + pi ai (from the excellent pi coding agent) as a multi model runtime but also supports a Claude Agent SDK runtime.
I can have an implementation and review process of an OpenSpec change run anywhere from 2 hours to 24+ hours going through review/fix/verification rounds automatically until the implementation matches the spec and any additional reviewers are done finding issues after the fix rounds.
it's going to be fully open sourced in the next two weeks and fully free to use
This isnāt a binary is/isnāt thing though. What if only 80% of my task is, how would I know that the other part isnāt, if I havenāt worked it through fully
What if my task is generally represented, but for my specific context, there are specific details that arenāt?
How would I know until Iāve reasoned through it myself? At that point having the LLM do the work doesnāt add much value
There will be a large majority of people who hold these opinions, because they werenāt capable of or didnāt care enough to write good code in the before times
Sure. But how long will that last? LLMs are getting better at programming much faster than I am.
Imagine a plot with time on the X axis and LLM skill on the Y axis. The line goes up and to the right. On the left is GPT3, or GPT3.5 with the very first glimmers of programming ability just a few short years ago. In the middle is Opus 4.7 now.
Where's the intersection point, where AI skill is higher than that of humans? Less than 10 years. I'd guess less than 5 years.
I'm sure there's an interesting study on how users 'leak' their preference unintentionally to the LLM; perhaps when users list their options, they often put their prefered option first; but not showing the cards on my hand has been very useful when thinking through a problem with LLMs.
I find it useful to let it generate benchmarks comparing the approaches. Turns out AI is terrible at guessing whats faster or allocates less
I would also recommend explaining the specs and doing a lot of your back and forth with a lower end model and set it to a higher end model only once the conversation history has all the context you feel the higher end model needs.
[0] At least, in my experience, "micromanaging" the AI is what gives me the best results. Iterating on the initial design, then iterating on the plan, then reviewing the proposed code changes (including tests), then getting an independent code review from another LLM, etc. If you give an LLM too much latitude that's when the really shitty code and ill-considered breaking changes/obliteration of existing functionality starts to creep in.
Define 'aware'. The volume of code for a feature/system to make it worth using a more complex workflow such as this one, is definitely larger than what a human can even briefly review and build a mental model about the inner workings within a reasonable amount of time. Reasonable meaning not considerable delaying the process. When deadlines loom and management adds pressure, this 'awareness' is the first thing that goes out the window.
A lot of people seem convinced that the point of AI coding is to write low-quality code as fast as possible. Spew out barely-passable slop, open massive PRs, and merge them unvetted. Ship it!
But the thing is, LLMs are very flexible. And you can use them just as effectively to write high-quality code more slowly.
This statement seems completely obvious to me at this point, and I almost didnāt want to write this post for that reason. But there seem to be enough people convinced that LLMs are only good as slop cannons that itās worth making the opposite case.
If Mythos taught us anything, itās that LLM agents are really good at finding bugs. Throw them at a codebase enough times, and they will find so many bugs that youāll barely know what to do with them.
Like many others, Iāve also found this is true of non-Mythos models ā some may be better than others at finding subtle bugs or avoiding false positives, but the fact is that the latest public models from Anthropic and OpenAI are good enough to find plenty of bugs in an unscrutinized codebase.
The problem is not so much finding the bugs, but instead prioritizing and validating them. For this reason I have a Claude skill I adapted from this articleās core insight, which is that the more, different models you throw at a PR review, the less likely you are to get hallucinations or bogus bugs.
The skill says (paraphrasing):
Run a Claude sub-agent, Codex, and Cursor Bugbot to find bugs in this PR ranked by critical/high/medium/low. Once theyāre all done, review their findings, do your own research to rule out false positives, and write a final report.
Thatās basically it. You can add your own definition of ābugā if you want ā mine has stipulations about the KISS and DRY principles, writing accessible HTML/JSX, using proper indexes for SQL queries, etc.
In my experience, this skill always finds tons of bugs in a PR, and the false positive rate is near zero. It finds so many bugs that youāll be bored senseless if you try to tackle them all. Theyāll range from critical security or correctness bugs to the more mundane medium-level perf bugs to low-level āthis comment is misleadingā-type bugs.
My typical workflow is:
When I use this technique, I havenāt necessarily seen my velocity go up. If anything, the review process often finds pre-existing bugs, so I end up on a tangential side-quest where Iām writing unit tests and fixing subtle flaws that pre-date the PR. This is the opposite of the ā10x productivityā slop-cannon style of development that most people imagine when they think of vibe coding, but I find it very satisfying.
Itās a great way to improve the overall health of the codebase while also teaching you about the odd corners of it. In my experience, the happy-path of a complex architecture is less interesting than its failure modes. And pre-LLMs, this is usually how I got familiar with a codebase anyway: understanding where the assumptions break down, and then getting my hands dirty to fix it.
If youāre the kind of person who is skeptical that AI coding is good for anything, then I doubt this post will persuade you. But if youāre the kind of developer who uses agents to write multi-hundred-line PRs that you barely understand yourself, Iād invite you to slow down a bit and try this other, slower style of āvibe coding.ā Ask an agent how your PR works and how it might fail. Have it write Markdown docs with Mermaid charts if necessary. Use Matt Pocockās /grill-me skill until you understand the entire PR front-to-back.
You might not be more āproductiveā in terms of raw lines of code. You might burn a ton of tokens just to find out that your entire plan was wrongheaded from the start. But I find this style of coding to be a more super-powered version of the kind of programming I was already trying to do before LLMs: careful, methodical, quality-obsessed, focused on making things better for the next coder.
So take a deep breath, slow down, try this technique, and see if you donāt enjoy writing better code more slowly.
The majority of jobs are still paid on a 40 hour per week basis. Disappearing for a day each week (20%) won't fly when you're full time.
Finally feel like I have a good workflow where I can fully benefit from these things without sacrificing my understanding of what they're doing.
I was more annoyed than anything that I didn't hit this moment until my 40s.
Except it's not just reddit (I quit reddit 15 years ago). It's the whole internet.
Keeping that many tasks in parallel, running all the time will kill you.
My visual for this is a capybara soaking
https://www.gettyimages.com/detail/news-photo/capybaras-take...
I am very visual and spatial. The first investment I make in my home or even visiting somewhere for more than 3-4 days where I will need to work without coworking is buying a whiteboard.
So now I'm here with all these tools trying to use a remarkable tablet to draw and show the AI what I'm thinking. It's just not fulfilling. Cleaning toilets isn't either. Lots of jobs have felt like a full on race to software factory and it's clear we're going there with AI and the "cognitive debt" from half (or less) activated brains driving the code generation is going to be massive.
This seems to me to be one of the key problems for AI usage in general. Students have this problem where it can be incredibly helpful in actually learning but late at night with the assignment due early tomorrow the temptation is just too strong to have it do the thing.
Maybe itās just me, but Iāve never understood how one understands from reading code. Yes you can understand what that code does, but not why it was done that way instead of a different way. In the end I only understand it deeply if I end up writing it. Chatting through it is helpful to me, but having AI crank out code loses all of that context pretty quickly.
Iām not disagreeing. Just curious how you think about this, and if there are key parts of your process that help you stay contexted in.
When it comes to the actual implementation I prefer to work through it in small steps, where the AI explains to me exactly what it's about to do and why (and I approve) along the way. This enables me to catch it if it's about to do something I disagree with beforehand. And reduces the time I need to spend reviewing in the end.
Yes, I thought the same as well because that was the same line of thought that made me write my comment.
>Except it's not just reddit (I quit reddit 15 years ago). It's the whole internet.
Yea, they are like a slingshot. You need to let go at some point or else it will drag you back.
So to me this loop really never properly ends so it never feels like I'm done. Which is not great from a psychological point of view.
That's more or less all of them, they do just generate the likely combinations of tokens, there is no critical thought involved. If you want to approximate that, review iterations are probably the right way to go about it, without the full conversation context either so there's no model output like "I'm doing X because it seems like the correct way to go about Y." but rather a fresh context which allows for more critical predictions.
Here's what works for me, can be made into a skill in whatever you use:
I would like you to do a review loop!
How this works:
* once implementation is done, all tools must be run and pass: whatever is configured in the project like Ruff, Oxlint and Oxfmt, depending on the tech stack (also don't run such tools directly, look at package.json or similar project files/configurations/run scripts first; like if it's a stack that has compilation, compile the app, if there are tests, then run those; just know that you DO NOT generally need to stand up the whole app); if there is a projectlint-rules folder then that means you probably should run ProjectLint as well (local tool, use projectlint --help or projectlint --docs, or better yet, look at whether package.json or README.md have any instructions on how to run it)
* once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code (not each having a different sub-section) and looking for CRITICAL/SERIOUS issues (not nitpicks), with the goal of not missing anything and building consensus
* whatever CRITICAL/SERIOUS issues are found, if you can confirm that they're real and not false positives, you will then fix and remember to run the tools after, after which you will do another review iteration, followed by a fix iteration if needed and so on
* remember that the review and fix loop must END with an iteration of the review agents returning that there are no CRITICAL/SERIOUS issues - you cannot just do fixes and say that there is nothing remaining yourself (and also remember that the reviews are done when all of the tools pass, like when the code is linted and formatted etc.)
* at the end, produce a summary post that has a table, the rows being iterations, the columns for each of the agents (A, B, C) showing FIX/OK and then a column called Iteration summary; the goal for this is to show a summary how many iterations it took and what was fixed, you can also include text alongside the table as normally
The ProjectLint references might need to be removed (replace with whatever higher level linting/architecture tools you have, if any), but that's the overall idea. It does use a LOT of tokens though, but almost always there's something to fix. Of course, the problem is that sometimes there will be nitpicks or the fixes themselves won't be fully okay, though in general this trends towards slightly better code, even with something like Opus 4.7.A calm attentive alternative of vibe coding: restful coding.
It's much easier to read and review code after a refreshing cat nap, especially with a real cat.
Too bad that's not usually acceptable to do that in the office. It should be! Slacking off by sword fighting all day is too exhausting.
Multitasking does not mean burnout. It just means you are not wasting time while idling. Multitasking was not invented for AI coding assistants. What do you think feature branches are used for?
Same with AI images. It feels good for 2 seconds to see what I asked for and I'm immediately disinterested. Same way, I've generated many Suno songs, but I don't care about them after a few listens.
But once the prototype is done, I spend too much time refining the details, fixing everything going wrong (bas design details, wrong implementation, half done testing ...)
A full agentic setting would be too expensive for me (I wonder how much Garry tan spends...)
So I'd like to take a more balanced approach with: 1. usual LLM specs and prototyping to get the bases of the feature and boilerplate done 2. Write myself the code, with the help of an ai auto complete (this is Where I look for recommendations) 3. Use a setup as OP mentioned to review code
Either you follow everything it does, revise the plans, do the code review, manual adjustments, etc, or you run sessions in parallel, not being that attentive and constantly context-switch (also resulting in less attention I guess).
I fail to see the benefits honestly.
Yak driven development.
Even code you write yourself, given enough time, you will forget the why unless you wrote comments. In a way comments are as much for you as they are for others.
Even before AI, understanding code you didn't write is essential to working on a team of other developers. If you can't understand the code from reading it, then that's part of the feedback loop - too complex, needs comments, etc..
On large teams you'll spend as much time reading code as you do writing it. And long term when it comes to writing maintainable code - the ability for others to read and understand it, including the why of it, is paramount. Your code could literally be around for decades.
Oh. I am aware. It is not that deep. But who you argues with still matter. There was a point where I have abandoned Reddit and HN. I came back to HN because people here also seem to have grown up. Reddit stays mostly the same.
I credit the moderation here for that, I mean allowing people to grow out of the echo chamber.
Your feature branch is to put things aside and send them to CI, or wait and think on them. Not to have four of them running in parallel in your head frying you.
Getting past that is problem we face now.
Code is never missing contexts. If what your code is doing is not obvious to the reader, it is bad code that needs to be fixed. Things like cryptic low-level expressions should be extracted to helper functions with descriptive names or even extracted into a class, and classes need to comply with the single responsibility principle.
After you put together a plan, today's models can take well over a minute to execute it. Also, your work shifts to code review and executing acceptance tests, followed by either tweaking your current change or moving on to the next change.
This is really not about context changes. This is about not having to switch contexts because your focus stays on architecture+review instead of having to do deep dives to type code around.
> Your feature branch is to put things aside and send them to CI, or wait and think on them.
No, not really. Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel.
A full, whole, entire _minute_ ?! Sixty seconds ! Oh no, they must be optimized away, we do not deserve our free time like so, we should toil until we fall over because... Growth?
It's still context switching. Either what you're doing is surface enough that you don't give a shit, it doesn't matter and you don't review it anyways (so the only context is basically the prompt you wrote or the nth SELECT * FROM table CRUD piece of crap), or you're context switching and it's fucking you over. The context isn't about remembering how you write if err != nil, it's the expected behaviour of what you're working on.
You're not getting a promotion from doing this, you're getting burnout.
> Feature branches, as well as most types of branches, is to set aside work fronts that are in progress and run in parallel
They're not running in parallel, unless you use work trees. They were put to the side, because you can't continue or finish the work they're about. Even just three branches in parallel in a modestly active repo that happen to be long lived drift enough that just keeping them up to date with develop makes it a waste of time.
Focus on one or two things, and do them well.
That, or get checked for ADHD.
Another thing is that unless you are doing really complicated stuff, you probably don't need the latest models running on high. I'm still on 5.4 medium with codex. I see very little reason to change that.
Part of agentic engineering is figuring out how to be economical with tokens and time. You can sacrifice one for the other of course. But there are diminishing returns as well where spending 10x more doesn't actually get you 10x more quality/results.
I did some evals with a prompt like this when I had some subscription tokens to burn, a few months ago. I think using Opus 4.5. What I found was:
1. Running two subagents was somewhat useful
2. Running three started to get redundant
3. Any more than three was pointless (at least when using the same model)
However, even two were getting like 60% the same results.
Much, much more effective was splitting out into audits through different lenses:
* One looking for security issues
* One looking for whether the task was completed successfully
* One looking for performance issues
* One looking for contract/maintainability issues
* One looking at test coverage
Etc.
If I had to pay per token, I'd probably look at DeepSeek. In general it feels like it's a bit early for the technology - either our software methods are wasteful, or the hardware hasn't caught up. To me, it appears that we often need to throw more tokens at these problems, not less, since otherwise it's just one-shot slop.