I now have several projects going in languages that I've never used. I have a side project in Rust, and two Go projects. I have a few decades experience with backend development in Java, Kotlin (last ten years) and occasionally python. And some limited experience with a few other languages. I know how to structurer backend projects, what to look for, what needs testing, etc.
A lot of people would insist you need to review everything the AI generates. And that's very sensible. Except AI now generates code faster than I can review it. Our ability to review is now the bottleneck. And when stuff kind of works (evidenced by manual and automated testing), what's the right point to just say it's good enough? There are no easy answers here. But you do need to think about what an acceptable level of due diligence is. Vibe coding is basically the equivalent of blindly throwing something at the wall and seeing what sticks. Agentic engineering is on the opposite side of the spectrum.
I actually emphasize a lot of quality attributes in my prompts. The importance of good design, high cohesiveness, low coupling, SOLID principles, etc. Just asking for potential refactoring with an eye on that usually yields a few good opportunities. And then all you need to do is say "sounds good, lets do it". I get a little kick out of doing variations on silly prompts like that. "Make it so" is my favorite. Once you have a good plan, it doesn't really matter what you type.
I also ask critical questions about edge cases, testing the non happy path, hardening, concurrency, latency, throughput, etc. If you don't, AIs kind of default to taking short cuts, only focus on the happy path, or hallucinate that it's all fine, etc. But this doesn't necessarily require detailed reviews to find out. You can make the AI review code and produce detailed lists of everything that is wrong or could be improved. If there's something to be found, it will find it if you prompt it right.
There's an art to this. But I suspect that that too is going to be less work. A lot of this stuff boils down to evolving guardrails to do things right that otherwise go wrong. What if AIs start doing these things right by default? I think this is just going to get better and better.
This experience is familiar to every serious software engineer who has used AI code gen and then reviewed the output:
> But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti14. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision,
Some people never get to the part where they review the code. They go straight to their LinkedIn or blog and start writing (or having ChatGPT write) posts about how manual coding is dead and they’re done writing code by hand forever.
Some people review the code and declare it unusable garbage, then also go to their social media and post how AI coding is completely useless and they’re not going to use it for anything.
This blog post shows the journey that anyone not in one of those two vocal minorities is going through right now: A realization that AI coding tools can be a large accelerator but you need to learn how to use them correctly in your workflow and you need to remain involved in the code. It’s not as clickbaity as the extreme takes that get posted all the time. It’s a little disappointing to read the part where they said hard work was still required. It is a realistic and balanced take on the state of AI coding, though.
Oof, this hit very close to home. My workplace recently got, as a special promotion, unlimited access to a coding agents with free access to all the frontier models, for a limited period of time. I find it extremely hard to end my workday when I get into the "one more prompt" mindset, easily clocking 12-hour workdays without noticing.
I like this a lot. It suggests that AI use may sometimes incentivize people to get better at metacognition rather than worse. (It won't in cases where the output is good enough and you don't care.)
> But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti...It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision...I decided to throw away everything and start from scratch
This part was interesting to me as it lines up with Fred Brooks "throw one away" philosophy: "In most projects, the first system built is barely usable. Hence plan to throw one away; you will, anyhow."
As indicated by the experience, AI tools provide a much faster way of getting to that initial throw-away version. That's their bread and butter for where they shine.
Expecting AI tools to go directly to production quality is a fool's errand. This is the right way to use AI - get a quick implementation, see how it works and learn from it but then refactor and be opinionated about the design. It's similar to TDD's Red, Green, Refactor: write a failing test, get the test passing ASAP without worrying about code quality, refactor to make the code better and reliable.
In time, after this hype cycle has died down, we'll come to realize that this is the best way to make use of AI tools over the long run.
> When I had energy, I could write precise, well-scoped prompts and be genuinely productive. But when I was tired, my prompts became vague, the output got worse
This part also echoes my experience - when I know well what I want, I'm able to write more specific specifications and guide along the AI output. When I'm not as clear, the output is worse and I need to spend a lot more time figuring it out or re-prompting.
I just extended that demo to one that runs the resulting Pyodide library in a browser with a playground interface for trying it out: https://tools.simonwillison.net/syntaqlite
I didn't have to review the code for understanding what Claude did, I reviewed it for verifying that it did what it had been told.
It's also nuts to me that he had to go back in later to build in tests and validation. The second there is an input able to be processed, you bet I have tests covering it. The second a UI is being rendered, I have Playwright taking screenshots (or gtksnapshot for my linux desktop tools).
I think people who are seeing issues at the integration phase of building complex apps are having that happen because they're not keeping the limited context in mind, and preempting those issues by telling their tools exactly how to bridge those gaps themselves.
It also reduces my hesitation to get started with something I don't know the answer well enough yet. Time 'wasted' on vibe-coding felt less painful than time 'wasted' on heads-down manual coding down a rabbit hole.
When I ported pikchr (also from the SQLite project) to Go, I first ported lemon, then the grammar, then supporting code.
I always meant to do the same for its SQL parser, but pikchr grammar is orders of magnitude simpler.
Seconded!
Expanding a thought beyond 280 characters and publishing it somewhere other than the X outrage machine is something we should be encouraging.
This could likely be extracted much easier now from the new code, but imagine API docs or a mapping of the logical ruleset with interwoven commentary - other devtools could be built easily, bug analysis could be done on the structure of rules independent of code, optimizations could be determined on an architectural level, etc.
LLMs need humans to know what to build. If generating code becomes easy, codifying a flexible context or understanding becomes the goal that amplifies what can be generated without effort.
This is my experience. Tests are perhaps the most challenging part of working with AI.
What’s especially awful is any refactor of existing shit code that does not have tests to begin with, and the feature is confusing or inappropriately and unknowingly used multiple places elsewhere.
AI will write test cases that the logic works at all (fine), but the behavior esp what’s covered in an integration test is just not covered at all.
I don’t have a great answer to this yet, especially because this has been most painful to me in a React app, where I don’t know testing best practices. But I’ve been eyeing up behavior driven development paired with spec driven development (AI) as a potential answer here.
Curious if anyone has an approach or framework for generating good tests
This is a great article. I’ve been trying to see how layered AI use can bridge this gap but the current models do seem to be lacking in the ambiguous design phase. They are amazing at the local execution phase.
Part of me thinks this is a reflection of software engineering as a whole. Most people are bad at design. Everyone usually gets better with repetition and experience. However, as there is never a right answer just a spectrum of tradeoffs, it seems difficult for the current models to replicate that part of the human process.
It is really good for getting up to speed with frameworks and techniques though, like they mentioned.
Nowhere is this more obvious in my current projects than with CRUD interface building. It will go nuts building these elaborate labyrinths and I’m sitting there baffled, bemused, foolishly hoping that THIS time it would recognise that a single SQL query is all that’s needed. It knows how to write complex SQL if you insist, but it never wants to.
But even with those frustrations, damn it is a lot faster than writing it all myself.
Unfortunately, AI seems to be divisive. I hope we will find our way back eventually. I believe the lessons from this era will reverberate for a long time and all sides stand to learn something.
As for me, I can’t help but notice there is a distinct group of developers that does not get it. I know because they are my colleagues. They are good people and not unintelligent, but they are set in their ways. I can imagine management forcing them to use AI, which at the moment is not the case, because they are such laggards. Even I sometimes want to “confront” them about their entire day wasted on something even the free ChatGPT would have handled adequately in a minute or two. It’s sad to see actually.
We are not doing important things and we ourselves are not geniuses. We know that or at least I know that. I worry for the “regular” developer, the one that is of average intellect like me. Lacking some kind of (social) moat I fear many of us will not be able to ride this one out into retirement.
I have several Open Source projects and wanted to refactor them for a decade. A week ago I sat down with Google Gemini and completely refactored three of my libraries. It has been an amazing experience.
What’s a game changer for me is the feedback loop. I can quickly validate or invalidate ideas, and land at an API I would enjoy to use.
90 percent of the things users want either A) dont exist or B) are impossible to find, install and run without being deeply technical.
These things dont need to scale, they dont need to be well designed. They are for the most part targeted, single user, single purpose, artifacts. They are migration scripts between services, they are quick and dirty tools that make bad UI and workflows less manual and more managable.
These are the use cases I am seeing from people OUTSIDE the tech sphere adopt AI coding for. It is what "non techies" are using things like open claw for. I have people who in the past would have been told "No, I will not fix your computer" talk to me excitedly about running cron jobs.
Not everything needs to be snap on quality, the bulk of end users are going to be happy with harbor freight quality because it is better than NO tools at all.
Ideally: local; offline.
Or do I have to wrestle it for 250 hours before it coughs up the dough? Last time I tried, the AI systems struggled with some of the most basic C code.
It seemed fine with Python, but then my cat can do that.
I’ve been driving Claude as my primary coding interface the last three months at my job. Other than a different domain, I feel like I could have written this exact article.
The project I’m on started as a vibe-coded prototype that quickly got promoted to a production service we sell.
I’ve had to build the mental model after the fact, while refactoring and ripping out large chunks of nonsense or dead code.
But the product wouldn’t exist without that quick and dirty prototype, and I can use Claude as a goddamned chainsaw to clean up.
On Friday, I finally added a type checker pre-commit hook and fixed the 90 existing errors (properly, no type ignores) in ~2 hours. I tried full-agentic first, and it failed miserably, then I went through error by error with Claude, we tightened up some exiting types, fixed some clunky abstractions, and got a nice, clean result.
AI-assisted coding is amazing, but IMO for production code there’s no substitute for human review and guidance.
The tricky part of unit tests is coming up with creative mocks and ways to simulate various situations based on the input data, w/o touching the actual code.
For integration tests, it's massaging the test data and inputs to hit every edge case of an endpoint.
For e2e tests, it's massaging the data, finding selectors that aren't going to break every time the html is changed, and trying to winnow down to the important things to test - since exhaustive e2e tests need hours to run and are a full-time job to maintain. You want to test all the main flows, but also stuff like handling a back-end system failure - which doesn't get tested in smoke tests or normal user operations.
That's a ton of creativity for AI to handle. You pretty much have to tell it every test and how to build it.
1) All-knowing oracle which is lightly prompted and develops whole applications from requirements specification to deployable artifacts. Superficial, little to no review of the code before running and committing.
2) An additional tool next to their already established toolset to be used inside or alongside their IDE. Each line gets read and reviewed. The tool needs to defend their choices and manual rework is common for anything from improving documentation to naming things all the way to architectural changes.
Obviously anything in between as well being viable. 1) seems like a crazy dead-end to me if you are looking to build a sustainable service or a fulfilling career.
Most of my questions are "in one sentence respond: long rambling context and question"
I completely agree that this is the case right now, but I do wonder how long it will remain the case.
But that's boring nerd shit and LLMs didn't change who thinks boring nerd shit is boring or cool.
One thing I will add: I actually don’t think it’s wrong to start out building a vibe coded spaghetti mess for a project like this… provided you see it as a prototype you’re going to learn from and then throw away. A throwaway prototype is immensely useful because it helps you figure out what you want to build in the first place, before you step down a level and focus on closely guiding the agent to actually build it.
The author’s mistake was that he thought the horrible prototype would evolve into the real thing. Of course it could not. But I suspect that the author’s final results when he did start afresh and build with closer attention to architecture were much better because he has learned more about the requirements for what he wanted to build from that first attempt.
But it does a good job of countering the narrative you often see on LinkedIn, and to some extent on HN as well, where AI is portrayed as all-capable of developing enterprise software. If you spend any time in discussions hyping AI, you will have seen plenty of confident claims that traditional coding is dead and that AI will replace it soon. Posts like this is useful because it shows a more grounded reality.
> 90 percent of the things users want either A) dont exist or B) are impossible to find, install and run without being deeply technical. These things dont need to scale, they dont need to be well designed. They are for the most part targeted, single user, single purpose, artifacts.
Yes, that is a particular niche where AI can be applied effectively. But many AI proponents go much further and argue that AI is already capable of delivering complex, production-grade systems. They say, you don't need engineers anymore. They say, you only need product owners who can write down the spec. From what I have seen, that claim does not hold up and this article supports that view.
Many users may not be interested in scalability and maintainability... But for a number of us, including the OP and myself, the real question is whether AI can handle situations where scalability, maintainability and sound design DO actually matter. The OP does a good job of understanding this.
What’s really happening is that you’re all of those people in the beginning. Those people are you as you go through the experience. You’re excited after seeing it do the impossible and in later instances you’re critical of the imperfections. It’s like the stages of grief, a sort of Kübler-Ross model for AI.
Then use ideation to architect, dive into details and tell the AI exactly what your choices are, how certain methods should be called, how logging and observability should be setup, what language to use, type checking, coding style (configure ruthless linting and formatting before you write a single line of code), what testing methodology, framework, unit, integration, e2e. Database, changes you will handle migrations, as much as possible so the AI is as confined as possible to how you would do it.
Then, create a plan file, have it manage it like a task list, and implement in parts, before starting it needs to present you a plan, in it you will notice it will make mistakes, misunderstand some things that you may me didn’t clarify before, or it will just forget. You add to AGENTS.md or whatever, make changes to the ai’s plan, tell it to update the plan.md and when satisfied, proceed.
After done, review the code. You will notice there is always something to fix. Hardcoded variables, a sql migration with seed data that should actually not be a migration, just generally crazy stuff.
The worst is that the AI is always very loose on requirements. You will notice all its fields are nullable, records have little to no validation, you report an error when testing and it tried to solve it with an brittle async solution, like LISTEN/NOTIFY or a callback instead of doing the architecturally correct solution. Things that at scale are hell to debug, especially if you did not write the code.
If you do this and iterate you will gradually end up with a solid harness and you will need to review less.
Then port it to other projects.
Pull out as many pure functions as possible and exhaustively test the input and output mappings.
In one of the cases, I was searching for a way to extract a bunch of code that 5-6 queries had in common. Whatever this thing was, its parameters would have to include an array/tuple of IDs, and a parameter that would alter the table being selected from, neither of which is allowed in a clickhouse parameterized view. I could write a normal view for this, but performance would’ve been atrocious given ClickHouse’s ok-but-not-great query optimizer.
I asked AI for alternatives, and to discuss the pros and cons of each. I brought up specific scenarios and asked it how it thought the code would work. I asked it to bring what it knew about SQL’s relational algebra to find the an elegant solution.
It finally suggested a template (we’re using Go) to include another sql file, where the parameter is a _named relation_. It can be a CTE or a table, but it doesn’t matter as long as it has the right columns. Aside from poor tooling that doesn’t find things like typos, it’s been a huge win, much better than the duplication. And we have lots of tests that run against the real database to catch those typos.
Maybe this kind of thing exists out there already (if it does, tell me!) but I probably wouldn’t have found it.
I know not everybody is quite ready for this yet. But I'm working from the point of view that I won't be manually programming much professionally anymore.
So, I now pick stuff I know AIs supposedly do well (like Go) with good solid tool and library ecosystems. I can read it well enough; it's not a hard language and I've seen plenty other languages. But I'm clearly not going to micro manage a Go code base any time soon. The first time I did this, it was an experiment. I wanted to see how far I could push the notion. I actually gave it some thought and then I realized that if I was going to do this manually I would pick what I always pick. But I just wasn't planning to do this manually and it wasn't optimal for the situation. It just wasn't a valid choice anymore.
Then I repeated the experiment again on a bigger thing and I found that I could have a high level discussion about architectural choices well enough that it did not really slow me down much. The opposite actually. I just ask critical questions. I try to make sure to stick with mainstream stuff and not get boxed into unnecessary complexity. A few decades in this industry has given me a nose for that.
My lack of familiarity with the code base is so far not proving to be any issue. Early days, I know. But I'm generating an order of magnitude more code than I'll ever be able to review already and this is only going to escalate from here on. I don't see a reason for me to slow down. To be effective, I need to engineer at a macro level. I simply can't afford to micro manage code bases anymore. That means orchestrating good guard rails, tests, specifications, etc. and making sure those cover everything I care about. Precisely because I don't want to have to open an editor and start fixing things manually.
As for Rust, that was me not thinking about my prompt too hard and it had implemented something half decent by the time I realized so I just went with it. To be clear, this one is just a side project. So, I let it go (out of curiosity) and it seems to be fine as well. Apparently, I can do Rust now too. It's actually not a bad choice objectively and so far so good. The thing is, I can change my mind and redo the whole thing from scratch and it would not be that expensive if I had to.
If it generates the slop version in a week but it takes me 3 more weeks to clean it up, could I have I just done it right the first time myself in 4 weeks instead? How much money have I wasted in tokens?
I am a technologist. But I am seriously concerned about the ecological consequences of the training and usage of AI. To me, the true laggards are those, who have not understood yet, that climate change requires a prudent use of our resources.
I don't mind people having fun or being productive with AI. But I do mind it when AI is presented as the only way of doing things.
Personally, I think it's just the natural flow when you're starting out. If he keeps going, his opinion is going to change and as he gets to know it better, he'll likely go more and more towards vibecoding again.
It's hard to say why, but you get better at it. Even if it's really hard to really put into words why
For local/offline Qwen 2.5 Coder 32B is probably your strongest option if you have the VRAM (or can run it quantized). Handles C better than most other local models in my experience.
Previously, takes were necessarily shallower or not as insightful ("worked with caveats for me, ymmv") - there just wasn't enough data - although a few have posted fairly balanced takes (@mitsuhiko for example).
I don't think we've seen the last of hypers and doomers though.
I kinda like how you can just use it for anything you like. I have bazillion personal projects, I can now get help with, polish up, simplify, or build UI for, and it's nice. Anything from reverse engineering, to data extraction, to playing with FPGAs, is just so much less tedious and I can focus on the fun parts.
Some people do find it unfun, saying it deprives them of the happy "flow" of banging out code. Reaching "flow" when prompting LLMs arguably requires a somewhat deeper understanding of them as a proper technical tool, as opposed to a complete black box, or worse, a crystal ball.
I use LLMs in my every day work. I’m also a strong critic of LLMs and absolutely loathe the hype cycle around them.
I have done some really cool things with copilot and Claude and I keep sharing them to within my working circle because I simply don’t want to interact that much with people who aren’t grounded on the subject.
Only an AI would bother to create a throwaway account to post such a shallow comment that is mostly fearmongering to push people to use AI.
By extraordinary coincidence, I was just a moment ago part-of-the-way through re-watching The Matrix (1999) and paused it to check Hacker News. There your reply greeted me.
Wild glitch!
SWEs spend 20% of the time writing code for exactly the same reason brick-layers spend 20% of their time laying bricks
Soooooo....
As one who hasn't taken the plunge yet -- I'm basically retired, but have a couple of projects I might want to use AI for -- "time" is not always fungible with, or a good proxy for, either "effort" or "motivation"
> How much money have I wasted in tokens?
This, of course, may be a legitimate concern.
> If it generates the slop version in a week but it takes me 3 more weeks to clean it up, could I have I just done it right the first time myself in 4 weeks instead?
This likewise may be a legitimate concern, but sometimes the motivation for cleaning up a basically working piece of code is easier to find that the motivation for staring at a blank screen and trying to write that first function.
It may actually be true. Your feeling might be right - but I strongly caution you against trusting that feeling until you can explain it. Something you can’t explain is something you don’t understand.
Cleaning up agent slop code by hand is also a miserable experience and makes me hate my job. I do it already because at $DAYJOB because my boss thinks “investing” in third worlders for pennies on the dollar and just giving them a Claude subscription will be better than investing in technical excellence and leadership. The ROI on this strategy is questionable at best, at least at my current job. Code Review by humans is still the bottleneck and delivering proper working features has not accelerated because they require much more iteration because of slop.
Would much rather spend the time making my own artisanal tradslop instead if it’s gonna take me the same amount of time anyway - at least it’s more enjoyable.
have you ever learned a skill? Like carving, singing, playing guitar, playing a video game, anything?
It's easy to get better at it without understanding why you're better at it. As a matter of fact, very very few people master the discipline enough to be able to grasp the reason for why they're actually better
Most people just come up with random shit which may or may not be related. Which I just abstained from.
This is something everyone who cares about improving in a skill does regularly - examine their improvement, the reasons behind it, and how to add to them. That’s the basis of self-driven learning.
And that's not really explainable without exploring specific examples. And now we're in thousands of words of explanation territory, hence my decision to say it's hard to put it into words.
For instance, if I say “I noticed I run better in my blue shoes than my red shoes” I did not learn anything. If I examine my shoes and notice that my blue shoes have a cushioned sole, while my red shoes are flat, I can combine that with thinking about how I run and learn that cushioned soles cause less fatigue to the muscles in my feet and ankles.
The reason the difference matters is because if I don’t do the learning step, when buy another pair of blue shoes but they’re flat soled, I’m back to square one.
Back to the real scenario, if you hold on to your ungrounded intuition re what tricks and phrasing work without understanding why, you may find those don’t work at all on a new model version or when forced to change to a different product due to price, insolvency, etc.
For eight years, I’ve wanted a high-quality set of devtools for working with SQLite. Given how important SQLite is to the industry1, I’ve long been puzzled that no one has invested in building a really good developer experience for it2.
A couple of weeks ago, after ~250 hours of effort over three months3 on evenings, weekends, and vacation days, I finally released syntaqlite (GitHub), fulfilling this long-held wish. And I believe the main reason this happened was because of AI coding agents4.
Of course, there’s no shortage of posts claiming that AI one-shot their project or pushing back and declaring that AI is all slop. I’m going to take a very different approach and, instead, systematically break down my experience building syntaqlite with AI, both where it helped and where it was detrimental.
I’ll do this while contextualizing the project and my background so you can independently assess how generalizable this experience was. And whenever I make a claim, I’ll try to back it up with evidence from my project journal, coding transcripts, or commit history5.
In my work on Perfetto, I maintain a SQLite-based language for querying performance traces called PerfettoSQL. It’s basically the same as SQLite but with a few extensions to make the trace querying experience better. There are ~100K lines of PerfettoSQL internally in Google and it’s used by a wide range of teams.
Having a language which gets traction means your users also start expecting things like formatters, linters, and editor extensions. I’d hoped that we could adapt some SQLite tools from open source but the more I looked into it, the more disappointed I was. What I found either wasn’t reliable enough, fast enough6, or flexible enough to adapt to PerfettoSQL. There was clearly an opportunity to build something from scratch, but it was never the “most important thing we could work on”. We’ve been reluctantly making do with the tools out there but always wishing for better.
On the other hand, there was the option to do something in my spare time. I had built lots of open source projects in my teens7 but this had faded away during university when I felt that I just didn’t have the motivation anymore. Being a maintainer is much more than just “throwing the code out there” and seeing what happens. It’s triaging bugs, investigating crashes, writing documentation, building a community, and, most importantly, having a direction for the project.
But the itch of open source (specifically freedom to work on what I wanted while helping others) had never gone away. The SQLite devtools project was eternally in my mind as “something I’d like to work on”. But there was another reason why I kept putting it off: it sits at the intersection of being both hard and tedious.
If I was going to invest my personal time working on this project, I didn’t want to build something that only helped Perfetto: I wanted to make it work for any SQLite user out there8. And this means parsing SQL exactly like SQLite.
The heart of any language-oriented devtool is the parser. This is responsible for turning the source code into a “parse tree” which acts as the central data structure anything else is built on top of. If your parser isn’t accurate, then your formatters and linters will inevitably inherit those inaccuracies; many of the tools I found suffered from having parsers which approximated the SQLite language rather than representing it precisely.
Unfortunately, unlike many other languages, SQLite has no formal specification describing how it should be parsed. It doesn’t expose a stable API for its parser either. In fact, quite uniquely, in its implementation it doesn’t even build a parse tree at all9! The only reasonable approach left in my opinion is to carefully extract the relevant parts of SQLite’s source code and adapt it to build the parser I wanted.
This means getting into the weeds of SQLite source code, a fiendishly difficult codebase to understand. The whole project is written in C in an incredibly dense style; I’ve spent days just understanding the virtual table API11 and implementation. Trying to grasp the full parser stack was daunting.
There’s also the fact that there are >400 rules in SQLite which capture the full surface area of its language. I’d have to specify in each of these “grammar rules” how that part of the syntax maps to the matching node in the parse tree. It’s extremely repetitive work; each rule is similar to all the ones around it but also, by definition, different.
And it’s not just the rules but also coming up with and writing tests to make sure it’s correct, debugging if something is wrong, triaging and fixing the inevitable bugs people filed when I got something wrong…
For years, this was where the idea died. Too hard for a side project12, too tedious to sustain motivation, too risky to invest months into something that might not work.
I’ve been using coding agents since early 2025 (Aider, Roo Code, then Claude Code since July) and they’d definitely been useful but never something I felt I could trust a serious project to. But towards the end of 2025, the models seemed to make a significant step forward in quality13. At the same time, I kept hitting problems in Perfetto which would have been trivially solved by having a reliable parser. Each workaround left the same thought in the back of my mind: maybe it’s finally time to build it for real.
I got some space to think and reflect over Christmas and decided to really stress test the most maximalist version of AI: could I vibe-code the whole thing using just Claude Code on the Max plan (£200/month)?
Through most of January, I iterated, acting as semi-technical manager and delegating almost all the design and all the implementation to Claude. Functionally, I ended up in a reasonable place: a parser in C extracted from SQLite sources using a bunch of Python scripts, a formatter built on top, support for both the SQLite language and the PerfettoSQL extensions, all exposed in a web playground.
But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti14. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision, never mind integrating it into the Perfetto tools. The saving grace was that it had proved the approach was viable and generated more than 500 tests, many of which I felt I could reuse.
I decided to throw away everything and start from scratch while also switching most of the codebase to Rust15. I could see that C was going to make it difficult to build the higher level components like the validator and the language server implementation. And as a bonus, it would also let me use the same language for both the extraction and runtime instead of splitting it across C and Python.
More importantly, I completely changed my role in the project. I took ownership of all decisions16 and used it more as “autocomplete on steroids” inside a much tighter process: opinionated design upfront, reviewing every change thoroughly, fixing problems eagerly as I spotted them, and investing in scaffolding (like linting, validation, and non-trivial testing17) to check AI output automatically.
The core features came together through February and the final stretch (upstream test validation, editor extensions, packaging, docs) led to a 0.1 launch in mid-March.
But in my opinion, this timeline is the least interesting part of this story. What I really want to talk about is what wouldn’t have happened without AI and also the toll it took on me as I used it.
I’ve written in the past about how one of my biggest weaknesses as a software engineer is my tendency to procrastinate when facing a big new project. Though I didn’t realize it at the time, it could not have applied more perfectly to building syntaqlite.
AI basically let me put aside all my doubts on technical calls, my uncertainty of building the right thing and my reluctance to get started by giving me very concrete problems to work on. Instead of “I need to understand how SQLite’s parsing works”, it was “I need to get AI to suggest an approach for me so I can tear it up and build something better"18. I work so much better with concrete prototypes to play with and code to look at than endlessly thinking about designs in my head, and AI lets me get to that point at a pace I could not have dreamed about before. Once I took the first step, every step after that was so much easier.
AI turned out to be better than me at the act of writing code itself, assuming that code is obvious. If I can break a problem down to “write a function with this behaviour and parameters” or “write a class matching this interface,” AI will build it faster than I would and, crucially, in a style that might well be more intuitive to a future reader. It documents things I’d skip, lays out code consistently with the rest of the project, and sticks to what you might call the “standard dialect” of whatever language you’re working in19.
That standardness is a double-edged sword. For the vast majority of code in any project, standard is exactly what you want: predictable, readable, unsurprising. But every project has pieces that are its edge, the parts where the value comes from doing something non-obvious. For syntaqlite, that was the extraction pipeline and the parser architecture. AI’s instinct to normalize was actively harmful there, and those were the parts I had to design in depth and often resorted to just writing myself.
But here’s the flip side: the same speed that makes AI great at obvious code also makes it great at refactoring. If you’re using AI to generate code at industrial scale, you have to refactor constantly and continuously20. If you don’t, things immediately get out of hand. This was the central lesson of the vibe-coding month: I didn’t refactor enough, the codebase became something I couldn’t reason about, and I had to throw it all away. In the rewrite, refactoring became the core of my workflow. After every large batch of generated code, I’d step back and ask “is this ugly?” Sometimes AI could clean it up. Other times there was a large-scale abstraction that AI couldn’t see but I could; I’d give it the direction and let it execute21. If you have taste, the cost of a wrong approach drops dramatically because you can restructure quickly22.
Of all the ways I used AI, research had by far the highest ratio of value delivered to time spent.
I’ve worked with interpreters and parsers before but I had never heard of Wadler-Lindig pretty printing23. When I needed to build the formatter, AI gave me a concrete and actionable lesson from a point of view I could understand and pointed me to the papers to learn more. I could have found this myself eventually, but AI compressed what might have been a day or two of reading into a focused conversation where I could ask “but why does this work?” until I actually got it.
This extended to entire domains I’d never worked in. I have deep C++ and Android performance expertise but had barely touched Rust tooling or editor extension APIs. With AI, it wasn’t a problem: the fundamentals are the same, the terminology is similar, and AI bridges the gap24. The VS Code extension would have taken me a day or two of learning the API before I could even start. With AI, I had a working extension within an hour.
It was also invaluable for reacquainting myself with parts of the project I hadn’t looked at for a few days25. I could control how deep to go: “tell me about this component” for a surface-level refresher, “give me a detailed linear walkthrough” for a deeper dive, “audit unsafe usages in this repo” to go hunting for problems. When you’re context switching a lot, you lose context fast. AI let me reacquire it on demand.
Beyond making the project exist at all, AI is also the reason it shipped as complete as it did. Every open source project has a long tail of features that are important but not critical: the things you know theoretically how to do but keep deprioritizing because the core work is more pressing. For syntaqlite, that list was long: editor extensions, Python bindings, a WASM playground, a docs site, packaging for multiple ecosystems26. AI made these cheap enough that skipping them felt like the wrong trade-off.
It also freed up mental energy for UX27. Instead of spending all my time on implementation, I could think about what a user’s first experience should feel like: what error messages would actually help them fix their SQL, how the formatter output should look by default, whether the CLI flags were intuitive. These are the things that separate a tool people try once from one they keep using, and AI gave me the headroom to care about them. Without AI, I would have built something much smaller, probably no editor extensions or docs site. AI didn’t just make the same project faster. It changed what the project was.
There’s an uncomfortable parallel between using AI coding tools and playing slot machines28. You send a prompt, wait, and either get something great or something useless. I found myself up late at night wanting to do “just one more prompt,” constantly trying AI just to see what would happen even when I knew it probably wouldn’t work. The sunk cost fallacy kicked in too: I’d keep at it even in tasks it was clearly ill-suited for, telling myself “maybe if I phrase it differently this time.”
The tiredness feedback loop made it worse29. When I had energy, I could write precise, well-scoped prompts and be genuinely productive. But when I was tired, my prompts became vague, the output got worse, and I’d try again, getting more tired in the process. In these cases, AI was probably slower than just implementing something myself, but it was too hard to break out of the loop30.
Several times during the project, I lost my mental model of the codebase31. Not the overall architecture or how things fitted together. But the day-to-day details of what lived where, which functions called which, the small decisions that accumulate into a working system. When that happened, surprising issues would appear and I’d find myself at a total loss to understand what was going wrong. I hated that feeling.
The deeper problem was that losing touch created a communication breakdown32. When you don’t have the mental thread of what’s going on, it becomes impossible to communicate meaningfully with the agent. Every exchange gets longer and more verbose. Instead of “change FooClass to do X,” you end up saying “change the thing which does Bar to do X”. Then the agent has to figure out what Bar is, how that maps to FooClass, and sometimes it gets it wrong33. It’s exactly the same complaint engineers have always had about managers who don’t understand the code asking for fanciful or impossible things. Except now you’ve become that manager.
The fix was deliberate: I made it a habit to read through the code immediately after it was implemented and actively engage to see “how would I have done this differently?”.
Of course, in some sense all of the above is also true of code I wrote a few months ago (hence the sentiment that AI code is legacy code), but AI makes the drift happen faster because you’re not building the same muscle memory that comes from originally typing it out.
There were some other problems I only discovered incrementally over the three months.
I found that AI made me procrastinate on key design decisions34. Because refactoring was cheap, I could always say “I’ll deal with this later.” And because AI could refactor at the same industrial scale it generated code, the cost of deferring felt low. But it wasn’t: deferring decisions corroded my ability to think clearly because the codebase stayed confusing in the meantime. The vibe-coding month was the most extreme version of this. Yes, I understood the problem, but if I had been more disciplined about making hard design calls earlier, I could have converged on the right architecture much faster.
Tests created a similar false comfort35. Having 500+ tests felt reassuring, and AI made it easy to generate more. But neither humans nor AI are creative enough to foresee every edge case you’ll hit in the future; there are several times in the vibe-coding phase where I’d come up with a test case and realise the design of some component was completely wrong and needed to be totally reworked. This was a significant contributor to my lack of trust and the decision to scrap everything and start from scratch.
Basically, I learned that the “normal rules” of software still apply in the AI age: if you don’t have a fundamental foundation (clear architecture, well-defined boundaries) you’ll be left eternally chasing bugs as they appear.
Something I kept coming back to was how little AI understood about the passage of time36. It sees a codebase in a certain state but doesn’t feel time the way humans do. I can tell you what it feels like to use an API, how it evolved over months or years, why certain decisions were made and later reversed.
The natural problem from this lack of understanding is that you either make the same mistakes you made in the past and have to relearn the lessons or you fall into new traps which were successfully avoided the first time, slowing you down in the long run. In my opinion, this is a similar problem to why losing a high-quality senior engineer hurts a team so much: they carry history and context that doesn’t exist anywhere else and act as a guide for others around them.
In theory, you can try to preserve this context by keeping specs and docs up to date. But there’s a reason we didn’t do this before AI: capturing implicit design decisions exhaustively is incredibly expensive and time-consuming to write down. AI can help draft these docs, but because there’s no way to automatically verify that it accurately captured what matters, a human still has to manually audit the result. And that’s still time-consuming.
There’s also the context pollution problem. You never know when a design note about API A will echo in API B. Consistency is a huge part of what makes codebases work, and for that you don’t just need context about what you’re working on right now but also about other things which were designed in a similar way. Deciding what’s relevant requires exactly the kind of judgement that institutional knowledge provides in the first place.
Reflecting on the above, the pattern of when AI helped and when it hurt was fairly consistent.
When I was working on something I already understood deeply, AI was excellent. I could review its output instantly, catch mistakes before they landed and move at a pace I’d never have managed alone. The parser rule generation is the clearest example37: I knew exactly what each rule should produce, so I could review AI’s output within a minute or two and iterate fast.
When I was working on something I could describe but didn’t yet know, AI was good but required more care. Learning Wadler-Lindig for the formatter was like this: I could articulate what I wanted, evaluate whether the output was heading in the right direction, and learn from what AI explained. But I had to stay engaged and couldn’t just accept what it gave me.
When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful. The architecture of the project was the clearest case: I spent weeks in the early days following AI down dead ends, exploring designs that felt productive in the moment but collapsed under scrutiny. In hindsight, I have to wonder if it would have been faster just thinking it through without AI in the loop at all.
But expertise alone isn’t enough. Even when I understood a problem deeply, AI still struggled if the task had no objectively checkable answer38. Implementation has a right answer, at least at a local level: the code compiles, the tests pass, the output matches what you asked for. Design doesn’t. We’re still arguing about OOP decades after it first took off.
Concretely, I found that designing the public API of syntaqlite was where this hit home the hardest. I spent several days in early March doing nothing but API refactoring, manually fixing things any experienced engineer would have instinctively avoided but AI made a total mess of. There’s no test or objective metric for “is this API pleasant to use” and “will this API help users solve the problems they have” and that’s exactly why the coding agents did so badly at it.
This takes me back to the days I was obsessed with physics and, specifically, relativity. The laws of physics look simple and Newtonian in any small local area, but zoom out and spacetime curves in ways you can’t predict from the local picture alone. Code is the same: at the level of a function or a class, there’s usually a clear right answer, and AI is excellent there. But architecture is what happens when all those local pieces interact, and you can’t get good global behaviour by stitching together locally correct components.
Knowing where you are on these axes at any given moment is, I think, the core skill of working with AI effectively.
Eight years is a long time to carry a project in your head. Seeing these SQLite tools actually exist and function after only three months of work is a massive win, and I’m fully aware they wouldn’t be here without AI.
But the process wasn’t the clean, linear success story people usually post. I lost an entire month to vibe-coding. I fell into the trap of managing a codebase I didn’t actually understand, and I paid for that with a total rewrite.
The takeaway for me is simple: AI is an incredible force multiplier for implementation, but it’s a dangerous substitute for design. It’s brilliant at giving you the right answer to a specific technical question, but it has no sense of history, taste, or how a human will actually feel using your API. If you rely on it for the “soul” of your software, you’ll just end up hitting a wall faster than you ever have before.
What I’d like to see more of from others is exactly what I’ve tried to do here: honest, detailed accounts of building real software with these tools; not weekend toys or one-off scripts but the kind of software that has to survive contact with users, bug reports, and your own changing mind.