You can't unit test for taste

You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.

Except that I can't fully externalize myself. Debugging a system takes more resources than running the system. If I could write down everything I know and hand it to a machine, I'd do that, but it impossible.

People aren't books or hashmaps. If you want to build something, you need to use the tools, not teach the tools to use you.

[edit: I'm trying to figure out if there's something to be done about this. Email me if you want to chat -- tr at tern dot sh]

> Overall the evaluation of success was one of the most challenging parts of the project. As a developer, I’m used to building features that either work or don’t and there is often an objective way to measure how well a feature performs. For messy real world data it was hard to evaluate how good or bad the pipeline was. Furthermore, it was easy to start optimising for a specific parameter or route and find later that this work led to severe degradations in other areas.

> Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste. I’m sure these are familiar challenges to data scientists and that there are frameworks and evals for working on them. This will require more iteration and manual overrides. Hopefully with feedback and collaboration from the community. But for now I’ve shipped V1…

I suspect LLMs may be able to help us quantify our taste because they can keep track of so many data points all at once, where we have to lossily abstract these details away.

Unrelated to code, but along the same lines. I've been keeping track of the Reckless Ben case to fuel my unhealthy indignation, and we just had a like-for-like comparison between a human and an LLM.

Human: well-scoped argument that does just enough to get the job done with minimal risk.

AI: Extremely clever and correct legal argument that almost any lawyer would have said not to file (at least as written). It tries to burn the world and seriously risks pissing off the judge.

https://www.youtube.com/watch?v=YRXJnKP6Tu0

Language count is a decent notoriety signal though pretty coarse. The OP/author should take a look at QRank: https://qrank.toolforge.org/

> QRank is a ranking signal for Wikidata entities. It gets computed by aggregating page view statistics for Wikipedia, Wikitravel, Wikibooks, Wikispecies and other Wikimedia projects

from https://github.com/brawer/wikidata-qrank/blob/main/doc/desig...

Exactly one of the reasons I never went down with all the TDD dogma of only writing code to fix broken tests.

There is a reason conference talks are always about plain algorithms and data structures.

I am quite confident I could take a series of photos of various designs and classify them as "tacky" or not, and train a neural network to recognize tackiness.

https://pureinference.com/insights/taste-is-the-new-skill

I wrote about this a few months back. Rick Rubin is famous for this. I do think it is something that can be trained though, it just needs a lot more context. Taste builds over time through lots of unit tests, through lots of content writing, through an accumulation of product decisions. It’s hard to put it in the individual spec, but it can be teased out of 100 project specs. And when you get to that scale the AI starts to do it pretty well.

> but it ended up merely in a supporting role

This has been my experience, as well, but it’s a really big support. It just needs adult supervision. I can’t understand how vibe-coded apps, actually work.

As far as “taste,” goes, I test my stuff constantly, checking for even minor “friction points,” sometimes, refactoring back to design, in order to resolve issues that many folks would ship. I’m pretty anal, and want my work to be the best experience possible.

I can’t see any LLM coming close to being able to evaluate the user experience, like I can.

Taste is mostly the part of the spec you forgot to write down, plus the part you couldn't write down even if you tried.

You can’t even unit-test for correct program logic, unless you’re able to enumerate all possible inputs and states within a short time frame.

No but you can add selection as part of your workflow. Governance is something AI agents have allowed me to focus on more and more and this IMHO is where taste lands for me: https://github.com/lramoth/infoPipeline/blob/main/governance...

It makes me smile when runners use "X is a marathon, not a sprint" to hint at an effort that accumulates over time and an optimal use of energy.

I do it too because it's a common expression, and a marathon is of course longer than a sprint, but both have in common that properly raced, they are absolutely brutal efforts that leave you without a single additional drop at the end. The effort length and instantaneous power output changes, of course. Maybe "it's a marathon build, not the race" would be more precise at the loss of nearly all its expressive power (but with a lot more pedanticism points) :-p .

Nice project !

That's what linters are for. Linters can prevent SQL code from spilling out to code outside the model layer. Even more important when vibecoding.

I like to think of testing as making sure things not wrong, but not making it right.

Working, useful, delightful, in that order. Testing can make things more likely to work, that's it.

I think another important question is can you distill taste? (another comment uses the phrase "externalize", which might mean something similar).

I think people have been trying for the written word, with some degree of success (anti-slop skills). I have been trying for visuals, and it's pretty meh. It's easy to get a multimodal LLM to follow a style guide, but a style guide doesn't capture everything that accounts for taste. And anything that is dynamic (not a screenshot test) seems really hard or really expensive.

the taste part for me is cutting what the agent generated. 200 lines come back, i keep 80, no test for which 80.

So now we need a framework for unit tastes

We can encode taste -- generative AI depends on it. Ask people to compare two examples and pick the one with better taste. You can even ask them to rate multiple subjective criteria at once. Use that to learn a scoring function based on the rating labels, and raw features. Now you can write tests.

> For example, my native Iceland had a nice mix of nature, historical sites and populated places.

You absolutely can unit test for taste, just put an agent into loop, and write into prompt what you like. Then do scoring...

Iceland is really bad example, it basically has one populated site (capital) and circular road that goes around the island.

I suspect LLMs may be able to help us quantify our taste because they can keep track of so many data points all at once, where we have to lossily abstract these details away.

Unrelated to code, but along the same lines. I've been keeping track of the Reckless Ben case to fuel my unhealthy indignation, and we just had a like-for-like comparison between a human and an LLM.

Human: well-scoped argument that does just enough to get the job done with minimal risk.

AI: Extremely clever and correct legal argument that almost any lawyer would have said not to file (at least as written). It tries to burn the world and seriously risks pissing off the judge.

https://www.youtube.com/watch?v=YRXJnKP6Tu0

You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.

People aren't books or hashmaps. If you want to build something, you need to use the tools, not teach the tools to use you.

[edit: I'm trying to figure out if there's something to be done about this. Email me if you want to chat -- tr at tern dot sh]

The bigger problem I have as worker is that, once I externalize it (by writing a skill or whatever), it becomes a work-for-hire whose copyright is owned by my employer. Technically this is true of a few other things I do for work, like my .emacs and .bashrc files, small scripts I keep in ~/bin on my workstation, etc., but no employer cares to assert this unless they're being assholes for some unrelated reason. Agent skill files, especially ones that seem to semi-reliably do what they say on the tin (the white whale!), are not like that at all, and I can see them pursuing you if you try to use them at a future employer.

It can't be written down as code, that's the point.

I am more familiar with taste in coding and it can at best be described—that the resulting code is too subtly different from something else in the codebase, that you're masking a different bug, that you're not following what the code tells you. The good part is that while this cannot be unit tested, you can write documentation and code comments about it that tell people what they need to know.

But for taste of the kind described in the article there's not even a definition. The logic ended up being "trust a bunch of opaque weights the most"

You absolutely cannot unit test for taste.

I had this experience doing a port from Big Query to Postgres using Opus. I had unit tests to guarantee parity with the original code, and Opus insisted on building this bespoke query builder (e.g. `def _where(very_complicated_params)`) on top of sqlglot.

Even with the original code being straightforward and legible and repeated instructions to match, I had to fight with it to get close.

In the end, I ended up doing things the "old fashion way" where I copied chunks code into Claude proper and gave explicit instructions for each piece.

I clearly had externalized the requirements, and yet that wasn't sufficient. The only way to unit test further would be to use an AST to evaluate the output against metrics I couldn't even encode.

What's kind of funny is this is how I implemented "gates" for the ticketing system I built for Claude, because Beads would just close tickets without validation. I have tickets that are literally "Human validation" tier, so it will work on the next available thing until I personally tell the model to close it. So, in that spirit, yeah, you can unit test for taste, if you implement external validation.

Unit test runs, waits for human input before passing or failing, which might seem out of the norm, but we already have QA do manual testing.

Randomized trial. Half of them pledge to use AI freely and liberally, half of them to never use it, compare via surveys and off-AI tests after X months. Could even flip it so then the non-users used it for X months and vice versa, see if losses/gains are stable.

You may be able to effectively externalize taste by "hot or not" style pair testing. Enough comparisons and I'd expect ML to be able to mimic human taste by latching on to features we're not well aware of influencing us.

> You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

I'm not so sure. For instance, you can write down what it means for a program to be free of XSS and other injection vulnerabilities. Now, how would you unit test for that property?

Is there an issue of taste when generating images with AI ? or can we relatively rapidly train people to generate beautiful images with decent amount of variety ?

If you have enough examples you can train an AI on your preferences, then use that distilled AI as a unit test. Don’t combine multiple into one AI. If they don’t agree you want it to fail so you can decide and retrain the tests.

I agree and indeed externalize everything you know *that matters*.

Want to follow certain pattern, or convention - define it, ie active record vs repository pattern, stick is as an ADR! You don't know what you want? Look at what Claude produces and then acquire taste, mark this as convetion that future sessions will follow, but stick to *one* convention!

Treat your LLMs as junior developers willing to apply various patterns willy nilly, caring only about fulfilling the ACs of given task and not about the longevity or well being of the system in general. They will not look at bigger picture to check if given pattern applies globally, or even if there are any other patterns.

I am quite confident I could take a series of photos of various designs and classify them as "tacky" or not, and train a neural network to recognize tackiness.

Language count is a decent notoriety signal though pretty coarse. The OP/author should take a look at QRank: https://qrank.toolforge.org/

> QRank is a ranking signal for Wikidata entities. It gets computed by aggregating page view statistics for Wikipedia, Wikitravel, Wikibooks, Wikispecies and other Wikimedia projects

from https://github.com/brawer/wikidata-qrank/blob/main/doc/desig...

OP here, that looks really neat, thanks for the link!

Cool! Thanks for sharing.

Exactly one of the reasons I never went down with all the TDD dogma of only writing code to fix broken tests.

There is a reason conference talks are always about plain algorithms and data structures.

The biggest flaw I've seen with TDD is the fact that correctness does not compose upward. Every time two units come into contact, you've got an entirely new kind of unit. The tests from constituents do not cover emergent properties of the new things. You will repeat this same exercise the entire way up to the top, and the moment you come into contact with the customer (they want to change everything), the house of cards comes crumbling down and you have to start your agonizingly-slow process all over from the bottom again.

The only thing that the business seems to care about is top-down UI testing. This is also convenient because you can leave it until the very end after the customer has already seen several prototypes.

I do think TDD makes sense in isolated scopes (prove this specific custom parser works at the edges), but as the general policy for the entire product it's definitely not a viable practice. Much of the time if comes off as an ego trip to see just how cleverly we can mock something so that we can say we technically tested it.

> TDD dogma of only writing code to fix broken tests.

Isn't red-green-refactor pretty ingrained in TDD?

Only write code to make a failing test pass; then refactor while making sure the tests still pass?

Then write a test that fails, repeat?

yup and I find it weird that people still remain so defensive of the Church of TDD even against empirical studies that show its limited benefits

https://arxiv.org/abs/2602.07900

I agree and indeed externalize everything you know *that matters*.

You absolutely cannot unit test for taste.

Even with the original code being straightforward and legible and repeated instructions to match, I had to fight with it to get close.

In the end, I ended up doing things the "old fashion way" where I copied chunks code into Claude proper and gave explicit instructions for each piece.

I clearly had externalized the requirements, and yet that wasn't sufficient. The only way to unit test further would be to use an AST to evaluate the output against metrics I couldn't even encode.

It can't be written down as code, that's the point.

But for taste of the kind described in the article there's not even a definition. The logic ended up being "trust a bunch of opaque weights the most"

Apple's human interface guidelines says that some things can be written down though. It's a very thurough look at UX and while they don't adhere to them perfectly themselves, it's very much a north star to a some ideals. You can't unit test for taste, but you can integration test that bad tastes haven't happened.

Technically, AI is code, just very complex code.

I'd say there are "simple" simple things you can do though, like take automated screenshots and detect colours for jarring colourschemes.

> but it ended up merely in a supporting role

This has been my experience, as well, but it’s a really big support. It just needs adult supervision. I can’t understand how vibe-coded apps, actually work.

I can’t see any LLM coming close to being able to evaluate the user experience, like I can.

https://pureinference.com/insights/taste-is-the-new-skill

Taste is mostly the part of the spec you forgot to write down, plus the part you couldn't write down even if you tried.

the taste part for me is cutting what the agent generated. 200 lines come back, i keep 80, no test for which 80.

You can’t even unit-test for correct program logic, unless you’re able to enumerate all possible inputs and states within a short time frame.

> I can’t understand how vibe-coded apps, actually work.

With a better process. e.g. plan->revision cycles, better instructions/docs like an ADR system.

I don't think vibe-coding is relegated to "build me reddit but with blockchain" and then it's done.

I think it instead describes the workflow where the software impl stays opaque but you evaluate the end product as an end user to step the product forward. It basically centers you as the tastemaker.

I'd say I vibe-code all of my personal projects now since December where AI had a breakthrough where it required less babysitting and developed good "taste" like smart sum types without being prompted to do so.

I've accumulated my own best practices like a heavy plan->revise cycle where plans ultimately promote into ./plans/impl/YYYY-MM-DD-{slug}.md, and an ADR system in ./docs/design/*.md that encodes arch/design invariants that accumulate over time, and new decisions/principles are folding back into it as they are discovered (by the AI).

During the plan revision cycles, the LLMs may ask me a multiple choice question about which decision branch to take, and lately I've just been responding with "take the ideal option" with good results -- either way it will take a well-reasoned position that I can't really argue with.

Meanwhile, my role is mainly to evaluate the end product and steer it directionally. How much I decide to prescribe and inject myself into technical decisions is a function of how serious the project is, but it's easy to notice that LLMs are simply better and better at arriving at well-reasoned decisions, and my interjections are more and more limited to technical/directional taste rather than necessity.

Tools like Playwright and Maestro can already give you a small taste of what that would look like.

But overall I agree, LLMs are currently awful at being beta testers. They miss the most basic stuff that any human would immediately catch as being poor UX, and for all their visual prowess they are terrible at auditing UI.

> Rick Rubin told Anderson Cooper he has no technical ability. Doesn't play instruments. Can't work a mixing board.

If you watch his interview on Rick Beato's channel, this myth will fall apart. He plays guitar, had his own punk rock band and his guitar playing is featured on some high-profile records he produced. Also, he has a lot of practical experience with all kinds of studio equipment.

This is exactly it - the ultimate skill now is to be Rick Rubin with an LLM. Not a comfortable transition as a coder.

That's what linters are for. Linters can prevent SQL code from spilling out to code outside the model layer. Even more important when vibecoding.

I like to think of testing as making sure things not wrong, but not making it right.

Working, useful, delightful, in that order. Testing can make things more likely to work, that's it.

So now we need a framework for unit tastes

I think another important question is can you distill taste? (another comment uses the phrase "externalize", which might mean something similar).

Technically, AI is code, just very complex code.

I'd say there are "simple" simple things you can do though, like take automated screenshots and detect colours for jarring colourschemes.

It makes me smile when runners use "X is a marathon, not a sprint" to hint at an effort that accumulates over time and an optimal use of energy.

Nice project !

"The effort length and instantaneous power output changes, of course."

but that's what the phrase is meant to convey, right?

Don't run through consumable X (energy/money/etc) like there's no tomorrow - even though there's <some big important milestone> now, we've got dozens more of those that we need to meet, so you're better off getting this one done at 75% than committing 100% to it and failing on all the others.

> For example, my native Iceland had a nice mix of nature, historical sites and populated places.

You absolutely can unit test for taste, just put an agent into loop, and write into prompt what you like. Then do scoring...

Iceland is really bad example, it basically has one populated site (capital) and circular road that goes around the island.

I'm pretty sure there's more points of interest in the entirety of Iceland than just Reykjavík and Route Number One

Cool! Thanks for sharing.

OP here, that looks really neat, thanks for the link!

> You can't unit test for taste if you haven't written down what you mean by taste. If you can externalize it, then you can.

I'm not so sure. For instance, you can write down what it means for a program to be free of XSS and other injection vulnerabilities. Now, how would you unit test for that property?

Unit test runs, waits for human input before passing or failing, which might seem out of the norm, but we already have QA do manual testing.

Tools like Playwright and Maestro can already give you a small taste of what that would look like.

I think Apple lost a bit of credibility after the round-corner fiasco that still persists on Tahoe.

> TDD dogma of only writing code to fix broken tests.

Isn't red-green-refactor pretty ingrained in TDD?

Only write code to make a failing test pass; then refactor while making sure the tests still pass?

Then write a test that fails, repeat?

Now do a games engine with that approach regarding shaders and the desired visuals.

This is RL, right? Like, this is exactly why models have mostly converged around obvious style, because we train them literally on thumbs-up/thumbs-down data of what good behavior and good code looks like.

And that's why it's so hard to get a model to reproduce the specific taste of a person or an organization. My taste is different than yours, so if we dump our aggregate preferences into RL, in averages out to nothing interesting.

For the code-writing case, this means you end up reviewing every line of code, looking for places where you'd thumbs-down the code. Not every line of code contains a real decision, though, so it feels like a waste of time.

Wouldn't this style of training suffer from the AI learning things the user didn't intend? I may thumbs down something for a specific detail I don't like, while other things in it are great. Certain traits that tend to occur together go along for the ride. We see similar things happen in natural selection, where mates may be chosen for 1 specific feature, and other less desirable things come along for the ride.

Outside of AI, I run into this issue when taking basic personality tests. A question may be written for a specific reason, which influences the results, but the reason for my answer may be completely unrelated to the reason intended by the person who made the test.

I tell people you should be testing at the level where a change would be so hard you wouldn't do it anyway. Internal helper functions - they are tested only because the code that calls them passes. Interfaces that are used thousands of places - you better test them well because you wouldn't dare change that anyway: it would break too many others.

Or to put it differently: a test is an assertion that no matter what, for all time this should never change again. Even if customer requirements change in the future they won't change in such a way as to break this test (this isn't always true, but you should believe it is true).

A test is most valuable when it alerts you to a real problem when it fails. If the test fails but there isn't a real problem (either because customer requirements have changed, or it is flaky) it was needless cost to investigate it. If the test passes that gives some hope of correctness, but you can never be sure it is really correct vs a bug in the test (even if you use TDD and so the test failed when you wrote it that doesn't mean a refactoring since didn't make this an always pass test).

Part of the problem is if I tell you to write sort() or your new toy language's list type you have an intuitive idea of what it should look like and probably will get them right the first time (other than bugs you want the tests so you catch). These should have tiny micro tests. These things also are really easy to use as examples of how to do TDD - which they are, but they are not representative: this type of code is generally in your standard library already and you are not writing it.

Instead you are writing code that isn't well defined with lots of industry experience. It is not clear what the exact interface should be (or more likely it is clear customer requirements will change but you don't know how yet). You have no idea what the best implementation is. You don't know if this will be used in this one place, or if it will become a useful key part that many future projects depend on. You have to make guesses.

That is a flaw with unit tests written at far too low a level, not with TDD.

You would have the same problem if you wrote tests like that after the code.

TDD has no opinion about the level at which you wrote your test, it just assumes it's the correct one.

This is the number one biggest misconception about TDD which I keep seeing repeated on hacker news.

https://news.ycombinator.com/item?id=46810793

https://news.ycombinator.com/item?id=45113016

Exactly, the whole system thinking and large scale architecture also fails apart, when writing everything from little working tests.

TDD is perfect for bugs; codify a replication first, then fix it.

Is there an issue of taste when generating images with AI ? or can we relatively rapidly train people to generate beautiful images with decent amount of variety ?

ai generated images and art still seem to look cheap or untasteful to a lot of viewers, so it can't be that easy to train people on fixing that.