> Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste. I’m sure these are familiar challenges to data scientists and that there are frameworks and evals for working on them. This will require more iteration and manual overrides. Hopefully with feedback and collaboration from the community. But for now I’ve shipped V1…
I suspect LLMs may be able to help us quantify our taste because they can keep track of so many data points all at once, where we have to lossily abstract these details away.
Human: well-scoped argument that does just enough to get the job done with minimal risk.
AI: Extremely clever and correct legal argument that almost any lawyer would have said not to file (at least as written). It tries to burn the world and seriously risks pissing off the judge.
Follow this line of thinking, and the AI-friendly answer is easy: we just have to externalize everything we know, so Claude can implement what I want.
Except that I can't fully externalize myself. Debugging a system takes more resources than running the system. If I could write down everything I know and hand it to a machine, I'd do that, but it impossible.
People aren't books or hashmaps. If you want to build something, you need to use the tools, not teach the tools to use you.
[edit: I'm trying to figure out if there's something to be done about this. Email me if you want to chat -- tr at tern dot sh]
> QRank is a ranking signal for Wikidata entities. It gets computed by aggregating page view statistics for Wikipedia, Wikitravel, Wikibooks, Wikispecies and other Wikimedia projects
from https://github.com/brawer/wikidata-qrank/blob/main/doc/desig...
There is a reason conference talks are always about plain algorithms and data structures.
Want to follow certain pattern, or convention - define it, ie active record vs repository pattern, stick is as an ADR! You don't know what you want? Look at what Claude produces and then acquire taste, mark this as convetion that future sessions will follow, but stick to *one* convention!
Treat your LLMs as junior developers willing to apply various patterns willy nilly, caring only about fulfilling the ACs of given task and not about the longevity or well being of the system in general. They will not look at bigger picture to check if given pattern applies globally, or even if there are any other patterns.
I had this experience doing a port from Big Query to Postgres using Opus. I had unit tests to guarantee parity with the original code, and Opus insisted on building this bespoke query builder (e.g. `def _where(very_complicated_params)`) on top of sqlglot.
Even with the original code being straightforward and legible and repeated instructions to match, I had to fight with it to get close.
In the end, I ended up doing things the "old fashion way" where I copied chunks code into Claude proper and gave explicit instructions for each piece.
I clearly had externalized the requirements, and yet that wasn't sufficient. The only way to unit test further would be to use an AST to evaluate the output against metrics I couldn't even encode.
I am more familiar with taste in coding and it can at best be described—that the resulting code is too subtly different from something else in the codebase, that you're masking a different bug, that you're not following what the code tells you. The good part is that while this cannot be unit tested, you can write documentation and code comments about it that tell people what they need to know.
But for taste of the kind described in the article there's not even a definition. The logic ended up being "trust a bunch of opaque weights the most"
This has been my experience, as well, but it’s a really big support. It just needs adult supervision. I can’t understand how vibe-coded apps, actually work.
As far as “taste,” goes, I test my stuff constantly, checking for even minor “friction points,” sometimes, refactoring back to design, in order to resolve issues that many folks would ship. I’m pretty anal, and want my work to be the best experience possible.
I can’t see any LLM coming close to being able to evaluate the user experience, like I can.
I wrote about this a few months back. Rick Rubin is famous for this. I do think it is something that can be trained though, it just needs a lot more context. Taste builds over time through lots of unit tests, through lots of content writing, through an accumulation of product decisions. It’s hard to put it in the individual spec, but it can be teased out of 100 project specs. And when you get to that scale the AI starts to do it pretty well.
Working, useful, delightful, in that order. Testing can make things more likely to work, that's it.
I think people have been trying for the written word, with some degree of success (anti-slop skills). I have been trying for visuals, and it's pretty meh. It's easy to get a multimodal LLM to follow a style guide, but a style guide doesn't capture everything that accounts for taste. And anything that is dynamic (not a screenshot test) seems really hard or really expensive.
I'd say there are "simple" simple things you can do though, like take automated screenshots and detect colours for jarring colourschemes.
I do it too because it's a common expression, and a marathon is of course longer than a sprint, but both have in common that properly raced, they are absolutely brutal efforts that leave you without a single additional drop at the end. The effort length and instantaneous power output changes, of course. Maybe "it's a marathon build, not the race" would be more precise at the loss of nearly all its expressive power (but with a lot more pedanticism points) :-p .
Nice project !
You absolutely can unit test for taste, just put an agent into loop, and write into prompt what you like. Then do scoring...
Iceland is really bad example, it basically has one populated site (capital) and circular road that goes around the island.
I'm not so sure. For instance, you can write down what it means for a program to be free of XSS and other injection vulnerabilities. Now, how would you unit test for that property?
Unit test runs, waits for human input before passing or failing, which might seem out of the norm, but we already have QA do manual testing.
But overall I agree, LLMs are currently awful at being beta testers. They miss the most basic stuff that any human would immediately catch as being poor UX, and for all their visual prowess they are terrible at auditing UI.
Isn't red-green-refactor pretty ingrained in TDD?
Only write code to make a failing test pass; then refactor while making sure the tests still pass?
Then write a test that fails, repeat?
The only thing that the business seems to care about is top-down UI testing. This is also convenient because you can leave it until the very end after the customer has already seen several prototypes.
I do think TDD makes sense in isolated scopes (prove this specific custom parser works at the edges), but as the general policy for the entire product it's definitely not a viable practice. Much of the time if comes off as an ego trip to see just how cleverly we can mock something so that we can say we technically tested it.
With a better process. e.g. plan->revision cycles, better instructions/docs like an ADR system.
I don't think vibe-coding is relegated to "build me reddit but with blockchain" and then it's done.
I think it instead describes the workflow where the software impl stays opaque but you evaluate the end product as an end user to step the product forward. It basically centers you as the tastemaker.
I'd say I vibe-code all of my personal projects now since December where AI had a breakthrough where it required less babysitting and developed good "taste" like smart sum types without being prompted to do so.
I've accumulated my own best practices like a heavy plan->revise cycle where plans ultimately promote into ./plans/impl/YYYY-MM-DD-{slug}.md, and an ADR system in ./docs/design/*.md that encodes arch/design invariants that accumulate over time, and new decisions/principles are folding back into it as they are discovered (by the AI).
During the plan revision cycles, the LLMs may ask me a multiple choice question about which decision branch to take, and lately I've just been responding with "take the ideal option" with good results -- either way it will take a well-reasoned position that I can't really argue with.
Meanwhile, my role is mainly to evaluate the end product and steer it directionally. How much I decide to prescribe and inject myself into technical decisions is a function of how serious the project is, but it's easy to notice that LLMs are simply better and better at arriving at well-reasoned decisions, and my interjections are more and more limited to technical/directional taste rather than necessity.
If you watch his interview on Rick Beato's channel, this myth will fall apart. He plays guitar, had his own punk rock band and his guitar playing is featured on some high-profile records he produced. Also, he has a lot of practical experience with all kinds of studio equipment.
Or to put it differently: a test is an assertion that no matter what, for all time this should never change again. Even if customer requirements change in the future they won't change in such a way as to break this test (this isn't always true, but you should believe it is true).
A test is most valuable when it alerts you to a real problem when it fails. If the test fails but there isn't a real problem (either because customer requirements have changed, or it is flaky) it was needless cost to investigate it. If the test passes that gives some hope of correctness, but you can never be sure it is really correct vs a bug in the test (even if you use TDD and so the test failed when you wrote it that doesn't mean a refactoring since didn't make this an always pass test).
Part of the problem is if I tell you to write sort() or your new toy language's list type you have an intuitive idea of what it should look like and probably will get them right the first time (other than bugs you want the tests so you catch). These should have tiny micro tests. These things also are really easy to use as examples of how to do TDD - which they are, but they are not representative: this type of code is generally in your standard library already and you are not writing it.
Instead you are writing code that isn't well defined with lots of industry experience. It is not clear what the exact interface should be (or more likely it is clear customer requirements will change but you don't know how yet). You have no idea what the best implementation is. You don't know if this will be used in this one place, or if it will become a useful key part that many future projects depend on. You have to make guesses.
but that's what the phrase is meant to convey, right?
Don't run through consumable X (energy/money/etc) like there's no tomorrow - even though there's <some big important milestone> now, we've got dozens more of those that we need to meet, so you're better off getting this one done at 75% than committing 100% to it and failing on all the others.
And that's why it's so hard to get a model to reproduce the specific taste of a person or an organization. My taste is different than yours, so if we dump our aggregate preferences into RL, in averages out to nothing interesting.
For the code-writing case, this means you end up reviewing every line of code, looking for places where you'd thumbs-down the code. Not every line of code contains a real decision, though, so it feels like a waste of time.
Outside of AI, I run into this issue when taking basic personality tests. A question may be written for a specific reason, which influences the results, but the reason for my answer may be completely unrelated to the reason intended by the person who made the test.
You would have the same problem if you wrote tests like that after the code.
TDD has no opinion about the level at which you wrote your test, it just assumes it's the correct one.
This is the number one biggest misconception about TDD which I keep seeing repeated on hacker news.
I have not encountered anything like that, with my Swift (native iOS) apps, but am pretty close to it, with my backend PHP.
I suspect that it depends on the tech stack. So far, the Swift output closely resembles that of a very inexperienced, but smart, engineer. I need to really keep a close eye on it.
Substitute static typing for TDD in your comment, and it will remain equally valid statement.
If I were to ask you - what convention you want to follow for your database columns - camelcase or snakecase? There's no correct global answer. There's no overarching truth that should apply to all databases in existence (even if you'll focus on a certain type of database). Hence the no.
But yes, because in the context of existing system there is a convention. If it's snakecase, you create new tables with snakecase column names.
LLMs will generally follow conventions, but sometimes they will not, because indeed - global truths (or at least, the "last article it read" truths) sometimes win over (I assume)
The co-occurence thing is often not a bug of the algorithm but a genuine part of the stochastic landscape that must be solved. Evolution isn't "failing" when sickle cell vulnerability is ported along with malaria resistance; it's just a real tradeoff being made in the current biological landscape.
LLMs are built for scale so they've given up on the kind of online learning / "long term memory" processes that would individualize them.
The LLM is permanently locked to being a really cracked engineer on their first day at your company, looking at your codebase for the first time.
You can scaffold a bit with .md files, but at the moment they lack the ability to do what humans do: go to sleep, encode things from short to long term memory, and wake up the next day with more specific knowledge baked in.
Well, you can package it up, otherwise Rick wouldn't exist.
Here I am talking about the basic static typing, and maybe some generics use occasionally, but obviously people also go overboard sometimes with type features and that hinders understanding for newcomers to the codebase.
IMHO this is where code review goes until we fix the individualized model thing: you need to review the decisions the agent made, where you didn't steer. Most will be right. A few will be disastrously wrong. But decision-by-decision is a lot less to review than line-by-line of code.
set up a rendering profile and preconditions that generates a minimal snippet of images/video using a predefined GPU profile.
then test for either a pixel perfect reproduction of the correct behaviour or for the properties you're looking for (if it doesnt reproduce deterministically).
this is one way. i also subscribe to the view that if the type system is modified to become stricter in such a way that it can fail reliably in the presence of this type of bug that this is also good enough.
some people might argue that these arent "strictly" TDD by some definition but they set out a path to follow red green refactor and confer identical benefits so my view is who gives a duck?
I don't have enough domain expertise to know which variant of these approaches is best but I'm enough of a TDD expert to know that what you're implying isnt possible is actually something you would would probably derive a lot of value from if you did it.
it follows the definition of TDD and it works really well (with some caveats) but again some people get hung up on what their impression of TDD is (e.g. unit tests checking to see if a car object has a steering wheel or whatever...) rather than what it actually is and what about it is that actually works.
I wonder if this is even desirable from a product perspective. You probably don't want online learning in a product that you are selling because you can't guarantee a consistent quality of the product.
And to be fair, the ability to fire employees and hire new ones is pretty important for that reason. In cases where you can't easily fire employees (e.g. unions), you encounter the very problem you're describing, and it often leads to companies preferring more consistent automations.
22 June 2026·12 mins
I’m building In the Long Run where runners do virtual runs on famous routes around the world. The app tallies up your Strava mileage and plots your total distance as progress against country- or continent-spanning routes. The intention is to provide long-term inspiration and motivation; life is a marathon, not a sprint. You can have a bad month or season but still make progress on your virtual traversal of the world.
The app shows your progress on interactive maps, which lets users do some exploring of their own. But I had long wanted to enrich the maps with interesting sights or historical sites. For routes I was familiar with I could build such lists myself but that doesn’t scale to routes spanning countries I am not familiar with. So I set out to find a data source for points of interest that I could build a pipeline off. Along the way I wrestled with taste and biases, and fought a hallucinating llm. I initially thought AI would be the feature, but it ended up merely in a supporting role alongside other signals and data processing mainstays.
GeoNames was an obvious starting point, an extensive data source with locations, categories and links. The full data set can be downloaded and has a Creative Commons licence. So with my friend Claude I set about building a pipeline to go from the raw dumps to serving relevant points of interest to users of In the Long Run.
We used Python as the programming language (had good library support for the tasks at hand), stored processed data locally as Apache Parquet files and used DuckDB as the query layer.1 This was my first time using both Parquet and DuckDB but the ergonomics of both felt good and Claude introduced me to their features step by step (and most of the DuckDB work was SQL that I am very familiar with). In general I find adding one or two new tools or technologies to a project is the best way to learn. If the entire stack is new to you the learning curve will be too steep and it might put you off the project entirely. AI coding agents change this calculus somewhat, but even then I find having a handle on most of the technologies being used lets me steer the agent better and make informed decisions instead of blindly following its lead.

Point of interest feature screenshot for a runner on Route 66 near Springfield, Illinois.
I built a project plan with Claude before starting the implementation, outlining the different steps of the pipeline and feature work. As we went along we then built a spec/plan for each step that we could iterate on as we learned more from earlier work. This also meant I could start new agent sessions for each milestone. Condensing results from the previous milestones into short context and instructions for the next step gets you faster and better responses (I find big contexts quickly degrade the quality of agent work).
To begin with we downloaded and unzipped all the required files from Geonames and set up gitignores for the data files as most are too large to be version controlled.
The first step of processing was joining the downloaded files on the relevant columns and filtering out rows that were not useful for our purposes. For instance we excluded administrative divisions: countries, states, regions etc. We also selected specific feature codes that we thought would be most interesting: parks, historic sites, castles, monuments, mountains, etc.2 Finally we added a population filter on populated places and an elevation filter on mountains. I’m sure this led to some false negatives, but we wanted a rough first draft.
Somewhat unintuitively the alternateNames.txt Geonames dataset includes Wikipedia links (where isolanguage=link and alternate_name like %en.wikipedia.org%, this usecase feels bolted on to their schema after the fact but it is very helpful data to have). We used this as a notoriety/relevance signal, and it also provided texts that we could build blurbs from as Wikipedia summaries also have a Creative Commons licence.
We built a basic sanity check for this pipeline step that helped us verify we weren’t skipping notable landmarks, this let us tweak some of the filtering. For instance, the first draft pulled in the Australian rural locality Stonehenge but not the prehistoric megalithic structure (its more famous namesake). When working in English you also want to make sure you pull in the relevant alternate names / languages and use the relevant Wikipedia URL as a cross reference (GeoNames stores the canonical name in the local language).
The final result of this step was a parquet file with around 725 thousand rows for points of interest globally. A significant reduction from the 13 million in the full original set we started out with.

Populated places are the bulk of the Geonames dataset. But we don't want the points of interest to just show every town, village and hamlet on the way.
In the second step we matched all candidates from the first step with each of the routes we have. First we take a GeoJSON file for the route and build a bounding box to quickly filter to just the points remotely close to the route. We then iterate over the route coordinates to see which of the points inside the bounding box also fall within a given distance of the route itself (50km by default). We used Shapely and Pyproj for the geo calculations and to calculate a “distance along route” attribute so that we can decide “when” we should show the point of interest to the runner.
The output from this step is a route specific parquet file used for further refinement of the route. For our Iceland ring road route (1,321 km) we got 511 POIs, for the longest route in the app, Cape Town to Magadan (23,257 km) we got 10 thousand POIs while Route 66 (3,787 km) got 14,181 POIs. This was an early sign that our anglophone-Wikipedia signal was really a “where do English speakers live and edit wikis” bias.
In the third step we enriched the data we have with Wikipedia information and used an LLM to generate a rating for each point of interest. At first I’d also intended to use LLM generated summaries for the points of interest, but that proved a significant challenge with minor benefits.
First we fetch the Wikipedia summary for each of the points we have for a given route. We do the same for Wikidata, for each Wikipedia URL, look up how many language Wikipedias have an article on that subject. This is another good notoriety signal, if a page exists in many languages it is likely to be more significant than one that only has an entry in the English Wikipedia. The wiki data we can cache globally; this saves us a refetch in case later routes use the same points.
The wiki data is also input into an LLM powered step. We created a tool that we call to get structured data returned. Anthropic’s Haiku model was chosen for speed and price (unsurprisingly it was the one recommended through Claude Code by its “sibling” Opus) and batched the calls to get further price savings (50% off input and output tokens). This was my first time programmatically calling an LLM like this, the API made sense but its output wasn’t entirely consistent. For instance sometimes weird variants of the Anthropic Markup Language (antml) leaked into the tool call result string, calling for a cleanup. The batched tool calls can take hours to complete and the cost for the larger routes was around $10. I’d want to experiment with local or cheaper models here to see what the tradeoffs are.
Here we also caught some hallucinations, the first attempt did not “ground” the LLM enrichment in much data nor apply restrictions in the prompt. This meant that Haiku classified Central Park in Decatur, Illinois as its more famous namesake in Manhattan and it got a large upgrade in its significance. For the second pass we added location and administrative metadata (country, city, etc) as input to the LLM as well as grounding it more carefully in the system prompt. Even then my spot checking uncovered several hallucinations, Haiku changed population sizes for towns and made mountains way larger than they really were (like Hugh Grant in that 90s classic).
I decided to just revert to the Wikipedia summaries at that point. The LLM text did often read better for our purposes, but correctness felt more important than readability.3 You could play around with the input data and prompts on the input side and build evals on the verification side, but ultimately it didn’t feel like it was worth the time or costs (LLM output tokens being more expensive than input). This challenge is an exciting one but tough to wrestle with outside of more easily verifiable domains like code (building integration tests to fact check text sounds like a Wittgensteinian task).
I still used the LLM to give the points of interest a rating used for calculating a significance score along with the feature codes and wiki language counts. Relying on just the Wikidata gave a lot of weight to every small town that had an automatically translated wiki page in 150 languages. Getting a more “subjective” rating from an LLM helped lift the more “interesting” points of interest for every route.

Highest LLM rated points of interest so far. I suspect Reykjavík gets a 10 because it is explicitly mentioned in the prompt. It is a capital but is it more significant than Chicago or LA? How about Vatnajökull? I'm not sure.
So the LLM got relegated from writing (because it made stuff up) but promoted to offer the subjective taste latent in its weights. On the whole this step changed my thinking from this new technology being a foundation of the new feature ("AI solves this") to AI just being a new tool in a bigger toolbox ("AI nicely augments other traditional approaches").
As well as building the pipeline itself we built some tooling along the way to sanity check and debug the different stages. For instance a Leaflet based visualization tool to place the POIs on a map to verify placement and get a preview of what the end result would look like proved useful. I also built a queries.sql file to inspect the parquet files using SQLYac and DuckDB to spot check for false negatives or positives.
The last steps were to actually consume the artifacts produced by the pipeline, build the API endpoint for the data and show the points of interest to the user on the map. The implementation isn’t that relevant to the topic of the blog post but funnily enough this was also a step where Claude Code wanted to write the implementation first and was then going to give me the spec for approval. A shortcut that I’m sure many developers are familiar with, but an important place to try to rein in the AI and get it to follow the process you set out with.
From the enriched per-route candidates we then built an output artifact, a JSON file, that contained the points of interest for the route. This was the first point of interest data that was actually version controlled.
This is also where it became apparent that we would need per-route tweaks and parameters. Trying a few different routes I quickly realised that the data for every route is different, routes in different territories, countries and continents have different sights (cultural vs. natural vs. historical etc). This seems obvious when you say it out loud, but it didn’t really occur to me how big the variance was until we got this far.
For example, my native Iceland had a nice mix of nature, historical sites and populated places. But for other routes in more densely populated places the point of interest map basically became a population map, showing every town, village and hamlet along the way. Other points of interest were clustered in cities, because that is where the buildings, statues and monuments also are.
So we added per-route parameters like filtering on population, ranking based on Geoname feature classes, weighting the “subjective” LLM score higher against the “objectiveness” of the wiki link counts. We also applied a geographic filter so that only the most interesting sights in a given radius are shown to get a more even spread of points of interest between cities and the more rural paths that link them.
Overall the evaluation of success was one of the most challenging parts of the project. As a developer, I’m used to building features that either work or don’t and there is often an objective way to measure how well a feature performs. For messy real world data it was hard to evaluate how good or bad the pipeline was. Furthermore, it was easy to start optimising for a specific parameter or route and find later that this work led to severe degradations in other areas.
Verification becomes hard to reason about because there is no ground truth for points of interest, there are no red/green unit tests for taste. I’m sure these are familiar challenges to data scientists and that there are frameworks and evals for working on them. This will require more iteration and manual overrides. Hopefully with feedback and collaboration from the community. But for now I’ve shipped V1; you can try it out for select routes at InTheLongRun.app!
This is my first time writing up a project that I worked on using an AI agent. I kept writing “we” because the project felt like a collaboration. At times, it even felt like I was being mentored by a senior because some of the technology was new to me. On reading it back, saying we feels like an accountability dodge, because of course I’m fully and solely responsible for any errors in this write-up or code. But just using I/me also feels dishonest, because so much of the implementation here isn’t fully mine so I feel like I’m taking too much credit for my collaboration with the machines. I figure this is a new kind of pronouns debate we’ll be having for the foreseeable future. ↩︎
Here I found the first agent hallucination of the project, Claude wanted to filter to the NTMK feature code. But as far as I can tell no such feature code exists and I can’t figure out what it was meant to be either. ↩︎
Of course Wikipedia can also be incorrect, but that feels like a more known failure model and at least there we have attribution. ↩︎