I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
That's such a weird switch. There's lots of free medical imaging online. Example: https://www.cancerimagingarchive.net/
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO
i.e. perhaps minimal changes to autoresearch can take control for cost-effective research to occur.
A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
Years ago there were big hopes about bayesian hyperparameter optimization, predicting performance with Gaussian processes etc, hyperopt library, but it was often starting wasteful experiments because it really didn't have any idea what the parameters did. People mostly just do grid search and random search with a configuration that you set up by intuition and experience. Meanwhile LLMs can see what each hyperparameter does, it can see what techniques and settings have worked in the literature, it can do something approximating common sense regarding what has a big enough effect. It's surprisingly difficult to precisely define when a training curve has really flattened for example.
So in theory there are many non-LLM approaches but they are not great. Maybe this is also not so great yet. But maybe it will be.
Yes, for example "swarm optimization".
The difference with "autoresearch" (restricting just to the HPO angle) is that the LLM may (at least we hope) beat conventional algorithmic optimization by making better guesses for each trial.
For example, perhaps the problem has an optimization manifold that has been studied in the past and the LLM either has that study in its training set or finds it from a search and learns the relative importance of all the HP axes. Given that, it "knows" not to vary the unimportant axes much and focus on varying the important ones. Someone else did the hard work to understand the problem in the past and the LLM exploits that (again, we may hope).
There always are. You need to think about what those would be, though. Autoresearch outsources the thinking to LLMs.
Non-parametric optimization is not a new idea. I guess the hype is partly because people hope it will be less brute force now.
Why is that?
I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
What about more distant software projects? Give it the CPython source code and say you want it to be faster.
This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.
I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/
I found LLMs make a fabulous frontend for git :-D
Alternatively, a modular model with multiple “experts” that I could mix and match for my specific stack
I don’t need the model to know all of the Internet plus 20 different human languages. I just want it to be really good with the stack of the project
I recall reading about a stochastic one years ago: <https://github.com/StanfordPL/stoke>
What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.
> I move much, much faster than before
This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.
It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.
If you're resource unconstrained then BO should ofc do very well though.
Good Bayesian exploration is much, much better than grid search, and does indeed learn to avoid low value regions of the parameter space. If we're talking about five minute experiments (as in the blog post), Bayesian optimization should chew through the task no problem.
So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!
There are lots of old ideas from evolutionary search worth revisiting given that LLMs can make smarter proposals.
Ever since it showed up on my GH feed, Karpathy’s Autoresearch was rattling around in the back of my mind. I wanted to try it on a research problem I fully understood. So this weekend, I picked up my old research code from eCLIP , dusted it off the legacy dependencies and gave it to Claude Code. And just let it cook while I did some chores around the house.
This is my journey…
Autoresearch is a simple constrained optimization loop with an LLM agent in the middle. The agent iteratively improves some eval metric by modifying a single file (train.py), while reading instructions from program.md. I added a scratchpad.md file for the agent to use as working memory to document its thought process and experiment history.
In the program.md, I split the exploration into “phases”, starting with some obvious hyperparameter tuning, then moving on to small architectural changes and finally some moonshot ideas. In the final phase, I basically let the agent run with minimal constraints, and gave it web access to read papers and look for new ideas.
The whole thing is a tight loop: hypothesize → edit → train → evaluate → commit or revert → repeat.

The experiment should be short, around 5 minutes wall clock per run, to encourage quick iterations and prevent overfitting to noise. The agent is free to change anything in train.py as long as it runs within the time budget.
Since I was paranoid about letting the agent run arbitrary code in my workstation, I containerized the training loop and removed network access. The whole experimentation flow is orchestrated by a run.sh. Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.
I won’t bore you with the details, you can check out the repo here!
The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints with phrase → bounding box annotations from the CIGAr paper (ECCV 2024 VISART).

Heatmaps obtained from bounding boxes guide the model to focus on specific regions.
The bounding boxes were converted to gaussian heatmaps and fed into the model as an additional input, similar to how radiologist eye-gaze heatmaps work in the original eCLIP paper.
I had a busy week and a lot of chores piling up, so I just pointed Claude at my old research code and went to do laundry. It upgraded the python env of my old research codebase, wrote the ingestion code for the new dataset, and wrote the scaffolding for the experiment loop.
I set up the CV splits, evaluation logic and some initial ideas for the program.md.
For the eval metric we picked Mean Rank of the retrieved embeddings. I didn’t put much thought into it — in hindsight, Median Rank would’ve been a better choice since it’s more robust to outliers. But we just needed something intuitive that clearly tells the agent whether a change is good or bad. Since Recall@K is the standard for reporting final results anyway, Mean Rank just needed to point in the right direction.
I kicked off the loop on Saturday morning and let it run through the day, occasionally checking in to nudge the agent in the right direction. By the time I was done with groceries, the agent had already burned through a couple of dozen experiments and knocked off a huge chunk of the eval mean rank.

42 experiments · 13 committed · 29 reverted · 1 Saturday · 1 4090 GPU
By the end of the day, the agent ran 42 experiments, committing 13 and reverting 29. The mean rank dropped from 344.68 to 157.43 (54% reduction).
After the agent finished its exploration, I did one final training run on the full dataset. The test scores actually came out better than the validation scores. This meant we were underfitting during the short 800-step experiment runs, leaving performance on the table.
| Mean Rank | img→txt R@5 | txt→img R@5 | |
|---|---|---|---|
| Test | 34.30 | 53.0% | 51.4% |
Well…
Temperature clamp fix (−113 mean rank): It immediately went for a bug in my code. I had clamped the learnable temperature param at 2. It relaxed the limit, and boom, the eval dropped by 113 points. This was the single biggest win, worth more than all the architecture changes combined.
Optuna++ (-30 mean rank): Further gains came mostly from hyperparameter tuning. The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in. Increasing projection dimension and re-tuning the LR knocked off another 30 points. This is still tedious work that a human would do (and get minimal pleasure from), but the agent did it faster and more methodically.
Diminishing Returns: By the time we got to Phase 4 with the architectural changes, the success rate of the LLM’s hypotheses dropped significantly. The changes to the attention mechanism in the heatmap processor didn’t work out. Neither did the moonshot ideas in Phase 5. The agent was just throwing spaghetti at the wall, and most of it did not stick.
Sandbox is important: Towards the end, Claude Code sometimes forgot its permissions and started making weird bash calls, then complained and stopped looping. At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)
Like with any LLM project, the first 90% of the work was super smooth and barely needed my intervention. The last 10% was a slog. This was a fun experiment that showed how an LLM agent can drive ML research in a structured way. When the search space is clearly defined, the commit-or-revert loop proposed in Autoresearch is a surprisingly effective search strategy. But when the agent ventured into the “unknown unknowns”, the optimization loop just exploded.
It is possible that the “make only one change per experiment” constraint was too tight for the moonshot ideas. Maybe we could have injected a planning stage into the Agent loop so it could think ahead. Or maybe deployed some subagents.
Maybe. But it was already time for dinner, and we were planning to watch a movie after that, so this was where Claude and I parted ways… until Monday of course.