Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

I feel like most of this recent Autoresearch trend boils down to reinventing hyper-parameter tuning. Is the SOTA still Bayesian optimization when given a small cluster? It was ~3 years ago when I was doing this kind of work, haven't kept up since then.

Also, shoutout SkyPilot! It's been a huge help for going multi-cloud with our training and inference jobs (getting GPUs is still a nightmare...)!

> How parallelism changed the agent’s research strategy > With a single GPU, the agent is stuck doing greedy hill-climbing: try one thing, check the result, pick a direction, try the next thing. With 16 GPUs, the strategy shifts. ...skip... 12 experiments in a single 5-minute wave. This makes it much harder to get stuck in local optima and much easier to find interaction effects between parameters.

The agent can theoretically come up with a protocol to run those same 12 experiments one-by-one and only then decide which branch to explore next - which I think would lead to the same outcome?

But in this case, it just happened to have stumbled on this particular outcome only because it didn't get a chance to execute a greedy strategy after the first 1 or 2 results.

Worse experiment design + parallelism = better experiment design + serialized execution ?

The most surprising part: the agent had access to both H100s and H200s. Without being told, it noticed H200s scored better and started screening ideas on H100s, then promoting winners to H200s for validation. That strategy emerged entirely on its own.

I'm trying to find the 'wow' factor in this. Finding the optimal combination of parameters, given a validation criteria should be a boring repetitive task for a machine or a human. Is it about determining how to utilize the given hardware?

This "early velocity only" approach seems like a problem - how do you know with 5-minute training runs that you aren't affecting the overall asymptote? e.g., what if the AI picks a quantizer that happens to be faster in the first five minutes, but has a big noise floor where it can't make more progress?

Feels like we’ve solved how to run agents anywhere, but not yet how to trust them anywhere.

This is fascinating to me because I just recently built something similar (as a test), not for improving AI, instead it's for tuning hyperparameters for a physics simulation.

We've managed to optimize execution of the simulation enough that brute-force search is a viable option, but giving an agent some background on how we tune those parameters on intuition and some physical reasoning, and a means to run tests and retrieve resulting statstics, works surprisingly well.

I see it as essentially a hyperparameter search that is more capable of finding and exploiting implicit constraints in a system.

The US governement should 'autoresearch' a way to charge this man for his crimes as head of autopilot at tesla.

Isn't this apples to pears comparison? It's really just saying that having a bigger credit card gets you shit faster, but it's actually worse in terms of GPU utilization and efficiency.

1) The total amount of time is not the same if you just count GPU-hours. If you have 16 GPUs, it makes sense to run them for 4.5 hours to get to 72h for an even comparison, not 8.

2) If we stop at 4.5 hours(and are generous including the big drop), the loss is about 0.978, which is the same as about 44 hours with the sequential solution, making the sequential solution about twice as efficient.

So the real conclusion here is that we are able to run things in parallel at an efficiency loss but at a time win as long as we have access to more hardware. I feel like the blog oversells itself.

This feels like the chimpanzee with a power drill. An agent is honestly just brute-force search, but guided.

I am fascinated by this example of using AI to improve AI. I won a small prize using this technique on helion kernels at a pytorch hackathon in SF.

The next step are: - give the agent the whole deep learning literature research and do tree search over the various ideas that have been proposed in the past. - have some distributed notepad that any of these agents can read and improve upon.

It feels like the richer a company is, the dumber their software and more expensive their upkeep gets. Something you could do with optuna in C++ on a single server now requires clusters of GPUs with LLMs at the helm.

A cluster is 2 nodes? That's technically true, but not very exciting.

Wait, "Karpathy's Autoresearch", you mean a loop that prompts the agent to improve a thing given a benchmark?

People have been doing this for a year or more, Ralph loops etc.

I hate the weird strange Twitter world of hero-worship for folks that seems to arise just out of large followings.

Joe no-followers does this six months ago, nobody cares. Karpathy writes a really basic loop and it's now a kind of AI miracle prompting tons of grifters, copy-cats, weird hype.

I do wonder if LLMs have just made everyone seriously, seriously dumber all of a sudden. Most of the "Autoresearch" posts I see are completely rubbish, with AI optimizing for nonsense benchmarks and people failing to understand the graphs they are looking at. So yes, the AI made itself better at a useless benchmark while also making the code worse in 10 other ways you don't actually understand.