We've managed to optimize execution of the simulation enough that brute-force search is a viable option, but giving an agent some background on how we tune those parameters on intuition and some physical reasoning, and a means to run tests and retrieve resulting statstics, works surprisingly well.
I see it as essentially a hyperparameter search that is more capable of finding and exploiting implicit constraints in a system.
1) The total amount of time is not the same if you just count GPU-hours. If you have 16 GPUs, it makes sense to run them for 4.5 hours to get to 72h for an even comparison, not 8.
2) If we stop at 4.5 hours(and are generous including the big drop), the loss is about 0.978, which is the same as about 44 hours with the sequential solution, making the sequential solution about twice as efficient.
So the real conclusion here is that we are able to run things in parallel at an efficiency loss but at a time win as long as we have access to more hardware. I feel like the blog oversells itself.
The next step are: - give the agent the whole deep learning literature research and do tree search over the various ideas that have been proposed in the past. - have some distributed notepad that any of these agents can read and improve upon.
Also, shoutout SkyPilot! It's been a huge help for going multi-cloud with our training and inference jobs (getting GPUs is still a nightmare...)!
The agent can theoretically come up with a protocol to run those same 12 experiments one-by-one and only then decide which branch to explore next - which I think would lead to the same outcome?
But in this case, it just happened to have stumbled on this particular outcome only because it didn't get a chance to execute a greedy strategy after the first 1 or 2 results.
Worse experiment design + parallelism = better experiment design + serialized execution ?
People have been doing this for a year or more, Ralph loops etc.
I hate the weird strange Twitter world of hero-worship for folks that seems to arise just out of large followings.
Joe no-followers does this six months ago, nobody cares. Karpathy writes a really basic loop and it's now a kind of AI miracle prompting tons of grifters, copy-cats, weird hype.
I do wonder if LLMs have just made everyone seriously, seriously dumber all of a sudden. Most of the "Autoresearch" posts I see are completely rubbish, with AI optimizing for nonsense benchmarks and people failing to understand the graphs they are looking at. So yes, the AI made itself better at a useless benchmark while also making the code worse in 10 other ways you don't actually understand.
At least in theory, adaptiveness should save samples and in this case, compute. (As noted, you can always turn the parallel into serial and so the serial approach, which gets information 'from the future', should be able to meet or beat any parallel approach on sample-efficiency.)
So if the batch only matches the adaptive search, that suggests that the LLM is not reasoning well in the adaptive setting and is poorly exploiting the additional information. Maybe some sort of more explicit counterfactual reasoning/planning over a tree of possible outcomes?
It's really hard to imagine that they __won't__ exceed the human value for that efficiency parameter rather soon given that 1. there are plenty of scalar value functions that can represent research efficiency, of which a subset will result in robust training, and 2. that AI labs have a massive incentive to increase their research efficiency overall, along with billions of dollars and really good human researchers working on the problem.
All of science is "gather inputs, make hypothesis, test, analyse" on repeat.
There's plenty to critique in the particular guidance approach, but the overall method is the same.
Probably would cut the number of runs down by a significant number (as far as I can tell it's doing a grid search once it decides to mess with a knob or section of the architecture).
Re: OpenCogPrime:EconomicAttentionAllocation https://news.ycombinator.com/item?id=45518074 and something about eWASM (edit) https://news.ycombinator.com/item?id=47171887 .. from https://news.ycombinator.com/item?id=46825026 re: eWASM and costed opcodes for agent efficiency
No it's not. Is there anything to back that up? There's a creative aspect to human research that I've yet to see with gen AI. All it does is regurgitate stuff and get some "new" ideas via the latent space of the distribution it models. But a generative model cannot by definition create anything new. Just estimate its data well enough that it can sample it well enough to fake novelty.
Does the agent have access to arxiv (a brief skim of the README didn't have an answer)? If not, it could be that the current approach of relying on the model's weights only is resulting in the perceived local optimum of hyperparameter tuning.
Anecdotally, we built a little MCP for arxiv to help with our internal research, noticed a significant boost in the diversity of methods (architecture or otherwise) Claude and friends were able to reference.
Most people are optimizing for terrible benchmarks and then don't really understand what the model did anyone and just assume it did something good. It's the blind leading the blind basically, and a lot of people with an AI-psychosis or delusion.
In this case, using a cheap(er) signal or heuristic as an initial filter before spending more resources on cases that pass the filter is a pattern that shows up all over the place, and LLMs are good at picking up on patterns like that and generalizing them. AFAICT.
You can try this yourself in a simple fashion -- let's say you have piece of code that you want to speed up. Point your agent to a code profiler (your oracle -- typically your Python profiler) and tell it speed up the code. I've tried it. It works.
Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.
This automated run-and-discover process is far beyond what’s possible with hyperparam search.
I might reframe the comment as: are you actually using LLMs for sustained, difficult work in a domain that has nothing to do with LLMs?
It feels like a lot of LLM-oriented work is fake. It is compounding "stuff," both inputs and outputs, and so the increased amount of stuff makes it feel like we're living in a higher plane of information abundance, but in reality we're increasing entropy.
Tech has always had an information bias, and LLMs are the perfect vehicle to create a lot of superfluous information.
To calculate an gradient step, in practice one doesn't accumulate the gradient for the full corpus, but updates the weights on mini-batches.
Suppose one runs conventional gradient descent on minibatches multiple times with different starting seeds, and then considers a set of pre-trained models M_i
From a random starting point we thus have an idea of the desired end-region in weightspace (lets say a gaussian cloud fit to the final M_i's).
Then it seems like one could score update strategies by how much a single iteration has approached the gaussian cloud, by scoring just the approach on a number of minibatches or just a few update iterations. Instead of searching update strategy space by waiting until pretraining has finished for each candidate update strategy. Only the candidate strategies that perform well enough on 1 or a few iterations would be considered worthy of further consideration, those that pass (a lower number of candidates) are then inspected for approach to the gaussian target after another round of iterations etc.
It seems like it should be possible to optimize the optimization iteration loop, by running it just once for many candidates and observing their convergence to the known desired end region.
I don't necessarily disagree, but am wondering whether you have any particular reason/intuition driving you to claim this. I have seen AI agents be quite creative in other tasks; do you think there's a particular reason why we shouldn't see creativity in architecture research, given enough time and resources?
i don't know a great deal about the guy. i know: he worked at tesla, led autopilot there. if we ignore the character defects required to work at tesla, he's responsible for designing systems that would certainly kill people because they decided lidar was too expensive.
In practice, the vast majority of the changes that auto research actually made would have been found much faster with BO if properly parameterized. You do not need an LLM to find a better batch size or learning rate.
> Why surely? Have you never seen an LLM try something new?
I'm afraid I can't make it any simpler than this.
And I still don't know the answer to how you're so sure. To me there's several explanations, and it seems to you there's only one.
I'm pretty happy with my communication style.
Given that this is a common technique and not a novel invention, it’s probably present in the training set.
The “surely” reads like it’s referring to the presence of that information in the training set. But your response casts it as saying “surely the AI has not invented something on its own”.
The original question stands IMO, the burden of proof is on whoever is asserting that the AI has invented something on its own, with or without training data that surely already mentions this approach
The problem with the reasoning of the person I was responding to is that it's assuming "if X is in the training set and LLM outputs X, then it did so because X is in the training set". That does not follow. Conceivably it's possible that X is in the training set and LLM outputs X, but if X hadn't been in the training set the LLM also would've output X.
Lets look at that phrase again:
> Why do we think this emerged “on its own”? Surely this technique has been discussed in research papers that are in the training set.
This phrase implies "if X was in the training set, then LLM couldn't have come up with X on its own". This is false. In fact, my claim that the implication is false is testable, in the following manner: Have two training sets, T and T'. In T, X is present. In T' you've removed X but left X-adjacent things. Train LLM A on T and A' on T'. Find a prompt that requires that A outputs X. If on the same prompt A' also outputs X, that's an example of my claim. To repeat, my claim is "it's possible that X is in the training set and LLM outputs X, but if X hadn't been in the training set the LLM also would've output X."
In fact, I've just realized I even have a method for constructing (T, T') that guarantees what I've described. Not sure if it's worth a paper on its own though.
We pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster. Over 8 hours it submitted ~910 experiments, found that scaling model width mattered more than any single hyperparameter, taught itself to use H200s for validation while screening ideas on H100s, and drove val_bpb from 1.003 down to 0.974 - a 2.87% improvement over baseline.
Beyond raw speedup, parallelism changed how the agent searched. With one GPU, it’s stuck doing greedy hill-climbing - try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss. For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one - one round instead of six.
It also discovered it had access to multiple GPU types (H100s and H200s) and developed a strategy to exploit the performance difference across heterogeneous hardware: screen ideas on cheap H100s, promote winners to H200 for validation.

With 16 GPUs, the parallel agent reached the same best validation loss 9x faster than the simulated sequential baseline (~8 hours vs ~72 hours).
Autoresearch is Andrej Karpathy’s recent project where a coding agent autonomously improves a neural network training script. The agent edits train.py, runs a 5-minute training experiment on a GPU, checks the validation loss, and loops - keeping changes that help, discarding those that don’t. In Karpathy’s first overnight run, the agent found ~20 improvements that stacked up to an 11% reduction in time-to-GPT-2 on the nanochat leaderboard.
The default setup: one GPU, one agent, one experiment at a time. ~12 experiments per hour. We wanted to see what happens when you remove the infrastructure bottleneck and let the agent manage its own compute.
The project has three files:
prepare.py - Downloads data, trains a tokenizer, provides the dataloader and evaluation function. Read-only. The agent cannot touch it.train.py - The GPT model, optimizer, and training loop. This is the only file the agent modifies.program.md - Instructions for the agent: what it can change, how to evaluate results, when to keep vs. discard changes.The constraint is a fixed 5-minute wall-clock training budget. The agent’s job is to minimize val_bpb (validation bits per byte) within that window. Everything in train.py is fair game - architecture, hyperparameters, optimizer settings, batch size, model depth - as long as the code runs without crashing.
The default setup assumes you have a local GPU. You run uv run train.py, wait 5 minutes, check the result, edit, repeat. The agent automates the edit-run-check loop, but the experiments are still sequential.
Running experiments sequentially means the agent spends most of its time waiting. A typical cycle looks like:
train.py (~30 seconds)Steps 1 and 3 are cheap. Step 2 dominates. And during step 2, the agent is idle - it could be preparing the next experiment, or the next ten.
The bigger problem is combinatorial. Say the agent finds that lower weight decay helps and that a different Adam beta also helps. It wants to try them together. But with sequential execution, testing the combination means waiting another 5 minutes. With 16 GPUs, the agent can test that combination alongside a dozen other ideas simultaneously. Instead of testing one hypothesis per 5-minute window, it tests a factorial grid in a single wave.
To arm the agent with GPUs, we used SkyPilot. It’s an open-source tool that launches jobs across clouds and Kubernetes from a YAML file and includes a skill that teaches coding agents to use it. The agent reads the skill, then launches and manages GPU clusters on its own - no manual cloud setup.
Each experiment is defined in a short YAML (experiment.yaml) that specifies the GPU type, installs dependencies, runs train.py, and prints metrics to stdout. The agent checks results with sky logs.

Claude Code uses the SkyPilot skill to launch and manage GPU experiments across clouds and Kubernetes.
experiment.yaml - SkyPilot task definition for a single experiment
resources:
accelerators: {H100:1, H200:1}
image_id: docker:nvcr.io/nvidia/pytorch:24.07-py3
infra: k8s # or slurm, aws, gcp, azure, etc. (20+ infra backends supported)
workdir: .
envs:
EXPERIMENT_ID: baseline
EXPERIMENT_DESC: "baseline run"
setup: |
pip install uv
uv sync
uv run prepare.py
run: |
# Run the experiment (5-min fixed budget)
uv run train.py 2>&1 | tee run.log
EXIT_CODE=${PIPESTATUS[0]}
if [ $EXIT_CODE -ne 0 ]; then
echo "EXPERIMENT_STATUS: crash"
else
VAL_BPB=$(grep "^val_bpb:" run.log | awk '{print $2}')
PEAK_VRAM=$(grep "^peak_vram_mb:" run.log | awk '{print $2}')
MEMORY_GB=$(echo "scale=1; ${PEAK_VRAM} / 1024" | bc)
echo "EXPERIMENT_STATUS: done"
echo "EXPERIMENT_RESULT: ${EXPERIMENT_ID} val_bpb=${VAL_BPB} memory_gb=${MEMORY_GB}"
fi
echo "EXPERIMENT_DESC: ${EXPERIMENT_DESC}"
The setup block runs once per cluster - subsequent experiments on the same cluster skip straight to training. For this run, we used SkyPilot on Kubernetes (infra: k8s) backed by CoreWeave, with {H100:1, H200:1} letting SkyPilot pick whichever GPU was available.
The full setup (agent instructions + YAML) is at skypilot/examples/autoresearch.
To run experiments in parallel, the agent launches multiple clusters and submits different experiments to each. The -d flag (detached mode) submits the job and returns immediately:
# Launch a cluster with the first experiment
sky launch gpu-01 experiment.yaml -d -y \
--env EXPERIMENT_ID=exp-01 \
--env EXPERIMENT_DESC="baseline run"
# Reuse the same cluster for the next experiment (skips setup)
sky exec gpu-01 experiment.yaml -d \
--env EXPERIMENT_ID=exp-02 \
--env EXPERIMENT_DESC="higher LR"
sky exec queues a job that starts automatically when the current one finishes, so the agent can pipeline experiments on the same cluster with zero idle time. Between sky launch for provisioning and sky exec for pipelining, a single agent can keep 16 clusters busy.
The instructions.md points the agent to the SkyPilot skill, which teaches it to manage the full loop: provision clusters, submit experiments, check logs, commit winning changes, and keep going until stopped. You just point your coding agent at the instructions and walk away.
We pointed Claude Code at the instructions and let it run overnight. Claude used SkyPilot to provision 16 GPUs across our two Kubernetes clusters - 13 ended up on H100s and 3 on H200s, depending on what was available:
$ sky status
NAME INFRA RESOURCES STATUS
gpu-01 k8s 1x (H100:1) UP
gpu-02 k8s 1x (H100:1) UP
...
gpu-08 k8s 1x (H200:1) UP
...
gpu-16 k8s 1x (H100:1) UP
The session ran about 90 experiments per hour - a 9x throughput increase over the 10/hour you get with a single GPU (each experiment takes ~5 min plus ~1 min of setup and agent thinking time). Over 8 hours, the agent submitted ~910 experiments (700 with valid results, the rest queued or crashed).
The search went through five phases. The agent didn’t plan these ahead - each phase emerged from what it learned in the previous one.

Model performance across runs. Each grey dot is one experiment. Green dots mark new best validation losses. The agent drove val_bpb from 1.003 (baseline) to 0.974 over ~700 experiments in 8 hours.
Starting from val_bpb = 1.003 (baseline), the agent tested the obvious knobs in parallel: batch size, Adam betas, weight decay, window patterns, model depth, learning rate schedules. Early waves of 10-13 simultaneous experiments quickly mapped out what works:
After ~200 experiments: val_bpb = 0.981. Most of the hyperparameter space was mapped.
This was the biggest single jump, and the one that parallel search made possible. The agent tested six different aspect ratios simultaneously - AR=48, 64, 72, 80, 90, 96 - in a single 5-minute wave. In serial, that’s 30 minutes of waiting. In parallel, one wave.
The result: scaling model width from the default (AR~48, model_dim=384) to AR=96 (model_dim=768) outperformed every hyperparameter tweak from Phase 1. Going wider was worth more than all the optimizer tuning combined.
AR=112 was too big - the model didn’t get enough training steps in 5 minutes to use the extra capacity. AR=96 was the sweet spot: it fit in 64GB VRAM and completed ~1,060 steps on an H100 (vs ~1,450 for the smaller model), enough for the wider model to pay off.
After ~420 experiments: val_bpb = 0.977.
With AR=96 as the base architecture, the agent fine-tuned around it: warmdown schedule, matrix learning rate, weight decay, Newton-Schulz steps for the Muon optimizer. Each wave tested 10+ variants.
After ~560 experiments: val_bpb = 0.975 (on H200).
The biggest late-stage find: muon_beta2=0.98 (up from 0.95). The Muon optimizer’s second-momentum parameter controls how aggressively gradient normalization adapts. Increasing it smoothed the normalization and let the model take larger effective steps. This single change was worth ~0.001 val_bpb - the largest late-stage improvement.
The agent found this by testing beta2 in {0.95, 0.96, 0.97, 0.98, 0.99} across 10 clusters in one wave. Sequentially, that’s 5 experiments at 5 minutes each = 25 minutes. In parallel, 5 minutes.
After ~700 experiments: val_bpb = 0.974.
With the best config locked in, the agent ran combinatorial sweeps over final LR fraction, warmdown ratio, scalar LR, and embedding LR. Returns dropped below 0.0001 per experiment. The improvement curve had flattened:
Phase 1 (hyperparams): 1.003 → 0.981 (Δ = 0.022)
Phase 2 (architecture): 0.981 → 0.977 (Δ = 0.004)
Phase 3 (fine-tuning): 0.977 → 0.975 (Δ = 0.002)
Phase 4 (optimizer): 0.975 → 0.974 (Δ = 0.001)
Phase 5 (combinations): 0.974 → ??? (Δ < 0.0001)
The low-hanging fruit - architecture scale, batch size, optimizer structure - was picked. Further gains would require new architectural ideas or longer training budgets.
# Architecture
ASPECT_RATIO = 96 # model_dim = 8 * 96 = 768
DEPTH = 8 # 8 transformer layers
WINDOW_PATTERN = "SL" # alternating Sliding + Local attention
# Training
TOTAL_BATCH_SIZE = 2**18 # ~524K tokens/step
# Learning rates
MATRIX_LR = 0.05 # Muon LR for weight matrices
EMBEDDING_LR = 0.6 # AdamW LR for token embeddings
SCALAR_LR = 0.5 # AdamW LR for residual mixing scalars
# Optimizer
ADAM_BETAS = (0.70, 0.95)
WEIGHT_DECAY = 0.08
WARMDOWN_RATIO = 0.6
FINAL_LR_FRAC = 0.05
# Muon: momentum=0.95, ns_steps=5, beta2=0.98
With a single GPU, the agent is stuck doing greedy hill-climbing: try one thing, check the result, pick a direction, try the next thing. With 16 GPUs, the strategy shifts. The agent can run full factorial grids - test 3 values of weight decay × 4 values of learning rate = 12 experiments in a single 5-minute wave. This makes it much harder to get stuck in local optima and much easier to find interaction effects between parameters.
The aspect ratio discovery in Phase 2 is a good example. Sequentially, the agent might have tried AR=64, seen no improvement, and moved on to other ideas. In parallel, it tested AR=64, 72, 80, 90, 96, and 112 at once, immediately saw the trend, and zeroed in on AR=96. One wave instead of six sequential experiments.
The throughput numbers:
| Sequential (1 GPU) | Parallel (16 GPUs) | |
|---|---|---|
| Experiments / hour | ~10 | ~90 |
| Strategy | greedy hill-climbing | factorial grids per wave |
| Information per decision | 1 experiment | 10-13 simultaneous experiments |

With 16 GPUs, the parallel agent reached the same best validation loss 9x faster than the simulated sequential baseline (~8 hours vs ~72 hours).
We used SkyPilot to let our agent access our two H100 and H200 clusters. Of the 16 cluster budget we asked it to stick to, it used 13 H100s (80GB VRAM, ~283ms/step) and 3 H200s (141GB VRAM, ~263ms/step). We didn’t tell the agent about the GPUs’ performance differences. It figured it out on its own.
After a few waves, the agent noticed that identical configs scored ~0.005 val_bpb lower on H200 clusters - the faster step time meant more training steps in the 5-minute budget.
Without any prompts, it developed a two-tier strategy to exploit this difference: screen 10+ hypotheses cheaply on H100s in parallel, then promote the top 2-3 to H200 for confirmation runs. Here’s the agent reasoning through this in real time:
“Only 3 H200 clusters: gpu-03, gpu-04, gpu-08! The rest are H100. This explains everything — H200 is significantly faster than H100. In the same 5-minute budget, H200 can do MORE training steps. More steps = better val_bpb.”
“H200 runs 9% more steps in the same time! That directly leads to better val_bpb. All my ‘best’ results should be normalized by hardware.”
“Since H200 gets ~9% more steps than H100 in the same 5-minute budget, and I have only 3 H200 clusters, I should focus experiments on H200 clusters. The real optimization contest is on H200.”
This turned out to matter beyond just throughput. Rankings didn’t always transfer across hardware. For example, FINAL_LR_FRAC=0.03 sometimes beat 0.05 on H100 but consistently lost on H200. The likely explanation: with more training steps, the model benefits from keeping the learning rate higher toward the end of the schedule. The agent’s self-invented validation tier caught these discrepancies - a workflow a human researcher might design deliberately, but that the agent arrived at just by observing its own results.
The agent ran for ~8 hours on 16 Kubernetes GPUs. Claude Code’s API cost for the session would be about $9. The GPU compute depends on your pricing - H100s run about $2/hour or lower, so 13 H100s for 8 hours is ~$200 and 3 H200s for 8 hours adds ~$60 (at ~$2.3/h), totaling under $300 in total costs.
The full setup (agent instructions + SkyPilot YAML) is at skypilot/examples/autoresearch.
The quickest way to get started is the one-line setup script, which installs dependencies, clones autoresearch, and downloads the experiment files:
curl -sL https://raw.githubusercontent.com/skypilot-org/skypilot/master/examples/autoresearch/setup.sh | bash
# follow the steps
cd autoresearch
claude "Read instructions.md and start running parallel experiments."
From here, the agent handles everything. It reads instructions.md, fetches the SkyPilot skill, provisions GPU clusters, submits experiments, checks logs, commits winning changes, and loops until you stop it.
Manual setup (without the script)
# Clone autoresearch and copy in the parallel experiment files
git clone https://github.com/karpathy/autoresearch.git
git clone https://github.com/skypilot-org/skypilot.git
cd autoresearch
cp ../skypilot/examples/autoresearch/experiment.yaml .
cp ../skypilot/examples/autoresearch/instructions.md .
# Prepare data locally (one-time)
pip install uv && uv sync && uv run prepare.py
# Install the SkyPilot skill for your agent
# See: https://docs.skypilot.co/en/latest/getting-started/skill.html
# Point your coding agent at the instructions
# "Read instructions.md and start running parallel experiments"
You can use Claude Code, Codex, or any coding agent that can run shell commands and fetch URLs. Set infra: in the YAML to target a specific backend (e.g. infra: k8s for Kubernetes, infra: aws for AWS). Otherwise, SkyPilot picks the cheapest available option.
For a quick intro to SkyPilot, see the overview and quickstart.
To receive latest updates, please star and watch the project’s GitHub repo, follow @skypilot_org, or join the SkyPilot community Slack.
But you’re missing the context and implication: “doing new stuff” is the major achievement we’re looking for next from LLMs. Seeing something that is “new” and is not in the training set is interesting in a way that something contained in the training set is not.
We cannot introspect LLMs meaningfully yet, so the difference between “came up with myself and it’s in the training set incidentally” and “applied a concept in the training set” is not meaningful.
He is a researcher who understands neural networks and their architectures exceptionally well. That is all.
A few examples: Axiom's proof of Fel’s open conjecture on syzygies of numerical semigroups: https://x.com/axiommathai/status/2019449659807219884
Erdos 457: https://www.erdosproblems.com/457
The stronger form of Erdos 650: https://www.erdosproblems.com/650
E.g., you can see a post from a user named dhouston, who mentioned that he was thinking about starting an online file sync/backup service of some sort.
And that is precisely why he is more qualified on the subject than your average vibe coder!