You are welcome
I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.
When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.
I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.
Seems the castle of cards isn't just high enough. /s
And for that matter, whatβs it do with 9 digit numbers? Like, is it more accurate with them, or are these little guys mainly good at adding numbers with exactly 10 digits?
Basically, are the failures modes a gentle increase in inaccuracy, or spectacle failure outside their parameters?
I wonder why they don't just write the code themselves, so by design the focus can be on the model.
For instance the current high score model (311 params [0]), when given 12345678900 + 1, responds with 96913456789.
An interesting experiment would be: what's the minimum number of parameters required to handle unbounded addition (without offloading it to tool calls).
Of course memory constraints would preclude such an experiment. And so a sensible proxy would be: what kind of neural-net architecture and training would allow a model to handle numbers lengths it hasn't been trained on. I suspect, this may be not be possible.
Without any formal verification: The input space of two 10-digit numbers is a bit bigger than 64-bits, so exhaustively verifying all possible inputs doesn't sound practical. Using the same subset of the input space for verifying each submission seems like the easiest way to be fair, and not disclosing that subset to the competitors is obviously necessary.
That little "add" of yours has the overhead of: having an LLM emit it as a tool call, having to pause the LLM inference while waiting for it to resolve, then having to encode the result as a token to feed it back.
At the same time, a "transformer-native" addition circuit? Can be executed within a single forward pass at a trivial cost, generate transformer-native representations, operate both in prefill and in autoregressive generation, and more. It's cheaper.
"So that the room will be empty."
Does this boil down to a condemnation of all scientific endeavours if they use resources?
Would it change things if the people who did it enjoyed themselves? Would they have spent more energy playing a first person shooter to get the same degree of enjoyment?
How do you make the calculation of the worth of a human endeavour? Perhaps the greater question is why are you making a calculation of the worth of a human endeavour.
Those who worry about an imaginary risk and live their lives in constant fear have turned into nothing more than machines enslaved by propaganda.
not any more, eh?
I think that's one very good reason to make them more efficient, and that's part of the point of contests like this one.
Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)
It might work, I considered running a test like this. But it does demand certain things.
The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.
It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.
And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.
Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.
Although the converse would be interesting, racing city buses.
Now if you said this proof of addition opens up some other interesting avenue of research, sure.
I think it would be interesting to see challenges where two networks are trained and evaluated on the exact same datasets and the architecture is the same except for the presence of self-attention layers in one network.
So far it seems to me that self-attention really brought new capabilities to a network - essentially change the network's functionality in response to the input. It would be interesting to see if there are problems (i.e. datasets) that a "traditional" feedforward network fails to solve, but a transformer network of the same size can solve.
My guess would be: yes there are, and they are the kinds of "variable task" datasets that we see with LLMs, i.e. where part of the input indicates the task itself and part indicates the data for the task.
Similarly Iβm always surprised that we donβt start by training a small set of layers, stack them and then continue.
(I see the Trained Weights results now, thanks.)
One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.
What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.
True, but with even smarter humans, you could exploit the interactions for additional calculations.
While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.
Well for starters, it puts the lie to the argument that a transformer can only output examples it has seen before. Performing the calculation on examples that haven't been seen demonstrates generalisation of the principles and not regurgitation.
While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.
But it does not, right? You can either show it something, or modify the parameters in a way that resemble the result of showing it something.
You can claim that the model didn't see the thing, but that would mean nothing, because you are making the same effect with parameter tweaks indirectly.
I guess the analogy there is that a 74ls283 never really has a number either and just manipulates a series of logic levels.
Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.
This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.
Maintained by Dimitris Papailiopoulos (@dimitrispapail).
We track two categories:
Both are valid. Both are interesting.
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link |
|---|---|---|---|---|---|---|---|
| 1 | 36 | 100% | alexlitz | 2L decoder, d=5, 5h+1h | ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64 | gist | |
| 2 | 40 | 100% | Wonderfall (@w0nderfall) | 1L decoder, d=2, 1h, hd=2 | Tied Q/K + V/O projections, RoPE period-19, parabolic tied-embed decode, two-hinge ReLU MLP | gist | |
| 3 | 50 | 100% | lichengliu03 | 1L custom GPT, d=4, 2h, hd=2 | Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11) | repo | |
| 4 | 66 | 100% | cosminscn | 1L nanoGPT, d=4, 2h | Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11) | gist | |
| 5 | 87 | 100% | bingbangboom-lab | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params | gist | |
| 6 | 93 | 100% | jacobli99 | 1L decoder, d=2, 5h (MQA), hd=2, ff=4 | Tied parabolic decode, RoPE digit routing, ReLU carry detection | gist | |
| 7 | 111 | 100% | corbensorenson | Codex | 1L decoder, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE, SwiGLU, GQA | repo |
| 8 | 116 | 100% | nino | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, shared RMSNorm vectors, RoPE (hd=2) | gist | |
| 9 | 121 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection | gist |
| 10 | 130 | 100% | cosminscn | 1L nanoGPT, d=4, 2h | Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding | gist | |
| 11 | 130 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=3 | Tied embed, RoPE digit routing, SiLU carry logic | gist |
| 12 | 139 | 100% | Wonderfall (@w0nderfall) | GPT-5.2 Pro + Codex | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, RoPE digit routing, SiLU carry logic | gist |
| 13 | 148 | 100% | bingbangboom-lab | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing | gist | |
| 14 | 177 | 100% | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head | gist |
| 15 | 197 | ~100%* | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm | gist |
* Passed 8,192 random tests; not independently verified on our 10K test suite yet.
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link |
|---|---|---|---|---|---|---|---|
| 1 | 311 | 99.999% | rezabyt (@reza_byt) | 1L decoder, d=4, 1h, ff=8 | Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking | repo | |
| 2 | 335 | 99.92% | h3nock | 1L decoder, d=4, 1h, ff=12 | Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning | repo | |
| 3 | 456 | 100% | yinglunz | 1L decoder, d=7, 1h, ff=14 | Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed | repo | |
| 4 | 491 | 99.97% | rezabyt (@reza_byt) | 1L decoder, d=7 | Rank-3 factorization, RMSNorm, curriculum learning | repo | |
| 5 | 512 | 99.988% | yinglunz (@yinglun122) | 1L decoder, d=7, 1h, ff=14 | Rank-3 factorization | repo | |
| 6 | 777 | 99.69% | Yeb Havinga (@YebHavinga) | Claude Code | 1L decoder, d=7, 1h, ff=14 | Tied embeddings, no FFN bias, curriculum learning | repo |
| 7 | 1,644 | 99.04% | anadim (@dimitrispapail) | Codex | 1L decoder, pair tokens | Pair token encoding (digit pairs as single tokens) | repo |
| 8 | 6,080 | 100% | anadim (@dimitrispapail) | Claude Code | 2L decoder, d=16, ff=48 | Systematic scaling, found phase transition at d=16 | repo |
The model must operate as a genuine autoregressive transformer. This means:
Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer β without it, you have an MLP or RNN, not a transformer.
The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process β not from explicit state variables passed between steps in Python.
Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.
The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic β manually pairing digits, threading carry state, indexing into specific positions β then the Python code is solving the problem, not the model.
In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.
verify.py with --seed 2025Option A: Open an Issue (easiest)
Option B: Open a Pull Request
Updates to the leaderboard are welcome via pull request.
python verify.py submissions/your_submission.py
This runs:
This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?
Addition requires three capabilities:
Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.
MIT
Do we have a layman explanation for what makes self-attention so uniquely powerful? Something more than "it lets you do self-attention".
Iteratively measuring loss is a way to reconstruct values. That's trivial to show for a single value If 5 gives you a loss of 2 and 9 gives you a loss of 2 then you know the missing value is 7.
A model with enough parameters can memorise the training set in a similar manner. Technically the model hasn't seen that data by direct input either, but the mechanism provides the means to determine the what the data was. In that respect it is reasonable to say the model has seen the data.
Performing well on examples not in the training set is doing something else.
Any attempt to characterise that as having been seen before negates any distinction between taking in data and reasoning about that data.
So I don't understand how any one can make the claim that the model as not seen it. Because the internal transformation is similar.
So (modulo a _lot_ of details) it increases the power from that of a "calculator" to that of a "computer".
By what mechanism do you propose the model observed the test set?
By explicitly setting the model parameters.
What happens when a model is trained? We tweak the model parameters by some feed back.
In both cases, you affect the model parameters. Only the method is different. So both are eqvialent to "model observing the test set".