> Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

I think it would be interesting to see challenges where two networks are trained and evaluated on the exact same datasets and the architecture is the same except for the presence of self-attention layers in one network.

So far it seems to me that self-attention really brought new capabilities to a network - essentially change the network's functionality in response to the input. It would be interesting to see if there are problems (i.e. datasets) that a "traditional" feedforward network fails to solve, but a transformer network of the same size can solve.

My guess would be: yes there are, and they are the kinds of "variable task" datasets that we see with LLMs, i.e. where part of the input indicates the task itself and part indicates the data for the task.

So, what happens when you test it on 11 digit numbers? I don’t mean that as a gotcha or “LOL dumb transformer” snark. More like, does the accuracy start to drop as you add digits? Or instead, maybe it’s the transformer equivalent of a stack overflow and it outputs a picture of a burning spoon or something?

And for that matter, what’s it do with 9 digit numbers? Like, is it more accurate with them, or are these little guys mainly good at adding numbers with exactly 10 digits?

Basically, are the failures modes a gentle increase in inaccuracy, or spectacle failure outside their parameters?

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

Fast matrix multiplication would be a more useful benchmark: https://fmm.univ-lille.fr/

> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

I wonder why they don't just write the code themselves, so by design the focus can be on the model.

In the 90s, there were papers on emulating logical circuits with neurons. They would be bigger than this network, but at least always correct.

Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

I couldn't help but laugh out loud at the notion of a "held-out test set" for addition of 10-digit numbers.

Got excited that someone made one of those 120v humming coil beauties do the numbers... alas, it's just yet another NN project :-/

Very cool, but can I suggest the `add` CPU instruction instead? Supports 64-bit numbers, and it's encoded in hardware, and no need to cross a PCIe interface into a beefy, power-hungry GPU and back again. And chances are it's cross-platform, because basically every ISA since the very first has had `add`.

You can do that in a single matmul of course.

The leaderboard framing is clever - forces apples-to-apples comparison on a task where you can verify correctness deterministically. What I find interesting is the architectural constraints: 10-digit addition requires maintaining ~20 digits of working state across the carry chain, which is fundamentally sequential. The fact that tiny transformers can learn this at all (rather than just memorizing) suggests they are finding some form of positional carry representation in their attention patterns. Would love to see ablations on how attention head count vs depth trade off here - my intuition is that carry propagation needs depth more than width.

How is anyone predicting timelines for AGI when these systems can’t do basic addition of 2 arbitrary numbers with 100% accuracy?

Interesting, is this just a fun competition or would this also have some practical applications i wonder?

So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?

The ai slop pixel art...

Here: eval()

You are welcome

>=99% accuracy wtf?!?

I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.

this is the dumbest fking thing to do math with

Now wrap it all in an Electron app!

The gap between 36 hand-coded params and 311 trained params is fascinating and honestly underappreciated. It mirrors something we see repeatedly in ML: gradient descent finds solutions in a fundamentally different region of parameter space than a human engineer would design.

When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts.

I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy.

"it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." https://en.wikipedia.org/wiki/Law_of_the_instrument

Seems the castle of cards isn't just high enough. /s

I get that this is technically interesting, for certain, but the sheer amount of energy and associated global warming risk needed to do something with >=99% accuracy that we've been able to do easily for decades with a guaranteed 100% accuracy seems to me to be wasteful to the extreme.

Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...

Interesting, is this just a fun competition or would this also have some practical applications i wonder?

Here: eval()

You are welcome

>=99% accuracy wtf?!?

"it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." https://en.wikipedia.org/wiki/Law_of_the_instrument

Seems the castle of cards isn't just high enough. /s

And for that matter, what’s it do with 9 digit numbers? Like, is it more accurate with them, or are these little guys mainly good at adding numbers with exactly 10 digits?

Basically, are the failures modes a gentle increase in inaccuracy, or spectacle failure outside their parameters?

Depends on how the transformer has been trained. If it has seen 11 digit examples while training it might work, else the input will be out of distribution and it will respond with a nonsensical number.

For instance the current high score model (311 params [0]), when given 12345678900 + 1, responds with 96913456789.

An interesting experiment would be: what's the minimum number of parameters required to handle unbounded addition (without offloading it to tool calls).

Of course memory constraints would preclude such an experiment. And so a sensible proxy would be: what kind of neural-net architecture and training would allow a model to handle numbers lengths it hasn't been trained on. I suspect, this may be not be possible.

[0] https://github.com/rezabyt/digit-addition-311p

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.

Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)

I didn't look at all the details, but wanted to see how you did the initial embedding and see you do have a 14x5 matrix there. I guess when you are setting things by-hand (rather than learning), the definition of counting "parameters" is a bit unclear. One could say all those are parameters! even if setting in a straight-forward way.

I wonder why they don't just write the code themselves, so by design the focus can be on the model.

Writing code by hand, what is this, 6 months ago??

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

Good question.

It might work, I considered running a test like this. But it does demand certain things.

The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.

It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.

And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.

Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.

I had that in mind too. What if you handcraft a subnetwork with (some subset of) Turing machine capability? Do those kinds of circuits emerge naturally during training? Can reasoning use them for complex computation?

I couldn't help but laugh out loud at the notion of a "held-out test set" for addition of 10-digit numbers.

I don't think we have good tools for formally proving that a transformer's output will match a more traditionally-defined function. But the leading transformers are small enough that formal verification may be possible.

Without any formal verification: The input space of two 10-digit numbers is a bit bigger than 64-bits, so exhaustively verifying all possible inputs doesn't sound practical. Using the same subset of the input space for verifying each submission seems like the easiest way to be fair, and not disclosing that subset to the competitors is obviously necessary.

No. You cannot. It's the wrong tool for the problem.

That little "add" of yours has the overhead of: having an LLM emit it as a tool call, having to pause the LLM inference while waiting for it to resolve, then having to encode the result as a token to feed it back.

At the same time, a "transformer-native" addition circuit? Can be executed within a single forward pass at a trivial cost, generate transformer-native representations, operate both in prefill and in autoregressive generation, and more. It's cheaper.

"smallest supercomputing cluster that can add two numbers"

I mean, yeah, no need to put a bunch of high powered cars in a circular track to watch them race really close to each other at incredible speeds, causing various hazards, either. Especially since city buses have been around for ages.

You can do that in a single matmul of course.

So can you take an arbitrary transformer and somehow turn it into a compact set of low-power fast gates by some algorithm?

How is anyone predicting timelines for AGI when these systems can’t do basic addition of 2 arbitrary numbers with 100% accuracy?

Can you do basic addition of 2 arbitrary numbers with 100% accuracy (no tools) ? No you can't. You will make mistakes for a sufficiently large N even with pen and paper, and a very small N without. Are you no longer generally intelligent ?

LLMs should use tool calling (which is 100% reliable) instead of doing math internally. But in general it would be nice to be able to teach a process and have the AI execute it deterministically. In some sense, reliability between 99% and 100% is the worst because you still can't trust the output but the verification feels like wasted effort. Maybe code gen and execution will get us there.

So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?

For one the specific 36 parameter version is impossible without float64 so you might guess the corollary that it is not exactly amenable to being found by gradient descent. I think the question of how you can structure transformers and neural nets in general so that they can both very parsimoniously represent things like this and have it be amenible to learning by gradient descent.

"Minsky, why did you close your eyes?"

"So that the room will be empty."

this is the dumbest fking thing to do math with

Yes, but it's interesting that you can teach it to do arithmetic, don't you think? Most things can't be taught to do arithmetic, making this "transformer" thing slightly magical. And so then it seems interesting to investigate exactly how much magic is needed to achieve this.

Now wrap it all in an Electron app!

And npm install llm-is-odd to divide and conquer!

What would be an acceptable amount of energy to spend on something that someone has done in a different manner before? Would you rather we stick with all of the current known ways to do things.

Does this boil down to a condemnation of all scientific endeavours if they use resources?

Would it change things if the people who did it enjoyed themselves? Would they have spent more energy playing a first person shooter to get the same degree of enjoyment?

How do you make the calculation of the worth of a human endeavour? Perhaps the greater question is why are you making a calculation of the worth of a human endeavour.

Because it's fun. Life is meant to be enjoyed.

Those who worry about an imaginary risk and live their lives in constant fear have turned into nothing more than machines enslaved by propaganda.

>Hacker News

not any more, eh?

> the sheer amount of energy and associated global warming risk

I think that's one very good reason to make them more efficient, and that's part of the point of contests like this one.

Wait until you see the quantum computer that it takes to factor the integer 15.

You need to recalibrate your sense of scale if you think that this is a geologically relevant usage of energy.

For instance the current high score model (311 params [0]), when given 12345678900 + 1, responds with 96913456789.

An interesting experiment would be: what's the minimum number of parameters required to handle unbounded addition (without offloading it to tool calls).

[0] https://github.com/rezabyt/digit-addition-311p

Writing code by hand, what is this, 6 months ago??

No. You cannot. It's the wrong tool for the problem.

"smallest supercomputing cluster that can add two numbers"

I would similarly criticise a race car being used to do a city bus' job of getting a lot of people from point A to B.

Although the converse would be interesting, racing city buses.

So can you take an arbitrary transformer and somehow turn it into a compact set of low-power fast gates by some algorithm?

I think you're misunderstanding the joke.

No, but I can develop methods to eventually do it.

"Minsky, why did you close your eyes?"

"So that the room will be empty."

In theory, there is an infinite number of systems with simple emergent rules that can eventually be taught arithmetic.

And npm install llm-is-odd to divide and conquer!

What would be an acceptable amount of energy to spend on something that someone has done in a different manner before? Would you rather we stick with all of the current known ways to do things.

Does this boil down to a condemnation of all scientific endeavours if they use resources?

Would it change things if the people who did it enjoyed themselves? Would they have spent more energy playing a first person shooter to get the same degree of enjoyment?

How do you make the calculation of the worth of a human endeavour? Perhaps the greater question is why are you making a calculation of the worth of a human endeavour.

Ok I don't really care either way but to play devil's advocate, what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing? The pushpack from people, albeit a little aggressive, does have a grain of truth. We're demonstrating that a model which uses preexisting addition instructions can add numbers? I mean yeah you can do it with arbitrarily few parameters because you don't need a machine learning model at all. Not exactly groundbreaking so I reckon the debate is fair.

Now if you said this proof of addition opens up some other interesting avenue of research, sure.

Because it's fun. Life is meant to be enjoyed.

Those who worry about an imaginary risk and live their lives in constant fear have turned into nothing more than machines enslaved by propaganda.

>Hacker News

not any more, eh?

> the sheer amount of energy and associated global warming risk

I think that's one very good reason to make them more efficient, and that's part of the point of contests like this one.

Wait until you see the quantum computer that it takes to factor the integer 15.

You need to recalibrate your sense of scale if you think that this is a geologically relevant usage of energy.

I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.

I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.

Yeah basically it is an implementation detail but most of them are zero, there is an equivalent 14 parameter sparse matrix for that.

Good question.

It might work, I considered running a test like this. But it does demand certain things.

+1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.

Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.

Yeah basically it is an implementation detail but most of them are zero, there is an equivalent 14 parameter sparse matrix for that.

I would similarly criticise a race car being used to do a city bus' job of getting a lot of people from point A to B.

Although the converse would be interesting, racing city buses.

Nobody has suggested using this for addition tasks in production. It's an academic exercise. What are you on about?

I think you're misunderstanding the joke.

Yes joke is:

    [A B]

times

    [1]
    [1]

    [A+B]

No, but I can develop methods to eventually do it.

In theory, there is an infinite number of systems with simple emergent rules that can eventually be taught arithmetic.

Now if you said this proof of addition opens up some other interesting avenue of research, sure.

Making things more efficient in a market setting just means they're used more. Which means we eventually use more resources with efficient methods, not less.

> Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

> So far it seems to me that self-attention really brought new capabilities to a network

Do we have a layman explanation for what makes self-attention so uniquely powerful? Something more than "it lets you do self-attention".

I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.

>I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling.

True, but with even smarter humans, you could exploit the interactions for additional calculations.

While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.

Thanks!

(I see the Trained Weights results now, thanks.)

+1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.

Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.

Better-than-random initialization is underexplored, but there are some works in that direction.

One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.

What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.

Making things more efficient in a market setting just means they're used more. Which means we eventually use more resources with efficient methods, not less.

Fast matrix multiplication would be a more useful benchmark: https://fmm.univ-lille.fr/

Thanks!

(I see the Trained Weights results now, thanks.)

Better-than-random initialization is underexplored, but there are some works in that direction.

One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.

What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.

>I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling.

True, but with even smarter humans, you could exploit the interactions for additional calculations.

The ai slop pixel art...

Nobody has suggested using this for addition tasks in production. It's an academic exercise. What are you on about?

Yes joke is:

    [A B]

times

    [1]
    [1]

    [A+B]

From context then, I infer that a transformer is not comprised of matrix multiplications, because it would simply be one that adds two 10-digit numbers.

>what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing?

Well for starters, it puts the lie to the argument that a transformer can only output examples it has seen before. Performing the calculation on examples that haven't been seen demonstrates generalisation of the principles and not regurgitation.

While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.

In the 90s, there were papers on emulating logical circuits with neurons. They would be bigger than this network, but at least always correct.

You might find https://corticallabs.com/cl1.html interesting (that's of course assuming this is not a scam, which I'm unable to assess).

Got excited that someone made one of those 120v humming coil beauties do the numbers... alas, it's just yet another NN project :-/

My reaction was the same as I expected there to be a fancy analogue computer build mainly with transformers.

From context then, I infer that a transformer is not comprised of matrix multiplications, because it would simply be one that adds two 10-digit numbers.

>what exactly is this specific challenge of adding numbers with a transformer model demonstrating/advancing?

While this misconception persists in a large number of people, counterexamples can always serve a useful purpose.

Are people usually claiming that it strictly cannot produce any output it hasn't seen before? I wouldn't agree, I mean clearly they are generating some form of new content. My argument would be that while they can learn to some extent, the power of their generalisation is still tragically weak, particularly in some domains.

>it puts the lie to the argument

But it does not, right? You can either show it something, or modify the parameters in a way that resemble the result of showing it something.

You can claim that the model didn't see the thing, but that would mean nothing, because you are making the same effect with parameter tweaks indirectly.

A transformer tokenizes input, does a bunch of matmul and relu set up in a certain way. It doesn't get to see the raw number (just like you don't when you look at 1+1 you need visual cortex etc. first.)

My reaction was the same as I expected there to be a fancy analogue computer build mainly with transformers.

Notably the difference is that ten digits is not the same thing as a number. One might say that turning it into a number might be the first step, but Neural nets being what they are, they are liable to produce the correct result without bothering to have a representation any more pure than a list of digits.

I guess the analogy there is that a 74ls283 never really has a number either and just manipulates a series of logic levels.

So the question is, why do we tokenise it in such a way that it makes everything harder?

>it puts the lie to the argument

But it does not, right? You can either show it something, or modify the parameters in a way that resemble the result of showing it something.

You can claim that the model didn't see the thing, but that would mean nothing, because you are making the same effect with parameter tweaks indirectly.

I guess the analogy there is that a 74ls283 never really has a number either and just manipulates a series of logic levels.

That's a counterargument to a different thing.

Iteratively measuring loss is a way to reconstruct values. That's trivial to show for a single value If 5 gives you a loss of 2 and 9 gives you a loss of 2 then you know the missing value is 7.

A model with enough parameters can memorise the training set in a similar manner. Technically the model hasn't seen that data by direct input either, but the mechanism provides the means to determine the what the data was. In that respect it is reasonable to say the model has seen the data.

Performing well on examples not in the training set is doing something else.

Any attempt to characterise that as having been seen before negates any distinction between taking in data and reasoning about that data.

AdderBoard

Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.

This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.

Maintained by Dimitris Papailiopoulos (@dimitrispapail).

We track two categories:

Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.

Both are valid. Both are interesting.

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	36	100%	alexlitz		2L decoder, d=5, 5h+1h	ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64	gist
2	40	100%	Wonderfall (@w0nderfall)		1L decoder, d=2, 1h, hd=2	Tied Q/K + V/O projections, RoPE period-19, parabolic tied-embed decode, two-hinge ReLU MLP	gist
3	50	100%	lichengliu03		1L custom GPT, d=4, 2h, hd=2	Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11)	repo
4	66	100%	cosminscn		1L nanoGPT, d=4, 2h	Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11)	gist
5	87	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params	gist
6	93	100%	jacobli99		1L decoder, d=2, 5h (MQA), hd=2, ff=4	Tied parabolic decode, RoPE digit routing, ReLU carry detection	gist
7	111	100%	corbensorenson	Codex	1L decoder, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE, SwiGLU, GQA	repo
8	116	100%	nino		1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, shared RMSNorm vectors, RoPE (hd=2)	gist
9	121	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=2	Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection	gist
10	130	100%	cosminscn		1L nanoGPT, d=4, 2h	Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding	gist
11	130	100%	Wonderfall (@w0nderfall)	Codex	1L Qwen3, d=3, 4h/1kv, hd=2, ff=3	Tied embed, RoPE digit routing, SiLU carry logic	gist
12	139	100%	Wonderfall (@w0nderfall)	GPT-5.2 Pro + Codex	1L Qwen3, d=3, 4h/1kv, hd=2	Tied embed, RoPE digit routing, SiLU carry logic	gist
13	148	100%	bingbangboom-lab		2L Qwen3, d=5, 2h/1kv, hd=2, ff=3	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing	gist
14	177	100%	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head	gist
15	197	~100%*	xangma (@xangma)	GPT + Codex	2L Qwen3, d=5, 2h/1kv, hd=2	Rank-1 linear, factorized embed, sparse gate, param-free norm	gist

* Passed 8,192 random tests; not independently verified on our 10K test suite yet.

Trained Weights (Learned from Data)

Rank	Params	Accuracy	Author	Built with	Architecture	Key Tricks	Link
1	311	99.999%	rezabyt (@reza_byt)		1L decoder, d=4, 1h, ff=8	Rank-3 factorization, shared-A tied-KV, RMSNorm, grokking	repo
2	335	99.92%	h3nock		1L decoder, d=4, 1h, ff=12	Rank-3 factorization, shared-A tied-KV, RMSNorm, tied embed, curriculum learning	repo
3	456	100%	yinglunz		1L decoder, d=7, 1h, ff=14	Rank-3 factorization, shared-A tied-KV, rank-2 attn out, tied embed	repo
4	491	99.97%	rezabyt (@reza_byt)		1L decoder, d=7	Rank-3 factorization, RMSNorm, curriculum learning	repo
5	512	99.988%	yinglunz (@yinglun122)		1L decoder, d=7, 1h, ff=14	Rank-3 factorization	repo
6	777	99.69%	Yeb Havinga (@YebHavinga)	Claude Code	1L decoder, d=7, 1h, ff=14	Tied embeddings, no FFN bias, curriculum learning	repo
7	1,644	99.04%	anadim (@dimitrispapail)	Codex	1L decoder, pair tokens	Pair token encoding (digit pairs as single tokens)	repo
8	6,080	100%	anadim (@dimitrispapail)	Claude Code	2L decoder, d=16, ff=48	Systematic scaling, found phase transition at d=16	repo

Rules

The Core Constraint: Autoregressive Transformer

The model must operate as a genuine autoregressive transformer. This means:

Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.
The model must be autoregressive. It receives a token sequence as input and predicts the next token. Output digits are generated one at a time, with each new token fed back as input for predicting the next. The carry propagation must emerge from this autoregressive process — not from explicit state variables passed between steps in Python.
Standard forward pass. The model's forward() method must be a standard tensor-in, logits-out computation. No problem-specific control flow (for-loops over digits, explicit carry variables, string manipulation) inside forward(). The autoregressive generation loop lives outside the model, exactly as it would for any language model.
The model does the work, not the code. The inference code should be generic autoregressive decoding that would work with any transformer checkpoint. If your generation loop contains addition-specific logic — manually pairing digits, threading carry state, indexing into specific positions — then the Python code is solving the problem, not the model.

In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

What's Allowed

Architectural variations: rank-1/low-rank projections, factorized embeddings, custom positional encodings, alternative norms
Hand-coded weights (constructive proofs are valid — they show the architecture can represent addition)
Trained weights via any generic learning algorithm (shows the solution is learnable — encourages creative ideas on data format, tokenization, and curriculum)
Input formatting choices (reversed digits, delimiters, etc.) as long as the format is fixed and doesn't encode the answer

Qualification

Must achieve >= 99% accuracy on 10,000 random test pairs (held-out, fixed seed)
Inputs: two integers in [0, 9,999,999,999]
Output: their sum as an integer
Verified using verify.py with --seed 2025

Parameter Counting

Count unique parameters (after weight tying/deduplication)
Fixed/sinusoidal positional encodings are not counted (following the original Transformer paper convention)
Learned positional encodings are counted

How to Submit

Option A: Open an Issue (easiest)

Click New Issue and fill in the template
Include a link to your code (GitHub repo, gist, etc.)
Include test results (accuracy on random pairs)
We'll verify and add you to the leaderboard

Option B: Open a Pull Request

Fork this repo
Update the leaderboard in README.md with your entry
Include verification results
We'll review and merge

Updates to the leaderboard are welcome via pull request.

Verification

python verify.py submissions/your_submission.py

This runs:

10 edge cases (boundary values, max carry chains)
10,000 random pairs (seed=2025)
Reports accuracy, pass/fail, and timing

Context

This challenge explores a fundamental question: what is the minimal transformer that can represent integer addition?

Addition requires three capabilities:

Digit alignment — pairing corresponding digits from two numbers
Per-digit arithmetic — computing sum and carry for each pair
Carry propagation — threading carry information across positions

Transformers solve these using attention (for alignment), MLPs (for arithmetic), and autoregressive generation (for carry propagation). The question is how small the architecture can be while still implementing all three.

Key Findings from the Community

Parameter cliff at ~800: Sharp accuracy transition observed by multiple researchers
Single layers beat two layers at equivalent parameter budgets (for trained models)
d=7 was the sweet spot for early trained models — multiple independent teams converged on this
d=4 now works with rank-3 factorization + grokking (311 params trained)
Hand-coded models can go much smaller (36 vs 311 trained) since they don't need to be discoverable by SGD
Rank-3 factorization is the key trick for trained models
ALiBi enables extreme compression: the 36-param leader uses ALiBi with slope log(10) for base-10 positional weighting, achieving 100% accuracy with a 2-layer decoder (d=5) in float64

License

MIT

You might find https://corticallabs.com/cl1.html interesting (that's of course assuming this is not a scam, which I'm unable to assess).

So the question is, why do we tokenise it in such a way that it makes everything harder?

There is no encoding that makes everything easier. You trade off maths for general intelligence. Now we are at a point where the LLM can just choose to use a normal calculator anyway!

The tokenisation needs to be general -- it needs to be able to encode any possible input. It should also be at least moderately efficient across the distribution of inputs that it will tend to see. Existing tokenisation schemes explicitly target this.

> So far it seems to me that self-attention really brought new capabilities to a network

Do we have a layman explanation for what makes self-attention so uniquely powerful? Something more than "it lets you do self-attention".

Computational power. Without self attention, you have a sloppy implementation of something called a PDA (push-down-automaton) -- like an old HP calculator. With it, you have an even sloppier implementation of a Turing machine.

So (modulo a _lot_ of details) it increases the power from that of a "calculator" to that of a "computer".

That's a counterargument to a different thing.

Iteratively measuring loss is a way to reconstruct values. That's trivial to show for a single value If 5 gives you a loss of 2 and 9 gives you a loss of 2 then you know the missing value is 7.

Performing well on examples not in the training set is doing something else.

Any attempt to characterise that as having been seen before negates any distinction between taking in data and reasoning about that data.

Yea, because "seeing" is also tweaking the parameters. Which this example is doing manually.

So I don't understand how any one can make the claim that the model as not seen it. Because the internal transformation is similar.

There is no encoding that makes everything easier. You trade off maths for general intelligence. Now we are at a point where the LLM can just choose to use a normal calculator anyway!

Yea, because "seeing" is also tweaking the parameters. Which this example is doing manually.

So I don't understand how any one can make the claim that the model as not seen it. Because the internal transformation is similar.

You are going to have to be more specific, because that reads like nonsense.

By what mechanism do you propose the model observed the test set?

So (modulo a _lot_ of details) it increases the power from that of a "calculator" to that of a "computer".

You are going to have to be more specific, because that reads like nonsense.

By what mechanism do you propose the model observed the test set?

>By what mechanism do you propose the model observed the test set..

By explicitly setting the model parameters.

What happens when a model is trained? We tweak the model parameters by some feed back.

In both cases, you affect the model parameters. Only the method is different. So both are eqvialent to "model observing the test set".

>By what mechanism do you propose the model observed the test set..

By explicitly setting the model parameters.

What happens when a model is trained? We tweak the model parameters by some feed back.

In both cases, you affect the model parameters. Only the method is different. So both are eqvialent to "model observing the test set".

Hacker Times

Hacker Times

Smallest transformer that can add two 10-digit numbers

Discussion

Discussion

AdderBoard

Leaderboard

Hand-Coded Weights (Constructive Proofs)

Trained Weights (Learned from Data)

Rules

The Core Constraint: Autoregressive Transformer

What's Allowed

Qualification

Parameter Counting

How to Submit

Verification

Context

Key Findings from the Community

License