Attention Residuals

Amazingly, the first author is a high school student! https://nathanchen.me/public/About%20me.html

This is the key piece

> Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.

Great idea and seems quite obvious in hindsight.

Is it guaranteed to have the same effect on vanishing gradients though? What if it put weight 1 on a layer that had a tiny gradient?

Two things stand out to me with this:

1. Drops compute required for training by ~20%. This approach wont just help the ever escalating model sizes larger companies are pushing for, it means things like autoresearch can iterate on new model architectures faster.

2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.

This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.

This is reminds me of the input gates of an LSTM.

I don't know why all the posts fail to summarize the results properly.

I had a similar idea at the back of my head but here is a layman explanation:

Standard attention threads the previous layers output to the next layers input. By adding residual connections to each layer, the layers learn an update rule.

There is an obvious limitation here. Only the first layer gets to see the original input and all subsequent layers only get to see the previous layer output.

With attention residuals, the idea is that you have a tiny attention operator that decides between using the original input and any of the previous layer outputs.

ScholarlyArticle: "Attention Residuals" (2026) https://arxiv.org/abs/2603.15031 :

> Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. [...]

Amazingly, the first author is a high school student! https://nathanchen.me/public/About%20me.html

We're about to get an onslaught of young Chinese geniuses (raised in China). It's pure statistics

Sadly, same can't be said about India (infrastructure/food security lags China).

Long, long ago, I remember the first toy I ever got that was made in chiina. It was a wooden cube puzzle. various interlocking differently shaped pieces that when assembled formed a cube. It was so different to all the other toys made in america by hasbro, mattel, tonka, etc. Back then I felt like I was holding a toy made by the ancient Greeks, a puzzle to teach geometry, analysis, pattern recognition. So abstract, so removed from daily life, it transported me into a different world. Like chess, it was an engaging abstraction. But unlike chess, it was not about conflict but rather interrelating pieces to make a greater whole.

So this is what really unsettles me. Not that China graduates more engineers every year than we have entirely employed in the US, but rather, that these individuals are not about delegating work, but actually doing it. Whereas the western credo is to get someone else to do the work (or in the words of PAtton, to get some one else to die for his country), I get the feeling that China will get robots and AI to do the work. I am reminded of the joke about Chinese factories having only 1 security guard and 1 dog. The guard is there to feed the dog.

This is reminds me of the input gates of an LSTM.

Great idea and seems quite obvious in hindsight.

Is it guaranteed to have the same effect on vanishing gradients though? What if it put weight 1 on a layer that had a tiny gradient?

This is the key piece

Two things stand out to me with this:

This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.

> Drops compute required for training by ~20%.

This is not true. Authors claim that w.r.t. training, their method adds negigible overhead for AttnRes with no memory impact (but is way more complicated for Block AttnRes since we need to use pipelining for larger models, hence the O(Ld) & O(Nd) figures, with N ≪ L).

> WAY lower bandwidth requirements for inference.

Also not true. Paper has nothing to do with inference, apart from the benchmarks. If you're looking at the graph about "compute advantage," it's about training compute. They do some interpolation to get to the 1.25x number, basically answering the question "if non-AttnRes architecture were trained, how much compute would it take to get to the same loss as AttnRes?" (The answer being ~20% more compute.) It's an interesting claim, but there's all kinds of weird and unexpected convergence that can happen, so take it with a grain of salt.

> 2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.

That should be the headline right there. Giant side 60 font headline.

Some people have PhDs in burying the lede!

ScholarlyArticle: "Attention Residuals" (2026) https://arxiv.org/abs/2603.15031 :

I don't know why all the posts fail to summarize the results properly.

I had a similar idea at the back of my head but here is a layman explanation:

Standard attention threads the previous layers output to the next layers input. By adding residual connections to each layer, the layers learn an update rule.

There is an obvious limitation here. Only the first layer gets to see the original input and all subsequent layers only get to see the previous layer output.

With attention residuals, the idea is that you have a tiny attention operator that decides between using the original input and any of the previous layer outputs.

Ah - now I understand how this has 2k+ (supposedly legitimate) Github stars in less than a week. Thank you - I was more skeptical

Exactly. I was reading all the other comments and wondering why many looked like they were talking of something else.

We're about to get an onslaught of young Chinese geniuses (raised in China). It's pure statistics

Sadly, same can't be said about India (infrastructure/food security lags China).

> Sadly, same can't be said about India

And quality of leadership.

They (barring a few exceptions) are happy to gloat in imagined past glories of vedic aeroplanes, inter-species head transplants apparent performed in Hindu golden age and loyalty based funding that produces institutions like Galgotia University.

> It's pure statistics

I'm not so sure about that: https://www.populationpyramid.net/china/2026/ suggests peak high school in china was years ago.

I don't think you can blame food security here.

Even if food security holds back 10% of Indians (which would still be a huge tragedy), that would still leave the other 90% for the 'onslaught'. 10% is just a made up number. But even with 50% you'd get an 'onslaught'.

So if we are seeing less than that, it's probably down to other factors.

> Drops compute required for training by ~20%.

> WAY lower bandwidth requirements for inference.

I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.

It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.

That should be the headline right there. Giant side 60 font headline.

Some people have PhDs in burying the lede!

Exactly. I was reading all the other comments and wondering why many looked like they were talking of something else.

except it's not true

Ah - now I understand how this has 2k+ (supposedly legitimate) Github stars in less than a week. Thank you - I was more skeptical

> Sadly, same can't be said about India

And quality of leadership.

I don't think you can blame food security here.

So if we are seeing less than that, it's probably down to other factors.

> It's pure statistics

I'm not so sure about that: https://www.populationpyramid.net/china/2026/ suggests peak high school in china was years ago.

Of ≈17 million Chinese students who graduated junior high school after grade 9 in 2024, ≈10 million were admitted to a high school, ≈4 million to a vocational school and the remaining ≈3 million disappear from education statistics, presumably directly entering the workforce. http://www.moe.gov.cn/jyb_sjzl/sjzl_fztjgb/202506/t20250611_...

So at least in theory there's still lots of room to increase high school enrollment, though I doubt this would lead to noticeably more geniuses. The testing system is pretty good at sorting the best students into good schools, I think.

━━━━━━━━━━━━━━━━━━━━━━━━━━━
Attention Residuals
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Paper | arXiv | Overview | Results | Citation

(a) Standard residuals with uniform additive accumulation. (b) Full AttnRes: each layer attends over all previous outputs. (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).

This is the official repository for Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.

Overview

Standard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.

AttnRes replaces this fixed accumulation with softmax attention over preceding layer outputs:

$$\mathbf{h}l = \sum{i=0}^{l-1} \alpha_{i \to l} \cdot \mathbf{v}_i$$

where the weights $\alpha_{i \to l}$ are computed via a single learned pseudo-query $\mathbf{w}_l \in \mathbb{R}^d$ per layer. This gives every layer selective, content-aware access to all earlier representations.

Block AttnRes

Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.

PyTorch-style pseudocode

def block_attn_res(blocks: list[Tensor], partial_block: Tensor, proj: Linear, norm: RMSNorm) -> Tensor:
    """
    Inter-block attention: attend over block reps + partial sum.
    blocks:
        N tensors of shape [B, T, D]: completed block representations for each previous block
    partial_block:
        [B, T, D]:    intra-block partial sum (b_n^i)
    """
    V = torch.stack(blocks + [partial_block])  # [N+1, B, T, D]
    K = norm(V)
    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)
    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)
    return h

def forward(self, blocks: list[Tensor], hidden_states: Tensor) -> tuple[list[Tensor], Tensor]:
    partial_block = hidden_states
    # apply block attnres before attn
    # blocks already include token embedding
    h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)

    # if reaches block boundary, start new block
    # block_size counts ATTN + MLP; each transformer layer has 2
    if self.layer_number % (self.block_size // 2) == 0:
        blocks.append(partial_block)
        partial_block = None

    # self-attention layer
    attn_out = self.attn(self.attn_norm(h))
    partial_block = partial_block + attn_out if partial_block is not None else attn_out

    # apply block attnres before MLP
    h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)

    # MLP layer
    mlp_out = self.mlp(self.mlp_norm(h))
    partial_block = partial_block + mlp_out

    return blocks, partial_block

Results

Scaling Laws

AttnRes consistently outperforms the baseline across all compute budgets. Block AttnRes matches the loss of a baseline trained with 1.25x more compute.

Downstream Performance (Kimi Linear 48B / 3B activated, 1.4T tokens)

Category	Benchmark	Baseline	AttnRes
General	MMLU	73.5	74.6
	GPQA-Diamond	36.9	44.4
	BBH	76.3	78.0
	TriviaQA	69.9	71.8
Math & Code	Math	53.5	57.1
	HumanEval	59.1	62.2
	MBPP	72.0	73.9
Chinese	CMMLU	82.0	82.9
	C-Eval	79.6	82.5

AttnRes improves across the board, with the largest gains on multi-step reasoning (+7.5 on GPQA-Diamond) and code generation (+3.1 on HumanEval).

Training Dynamics

AttnRes mitigates PreNorm dilution: output magnitudes remain bounded across depth and gradient norms distribute more uniformly across layers.

Citation

If you found our work useful, please cite

@misc{chen2026attnres,
  title         = {Attention Residuals},
  author        = {Kimi Team  and Chen, Guangyu  and Zhang, Yu  and Su, Jianlin  and Xu, Weixin  and Pan, Siyuan  and Wang, Yaoyu  and Wang, Yucheng  and Chen, Guanduo  and Yin, Bohong  and Chen, Yutian  and Yan, Junjie  and Wei, Ming  and Zhang, Y.  and Meng, Fanqing  and Hong, Chao  and Xie, Xiaotong  and Liu, Shaowei  and Lu, Enzhe  and Tai, Yunpeng  and Chen, Yanru  and Men, Xin  and Guo, Haiqing  and Charles, Y.  and Lu, Haoyu  and Sui, Lin  and Zhu, Jinguo  and Zhou, Zaida  and He, Weiran  and Huang, Weixiao  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Du, Yulun  and Wu, Yuxin  and Yang, Zhilin  and Zhou, Xinyu},
  year          = {2026},
  archiveprefix = {arXiv},
  eprint        = {2603.15031},
  primaryclass  = {cs.CL}
}

"the western credo is to get someone else to do the work"

This. One of the things that most shocked me when I moved to London was how bad English people were at hard skills, but also how easily giving orders and "projecting gravitas" came to them. Everyone wants to be a "leader", which sadly has become code for reaping benefits of other people's work.

I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.

> I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

This is not what they're getting at; I explained exactly what they're getting at. I mean, your equivalence of "loss" (what authors actually measured) and "performance" is just bizarre. We use benchmarks to measure performance, and the numbers there were like 1-5% better (apart from the GPQA-Diamond outlier).

Do people even read these papers?

except it's not true

It's not not true, it's just that things are getting lost in the excitement. There are some specific cases where there's a big boost, it's just not exactly what people are hoping.

>>>The "1/6th" specifically appears in community comparisons to DeepSeek's mHC (multi-lane highway connections, a prior technique for better depth-wise information flow in deep models). Several Chinese-language sources and downstream discussions (e.g., translated articles, YouTube breakdowns, and blogs like houdao.com) state that Block AttnRes achieves comparable (or better) performance to mHC while using only one-sixth of the data read/write volume (or memory bandwidth pressure) during inference/engineering deployment.

There are specific cases where that speedup does occur; it's not going to translate exactly into local models or other architectures or hardware.

Unfortunately it's hard to take China's population / enrollment demographics at face value. There's many incentives in the system to overstate growth, and cross checks between different reports that _should_ be correlated suggest they're quite overstated.

It's bad enough they passed some legislation a few years ago[1], but the damage has in many senses already been done. And it's unclear how effective the changes will be. So it's entirely possible those 3 million missing high schoolers never existed.

[1]: https://www.reuters.com/world/china/chinas-top-legislative-b...

"the western credo is to get someone else to do the work"

> I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.

Do people even read these papers?

It's not not true, it's just that things are getting lost in the excitement. There are some specific cases where there's a big boost, it's just not exactly what people are hoping.

There are specific cases where that speedup does occur; it's not going to translate exactly into local models or other architectures or hardware.

No. It seems to me that the comment is objectively incorrect. The original comment was talking about inference and from what I can tell, it is strictly going to run slower than the model trained to the same loss without this approach (it has "minimal overhead"). The main point is that you wont need to train that model for as long.

[1]: https://www.reuters.com/world/china/chinas-top-legislative-b...

The figures for students graduating primary school (18.57 million) and entering junior high school (18.49 million) match up quite well, though. Do you think primary schools and junior high schools manage to coordinate massive student number inflation to the tune of 3 million non-existent students, but then at the transition to senior high schools that suddenly breaks down? If anything, I'd expect it to break down when those non-existent students are supposed to take the Zhongkao exam in order to graduate, not at the senior high school admissions stage.

Some statistics reported in China are unreliable because the person doing the reporting also has their performance evaluated by the numbers they report and there are few external checks on validity, but I don't think that's the case for student numbers in particular.

Also, it seems like you're the same 'jldugger who cited Chinese population statistics upthread, but when somebody else does it, they're suddenly unreliable???

At this point i’ve witnessed over 30years of “stats about China aren’t real” type posts while they continue to demonstrate impressive economic and social results that i’m far more inclined to believe the potentially flawed Chinese data than posts that basically claim all data out of China is fake.

Also, it seems like you're the same 'jldugger who cited Chinese population statistics upthread, but when somebody else does it, they're suddenly unreliable???

Isolated demands for rigor, really. China does have a lot of incentives to publish misleading statistics. Also, so does everyone else. In most places we bake skepticism of official lines from government and industry alike into our epistemic weights and move on, but when China does it we're supposed to treat it as a big deal. Propaganda at its finest

Hacker Times

Hacker Times

Discussion

Discussion

━━━━━━━━━━━━━━━━━━━━━━━━━━━
Attention Residuals
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overview

Block AttnRes

Results

Scaling Laws

Downstream Performance (Kimi Linear 48B / 3B activated, 1.4T tokens)

Training Dynamics

Citation

Hacker Times

Hacker Times

Attention Residuals

Discussion

Discussion

━━━━━━━━━━━━━━━━━━━━━━━━━━━ Attention Residuals ━━━━━━━━━━━━━━━━━━━━━━━━━━━

Overview

Block AttnRes

Results

Scaling Laws

Downstream Performance (Kimi Linear 48B / 3B activated, 1.4T tokens)

Training Dynamics

Citation

━━━━━━━━━━━━━━━━━━━━━━━━━━━
Attention Residuals
━━━━━━━━━━━━━━━━━━━━━━━━━━━