Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Spread the love

The residual connection is one of the least questioned parts of modern transformer design. In the prenorm architecture, each layer connects its outputs back to the current hidden state, which keeps the optimization stable and allows deeper models to be trained. Moonshot AI researchers argue that this standard mechanism also introduces a structural problem: all prior layer outputs are accumulated with fixed unit weights, which causes the magnitude of the hidden state to grow with depth and progressively weakens the contribution of any single layer.

The research team proposes Attention Residual (AttnRes) As a drop-in replacement for standard residual storage. Instead of forcing each layer to consume the residual stream uniformly mixed, AttnRes lets each layer aggregate earlier representations using softmax attention on depth. The input of the layer (L) is a weighted sum of the token embedding and the previous layer output, where the weights are calculated on the prior depth position rather than the sequence position. The basic idea is simple: if attention improves sequence modeling by replacing fixed iteration with time, a similar idea can be applied to the depth dimension of the network.

https://github.com/MoonshotAI/Attention-Residuals/tree/master?tab=readme-ov-file

Table of Contents

Why do standard residuals become a hindrance?

The research team identified three issues with standard residual accumulation. First of all, there is no selective access: All layers receive the same aggregation state, even though attention layers and feed-forward or MoE layers may benefit from different mixes of prior information. Second, there is irreparable damage: Once the information is mixed into a single residual stream, subsequent layers cannot selectively recover specific prior representations. Third, there is output growth: To remain efficient within an ever-growing cache state, deeper layers produce larger outputs, which can destabilize training.

This is the research team’s main outline: behave like a compressed iteration over standard residual layers. AttnRes replaces that fixed iteration by paying explicit attention to the previous layer output.

Overall Appearance Description: Focus on all previous layers

In full attendance detailsEach layer calculates attention weights over all preceding depth sources. default designs No Use input-conditioned queries. Instead, each layer has a learned layer-specific pseudo-query vector w_l ∈ R^DWhile the keys and values come from token embeddings and previous layer output after RMSNorm. The RMSNorm step is important because it prevents the large-magnitude layer output from dominating the depth-wise attention weights.

Full AttnRes is straightforward, but it increases the cost. Per token, this requires O(L)² d) arithmetic and (O(LD)) memory to store the layer output. In standard training this memory largely overlaps with the activations already needed for backpropagation, but under activation re-computation and pipeline parallelism the overhead becomes more significant because those first outputs must remain available and may need to be propagated across stages.

Block AttnRes: a practical version for larger models

To make the method usable on a larger scale, the Moonshot AI research team has introduced Block AttnRes. Instead of focusing on each previous layer’s output, the model splits into layers. n Block. Within each block, the output is accumulated into a block representation, and attention is applied only to those block-level representations and token embeddings. This reduces memory and communication overhead. Old) To O(ND).

The research team describes a cache-based pipelined communication and two-phase computation strategy that makes block integers practical in distributed training and inference. This results in less than 4% training overhead under pipeline parallelism, while reporting less than 2% inference latency overhead on repository typical workloads.

scaling results

The research team evaluates five model sizes and compares three variants at each size: a prenorm baseline, full AttnRes, and block AttnRes with approximately eight blocks. All variants in each size group share the same hyperparameters chosen under the baseline, which the research team notes makes the comparison conservative. The fitted scaling laws are reported as:

ground line: l = 1.891 x C^-0.057
draw the attention: l = 1.870 x C^-0.058
Complete Attendance: l = 1.865 x C^-0.057

The practical implication is that AttnRes achieves low validation loss in the tested computation range, and block AttnRes approximately matches the loss of the trained baseline. 1.25× overcount.

integration in km linear

Moonshot AI also integrates AttnRes km linearWith its MoE architecture 48B total parameters and 3B active parametersand pre-trains it 1.4T token. According to the research paper, AttnRes reduces prenorm dilution by keeping the output magnitude more limited across depth and distributing gradients more evenly across layers. Another implementation detail is that all pseudo-query vectors are initialized to zero so that the initial attention weights are the same across the source layers, effectively reducing AttnRes to an equal-weight average at the beginning of training and avoiding initial instability.

Upon downstream evaluation, the reported benefits are consistent across all listed functions. It reports improvement from 73.5 to 74.6 on MMLU, 36.9 to 44.4 on GPQA-Diamond, 76.3 to 78.0 on BBH, 53.5 to 57.1 on Maths, 59.1 to 62.2 on HumanEval, 72.0 to 73.9 on MBPP, 82.0 to 82.9 Is. 79.6 to 82.5 on CMMLU, and C-Eval.

key takeaways

Attention residual replaces fixed residual accumulation with softmax attention on previous layers.
The default AttnRes design uses a learned layer-specific pseudo-query, not an input-conditioned query.

Block AttnRes makes the method practical by reducing depth-wise memory and communication from O(Ld) to O(Nd).
The Moonshot research team reports less scaling loss compared to the prenorm baseline, with Block ATNRES matching the baseline calculation by approximately 1.25× more.
In KmLinear, AttnRes improves results in logic, coding and evaluation benchmarks with limited overhead.

check out paper And Repo. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Source link