Elon Musk Praises Chinese AI Team’s Paper That Replaces a Transformer Mainstay
Surprise endorsement and context
Elon Musk publicly called the work from Kimi (月之暗面) “impressive” after a prominent tech blogger dissected a new technical report by the team — a rare moment of praise from Musk, who has been openly critical of several Western AI rivals. It has been reported that Musk’s own xAI has been through a major reorganization and that its flagship model Grok has struggled; so why would he single out a paper from a Chinese research group? The brief reply highlights how technical breakthroughs can cut across the current geopolitical rivalry over AI and semiconductors between China and the West.
What the paper proposes
Kimi argues that the ubiquitous residual connection inherited from ResNet and adopted by Transformers acts like “linear attention across depth” and limits the model’s ability to selectively use information from different layers. Their proposed Full Attention Residuals replaces the simple equal-sum residual with a learned attention mechanism: each layer learns a query that weights all previous layers’ outputs via a softmax, so different tokens can pull selectively from different depths. Because storing every layer’s output would blow up memory and pipeline communication, Kimi also introduces Block Attention Residuals: partition layers into a small number of blocks, sum within blocks but attend across blocks, cutting memory from O(Ld) to O(Nd) (N ≈ 8 in their experiments).
Results and significance
Kimi reports that block attention residuals yield consistent gains across scaling experiments — roughly equivalent to training the baseline with 1.25× the compute — and that a 48-billion-parameter model pretrained on over a trillion tokens outperformed its baseline on science Q&A, math reasoning, code generation and general knowledge tests. They also report more stable layer activations and more uniform gradient distributions, addressing a long-standing “depth dilution” problem in deep Transformers. These claims are significant because they touch the foundational architecture of large language models; if robust, the technique could be adopted broadly, affecting efficiency and stability in both Western and Chinese models. Reportedly, the inference latency penalty is small — under 2% — making the idea practically interesting for large-scale deployments even amid rising U.S.-China export controls and supply-chain scrutiny.
