← Back to stories A modern desk setup featuring a laptop displaying trading charts, candles, and clocks.
Photo by George Morina on Pexels
凤凰科技 2026-03-26

Google Research's TurboQuant threatens GPU memory market — software, not silicon, sparks sell-off

A software shock to a hardware market

Google Research has unveiled TurboQuant, a compression technique that it has been reported can cut the GPU "KV cache" memory footprint by roughly sixfold while preserving model accuracy — and markets took notice. Why did a software paper send storage-related stocks tumbling? Because if AI systems can run large-context inference with dramatically less high-bandwidth memory, the economics of buying expensive HBM and flash for AI inference change overnight.

How the method works (in plain terms)

TurboQuant reportedly uses a two-stage mathematical trick: PolarQuant maps token vectors into polar coordinates so angular components become highly predictable, removing the need for per-block normalization metadata; a second stage, dubbed QJL, projects residuals into a low-dimensional space and encodes them as binary signs to preserve attention statistics. The payoff is claimed to be large-context lossless—or near-lossless—compression for KV caches with no model retraining required. It’s the kind of neat engineering that sounds like fiction (think Pied Piper/魔笛手 from HBO’s Silicon Valley). But it has been reported that Google will present the work at ICLR 2026 and AISTATS 2026, and early benchmark reports cite perfect recall on some long-context tests for Llama-3.1-8B and Mistral-7B and major speedups on Nvidia H100 runs.

Benchmarks, caveats and provenance

Reporters and developers have already ported implementations to other runtimes; it has been reported that an Apple Silicon MLX port ran Qwen3.5-35B across contexts up to 64,000 tokens with claimed parity at several quantization levels. Nvidia H100 tests reportedly show up to an 8× speed-up over uncompressed 32‑bit attention in some setups. Important caveats remain: the work is still at an experimental stage, real-world production integration across architectures is nontrivial, and some observers note parts of the research were publicly previewed earlier — so the "newness" debate is active. Reportedly, engineers will still need to validate robustness, cross-vendor compatibility and edge-case behavior before broad deployment.

Industry and geopolitical ripple effects

The immediate market reaction was telling: U.S. storage and memory-related equities, including names reported in Chinese media such as SanDisk (闪迪) and Micron (美光科技), showed intraday weakness as investors priced in lower long-term memory demand. That feeds into a broader geopolitical backdrop: Western export controls and supply-chain tensions have made advanced memory and accelerators a strategic asset in U.S.–China tech rivalry. If a software layer can materially reduce hardware demand, it could blunt some pressure points — or, paradoxically, lower costs and spark even more AI usage (the Jevons paradox). For now, TurboQuant is a technical provocation with large economic implications — but many engineering and deployment hurdles remain before this algorithm reshapes AI infrastructure at scale.

AIResearch
View original source →