← Back to stories Detailed close-up of a green circuit board featuring microchips and labeled components.
Photo by Tima Miroshnichenko on Pexels
虎嗅 2026-03-26

Google blog sparks a memory-stock rout — but the panic looks misplaced

What happened

A blog post from Google (谷歌) Research about an algorithm called TurboQuant touched off a dramatic one‑day sell‑off in memory stocks, wiping several percentage points off chipmakers from Seoul to New York. SK Hynix (SK海力士) and Samsung (三星) fell sharply on the KOSPI, while U.S. names including Micron (美光) and SanDisk slid in after‑hours trading. It has been reported that the post claimed TurboQuant can reduce the KV‑cache storage needs of model inference by up to sixfold — a soundbite that quickly metastasized into headlines declaring AI memory demand “peaked.”

What TurboQuant actually targets

TurboQuant is a vector‑quantization technique applied to the KV cache used during model inference — the transient key‑value tensors that grow with context length and live in GPU memory. The paper (reportedly uploaded to arXiv on April 28, 2025, arXiv:2504.19874) combines a random rotation plus precomputed scalar quantizers and a 1‑bit Johnson‑Lindenstrauss residual step to push KV cache down to ~3.5 bits with near‑lossless attention results; the authors report speedups on certain accelerators and the work was accepted to ICLR 2026. Those are meaningful academic gains. They are not, however, a turnkey replacement for system designs that underpin global DRAM and HBM markets.

Why the market reaction looks like a blunder

TurboQuant addresses only one of three memory demands from modern AI: inference KV cache, not model weights or training activations and gradients. It has been reported that no official Google code accompanies the post; community re‑implementations exist, but major inference stacks (vLLM, TensorRT‑LLM, Ollama, etc.) had not shipped production integrations at the time of the sell‑off, and the technique has seen limited testing on the very largest models, MoE setups or million‑token contexts. More fundamentally, AI hardware economics hinge on bandwidth as much as capacity: HBM’s value is how fast it can feed compute, not just how many bytes it stores. If TurboQuant reduces per‑request memory footprint, will providers simply expand context windows and concurrency? Jevons paradox suggests they might.

The bigger picture

Markets were pricing a narrative — that a single algorithmic advance would materially shrink multi‑billion‑dollar capex bets at memory suppliers — rather than grappling with the system‑level caveats and deployment work required to realize those gains. Geopolitics also matters: memory chips sit at the heart of U.S.–China technology competition and trade policy (including export controls) that shape supply, demand and investment decisions. TurboQuant is an important paper in compression and inference efficiency, but the stock market’s knee‑jerk verdict looks premature: adoption, integration, and wider architectural constraints will decide whether this becomes an industry‑reshaping tool or an impressive academic result with limited near‑term market impact.

AIResearch
View original source →