Live from GTC: Jensen Huang Points NVIDIA at Token Economics — Vera Rubin, Groq LPUs and a New Inference Playbook
Key announcements
NVIDIA (英伟达) CEO Jensen Huang (黄仁勋), often nicknamed “Old Huang” (老黄), used the 2026 GTC keynote to reposition the company from chip vendor to systems provider — and to explicitly tie future customer economics to NVIDIA’s hardware roadmap. What changed? He unveiled the Vera Rubin rack: a seven‑chip, end‑to‑end system that pairs a large Rubin GPU with a new class of Groq 3 LPUs to split (“disaggregate”) inference workloads into high‑throughput prefill/attention and ultra‑low‑latency token generation. It has been reported that NVIDIA paid roughly $20 billion for Groq’s technology and team; Groq 3 LPUs are reportedly already in production and are being integrated into the NVL72 systems running in Microsoft Azure, Satya Nadella reportedly confirmed during the event.
Architecture and software
The Rubin GPU is a TSMC 3nm, dual‑chip package with 336 billion transistors, 288GB of HBM4 and NVFP4 inference performance pegged at ~50 PFLOPs; NVIDIA said Rubin plus Groq LPUs multiply end‑to‑end efficiency for agentic and high‑interaction models. Groq’s LPU is a deterministic, SRAM‑only data‑flow processor optimized for decoding and token generation; a single LPU has small SRAM but enormous bandwidth. NVIDIA’s Dynamo software orchestrates the split: Rubin handles context‑heavy attention and memory‑bound work, while Groq handles feed‑forward decode at hundreds of tokens per second. The result, Huang argued, is a system that can extend inference performance into high‑interaction “Ultra” tiers that prior architectures could not economically support. The rack is fully liquid‑cooled with 45°C hot‑water reuse and an upgraded NVLink and CPO optical switching fabric.
The commercial pitch — token economics and pricing tiers
Huang framed data centers as “token factories”: throughput (tokens/MW) and per‑user interaction speed (tokens/user/s) are opposed, and only tightly integrated hardware-software stacks move customers rightward on his performance‑economics curve. He sketched a five‑tier token pricing ladder — from free models to a proposed $150 per million‑token “Ultra” tier — and argued that each higher tier becomes commercially viable only with the next generation of NVIDIA hardware. It has been reported that NVIDIA estimates hundreds of billions in incremental addressable revenue as Rubin and Groq scale; the company presented example 1GW data‑center revenue curves that show step‑wise uplifts with each new architecture.
Geopolitics, risk and who benefits
This technical and commercial strategy comes against a background of U.S. export controls and broader geopolitics shaping access to advanced AI accelerators. High‑end GPUs are geopolitical assets; supply policies and semiconductor export rules will influence which customers worldwide can run Rubin/Groq systems. For cloud providers and enterprises in permissive markets the message was clear: buy into NVIDIA’s stack or risk being unable to economically reach the next price‑performance tier. For others, the new disaggregated inference model opens design and software questions: can rivals replicate the software glue and system economics that Huang says will define the next phase of AI monetization? It was a systems‑level keynote. And the real product, as Huang framed it, may be a new way to sell compute as a driver of revenue rather than just cost.
