Small-but-mighty Gemma 4 builds run natively on phones — and users say the speed is “like magic”

What happened

It has been reported that a family of models sharing the same architecture as Google’s Gemini 3 — circulating under names like Gemma 4 and smaller variants E2B (≈2.3B effective params) and E4B (≈4.5B) — can run natively on modern smartphones and deliver startling performance. The mini models reportedly support full multimodal inputs, claim a 128K context window in some builds, and placed highly on community leaderboards such as Arena AI. Users have posted videos showing local image, audio and even hardware-control demos (flashlight toggles) on iPhones, and those posts drew hundreds of thousands of views.

Real-world tests and caveats

Reportedly, an iPhone 17 Pro running Apple’s (苹果) machine‑learning stack MLX achieved inference speeds above 40 tokens/second for these quantized models; similar numbers were reportedly seen on recent Samsung (三星) phones as well. Google has made mobile experimentation easier with an official Google (谷歌) app, Google AI Edge Gallery, where users can download and run edge-optimized model builds without heavy tooling. But not all is solved: when users pushed a 26B Mixture‑of‑Experts variant as a coding agent — needing long 256K context windows, robust tool calls and structured outputs — Gemma 4 reportedly struggled, stalling or producing malformed outputs, while a qwen3-coder build handled the same agent tasks more reliably.

What this means for cloud AI and geopolitics

Short term? Cloud, closed‑source flagship models still lead for the hardest multi‑agent coordination and frontier research workloads. Long term? If mobile hardware, quantization and compiler stacks continue to improve, edge models will eat away at high‑frequency, simple tasks that today generate token revenue for API sellers. That’s a business problem for firms monetizing per‑token usage. There’s also a geopolitical angle: tensions over chip export controls and trade policy make resilient, on‑device AI more attractive in markets seeking sovereignty from foreign cloud providers. Will local models make the cloud invisible to users? Maybe — and when they do, the industry’s commercial wiring will be up for a serious shake‑up.