Xiaomi (小米) debuts MiMo‑V2‑TTS, tying expressive speech to its MiMo‑V2‑Omni multimodal roadmap
The announcement
Xiaomi (小米) today unveiled MiMo‑V2‑TTS, a self‑developed large speech synthesis model the company says can speak, act and even sing with fine‑grained control. The system is built around a proprietary Audio Tokenizer and a multi‑codebook speech–text joint modelling architecture; it has been reported that Xiaomi pretrained the model on hundreds of millions of hours of speech data and applied multi‑dimensional reinforcement learning to balance stability and expressiveness. Can one model convincingly do both natural conversation and musical performance? Xiaomi claims MiMo‑V2‑TTS can.
Technical highlights
According to the announcement, the model supports multi‑granularity style control—from global speaking style down to local emotional shifts—allowing tone changes and emotional gradations within a single sentence. It reportedly maps text cues such as punctuation, interjections and emphasis markers into natural prosody without extra user annotation, and supports a range of Chinese dialects and accents (东北话, 四川话, 河南话, 粤语, 台湾腔) as well as actor‑style role play and high‑quality singing synthesis. Xiaomi frames the work as both a standalone TTS breakthrough and a component intended to be deeply fused with its MiMo‑V2‑Omni multimodal understanding capabilities.
Why it matters
For Western readers, the announcement reflects China’s rapid push to build full‑stack AI capabilities: speech, vision and language are being tied together into agentic systems that can “see, understand and speak” in expressive human voice. It has been reported that Xiaomi plans broader multilingual coverage and closer integration between MiMo‑V2‑TTS and the MiMo‑V2‑Omni base model to enable tool invocation and multimodal perception. This drive comes amid geopolitical headwinds—export controls on advanced chips and growing scrutiny of Chinese AI firms—pushing domestic players to vertically integrate software and hardware and accelerate model innovation.
