Xiaomi's MiMo-v2.5-Pro Achieves 1000 Tokens/Second on 1-Trillion-Parameter Model
Tags AI · Infrastructure · OSS
Xiaomi, in collaboration with TileRT, has released an UltraSpeed mode for its MiMo-V2.5-Pro model that achieves over 1000 tokens per second generation speed on a 1-trillion-parameter model using commodity GPUs. The breakthrough is attributed to extreme model-system co-design rather than specialized hardware. This performance level makes real-time interaction with a trillion-parameter model feasible on widely available hardware, potentially democratizing access to large-model inference. The result was highlighted on Hacker News with significant community engagement (503 points, 350 comments), indicating strong developer interest in inference efficiency.
Technical significance
Achieving 1000 tokens/s on a 1T-parameter model using commodity GPUs represents a significant leap in inference efficiency. If reproducible, this could reduce the cost of serving large models by an order of magnitude, making it economically viable to deploy trillion-parameter models in production without specialized AI accelerators. For the competitive landscape, it demonstrates that Chinese AI labs are making substantial contributions to inference optimization, an area where Western companies have traditionally led.