Data at https://gertlabs.com/rankings
The Xiaomi team really brought something to the table.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
Really?
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.
Remember, these guys are not VC backed. Anything they do must break even
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind
I think this site often overlooks that second group and how large it likely is.
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
Anything different for Grok?
I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.
I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.
In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.
MiMo 2.5 is not the same model as MiMo 2.5 Pro.
GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.
If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?
Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
128 sounds really tiny, I wonder if they mean some kind of blocks?
[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...
i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.
now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.
Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
Especially as teams invest in proper agentic harnessing.
We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.
The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.
Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.
And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).
I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.
It is another thing the the BigLabs accuse open weight models of benefitting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
Ex A: https://www.anthropic.com/research/2028-ai-leadership
Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...
For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.
Sometimes Opus just gives me a rubbish session.
I'm not agreeing or disagreeing with you, but my brain cannot comprehend how machines can advance such interconnected systems while keeping humans in focus.
Perhaps I shouldn't have watched the Animatrix again.
There will only be a reckoning if models don't get much better.
If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.
The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) The biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)
In this case, at least it’s threatening multimillion dollar salary jobs instead of entire towns of working class people in America or Mexico.
And the Chinese labs actually release their weights. You could call it… open AI.
What actually matters is that the mere tool is withholding information at all, and that the boundaries were set by whoever designed it.
Dont get me wrong I've been an advocate of this stuff (I carry two phones, one with GOS for my personal use and the other for ID verifications). However, without reasoning, you just can't see it, because you're as biased and propagandized as anyone in China.

June 8, 2026
Try it now ›Access API â€ºä¸æ–‡ ›
From the first roaring racer of the combustion age to the sonic boom that shattered the sound barrier, humanity's hunger for speed is written into our very DNA. The speed of AI reasoning is no different — it defines the boundaries of intelligence itself. When a model is fast enough, it ceases to be a tool you wait on and becomes an extension of your own thinking: responding in real time, iterating in an instant, collaborating without friction.
Today, we are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with TileRT, breaking the 1000 tokens/s decode speed on a 1-trillion-parameter model for the first time!
MiMo-V2.5-Pro UltraSpeed real-time generation speed comparison (up to ~1200 tokens/s)
The MiMo-V2.5-Pro-UltraSpeed API launches simultaneously at a limited-time promotional price — 3× the cost of MiMo-V2.5-Pro, but delivering approximately 10× the generation speed! 3× the price, 10× the output experience. (API only; Token Plan not supported.)
Due to limited high-speed inference resources, MiMo-V2.5-Pro-UltraSpeed will be available through an application-based, limited-time window. Approved users can access the API during the trial period, available only from June 9 to June 23, 2026, 23:59 (Beijing Time, UTC+8 / 08:59 PDT).
API platform: platform.xiaomimimo.com/ultraspeed. Trial slots are limited — submission does not guarantee approval. We will prioritize enterprises and professional developers with genuine business needs. For standard model access, please follow the MiMo-V2.5 model series. For in-depth business partnerships for the UltraSpeed model, contact business-mimo@xiaomi.com.
Approved users will receive free Chat access valid within the two-week window. Entry point: ultraspeed.xiaomimimo.com
To ensure quality and fairness under resource constraints, the following rules apply: each account may enter the queue up to 10 times per day; each session is capped at 30 minutes; sessions idle for more than 5 minutes will be automatically released.
At the trillion-parameter (1T) scale, breaking 1000 tps is far more than a faster typewriter — it fundamentally disrupts AI application paradigms.
First, speed itself begins to transmute into intelligence. Previously, when facing a hard problem, you could only "wait for one answer and pray it's correct." Now, within the same wall-clock time, the model can run dozens of reasoning paths in parallel (Best-of-N / Tree Search), automatically verifying and self-correcting in the background — using raw speed to generate depth of thought, directly elevating reasoning quality.
Second, it completely unleashes the productivity ceiling of Coding Agents. Before, having AI write code meant developers painfully waiting in front of screens, bottlenecked by inference latency. At 1000 tps, code generation speed and production efficiency undergo a paradigm-level acceleration.
Most importantly, trillion-parameter models can now enter real-time decision loops. Millisecond-level "think-respond" cycles allow 1T flagship models to seamlessly plug into time-critical scenarios — high-frequency quantitative trading signal generation, instant anti-fraud interception, intelligent bidding, and real-time interactive dialogue. And when this power is brought to surgical assistance and medical imaging analysis in life-or-death situations, AI speed is no longer just a metric of efficiency — it becomes a chip in the race against death. On the operating table, every second AI saves in completing lesion analysis and risk prediction gives the surgeon one more degree of freedom. This deepens our conviction that the ultimate significance of speed is not merely boosting productivity, but enabling technology to help humanity live better.
Achieving 1000+ tokens/s generation speed with a 1T flagship model is not the breakthrough of a single technique — it is the product of deep collaboration and extreme Codesign between the MiMo model team and the TileRT system team. The industry's current approach to similar extreme speeds typically relies on specialized hardware — Cerebras's Wafer-Scale integration or Groq's pure on-chip SRAM custom architecture. We chose a different path: achieving even more impressive inference speed on commodity GPUs through model-system codesign alone.
On the model side, we applied FP4 quantization targeting the bandwidth bottleneck of commodity hardware, dramatically shrinking model size and reducing memory-access overhead; simultaneously, we introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction, substantially increasing the accepted token length per verification step. On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms, delivering a tailor-made compilation engine and compute kernels optimized specifically for the novel quantization and speculative decoding pipeline. Through this extreme Codesign, we achieved 1000+ tokens/s output from a 1T model using just a single standard 8-GPU commodity node.
At the trillion-parameter (1T) scale, traditional 8-bit (FP8 / INT8) or even 16-bit inference imposes prohibitive memory footprint and bandwidth pressure. Reducing parameter bit-width directly contributes to decoding speed. We therefore adopt the widely validated, virtually lossless FP4 (MXFP4) quantization format[1].
However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below:

Model capability comparison between FP4 quantization (MoE Experts only) and FP8 across benchmarks, with overall capability essentially on par with the original model
Traditional Speculative Decoding relies on a small draft model to "guess" subsequent tokens, which the large model then verifies. This transforms autoregressive generation (1 token per forward pass) into parallel multi-token generation, with rejection sampling during verification ensuring lossless output quality. However, its bottleneck lies in the draft model's quality determining the acceptance rate, while a stronger draft model incurs higher compute overhead — a fundamental tension.
To break this deadlock, we adopt DFlash, an innovative block-level masked parallel prediction method from the research community[2]: the draft model fills an entire block of masked positions in a single forward pass, fundamentally eliminating the serial constraint of "autoregressive drafting."
We deployed this approach on MiMo-V2.5-Pro with custom optimizations tailored for trillion-scale MoE and long-context scenarios. Using the Muon second-order optimizer and model self-distillation, we ensure that compact mask blocks still deliver ideal acceptance rates while compressing draft-stage overhead to near its theoretical minimum:
In terms of results, our parallel-prediction speculative decoding achieves significant acceptance-length improvements across high-value agent and coding scenarios, meaning the large model can confirm more content "in one breath" per verification round. Furthermore, we limit block size to 8 to reduce verification overhead and increase concurrency, allowing high acceptance lengths to translate directly into high inference throughput:
| Scenario | Acceptance Length |
|---|---|
| Coding | 6.30 |
| Math / Reasoning | 5.56 |
| Agent | 4.29 |
In the Coding scenario, we achieve an average acceptance length of 6.30, with some samples reaching a maximum of 7.14 — meaning 6–7 out of the 8 draft tokens per verification round are accepted. The draft model remains lightweight while pushing acceptance rates to levels that deliver real end-to-end gains. We also observe that in more semantically divergent, higher-uncertainty general conversation scenarios, current acceptance rates are not yet high. We are continuously optimizing the algorithm to explore higher generalization ceilings.
If MiMo's algorithmic innovations unshackle the bandwidth constraints of hundred-billion and trillion-parameter models, then the TileRT inference system squeezes every last drop of physical potential from commodity GPUs down to the microsecond level.
At 1000 tokens/s operating frequency, each operator's lifecycle is compressed to microseconds, and the "operator boundaries" of traditional inference systems become the core bottleneck — every operator launch, hardware synchronization, and global memory round-trip fractures the execution flow at the microsecond scale, exposing visible "Execution Gaps."
As the foundational infrastructure for ultra-low-latency inference, TileRT introduces an entirely new execution model that eliminates execution gaps from operator boundaries at their root:
When the underlying execution model pushes hardware performance to its limits, pure runtime optimization begins to hit physical boundaries. Building on this foundation, the TileRT system team and Xiaomi's MiMo team engaged in deep technical co-creation, breaking down traditional software layer boundaries. To perfectly align model behavior with this ultra-low-latency execution pipeline, the model layer ultimately adopted a mixed FP4 quantization strategy for MoE Experts and deployed SWA-aligned DFlash speculative decoding on the trillion-parameter architecture. TileRT tightly couples with these algorithmic characteristics and quantization schemes, delivering custom-built compilation engines and compute kernels. Both teams made profound joint engineering tradeoffs based on hardware physics, ensuring execution pressure closes smoothly within hardware boundaries.
The birth of 1000 tokens/s is no coincidence of point optimizations. It is the inevitable result of world-class system infrastructure and extreme algorithmic models deeply converging toward each other, co-evolving as one.
TileRT is a frontier systems architecture team focused on next-generation AI infrastructure and ultra-low-latency inference. The team is dedicated to enabling millisecond-level real-time response for frontier large models in production environments, breaking traditional storage-compute barriers with an entirely new runtime architecture. The team has conceived and implemented a paradigm-level execution model. Through full-stack breakthroughs in persistent kernels, tile pipelines, and heterogeneous collaboration, TileRT achieves extreme compute utilization within complex heterogeneous ecosystems. As a core infrastructure enabler, the team actively partners with industry-leading collaborators on hardware-software codesign, building the high-performance compute foundation for the era of autonomous intelligence that craves "ultimate speed." For more TileRT technical details: tilert.ai/blog/breaking-1000-tps.html

Build a Snake game in just 10 seconds

Recreate a MacOS interface in just 1 minute
MiMo × TileRT — extreme model-system codesign, delivering 1000 tps output speed for trillion-parameter models.