MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second(mimo.xiaomi.com)

191 points by gainsurier 2 hours ago | 130 comments

goyozi 1 hour ago |

Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.

flexagoon 1 hour ago | |

I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer

throwaway67678 22 minutes ago | | |

Agent mania setting in

It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour

RussianCow 1 hour ago | | |

Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.

https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...

tmaly 57 minutes ago | | |

This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.

efromvt 18 minutes ago | |

I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.

(I should go measure this now, I'm curious)

ilaksh 11 minutes ago | |

Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.

It will go much faster.

ipkstef 1 hour ago | |

asking for curiosities sake. What kind of PR loop are you running that takes a few hours?

ketzo 1 hour ago | | |

not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation

goyozi 47 minutes ago | | |

I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.

HarHarVeryFunny 38 minutes ago | |

I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.

There can't be many normal use cases where there'd be any cost benefit.

fragmede 23 minutes ago | | |

The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.

It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.

pianopatrick 1 hour ago | |

We fit in for the things that are not artificial.

So long as AI lives in server farms, humans will be needed for tasks in the physical world.

It's only if we combine AI with robots that things get really dicey.

fartfeatures 1 hour ago | | |

This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.

recroad 1 hour ago | |

Woah - what’s the prompt and what’s the PR?

goyozi 45 minutes ago | | |

I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches

dakiol 44 minutes ago |

So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.

So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!

amunozo 1 hour ago |

These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.

gertlabs 1 hour ago |

MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.

Data at https://gertlabs.com/rankings

kingstnap 1 hour ago |

Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.

miroljub 1 hour ago | |

MiMo and DeepSeek are not cheap. Anthropic and OpenAI are expensive for what they provide.

chrismustcode 1 hour ago | | |

You don't consider Input $0.435 Output $0.87 cache read $0.003625 per million tokens for near frontier intelligence cheap?

tmaly 55 minutes ago | | |

Energy is likely more abundant in China. I am not sure about compute, but that must be part of reason for such drastic price differences.

ignoramous 1 hour ago | | |

The Chinese "Neijuan" is real & well reported: https://www.reuters.com/business/autos-transportation/what-i...

It is another thing the the BigLabs accuse open weight models of benefitting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).

Ex A: https://www.anthropic.com/research/2028-ai-leadership

Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...

serpix 1 hour ago |

I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.

Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.

eli 1 hour ago |

Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.

For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.

maxdo 1 hour ago | |

i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.

ignoramous 1 hour ago | |

> And MiMo 2.5 is a lot more capable than GLM 4.7

MiMo 2.5 is not the same model as MiMo 2.5 Pro.

GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.

If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?

eli 39 minutes ago | | |

GLM 5.1 is very good. Definitely a contender for best open weight coding model. Nothing like 4.7.

But quite a bit more expensive than MiMo 2.5 Pro. Like 5x to 10x more on my little tests, at least by the API rates.

prplfsh 1 hour ago |

This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.

jeffrallen 1 hour ago | |

This is true for humans too. Lol

Oras 1 hour ago |

1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!

trollbridge 56 minutes ago | |

Comments at 1,000 TPS is a terrifying future.

0xbadcafebee 24 minutes ago | | |

I prefer a thousand smart AI comments to a thousand dumb human comments

eli 1 hour ago | |

Like what?

scosman 1 hour ago |

Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.

adrian_b 34 minutes ago | |

TFA mentions that until now special very expensive hardware like Cerebras was required for reaching this kind of speeds, and it emphasizes that what is novel in their results is that they have obtained over 1000 token/s for a model with over 1 T parameters by using just standard hardware, i.e. one server with 8 GPUs.

btian 32 minutes ago | |

Source? Their website says 1000t/s https://www.cerebras.ai/blog/which-is-faster-gemini-3-5-flas...

michael-ax 1 hour ago | |

now that's what i call a software development breakthrough/platform! thanks for the heads up!

lostmsu 1 hour ago | |

Cerebras currently does not provide any discounts for prefix caching making its use for agentic workloads sqr(n_turns) more expensive.

maxloh 1 hour ago |

The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.

The Xiaomi team really brought something to the table.

irthomasthomas 1 hour ago |

I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.

gekoxyz 1 hour ago | |

Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.

jdthedisciple 1 hour ago | |

Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?

boutell 1 hour ago | |

I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?

I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.

HarHarVeryFunny 1 hour ago | |

Maybe they only have a finite number of racks ;-)

slaw 1 hour ago | |

Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.

minraws 1 hour ago |

Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.

I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.

throwa356262 1 hour ago | |

Suspect this will be included once out of beta but at a higher credit/token ratio.

Remember, these guys are not VC backed. Anything they do must break even

JayStavis 1 hour ago | | |

> must break even

Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.

varispeed 1 hour ago | | |

Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.

From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.

Qdulf 1 hour ago | |

Must be Blackwell for native fp4 support.

isusmelj 1 hour ago |

No note about the specific GPU they use. One might speculate. B200? H200? H100?

PhunkyPhil 56 minutes ago |

Obligatory taalas mention:

https://taalas.com/

Despite the performative UI components they have a shipped (demo) product:

https://chatjimmy.ai/

This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.

High tok/s is the future IMO.

pullshark91 1 hour ago |

It's interesting but not game-changing IMO. Speed here is not a bottleneck.

npn 1 hour ago |

How?

edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.

though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.

lostmsu 1 hour ago | |

They say they are using https://github.com/tile-ai/TileRT

- persistent CUDA kernel

- tiled processing with overlapping read/writes

- model designed with specific constraints in mind

h14h 1 hour ago |

The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.

Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.

__natty__ 1 hour ago |

With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput

desireco42 54 minutes ago |

I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.

trilogic 1 hour ago |

Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.

Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.

qsera 1 hour ago |

Tokens per seconds is the "Megapixels" of AI marketing!

Octoth0rpe 1 hour ago | |

I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.

elar_verole 1 hour ago |

Yeah, this seems to be the easiest path for overall agents efficiency in the short term

moffkalast 1 hour ago |

42B active params, sliding window attention. There's your tradeoff.

vlovich123 1 hour ago | |

Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.

moffkalast 1 hour ago | | |

Seems to be for both according to the spec [0], maybe it's wrong though.

128 sounds really tiny, I wonder if they mean some kind of blocks?

[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...

bearjaws 1 hour ago | |

Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.

holoduke 1 hour ago |

Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.

harel 1 hour ago |

A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."

Really?

sidrag22 1 hour ago | |

different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.

I think this site often overlooks that second group and how large it likely is.

philipkglass 1 hour ago | |

I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.

harel 1 hour ago | | |

The example in the video was a generation of a dashboard app of some sort. I can do that with a "normal speed" Claude in a few minutes. The difference is a few minutes. This is compared to a few weeks in old school development time. I don't have a problem with taking it a little "slow" (as in - few minutes) and lending my thought to it rather than just going for fast generation and who knows what's inside. I get your use case, but this is a specialised one, and not the one 90% of people will think of - everyone want that fast app in 12 seconds... Or so it seems from me being downvoted on that comment.

GaggiX 1 hour ago |

If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.

slopinthebag 1 hour ago |

I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.

m00dy 1 hour ago |

boom!

atemerev 1 hour ago |

I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.