MiniMax M2.5 released: 80.2% in SWE-bench Verified

I hope better and cheaper models will be widely available because competition is good for the business. However, I'm more cautious about benchmark claims. MiniMax 2.1 is decent, but one can really not call it smart. The more critical issue is that MiniMax 2 and 2.1 have the strong tendency to reward hacking, often write nonsensical test report while the tests actually failed. And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]

[1] https://artificialanalysis.ai/models/minimax-m2-1

Everyone is using this sort of let me group the plots weirdly instead of sorting them to make harder to compare. I see you folks

Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...

Really looked forward to this release as MiniMax M2.1 is currently my most used model thanks to it being fast, cheap and excellent at tool calling. Whilst I still use Antigravity + Claude for development, I reach for MiniMax first in my AI workflows, GLM for code tasks and Kimi K2.5 when deep English analysis is needed.

Not self-hosting yet, but I prefer using Chinese OSS models for AI workflows because of the potential to self-host in future if needed. Also using it to power my openclaw assistant since IMO it has the best balance of speed, quality and cost:

> It costs just $1 to run the model continuously for an hour at 100 tokens/sec. At 50 tokens/sec, the cost drops to $0.30.

This is cool, but they mentioned affordability, and said this is about $1/hour to run, which is about what I pay for claude code on $200/mo plan. This is not literally true, sometimes I'm running up to 3 concurrent intermittently throughout the day for maybe 60 hours per week.

So I do believe if there is something that comes up that is literally continuous, would be interesting, but I'm not sure about it right now. I would be curious if anyone has anything they would literally use running 24/7.

Hm. The benchmarks look too good to be true and a lot of the things they say about the way they train this model sound interesting, but it's hard to say how actually novel they are. Generally, I sort of calibrate how much salt I take benchmarks with based on the objective properties of the model and my past experiences with models from the same lab.

For instance,

I'm inclined to generally believe Kimi K2.5's benchmarks, because I've found that their models tend to be extremely good qualitatively and feel actually well-rounded and intelligent instead of brittle and bench-maxed.

I'm inclined to give GLM 5 some benefit of the doubt, because while I think their past benchmarks have overstated their models' capabilities, I've also found their models relatively competent, and they 2X'd the size of their models, as well as introduced a new architecture and raised the number of active parameters, which makes me feel like there is a possibility they could actually meet the benchmarks they are claiming.

Meanwhile, I've never found MiniMax remotely competent. It's always been extremely brittle, tended to screw up edits and misformat even simple JavaScript code, get into error loops, and quickly get context rot. And it's also simply just too small, in my opinion, to see the kind of performance they are claiming.

Wish my company allowed more of these LLMs through Github Copilot, stuck with OpenAi, Anthropic and Google LLMs where they burn my credit one week into the month

Wouldn't it be nice if we have language specific llms that work on average computers.

Like LLM that only trained on Python 3+, certain frameworks, certain code repos. Then you can use a different model for searching the internet to implement different things to cut down on costs.

Maybe I have no idea what I'm talking about lol

> M2.5-Lightning [...] costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5 [...] costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

Huge - if not groundbreaking - if the benchmark stats are true.

I wonder if these are starting to get reasonable enough to use locally?

A reasonably sized OSS model that's this good at coding is a HUGE step forward.

We've done some vibe checks on it with OpenHands and it indeed performs roughly as good as Sonnet 4.5.

OSS models are catching up

And it's available on their coding plans, even the cheapest one.

With the GLM news yesterday and now this, I'd love to try out one of these models, but I'm pretty tied to my Claude Code workflow. I see there's a workaround for GLM, but how are people utilizing MiniMax, especially for coding?

> It costs just $1 to run the model continuously for an hour at 100 tokens/sec. At 50 tokens/sec, the cost drops to $0.30.

!!!!!! Incredibly cheap!!!!!

I'll have to look for it in OpenRouter.

!!!!!! Incredibly cheap!!!!!

I'll have to look for it in OpenRouter.

For the moment is free in Opencode, if you want ot try it.

Everyone is using this sort of let me group the plots weirdly instead of sorting them to make harder to compare. I see you folks

For instance,

Wish my company allowed more of these LLMs through Github Copilot, stuck with OpenAi, Anthropic and Google LLMs where they burn my credit one week into the month

And it's available on their coding plans, even the cheapest one.

Pelican is recognizable but not great, bicycle frame is missing a bar: https://gist.github.com/simonw/61b7953f29a0b7fee1f232f6d9826...

You should switch to an octopus riding a bike, much harder.

Artificial Analysis put MiniMax 2.1 Coding index on 33, far behind frontier models and I feel it's about right. [1]

[1] https://artificialanalysis.ai/models/minimax-m2-1

> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

I haven’t tried MiniMax, but GPT-5.2-Codex has this problem. Yesterday I watched it observe a Python type error (variable declared with explicit incorrect type — fix was trivial), and it added a cast. (“cast” is Python speak for “override typing for this expression”.) I told it to fix it for real and not use cast. So it started sprinkling Any around the program (“Any” is awful Python speak for “don’t even try to understand this value and don’t warn either”).

That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.

MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.

Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash

Huge - if not groundbreaking - if the benchmark stats are true.

Wouldn't it be nice if we have language specific llms that work on average computers.

Like LLM that only trained on Python 3+, certain frameworks, certain code repos. Then you can use a different model for searching the internet to implement different things to cut down on costs.

Maybe I have no idea what I'm talking about lol

A reasonably sized OSS model that's this good at coding is a HUGE step forward.

We've done some vibe checks on it with OpenHands and it indeed performs roughly as good as Sonnet 4.5.

OSS models are catching up

I wonder if these are starting to get reasonable enough to use locally?

yes it's good. But you should also look at GLM 5 and Kimi K2.5 when looking at M2.5. It's amazing we have so many good and cheap open weight models now which are really not far behind the top models from the big US AI companies.

Anthropic Claude Code and OpenAI Codex plans are subsidised.

The Chinese open weight models hosted in US or Europe make more sense to use when you want to stay model agnostic and less dependent on a single AI company with relative expensive APIs.

Cost per token doesn't really matter anymore, cost per task it more important.

I imagine some sort of distill like this would be possible, but I think multi-language training really helps the LLM.

you can use Claude Code with these models. You just need to pass the right env vars. Have a look at the client setup guide on z.ai

I use Opencode, when the model is free for the moment. I have not used Claude Code so I cannot compare.

anything with an open ai compatible endpoint can have claude code router put in front of it afaik https://github.com/musistudio/claude-code-router

> And sometimes it changed the existing code base to make its new code "pass", when it actually should fix its own code instead.

MiniMax 2.1 didn't really work for my data-parsing tasks, a lot of errors.

Instead, this one works surprisingly well for the cost: https://openrouter.ai/xiaomi/mimo-v2-flash

I use Opencode, when the model is free for the moment. I have not used Claude Code so I cannot compare.

anything with an open ai compatible endpoint can have claude code router put in front of it afaik https://github.com/musistudio/claude-code-router

You should switch to an octopus riding a bike, much harder.

Not an SVG, but I'm pretty impressed by what Gemini 3.0 Fast does: https://gemini.google.com/share/52c1229bd1d9

/imagine an svg of an octopus riding a bike. 1 arm shading its eyes from the sun, another waving a cute white flag, 2 driving the bike, 2 peddling the wheels, and 2 drifting behind in the wind

also much less in training data by now

Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)

I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own

you can use Claude Code with these models. You just need to pass the right env vars. Have a look at the client setup guide on z.ai

Interesting—thanks!

Cost per token doesn't really matter anymore, cost per task it more important.

Anthropic Claude Code and OpenAI Codex plans are subsidised.

The Chinese open weight models hosted in US or Europe make more sense to use when you want to stay model agnostic and less dependent on a single AI company with relative expensive APIs.

I imagine some sort of distill like this would be possible, but I think multi-language training really helps the LLM.

also much less in training data by now

Not an SVG, but I'm pretty impressed by what Gemini 3.0 Fast does: https://gemini.google.com/share/52c1229bd1d9

/imagine an svg of an octopus riding a bike. 1 arm shading its eyes from the sun, another waving a cute white flag, 2 driving the bike, 2 peddling the wheels, and 2 drifting behind in the wind

I think part of the point is that the SVG is the hard part. Gemini is quite good at generating images, but it’s trained to generate raster images.

wtf kinda juniors are you interacting with

Maybe the Juniors you have seen are actually retarded?

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

Interesting—thanks!

this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.

It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.

I think part of the point is that the SVG is the hard part. Gemini is quite good at generating images, but it’s trained to generate raster images.

wtf kinda juniors are you interacting with

Lots of self-taught; looking for an entry level.

Maybe the Juniors you have seen are actually retarded?

It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.

Lots of self-taught; looking for an entry level.

2026.2.12

Today we're introducing our latest model, MiniMax-M2.5.

Extensively trained with reinforcement learning in hundreds of thousands of complex real-world environments, M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

Trained to reason efficiently and decompose tasks optimally, M2.5 exhibits tremendous speed in performing complicated agentic tasks, completing the SWE-Bench Verified evaluation 37% faster than M2.1, matching the speed of Claude Opus 4.6.

M2.5 is the first frontier model where users do not need to worry about cost, delivering on the promise of intelligence too cheap to meter. It costs just $1 to run the model continuously for an hour at a rate of 100 tokens per second. At 50 tokens per second, the cost drops to $0.30. We hope that the speed and cost effectiveness of M2.5 enable innovative new agentic applications.

Coding

In programming evaluations, MiniMax-M2.5 saw substantial improvements compared to previous generations, reaching SOTA levels. The performance of M2.5 in multilingual coding tasks is especially pronounced.

A significant improvement from previous generations is M2.5's ability to think and plan like an architect. The Spec-writing tendency of the model emerged during training: before writing any code, M2.5 actively decomposes and plans the features, structure, and UI design of the project from the perspective of an experienced software architect.

M2.5 was trained on over 10 languages (including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby) across more than 200,000 real-world environments. Going far beyond bug-fixing, M2.5 delivers reliable performance across the entire development lifecycle of complex systems: from 0-to-1 system design and environment setup, to 1-to-10 system development, to 10-to-90 feature iteration, and finally 90-to-100 comprehensive code review and system testing. It covers full-stack projects spanning multiple platforms including Web, Android, iOS, and Windows, encompassing server-side APIs, business logic, databases, and more, not just frontend webpage demos.

To evaluate these capabilities, we also upgraded the VIBE benchmark to a more complex and challenging Pro version, significantly increasing task complexity, domain coverage, and evaluation accuracy. Overall, M2.5 performs on par with Opus 4.5.

We focused on the model's ability to generalize across out-of-distribution harnesses. We tested performance on the SWE-Bench Verified evaluation set using different coding agent harnesses.

On Droid: 79.7(M2.5) > 78.9(Opus 4.6)
On OpenCode: 76.1(M2.5) > 75.9(Opus 4.6)

Search and Tool calling

Effective tool calling and search are prerequisites for a model's ability to autonomously handle more complex tasks. In evaluations on benchmarks such as BrowseComp and Wide Search, M2.5 achieved industry-leading performance. At the same time, the model's generalization has also improved — M2.5 demonstrates more stable performance when facing unfamiliar scaffolding environments.

In research tasks performed by professional human experts, using a search engine is only a small part of the process; most of the work involves deep exploration across information-dense webpages. To address this, we built RISE (Realistic Interactive Search Evaluation) to measure a model's search capabilities on real-world professional tasks. The results show that M2.5 excels at expert-level search tasks in real-world settings.

Compared to its predecessors, M2.5 also demonstrates much better decision-making when handling agentic tasks: it has learned to solve problems with more precise search rounds and better token efficiency. For example, across multiple agentic tasks including BrowseComp, Wide Search, and RISE, M2.5 achieved better results with fewer rounds, using approximately 20% fewer rounds compared to M2.1. This indicates that the model is no longer just getting the answer right, but is also reasoning towards results in more efficient paths.

Office work

M2.5 was trained to produce truly deliverable outputs in office scenarios. To this end, we engaged in thorough collaboration with senior professionals in fields such as finance, law, and social sciences. They designed requirements, provided feedback, participated in defining standards, and directly contributed to data construction, bringing the tacit knowledge of their industries into the model's training pipeline. Based on this foundation, M2.5 has achieved significant capability improvements in high-value workspace scenarios such as Word, PowerPoint, and Excel financial modeling. On the evaluation side, we built an internal Cowork Agent evaluation framework (GDPval-MM) that assesses both the quality of the deliverable and the professionalism of the agent's trajectory through pairwise comparisons, while also monitoring token costs across the entire workflow to estimate the model's real-world productivity gains. In comparisons against other mainstream models, it achieved an average win rate of 59.0%.

Efficiency

Because the real world is full of deadlines and time constraints, task completion speed is a practical necessity. The time it takes a model to complete a task depends on its task decomposition effectiveness, token efficiency, and inference speed. M2.5 is served natively at a rate of 100 tokens per second, which is nearly twice that of other frontier models. Further, our reinforcement learning setup incentivizes the model to reason efficiently and break down tasks optimally. Due to these three factors, M2.5 delivers a significant time savings in complex task completion.

For example, when running SWE-Bench Verified, M2.5 consumed an average of 3.52 million tokens per task. In comparison, M2.1 consumed 3.72M tokens. Meanwhile, thanks to improvements in capabilities such as parallel tool calling, the end-to-end runtime decreased from an average of 31.3 minutes to 22.8 minutes, representing a 37% speed improvement. This runtime is on par with Claude Opus 4.6's 22.9 minutes, while the total cost per task is only 10% that of Claude Opus 4.6.

Cost

Our goal in designing the M2-series of foundation models is to power complex agents without having to worry about cost. We believe that M2.5 is close to realizing this goal. We’re releasing two versions of the model, M2.5 and M2.5-Lightning, that are identical in capability but differ in speed. M2.5-Lightning has a steady throughput of 100 tokens per second, which is two times faster than other frontier models, and costs $0.3 per million input tokens and $2.4 per million output tokens. M2.5, which has a throughput of 50 tokens per second, costs half that. Both model versions support caching. Based on output price, the cost of M2.5 is one-tenth to one-twentieth that of Opus, Gemini 3 Pro, and GPT-5.

At a rate of 100 output tokens per second, running M2.5 continuously for an hour costs $1. At a rate of 50 TPS, the price drops to $0.3. To put that into perspective, you can have four M2.5 instances running continuously for an entire year for $10,000. We believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy. For the M2-series, the only problem that remains is how to continually push the frontier of model capability.

Improvement Rate

Over the three and a half months from late October to now, we have successively released M2, M2.1, and M2.5, with the pace of model improvement exceeding our original expectations. For instance, in the highly-regarded SWE-Bench Verified benchmark, the rate of progress of the M2-series has been significantly faster than that of peers such as the Claude, GPT, and Gemini model families.

RL Scaling

One of the key drivers of the aforementioned developments is the scaling of reinforcement learning. As we train our models, we also benefit from their abilities. Most of the tasks and workspaces that we perform in our company have been made into training environments for RL. To date, there are already hundreds of thousands of such environments. At the same time, we did plenty of work on our agentic RL framework, algorithms, reward signals, and infrastructure engineering to support the continued scaling of our RL training.

Forge –– Agent-Native RL Framework

We designed an agent-native RL framework in-house, called Forge, which introduces an intermediary layer that fully decouples the underlying training-inference engine from the agent, supporting the integration of arbitrary agents and enabling us to optimize the model's generalization across agent scaffolds and tools. To improve system throughput, we optimized asynchronous scheduling strategies to balance system throughput against sample off-policyness, and designed a tree-structured merging strategy for training samples, achieving approximately 40x training speedup.

Agentic RL Algorithm and Reward Design

On the algorithm side, we continued using the CISPO algorithm we proposed at the beginning of last year to ensure the stability of MoE models during large-scale training. To address the credit assignment challenge posed by long contexts in agent rollouts, we introduced a process reward mechanism for end-to-end monitoring of generation quality. Furthermore, to deeply align with user experience, we evaluated task completion time through agent trajectories, achieving an optimal trade-off between model intelligence and response speed.

We will release a more comprehensive introduction to RL scaling soon in a separate technical blogpost.

MiniMax Agent: M2.5 as a Professional Employee

M2.5 has been fully deployed in MiniMax Agent, delivering the best agentic experience.

We have distilled core information-processing capabilities into standardized Office Skills deeply integrated within MiniMax Agent. In MAX mode, when handling tasks such as Word formatting, PowerPoint editing, and Excel calculations, MiniMax Agent automatically loads the corresponding Office Skills based on file type, improving the quality of task outputs.

Furthermore, users can combine Office Skills with domain-specific industry expertise to create reusable Experts tailored to specific task scenarios.

Take industry research as an example: by merging a mature research framework SOP (standard operating procedure) with Word Skills, the Agent can strictly follow the established framework to automatically fetch data, organize analytical logic, and output properly formatted research reports — rather than merely generating a raw block of text. In financial modeling scenarios, by combining an organization's proprietary modeling standards with Excel Skills, the Agent can follow specific risk control logic and calculation standards to automatically generate and validate complex financial models, rather than simply outputting a basic spreadsheet.

To date, users have built over 10,000 Experts on MiniMax Agent, and this number is still growing rapidly. MiniMax has also built multiple sets of deeply optimized, ready-to-use Expert suites on MiniMax Agent for high-frequency scenarios such as office work, finance, and programming.

MiniMax itself has been among the first to benefit from M2.5's capabilities. Throughout the company's daily operations, 30% of overall tasks are autonomously completed by M2.5, spanning functions including R&D, product, sales, HR, and finance — and the penetration rate continues to rise. Performance in coding scenarios has been particularly notable, with M2.5-generated code accounting for 80% of newly committed code.

Appendix

Further benchmark results of M2.5:

Evaluation methods:

_* SWE benchmark: SWE-bench Verified, SWE-bench Multilingual, SWE-bench-pro, and Multi-SWE-bench were tested on internal infrastructure using Claude Code as the scaffolding, with the default system prompt overridden, and results averaged over 4 runs. Additionally, SWE-bench Verified was also evaluated on the Droid and Opencode scaffoldings using the default prompt.

Terminal Bench 2: We tested Terminal Bench 2 using Claude Code 2.0.64 as the evaluation scaffolding. We modified the Dockerfiles of some problems to ensure the correctness of the problems themselves, uniformly expanded sandbox specifications to 8-core CPU and 16 GB memory, set the timeout uniformly to 7,200 seconds, and equipped each problem with a basic toolset (ps, curl, git, etc.). While not retrying on timeouts, we added a detection mechanism for empty scaffolding responses, retrying tasks whose final response was empty to handle various abnormal interruption scenarios. Final results are averaged over 4 runs.
VIBE-Pro: Internal benchmark. Uses Claude Code as the scaffolding to automatically verify the interaction logic and visual effects of programs. All scores are computed through a unified pipeline that includes a requirements set, containerized deployment, and a dynamic interaction environment. Final results are averaged over 3 runs.
BrowseComp: Uses the same agent framework as WebExplorer (Liu et al., 2025). When token usage exceeds 30% of the maximum context, all history is discarded.
Wide Search: Uses the same agent framework as WebExplorer (Liu et al., 2025).
RISE: Internal benchmark. Contains real questions from human experts, evaluating the model's multi-step information retrieval and reasoning capabilities when combined with complex web interactions. A Playwright-based browser tool suite is added on top of the WebExplorer (Liu et al., 2025) agent framework.
GDPval-MM: Internal benchmark. Based on the open-source GDPval test set, using a custom agentic evaluation framework where an LLM-as-a-judge performs pairwise win/tie/loss judgments on complete trajectories. Average token cost per task is calculated based on each vendor's official API pricing (without caching).
MEWC: Internal benchmark. Built on MEWC (Microsoft Excel World Championship), comprising 179 problems from the main and other regional divisions of Excel esports competitions from 2021–2026. It evaluates the model's ability to understand competition Excel spreadsheets and use Excel tools to complete problems. Scores are calculated by comparing output and answer cell values one by one.
Finance Modeling: Internal benchmark. Primarily contains financial modeling problems constructed by industry experts, involving end-to-end research and analysis tasks performed via Excel tools. Each problem is scored using expert-designed rubrics. Final results are averaged over 3 runs.
AIME25 ~ AA-LCR: Obtained through internal testing based on the public evaluation sets and evaluation methods covered by the Artificial Analysis Intelligence Index leaderboard._

Hacker Times