Well still no list nor publication of the training data.
And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.
GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot
I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.
If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.
That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).
Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
https://microsoft.ai/news/introducingmai-code-1-flash/
and the model card
https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF
The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
MAI-Thinking-1 - https://news.ycombinator.com/item?id=48374362 - June 2026 (64 comments)
While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.
That sounds like something you say when you don't benchmark well
Why not assign them to make windows good :D
I suppose if you're reeling at the new Copilot bill but want to stay in their ecosystem, this gives you something to use, but for most folks, there's a plethora of better options.
But with Copilot now just being paying per-token prices I don't see how this is competitive with Chinese models.
It is probably telling you can't find the costs in the announcement. Because Input $0.75 Cached input $0.075 Output $4.50 might be competitive with Haiku, but nobody in their right mind uses Haiku and Anthropic has abandoned it chasing the tokenmaxers who aren't thinking about budgets.
So I guess they are aiming for corporate customers that are bound to Microsoft through compliance approval that will soon start seeing their budgets explode that have to find some corporate compromise.
When I need a light model, I reach for Sonnet. It is nearly free on the max plans, and quite fast. I don't see a place for Haiku in regular coding.
Haiku I guess is when you need summarization/categorization at scale.
Microsoft setting Haiku as the benchmark is a low bar.
Seems like the work from a good system design to code is practically solved.
Now it’s a matter of the design of the system. Or is that represented in these evals?
But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?
And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.
There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).
As such, haiku isn’t a waste of my time, it saves enormous amounts of time for me. But I spent a large amount of time building the orchestration system up front and iterating on it to get here. Interestingly i found my experience as a director and later a distinguished engineer gave me the tools to build it and get it working well and reliably end to end - the dynamics of multi agent workflows of varying capability is not a lot different than the dynamics of a 1000 engineer organization.
is a funny oxymoron
https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...
Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.
Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?
I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.
The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"
(gestures wildly while changing lanes in his Fiat 500)
Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.
But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.
These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.
In my tests, openweight Qwens and GLM are way better than it.
I personally wouldn't use models that class directly, though - I'd use them in a harness as a "backend" for more capable models. And Haiku itself, as opposed to other smaller models, is still expensive.
[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...
Yeah, not a 5B param model as the earlier title implied!
The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.
While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.
In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.
With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.
It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.
Yet another reason the current buildout will feel like the railroads.
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.
Qwen3.6-35B-A3B vs Claude Haiku 4.5
reasoning mode · AA Intelligence Index v4.0
46.0 ┤ ↖ better — cheaper · smarter · faster
│
│
44.0 ┤ ╭─────╮
│ │ ● │ Qwen3.6-35B-A3B
│ ╰─────╯
42.0 ┤
│
│
40.0 ┤
│
│
38.0 ┤ ╭───╮
│ Claude Haiku 4.5 │ ○ │
│ ╰───╯
36.0 ┤
└┬─────────┬─────────┬─────────┬─────────┬────────┬
$200 $300 $400 $500 $600 $700
x → cost to run the index (USD) lower is better
y → AA intelligence index higher is better
bubble area = output speed (tokens / sec)
╭─────╮ ╭───╮
│ ● │ Qwen ~196 t/s │ ○ │ Haiku ~93 t/s
╰─────╯ ╰───╯
┌─────────────────────┬──────────┬──────────┬───────────┐
│ model │ AA index │ run cost │ out speed │
├─────────────────────┼──────────┼──────────┼───────────┤
│ Qwen3.6-35B-A3B ●│ 43.5 │ $280 │ 196 t/s │
│ Claude Haiku 4.5 ○│ 37.1 │ $620 │ 93 t/s │
└─────────────────────┴──────────┴──────────┴───────────┘
COST PER TOKEN ≠ COST PER TASK
output tokens per index run:
Haiku 4.5 87.3M (79.3M reasoning + 8.0M answer)
Qwen3.6 143.2M (131.7M reasoning + 11.5M answer)
→ Qwen emits 1.64× more output
── output speed (tokens / sec) ────────── raw rate · higher = faster
Qwen3.6 100% ~196 t/s
Haiku 4.5 ~47% ~93 t/s
→ Qwen ~2.1× faster per token
╎ 1.64× more tokens < 2.1× faster rate
▼
── solution speed (per finished answer) ── higher = faster
Qwen3.6 100%
Haiku 4.5 ~78%
→ Qwen ~1.3× FASTER to a solution
SCORECARD
intelligence cost / task speed to solution
Qwen3.6-35B-A3B 43.5 $280 ~1.3× faster
Claude Haiku 4.5 37.1 $620 (slower)
→ Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
the raw-speed edge (2.1×), so Qwen stays ahead per task. (() => {
const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
const block = e => e.stopImmediatePropagation();
for (const t of KILL) {
window.addEventListener(t, block, { capture: true, passive: true });
document.addEventListener(t, block, { capture: true, passive: true });
}
document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
console.log('Scroll hijack disabled — native scrolling restored.');
})();For tasks (like kubernetes, linux, reports, database exploration and such) I use GLM5.1. Faster is actually smarter in those cases. And much cheaper too.
Opus 4.8 is for the unknown. Things I don't know how to do myself.
always has been
claude code has opusplan — uses opus while in plan mode, switches to sonnet for execution.
https://code.claude.com/docs/en/model-config#opusplan-model-...
edit: you can make it work with sonnet for planning, and haiku for execution, or any other combination you fancy to work with.
https://code.claude.com/docs/en/model-config#control-the-mod...
For simple features I don't have a full plan worked out. I write a bit of code then tell the model in a short line prompt what it should do. Sometimes I put temporary comments in the code to give it guidance. Generally if the code change is within a file or package, Haiku is good enough follow what you ask and not mess up too much. I also have skills created over time to give it guidance. There were some months when I used GitHub copilot where I had excess credits available at the end of the month I frantically try to use up.
Even the AI code completions can be pretty good on their own. Sometimes I write some temporary comments describing what the code should do and just press Tab-Tab-Tab and the entire function is done.
I think there is a tendency for people to go for the advanced models thinking they we screw up less but if you really understand the code its easier to interactively do it with a lesser model.
1. Step execution (Sonnet): Work for 30 minutes / 100k tokens at the direction of the Orchestrator
2. Review (Opus): Scrutinize the previous step's work for errors, fidelity to the instructions, fix those and record opportunities to improve the agent configuration + tools to reduce errors and token usage (record those to a file).
3. Self-improvement (Opus): Implement the highest impact self-improvement items that don't require user intervention.
Repeat: Until orchestrator session token budget exhausted (set it to 1M or whatever).
The underlying rationale is to keep each step manageable to maximize adherence to instructions and minimize cost (even cached tokens cost something). Prompt tokens are much cheaper than generated, so to the extent Opus mostly reviews rather than drives that saves a lot too. Self-improvement steps are very expensive but the improvements compound, if you're going to run a job for days or weeks it's way more expensive not to do them.
Edit: I do this in Claude Code with the Anthropic models as well as Qwen family models for offline use.
Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.
With Opus I can work, trust its designs, architecture suggestions, and code changes, even in a complex code base.
The smaller models seem to "try". They work for smaller tasks, but for more complex task it's often more work than doing it myself.
I wish it were different, and maybe in a year or two it will be.
Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?
Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.
Coding models are most useful when they perform well in the same environment developers use every day. That is why we built MAI-Code-1-Flash with production workflows at the center, rather than optimizing only for benchmarks. The model was trained directly with GitHub Copilot harnesses used in production. This allows it to learn how to interact with surrounding tools and systems in agentic coding tasks, making it uniquely well suited to real-world Copilot workflows compared to other available models.
During training, we evaluated checkpoints across core software engineering tasks, repository question answering, refactoring, and telemetry-grounded tasks adapted from real GitHub Copilot usage. This alignment between training, evaluation, and production helps offline improvements translate into real-world developer quality.
MAI-Code-1-Flash was trained with adaptive solution length control, which helps the model adjust the depth of its response to the task. It can stay concise for simpler requests and spend more reasoning budget when a problem requires deeper analysis or broader code changes. In practice, this means developers start seeing useful output sooner. We see MAI-Code-1-Flash solving harder problems with up to 60% fewer tokens. This helps reduce latency, lower cost, improve return on token, and make interactive workflows feel smoother.
To understand both quality and efficiency, we evaluated MAI-Code-1-Flash against Claude Haiku 4.5 on SWE-Bench Verified, SWE-Bench Pro, SWE-Bench Multilingual, and Terminal Bench 2 using the same production harness that developers use for their everyday coding tasks. We measured task success and the average number of solution tokens required to complete each task.
MAI-Code-1-Flash outperforms Claude Haiku 4.5 across all core coding benchmarks tested, with higher pass rates on all 4 evaluations, including a +16-point lead on the diverse, real-world tasks of SWE-Bench Pro (51.2% vs. 35.2%). It’s not just smarter; it’s leaner, solving harder problems with up to 60% fewer tokens on SWE-Bench Verified, proving that higher accuracy and greater efficiency are no longer a trade-off.
The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.
do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?
The supermajority of respondents did report that they do engage in some coding outside of working hours, for one reason or another. I'm impressed; I'm basically a zombie after hours, rarely in any shape to touch anything technical. Good for them.
But then only 19.3% of respondents ticked that they code for freelancing reasons, and only 15% said they're doing it in an attempt to bootstrap a business. These groups were the only types that suggested revenue generating after-hours activity, and they even overlap to a non-obvious-to-me extent. But even if we pretended they didn't, that adds up to like a third at best.
So when you say:
> I don’t understand how’s that not the default option for all professional developers.
that's in contradiction with this data (and imo common sense), which suggests that the supermajority of professional developers simply do not perform revenue generating software development activity outside of work hours, period. Therefore, for them, the ROI on any potential AI subscription is a flat and constant zero.
Unless you envision people working at "bring your own license" type shops, I don't know how this is supposed to make sense. These are work tools, corporate should be providing them already. But then I'm clearly not from a "wealthy" country either, so YMMV.
Since I use LLMs basically only for analysis and as a signal in bug discovery, debugging, research and general search, I don't need a very powerful model and I don't need high token counts. A $100 subscription would be entirely way too much for useful usage for me, and would border on just using tokens for the sake of using them.
Let me slide as fast and unrestricted as I want. I do not want to "transition" to the next paragraph.
This trend needs to stop.
We moved to OpenCode Go ($10/mo), so we could switch between DeepSeek v4, GLM 5.1, and Qwen 3.7 models run by providers in EU, US, & Singapore that OpenCode FAQ claims don't use retained data for training.
What about data and privacy?
The [OpenCode Go] plan is designed primarily for international users, with models hosted in the US, EU, and Singapore for stable global access. Our providers follow a zero-retention policy and do not use your data for model training.
I find their rather verbose privacy policy is not making far-reaching guarantees about any of this though: https://opencode.ai/legal/privacy-policyEven if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/
I’ve used GPT mini quite a bit and it’s decent.
https://microsoft.ai/news/building-a-hillclimbing-machine-la...
Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.
Haiku costs $1/$5. DeepSeek V4 Flash, a stronger model, is only $0.0028/$0.14/$0.28. That first number is the cached input, and DeepSeek caching is crazy efficient. So, using DeepSeek V4 Flash costs about an order of magnitude less than Haiku and performs better.
I have a Claude subscription because I'm willing to pay a premium for the best model for coding, one that doesn't waste as much of my time doing dumb stuff. But, if I need something other than Claude Code, I'm using something other than Claude models. Why burn money for no benefit?
Oh, also, Haiku chews tokens like crazy. In my benchmarks it used three times more tokens than the next highest model. Of course, security bug hunting is not in its wheelhouse, so it's not fair to judge it based on that one thing, but if it's more expensive per token and burns a lot more tokens, it ends up being a lot more expensive.
But, from what I can tell DeepSeek is better than Sonnet, though I agree it is not at the level of current Opus or GPT 5.5 (but I think it probably beats Gemini Pro 3.1). I use the best model I can for code, because the cost of weaker performance is more than the $100/month I pay for Claude Opus, but it's worth knowing there are very cheap, very good, models for stuff I want to do that isn't Claude Code.
Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.
That's what I'm betting on anyway.
Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.
Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.
This model might have a perfect speed:
for i in range(100):
print(random.choices(words))I assume I'm misunderstanding you (likely my fault), because the way I read that is that you're saying nobody should currently be using models owned & hosted by companies like OpenAI and Antheopic, while clearly a huge number of people are using those in 2026 despite not owning them.
If you watch the Build keynote with Satya, you'll notice that the design of the slides changed to Serif typography and warmer colors when Mustafa/Microsoft AI segment came on which was completely different from the rest of the keynote. Now it makes sense!
But all in all, I don't think we disagree.
Model Input Cached input Output MAI-Code-1-Flash $0.75 $0.075 $4.50
I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.
As models get better and smaller, I expect that we will rapidly (within a year?) get to the point where SOTA models are not needed for the vast majority of coding tasks, and even today it seems many people are just using them for the planning phase.
How many people drive Ferraris vs Fords? How many people driving a Ford would, on a utilitarian basis, be any better off driving a Ferrari?
So far there seems to be mainly two high volume use cases that have been found for LLMs - coding and business flow automation, and it seems neither of these need SOTA models. I wonder if there will continue to be enough market demand for massive expensive SOTA models to make them worthwhile developing?
https://news.ycombinator.com/formatdoc
crimes ↑
│
10.0 ┤ ● Airport burger
│ ╭──────────────╮
8.0 ┤ │ theft arc │
│ ╰──────────────╯
6.0 ┤ ● Five Guys
│
4.0 ┤ ● Food truck burger
│
2.0 ┤ ● McBurger
│
0.0 ┤ ● Homemade burger
│
└───────┬─────────┬─────────┬─────────┬─────────→ price
$2 $8 $14 $22 $38
┌────────────────────┬────────┬──────────────┬────────────────────┐
│ burger │ price │ crime index │ expected behavior │
├────────────────────┼────────┼──────────────┼────────────────────┤
│ Homemade burger │ $2 │ 0.0 │ law-abiding citizen│
│ McBurger │ $6 │ 1.4 │ steals extra napkin│
│ Food truck burger │ $11 │ 3.1 │ lies about hunger │
│ Five Guys │ $18 │ 6.2 │ financial crime │
│ Airport burger │ $34 │ 9.7 │ enters villain arc │
└────────────────────┴────────┴──────────────┴────────────────────┘
conclusion: burger inflation is a gateway condimentI also work on a consumer AI application https://apps.apple.com/us/app/slidebits-studio/id1138731130
For comparison someone showed me an internal company tool he was working on. He had Claude agents dangerously skipping permissions and firing up github branches through a vm sandbox just to make a single feature change. One agent to code and the other to review.
[0] Not even here: https://playground.microsoft.ai/
I was implementing a re-print functionality in my warehouse management system.
It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.
After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.
Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.
Haiku simply added another field on the terminal state that carried the already generated urls.
I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.
Imho we should experiment rather than assume that what the rest of the community does to be the best practice.
For example you probably don't have days where you ask Opus to review your whole code base and look for code duplication/technical debt/robustness issues, and then to fix some of the found issues, and do this 3-5 times until no big issues are found anymore.
Where does the Pascal case inspired variant come from? Is it a reference to something? Is it like "M$" was used back in the days?
I’ve actually had luck taking the analysis from GHCP and pasting it into our M365 Copilot and getting a useful poc to stick into my bug reports.
Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.
Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.
Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.
The next question is where did the "ASCII-art" graph and table come from? Are there sites to generate these?
Yesterday Codex was making a big issue out of a new module that was upgraded in our cluster and because of which the same SSH key would be "regenerated" by Terraform. No big deal, it just truncates a newline at the end of the SSH key and it works all the same. But not being aware that this, as an example, is unimportant can cost a lot more time than using the big models saves.
They also did some more interesting work like showing very small models can be coherent as long as you have very simple children's book style training data (TinyStories is pretty famous).
Lots of these ideas are still used. Learning facts at scale with active reading is an ICLR 2026 paper from Meta AI that does a lot of similar work.
Cursor is potentially about to be acquired by X.ai (i.e. SpaceX), unless this is just some IPO game being played by Musk. They are certainly not just a token reseller since they have their own models in addition to their own vector database approach for code matching.
Just built a tool for that: https://krysoph.github.io/UnicodeData/
It is a single html file with no dependencies, it takes json data and turns into unicode charts.
Perform a thorough analysis of the <project_name> project (the code and the documentation).
- Explore the project, go over all important files one by one and look for any mistakes or possible bugs.
- Look for refactoring opportunities and ways to improve code quality and organization.
- Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant.
- Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information.
- Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values.
- Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear?
- Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code.
- Brainstorm ideas for improvements of the code and docs.
After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.I use quite plain prompts, nothing fancy:
> go over the tests and do a code review, focusing on how well they test inventory management, planner and controller. maybe some tests need to be deleted, maybe other tests need to be added. the end goal should be good coverage of the core features.
> do a code review, focusing on robustness/correctness issues. validate that the code correctly implements specification.md. focus on the async client.
> there was a big refactor. please do a code review, focusing on eliminating tech debt. look for unused, obsolete or duplicate code that can be removed, look for mismatched interfaces, inconsistent function/argument/variable names. do not output what is correct, just the issues you found. for each issue output instructions for a coding agent on how to fix it. do not nitpick.