MAI-Code-1-Flash

237 points by EvanZhouDev 2 hours ago | 109 comments

https://microsoft.ai/models/mai-code-1-flash/

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

Launching seven new MAI models: https://microsoft.ai/news/building-a-hillclimbing-machine-la...

bel8 59 minutes ago |

It's a start and I welcome competition but I don't think I ever used small cloud models like Haiku 4.5. They are cute but for serious coding they tend to waste your expensive time.

And this certainly wont bring me back to GitHub Copilot which I cancelled yesterday.

GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs: https://www.reddit.com/r/GithubCopilot

I have since changed to DeekSeek Flash on high which is Sonnet+ level for almost free.

If I feel I still need smarter models I might signup for $20/mo Codex to use GPT 5.5 which, in my opinion, is the best I can access right now.

nate 21 minutes ago | |

The small stuff has their place. I have this safari extension and needed a way to quickly title people's chat histories. Haiku is the fast cheap thing to come up with decent titles of blocks of text. I feel like there's a bunch of those little things lying around you need a model for. I'm even finding Apple's Foundation Model is super useful for stuff like that. Even summarizing an article. It's like equally awful at doing it, but gets enough done to still be useful as a way to be like "oh yeah, this article is actually worth reading"

seanlinehan 2 minutes ago | | |

Small models are super useful. But I'm skeptical of their use for coding in particular, which is what this model is advertised for.

GaryBluto 39 minutes ago | |

Almost exactly the same story here. I've also had little to no refusals from DeepSeek, with it's Chinese values meaning substantially less friction when it comes to things like reverse engineering, finding copyrighted files, working with dubiously-sourced source code, et cetera. I don't think I'd go back to Copilot even if they dropped prices by 90%.

alkonaut 22 minutes ago | |

Won’t (presumably) all the market actors converge on similar pricing? If OpenAI stopped operating on subsidies and charge the true costs and their most token hungry customers are the ones that switch to Anthropic and others, then their pricing model switch will also be around the corner.

Unless of course we’re thinking Copilot will be more expensive than others longer term. But is that a reasonable assumption?

stefan_ 17 minutes ago | | |

Anthropic & co charge API users much more, not least to demolish the middlemen low-effort plays like Cursor and Copilot. To not own the model is not viable in 2026.

partiallypro 2 minutes ago | |

> "GitHub Copilot had competitive pricing until yesterday when they changed from per-request to one of the most expensive per-token quotas. Seriously, take a look at their burning subreddit for some laughs"

AI is expensive and it has been heavily subsidized. I you think $20/mo for Codex/Claude flat vs a more usage based model you're in for a shock. Especially once these companies go public and have to meet investor expectations.

hparadiz 26 minutes ago | |

The $20/month ChatGPT plan that comes with codex is good value. Even just have premium ChatGPT is nice. I get rate limited regularly but it still lets me do most things.

tedggh 25 seconds ago | | |

The $100/month is excellent value. I don’t understand how’s that not the default option for all professional developers. Unless people don’t produce any value writing code, like playing around and experimenting with vibe coding, I understand. But if software development is your actual income, and assuming you live in a wealthy country, $100/month is nothing for a tool like Codex.

verdverm 48 minutes ago | |

I've been having really good results with DeepSeek-v4-flash, qwen-3.6-moe, and the older gimini-3-flash-preview. (recent geminis suck hard)

Small models are more than enough for the majority of tasks these days. Plan and review with the bigger ones, let the little ones explore and implement.

OpenCode Go is $10/month for the open weight models with nice quotas: https://opencode.ai/go

emsign 25 minutes ago | |

I wonder when THEY make it illegal to vote with your wallet.

camelmel 1 hour ago |

Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

giancarlostoro 1 hour ago | |

The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.

mdasen 47 minutes ago | | |

Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

minraws 1 hour ago | | |

They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.

kristjansson 58 minutes ago | |

> 137B-A5B

Yeah, not a 5B param model as the earlier title implied!

GaryBluto 41 minutes ago |

What's with the lack of Microsoft design language on the website? It's painfully obvious they're trying to emulate Anthropic's style here and it looks tacky.

foltik 14 minutes ago | |

Definitely vibed microslop, the giveaway is the broken header and scrolling on mobile.

Handy-Man 4 minutes ago | |

That's neither Microsoft nor Anthropic design. It's from their acquisition of Inflection AI. Even Copilot mobile app design is basically what was Inflection's design

winfredJa 39 minutes ago | |

i think it is AI generated.

i_have_an_idea 40 minutes ago | |

maybe it was coded by Claude

stringfood 13 minutes ago | |

A little to minimalist - only a few hundred words on entire page!

hmokiguess 1 hour ago |

Does anyone actually uses these smaller models for coding? If so, how? I usually Opus everything. Is the play to plan/design/architect with a heavier model than delegate structured tasks to these smaller ones? Would appreciate to hear someone's opinion on having done and tested both paths.

capten 2 hours ago |

It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.

Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?

npn 10 minutes ago | |

from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.

redrove 1 hour ago | |

It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.

It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.

Yet another reason the current buildout will feel like the railroads.

bgirard 19 minutes ago | | |

> It’s about bang for buck.

Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.

necubi 1 hour ago | | |

It's 5B active params in MoE, not 5B total params (total is 137B).

Flere-Imsaho 1 hour ago | | |

Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.

That's what I'm betting on anyway.

dist-epoch 1 hour ago | | |

The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".

npn 12 minutes ago |

I personally do not like Microsoft, but congrats them to release this model.

While the scores are not good compare to other open weight model, the important thing to note is their training data (as they claimed) is very clean, without any synthetic datasets.

AntiRush 2 hours ago |

The introductory blog post has a lot more information

https://microsoft.ai/news/introducingmai-code-1-flash/

and the model card

https://microsoft.ai/pdf/MAI-Code-1-Flash-Model-Card.PDF

The broader announcement of 7 MAI models seems to be where the 5B active in the title comes from

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

dang 1 hour ago | |

Thanks! I've changed the top link to the blog post and put the other links in the toptext.

giancarlostoro 1 hour ago |

Mark Zuckerberg must be in crisis. Microsoft releasing models that compete with Claude's models. Meanwhile the only thing anyone knows about Mark's models is that they help you get hacked more easily.

ggcr 52 minutes ago | |

Meta recently launched Muse Spark [1] and they themselves compare against Claude Opus 4.6 Max.

Here Microsoft is comparing against Claude Haiku, the smallest and least capable model from Anthropic.

[1] https://ai.meta.com/blog/introducing-muse-spark-msl/

yuppiepuppie 1 hour ago | |

Wait… I think he has moltbook IP as well that he can scale up.

Seriously tho, wtf is going on over at Meta? Anyone working there currently want to describe the vibe of the org when it comes to being a frontier company?

giancarlostoro 1 hour ago | | |

I don't understand his plan, if I were him I'd either have just gone all in on making RAM which would become very lucrative, or would have focused on building programming models. They've built some key open source technologies, but its as if Mark Zuckerberg cannot run anything that isn't a social media company / project.

deckar01 1 hour ago |

If only they had launched that yesterday I might have avoided Copilot auto model selection using a 9x model, quietly burning my monthly quota in a single afternoon.

efields 1 hour ago |

Please test your websites in Safari. Almost all of your iOS users use it by default, and the desktop experience is pretty close to the mobile experience, so testing is easy.

That scroll effect is jank city for me (yeah yeah works fine in Chrome/Edge).

whalesalad 31 minutes ago | |

some kind of scroll hijack going on for sure, feels terrible on firefox+macos

AJRF 56 minutes ago |

Copilot brand is tarnished, so time to bung everything under MAI?

OsrsNeedsf2P 2 hours ago |

So it's trained on the SWE Bench Pro evalset

topsycatt 24 minutes ago | |

That's not accurate. Take a look at the paper to see what it is trained on! And specifically decontamination is called out in A.4

https://microsoft.ai/wp-content/uploads/2026/06/main_2026060...

lemonish97 2 hours ago | |

What is your evidence for this claim?

fooker 2 hours ago | | |

They say hill climbing

https://microsoft.ai/news/building-a-hillclimbing-machine-la...

Unless they specifically clarify that the testing and training benchmarks are completely separate, we have to assume they test on the same 'hill' the model climbs.

mentos 1 hour ago |

Shouldn’t the next model focus not be on code but system design?

Seems like the work from a good system design to code is practically solved.

Now it’s a matter of the design of the system. Or is that represented in these evals?

dist-epoch 1 hour ago | |

Have you tried system design with LLMs? I find them pretty good at suggesting 5 architectures for a problem and then iterating on the solutions.

Even if I had no idea, going with the default suggestion would not be a terrible mistake, assuming you did describe your requirements relatively well.

tosh 1 hour ago |

not open weight or at least I did not find anything indicating open weight

ggcr 48 minutes ago | |

I was hoping Microsoft would make it open weights, as they have done for years with the Phi models.

The era of big tech releasing models into the wild might be over, which IMO is counter-productive, as we are shifting from "the model is the product" to "the harness is the product"

onlyrealcuzzo 2 hours ago |

Gemma 4 26B-A4B scored exceptionally well with 20% less params, so this isn't unprecedented.

bguberfain 1 hour ago |

It is good to se big companies like Microsoft launching LLMs. They have large amount of compute power and good scientists to create useful models.

ComputerGuru 1 hour ago | |

Microsoft has been releasing LLMs for years.

lemonish97 1 hour ago | | |

They were mostly distilled or fine-tuned OAI models.

ipsum2 1 hour ago | | |

Sort of. Phi models were just trained on GPT outputs though.

jwitthuhn 1 hour ago | | |

And occasionally un-releasing them like with WizardLM.

hootz 2 hours ago |

I'd love to see a tokens per second metric. I always prioritize speed over raw intelligence for flash models.

throwaw12 1 hour ago | |

> I always prioritize speed over raw intelligence for flash models.

This model might have a perfect speed:

    for i in range(100):
      print(random.choices(words))

OsrsNeedsf2P 57 minutes ago | | |

Leave it long enough, and it'll print the work of Shakespear!

mmaunder 1 hour ago |

You lost me at forced scrolling. Ugh!

Tepix 1 hour ago | |

From https://news.ycombinator.com/newsguidelines.html

Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

gslepak 1 hour ago |

Would be cool if this were an open model.

ajyoon 2 hours ago |

Scroll wheel hijacked on this entire domain

grav 1 hour ago | |

Fix:

  (() => {
  const KILL = ['wheel', 'mousewheel', 'DOMMouseScroll', 'touchmove'];
  const block = e => e.stopImmediatePropagation();
  for (const t of KILL) {
    window.addEventListener(t, block, { capture: true, passive: true });
    document.addEventListener(t, block, { capture: true, passive: true });
  }
  document.documentElement.classList.remove('lenis','lenis-smooth','lenis-scrolling','lenis-stopped');
  console.log('Scroll hijack disabled — native scrolling restored.');
  })();

matchbok3 1 hour ago | |

Yeah this website is horrendous to use. What were they thinking?

BadBadJellyBean 1 hour ago | | |

You mean "what was the LLM thinking?"

striking 1 hour ago |

To be clear about the size of the model: MAI-Code-1-Flash is 137B A5B.

ilia-a 54 minutes ago |

I mean they are comparing themselves to Haiku of all things, geez that's not a good start...

jMyles 1 hour ago |

I'd really like to get back to an autocomplete flow, ideally with some shared and optimized context with the relationship with my larger agent models.

But it seems like, by and large, even the faster models are now aimed at longer-running agentic flows and not sub-1s autocomplete. Or am I wrong about that?

verdverm 42 minutes ago | |

You aren't wrong, the field is moving to a world where we do less in the code editor, so autocomplete is not needed any more. I've only manually edited code a few times in the last month. Haven't used autocomplete in 6+ months since I left Copilot to build my own agent harness (I'm now mainly using OpenCode)

Marciplan 1 hour ago |

"Build for developers, not benchmarks" Shouldn't that be.. Built?

kylehotchkiss 1 hour ago |

"superintellegence team"

Why not assign them to make windows good :D

LoganDark 1 hour ago |

"Clean data" is impossible. Language models have polluted the landscape to such a degree it's impossible to filter them out now. OpenAI has no doubt discarded or muddled their dataset that was used to train the original ChatGPT, so there may be no dataset in existence now that isn't contaminated.

zb3 1 hour ago |

So it's not an open model while not being much better? Meh.

freediddy 1 hour ago |

is 51% good enough to reliably use? There's no world in which I use an AI agent where it gets even 15% of the code wrong, that's as bad a Tesla FSD where you need to pay attention to the road while engaging FSD. What's the point? My attention is what I'm trying to relieve, not mostly correct functionality. The only thing that matters is whether you can one-shot code like Claude or Codex, I'm not interested in a small but mostly-okay-but-annoyingly-buggy-every-now-and-then AI.

VygmraMGVl 1 hour ago | |

Claude opus 4.6 scores 51.9% on the same benchmark. Microsoft's result is quite good.

IanCal 1 hour ago | |

51% does not mean it randomly gets things wrong half the time.

These things can be useful if you can accurately predict which tasks they will reliably do, and which they will usually fail on. Then you can get much more reliable work from them.

pzo 43 minutes ago |

TLDR; this is just Claude Haiku altrenative, you can probably skip whole article.

mattlondon 1 hour ago |

Comparing against Claude 4.5? Aren't we up to 4.8? But disingenuous?

klardotsh 1 hour ago | |

They're comparing to Haiku, not Opus. Haiku is currently at 4.5.

Even if it were Opus, comparing to a version number makes for an interesting snapshot of time comparison: if you knew how a model performed at whatever time in was in vogue, you can say "well, it looks like Model X is about 6 months/1 year/etc. behind the frontier SOTA" - which is exactly the discussion that happens in the open-weight/local LLM space. (interesting, MAI-Code-1-Flash does not appear to be such an open-weight model, following the western trend of locking models up)

0vermorrow 1 hour ago | |

Latest Haiku (smallest Anthropic Model) is version 4.5, they haven't released a new version, hence the comparison to that.

Perform a thorough analysis of the <project_name> project (the code and the documentation). - Explore the project, go over all important files one by one and look for any mistakes or possible bugs. - Look for refactoring opportunities and ways to improve code quality and organization. - Identify any potential cruft/bloat, to ensure our code is clean and logically laid out. Keep in mind that efficient and good quality code needs to avoid over-engineered constructs and needless complexity. Avoid complicated logic where simple solutions would be more elegant. - Pay attention to comments: There should be enough of them to document the intent and provide high-level overview of the code logic, but not too much; avoid/remove excessive comments that simply restate the code logic or do not provide any useful information. - Every important function should have a top-level docstring comment that clearly explains its purpose, high-level logic overview, arguments, and return values. - Analyze the names of constants/variables/functions/classes and other code elements: could some of them be renamed to make their purpose more clear? - Analyze the documentation, uncover any potential inaccuracies/omissions and ensure the docs reflect the code. - Brainstorm ideas for improvements of the code and docs. After you finish the analysis, save an analysis report into "<project_name>_analysis_report.md" in the project root folder.