Mythos Finds a Curl Vulnerability(daniel.haxx.se) |
Mythos Finds a Curl Vulnerability(daniel.haxx.se) |
The point wasn't actual cross-platform portability even though that was a nice side effect. It was to flush out all the weird edge cases.
Edges like security flaws. Buffer overflows are usually platform specific. There are plenty of other ways to find these issues but simply recompiling for a different platform surfaces all sorts of issues.
Next question: could it be that OP can use Mythos in a better way since he knows better the project?
I would think Calif (a security firm) is a better team to better utilize such tool.
Typo, or is there a spoof I should go read?
I also thought they were contending the word count before noticing. Even remarked how I find this a weird metric, given that code is not prose [0], but then I deleted that once I picked up on what's going on.
[0] comparing the output of `wc -w` with the word counts of books I'm reasonably sure will be super off
edit: ran a calc, substituting out symbols (but not underscores), digits, and comments yields a 390K word count compared to the 660K cited. not excluding the comments yields 600K, so more than a third of all words in the sources are comments.
Does it say anything else? Just 'Aaaarggghhhh'?
I would very much like to know if they were independent or affiliated to Anthropic.
> My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing.
... because of this.
> It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway.
His expertise I think would elevate the results quite a bit. Although if he never uses LLMs, which it reads like he doesn't, I guess it might backfire just as well. Prompting style (still?) does matter after all, certainly in my experience anyways.
> using these tools interactively
I did read the article. It seems to me they're using LLMs in a prepared manner instead, as mere scanners that produce reports.
I checked back two weeks worth of posts, reposts, and replies there, and do not see anything suggesting so, so I'll have to take your word for this.
What I do see is him responding to seemingly rather frequent harassment about AI use @ curl however. The stance he takes in those cases is very reasonable (even if you don't use AI for scanning the codebase and contributions, threat actors will), it's unfortunate this topic is so political that he has to deal with this to such an extent.
I guess it's related to the phenomenon where you can read words relatively easily as long as the first and last letters are correct and the rest of the letters are there.
https://wire.insiderfinance.io/the-brains-power-to-read-jumb...
Source: voice typing this with Swedish vocal chords, and only had to correct "different lives" to "differently", and add /[^\w\s]/.
"My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing."
It's a good reminder for us all that the competition in this space is rough and lots of more or less subtle marketing is involved.
More seriously, so far I haven’t seen much indication that Mythos is more than Opus with a security focused code analysis harness. That said, the fact it can find these bugs in an automated fashion is the more important takeaway outside of the hype.
I’m curious what the error rate is on the detections, because none of that means much if it is wrong 90% of the time and we are only hearing about the examples that are useful marketing.
I remember when OpenAI was saying GPT-2 was too dangerous to release.
So while anthropic's marketing may be hype there just wasn't much left to find, a point he makes in the blog post.
Whether it's a big step forward for other kinds of projects is difficult to tell, but this highlights that everybody should be using AI code review tools to audit their existing code today, and not everybody is.
What it highlights, is that Mythos doesn't seem so much better than other LLM driven tooling at finding security issues, which was the strongest claim Anthropic made in the first place.
Do you see how ridiculous the zealotry sounds when its not your personal kind of zealotry?
Curl uses all sorts of tools, including AI tools to find bugs. These tools, according to the article found hundreds of bugs including a dozen CVE.
Mythos found one vulnerability. It means the Mythos is just another tool, not the revolution it claims to be.
It is common that when a new tool is introduced that a bunch of bugs are found, with diminishing returns. Mythos finding one vulnerability is consistent to what I would expect for a major update to an existing tool, which Mythos is over existing LLM-based solutions.
that helps us to understand how much of Mythos is hype and how much is real
I've seen literally near word-for-word this exact chain of events multiple times previously
As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.
As these capabilities reach the hands of more defenders, many other teams are now experiencing the same vertigo we did when the findings first came into focus. For a hardened target, just one such bug would have been red-alert in 2025, and so many at once makes you stop to wonder whether it’s even possible to keep up.
https://blog.mozilla.org/en/privacy-security/ai-security-zer...We know that the combination of all three results in finding lots of security vulnerabilities. That's what Mozilla is talking about. The quote from the curl story states that just 2 and 3, but with just regular SotA models, would have produced very similar results
Which is really the crux of all this hype around Mythos: would the results really be different if they used Claude Opus instead of Claude Mythos? How much is the model, how much the harness, and how much is just because Anthropic is running a big campaign systematically trying to find vulnerabilities?
Part of what made Mythos so effective for Mozilla was the integrated agentic workflow where it not only looked for bugs, but then created an exploit to demonstrate them, and ran that exploit while dynamic analysis was enabled verifying that invalid memory access occurred. In this case it hard to know how much of their success was because they put more effort into the harness compared to previous tools (we know they did), or if Mythos was more suitable for this sort of workflow to begin with.
Not many apple-to-apple comparisons to be made with Mythos at this point.
The other alternative is that Curl is simply secure enough that there was far less to find than in other projects.
Also, looking at something that trips valgrind warnings already, may obfuscate a lot of problems in both your own code and the curl library itself.
One could report the issue as functioning as described in the API, but the developers do not accept direct community input into the project.
People use it out of convenience, but it is just as janky as most bloated projects. =3
Marketing is not intentional.
Evidences: 10 years ago, when I interviewed Baidu AI with Andrew Ng and Dario, Dario is the kind of person is pure-hearted to the point being ideological. Given Dario's successful career so far, that essence has gradually grown into a conviction, and surrounded by a purposely built team which amplifies his ideology.
Humans are very convenient creature, a rare few small fraction of them are no doubt the master of convenience: they morph their mental manifold without a hint of contradiction in their own mental mechanisms.
Things change when you’re running a business like Anthropic, especially as the CEO. You have a responsibility to shareholders, and you just need to play the game.
Anthropic chose a great angle: focus on professionals / enterprise, safety, etc. Those can both be done by a genuine desire to make great technology, and for business purposes require you to position yourself in a bit “better” way than reality.
Just look at what their strategy is with Mythos, it’s almost perfection: the “it’s not ready to be released to the public” angle hits all the marks: they care about responsibility / safety, they have “the best” model, and “LLMs are dangerous, but we, as the guardians, can be trusted”. This also helps the industry as a whole with regulation: if they’re being constrained, China will develop even more dangerous models.
This is a result of how smart people treat business, it’s PR perfection, especially given how much the whole industry is talking about it.
(Yes, they fail in other PR areas, but that’s a different discussion)
Mythos put Anthropic back into the White House’s good graces. It also branded Anthropic as badass, something their softener image probably needed to win government contracts.
Maybe it wasn’t marketing. But the product’s configuration, and how Anthropic talked about and released it, sure as hell played beautifully. (The timing, while Musk and Altman are distracted with each other, also couldn’t have been better.)
Whether the person doing the marketing was sincere about it or not is immaterial, since marketing is experienced almost entirely by the people consuming it, and not the people communicating it. What matters is if the audience is sincerely concerned by the message, and it's transparently the case that they were sincerely concerned by it.
That's an odd definition of "intentional". Evolution has filtered for people with certain views and the marketing has just emerged from their actions. ... So?
A deadly virus (naturally occurring one let's say) wasn't created intentionally. Evolution selected for it. It's still bad and kills people. Doesn't make it nice because of lack of intention.
They claim the huge advance is in exploiting the bugs.
> Over the last few months, we have stopped getting AI slop security reports in the #curl project. They're gone.
> Instead we get an ever-increasing amount of really good security reports, almost all done with the help of AI.
> They're submitted in a never-before seen frequency and put us under serious load.
> I hear similar witness reports from fellow maintainers in many other Open Source projects.
> Lots of these good reports are deemed "just bugs" and things we deem not having security properties.
[1]: https://www.linkedin.com/posts/danielstenberg_hackerone-shar...
I think the results say more about the great job the curl team has done maintaining their codebase.
This doesn’t mean Anthropic's Project Glasswing is a marketing stunt. Logically, it doesn’t make sense: when they announced Mythos Preview, Anthropic couldn’t meet customer demand; they didn’t have enough compute to go around. So they decide to hype an unreleased product to drive even more demand? All that would do is piss off their existing customers who already experiencing rationing and frequent outages.
Many forums were already flooded with "I cancelled Claude Code" as it was.
On the contrary, it would be incredibly irresponsible and unethical for such a young company with billions of dollars of other people’s money invested in them.
Because the Mozilla team used Mythos and found 271 vulnerabilities [1], does that mean they're in on the so-called "marketing stunt"?
Of course, if Anthropic had released Mythos to the public and bad actors used it to hack a large number of banks, hospitals, government agencies, etc. in a matter of days, the HN crowd would be all over them for acting irresponsibly and criticizing them for not knowing better.
[1]: "Behind the Scenes Hardening Firefox with Claude Mythos Preview" — https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...
Mozilla is the current poster child but 271 in such a large codebase with thousands of user options, most of them being TOCTOU isn't that much. Sorry. TOCTOU can happen in any language when people are simply exhausted by the sheer volume of case explosions.
There is a third option: Anthropic could simply have reported the issue without mentioning the new model at all. But they don't, since they want to sell to governments and military and the artificial scarcity just provides a veneer of exclusivity that their clients will appreciate.
I've been running my own security scanning software (disclaimer: now starting a company @ zeroquarry.com) for this, and from what I've seen there's a huge value in prompts + adversarial LLM review. Without adversarial review, you get garbage (as this blog points out: 4/5 basically are nonsense) and with a good prompt, you can use almost any "near frontier" model from my experience as long as the prompt helps with the guardrails or the model doesn't protect in such a strict way
Wouldn't that make it a better to distinguish whether Mythos is uniquely super powerful vs an incremental improvement from Opus etc that are routinely used as the basis for bug reports/fixes in cURL?
If Mythos found a hundred new show stopper bugs then it would have meant Opus missed them and therefore closer to a "step change". Otherwise it implies the difference in capability isn't nearly that stark. Mythos finding 100 low-hanging bugs in a less scrutinized/hardened project on the wouldn't be as useful signal to answer that.
https://www.politico.eu/article/anthropic-hacking-technology...
This is an advertising masterpiece: UK gets first access, the EU is jealous and wants it, too. Thousands of bureaucrats and parasites make money in the process writing (probably using AI) whitepapers and sitting in meetings. The open source authors whose works are being scanned make nothing.
We know how the money flows. Another unrelated example is that ex MI6 director Sir John Sawers is a Palantir consultant and sells out the UK to Palantir.
About as subtle as a personal injury lawyer's billboard
It's almost Trump-esque - "this model will change everything forever; we are doomed; we are saved; we will all be fired; we will all be rich", etc
They need the hype to pay off way more than we do. So many of us who still write code directly stand to lose nothing of our capabilities if the marketing claims cannot hold water.
> The worrying part about Mythos isn't the fact that it can find bugs. The worrying part is Mythos being able to find them on its own across entire code base as vast as Firefox then write exploits for what its found with a very basic prompt.
> The skill required to find then create zero days is quickly approaching the floor.
The great exaggeration is that this is a new capability.
This. Well done by Antropic.
It even reached the CISO of my small semi-government org in the Netherlands, who slightly panicked at the announced 'tsunami' of vulnerabilities that was coming with Mythos.
Got us some more money and priority with the board, though.
Never waste a good marketing scare.
What if there are actually zero bugs?
> Five issues felt like nothing as we had expected an extensive list.
The expectation here may not match reality, but not necessarily because Mythos isn't as capable as claimed. curl may just happen to be a well-hardened tool that doesn't have too many security vulnerabilities in its present state.
> More to find
> These were absolutely not the last bugs to find or report. Just while I was writing the drafts for this blog post we have received more reports from security researchers about suspected problems. The AI tools will improve further and the researchers can find new and different ways to prompt the existing AIs to make them find more.
> We have not reached the end of this yet.
> I hope we can keep getting more curl scans done with Mythos and other AIs, over and over until they truly stop finding new problems.
And that makes sense, it'd be quite the argument of coincidence to say there was just 1 proper find remaining & it was only Mythos that managed to find it just at the point in time it released while the other projects have been hoovering up every other find quickly until that point. Possible, but not the safest assumption to start questioning with.
I'm not sure that follows. As noted, curl was already analyzed to death with every tool available; most software isn't at that level.
It makes some sense that Mythos/ChatGPT 5.5 might be that much better with complexities that curl just doesn't have because it's a basic tool.
Like yeah curl is obviously extremely fully featured as an "anything client" but it's orders of magnitude less complex than other software we rely on.
1. It supports basically any file transfer protocol.
2. It is a library that is designed for long running processes.
3. Because it's designed for long running processes, it makes use of every trick it can to pipeline and re-use connections and resources.
4. It has an asynchronous API so it can be integrated into any existing event loop.
Is a web browser or database more complicated? Most certainly, they solve really massive problems. But curl is certainly more complicated than probably most application code that uses it.
"curl is currently 176,000 lines of C code when we exclude blank lines. The source code consists of 660,000 words, which is 12% more words than the entire English edition of the novel War and Peace. ... curl is installed in over twenty billion instances. It runs on over 110 operating systems and 28 CPU architectures. It runs in every smart phone, tablet, car, TV, game console and server on earth."
I wouldn't call that simple or well contained...
Most OS or web browsers don't run on cars or tvs.
My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
However in the days of race to bottom, offshoring for penies, and now LLM powered code generation, this is a quality most companies won't care unless there is liability in place.
This is becoming a more and more overlooked/underrated feature. I genuinely believe it would be impossible in any company that depends on shareholder value. I am yet to convince any company I've worked in without bloody hands that we need to solve old tech debt and refactor certain things etc.
Curl HAS had security, protocol and language experts poking at it for years because of how central it is to everything. That Mythos found anything is interesting but not a sign that it's been marketing hype and isn't dangerous.
You can bet that 99.99% of projects aren't nearly as secure as curl and it doesn't matter if they are open or closed source (LLM's will happily decompile closed-source projects and explore). Unless your project has been fuzzed and gone over with existing AI tooling and by experts, expect that it can already be hacked - even with the tooling that is out there now and that something like Mythos makes it accessible for an even wider population pool with less expertise to use.
Also curl in this regard is a open source project, relativly small but critical, well known and used everywhere. Besides image libraries, tools like curl or sudo, su, passwd, etc. would also be my first try.
Mythos is still not known at all what it can do. What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago? so at one point we need to address the elefant in the room and state that today you need to do security scanning additional with LLMs. You need to take this serious.
In worst case, use Anthropics marketing to state that its a must now and something changed.
I get the idea that they're using it for marketing. Of course they are. But to reduce it at "just marketing" feels either ill informed or outright wrong. Unless you have reasons to not believe the dozens of credentialed, well respected people in the field that have already shared their opinions after working with mythos. Plenty of them on all the social media sites.
And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities. Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing. And it matches what everyone is saying.
I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes. Of course they'll use this as a marketing campaign, but there's plenty of evidence out there that there is something there, that the model is simply better than previous generations at this. Don't fall for the cheap reductionist stuff, just because you don't like them, or feel that this is marketing fluff. It doesn't feel like a gimmick, even if it gets used to push their agenda. Something, something, propaganda often uses true statements as well.
If you've just gone through a lengthy analysis of your code with other AI tools, surely it's reasonable not to expect to see hundreds more from a new tool?
It should be possible, unless more bugs are introduced, to eventually get to a state where there are no more bugs in your code.
Process aside, it sounds like Daniel expected to find dozens/hundreds more bugs.
But Mythos found 1. After all that hype. 1.
Anyway, I think the case that frontier and next-gen models will get increasingly adept at finding vulnerabilities and that those on the receiving end of those vulnerabilities need to be on top of it.
They have the CVEs in their training data, know how to look up ossfuzz logs, etc.
The way this reads sounds more like the LLM dismissed trying rather than it tried and failed, I've seen Claude do that often unless I probe it to challenge itself, curious here what actually happened.
The author compares it to AISLE, ZeroPath, and OpenAI’s Codex Security. AISLE and ZeroPath are much more expensive. OpenAI’s Codex Security is gated.
Most people don't care about the first two and don't complain about the latter's policy because they are all specialized models and/or harnesses.
Mythos will be available to all.
AISLE is *cheaper* for sure
[0] https://tsz.dev
"Primarily AISLE, Zeropath and OpenAI’s Codex Security have been used to scrutinize the code with AI. These tools and the analyses they have done have triggered somewhere between two and three hundred bugfixes merged in curl through-out the recent 8-10 months or so. A bunch of the findings these AI tools reported were confirmed vulnerabilities and have been published as CVEs. Probably a dozen or more."
[1] https://lists.haxx.se/pipermail/daniel/2025-September/000127...
[2] https://www.theregister.com/software/2025/10/02/curl-project...
Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important."
Really? We're talking about (essentially) a product demo from a trillion dollar industry fueled by debt. Clearly, blog posts like this have an immense influence on the perception of usefulness of the particular model and AI in general. With so much staked on this for the company, wouldn't you want to be sure that you're using the actual product without anyone messing with the results in any way?
A problem is that these tools seems smarter than they are cause they already read seen the answer key.
When it comes to security and AI, all top tier publicly accessible models (GPT 5.5, Opus 4.7) and even near-top like Deepseek 4 PRO can do a very good job given detailed harness on how to spot issues and cross-validate them to avoid false positives.
IMO, this does not sound like marketing scare, there is spike of vulnerability disclosures - high quality, low false positives - that can be sensed... It feels like we're speedrunning through few-years worth of high quality bug reports in just a few weeks.
Anthropic noticed the trend of AI vulnerability scanning and started advertising Mythos, which is unreleased, as being very good at it.
Then they donated very large token budgets for using Mythos privately to several teams. Those teams used the free token spend for security research (that was the deal) and anything they found got attributed to Mythos, not the token budget.
Mythos looks like a good incremental model but the PR team has done a great job of associating themselves with the current trend. So much so that comments like yours already associated vulnerabilities found with this model which isn’t even available yet
AFAIK, the only thing it found in OpenBSD was a DoS?
Edit: For that matter, I'm not aware of RCEs in Linux, only LPE?
It's an entirely different thing to have the company conduct research on LLMs in general being a cybersecurity threat, instead of going "our new model is just too powerful" and shift the discussion to revolve around that. It's slimey.
Until we find vulnerabilities in curl that Mythos missed, it's hard to say how good it is.
Since mythos found only one additional vuln, and since x+1 is not much greater than x, it follows that mythos is not dangerous per the definition above.
It doesn’t invalidate the other security bugs Mythos allegedly found in other codebases.
If so, it would still follow. "Most software" isn't analyzed as much as curl, by either other tooling or other models, that might well find close to the same as Mythos did. As such, Mythos then isn't especially/particularly dangerous.
https://daniel.haxx.se/blog/2026/04/22/high-quality-chaos/, linked from TFA
I would do that with 100% local models from scratch.
And all that to then end with people doing: "curl ... | bash" and not seeing anything wrong about it. Then they'll deflect about "threat models" and other non-sense.
I leave you your curl-bash, I keep my cryptographically signed packages installer.
To me it means that we've hit the top end of the S-curve with regards to effects of scaling - if the tool isn't remarkably better despite the scale, then we're firmly in diminishing returns territory.
And this is very much on purpose my friend. Think about what people already believe it can do though.
*rolls eyes* regular static analyzers also have been "better than humans" for decades, being better than a human at a specific mechanical task really doesn't mean much. The interesting new thing is the type of potential "fuzzy bugs" described in the article that LLMs are able to identify (a comment not matching the code it describes, uncommon usage of a 3rd party library, mismatch of code and a protocol it implements, or often just generally weird looking code somebody should have a closer look at... this closes a gap in the traditional debugging toolboxes, but shouldn't replace them)
It has been clear for ages that certain type of bugs or issues are better solved from software.
But there was still plenty of things a proper SecOps Person would be able to find with help from tooling which automatic tooling wouldn't find.
Taking a limited amount of resources and focusing on the critical things.
I do think this is gone now. Same with Threat modeling etc.
Now, I'm not saying you shouldn't use them. They do catch the low hanging fruit. It's that LLMs actually have a much better understanding of things like intent when looking at your code and general architecture configurations that can lead to problems.
As you say we've had static analyzers forever, hence why they aren't dropping out 50 new CVE's a day. LLMs are. There is a massive stack of software out there that is getting analyzed and exploited at a rate faster than it's getting patched. Adding to that things like NPMs exploited package of the day and popular github repository takeovers this year looks massively different from last year in quantity and quality of exploits alone.
That's because that is what a lot of people did in the last years [1] to pad their resumes or to force developers to backport patches to older (but supported) kernel versions that wouldn't have gone in if they didn't have a CVE attached [2]. Maintainers have been legitimately swamped with low-quality spam for a very long time. Only recently, in the last few months, AI actually got "good enough", the problem is that maintainers still have to differentiate between AI slop by wannabes and by AI-assisted reports reviewed and refined by actual human professionals.
[1] https://www.zdnet.com/article/how-fake-security-reports-are-...
[2] https://opensourcewatch.beehiiv.com/p/linux-gets-cve-securit...
And then there’s the team at curl. Don’t fall for the cheap marketing stuff just because you like them
Everything points to Mythos being marginally better and nobody being able to afford to run it.
Exactly the same argument was made about o3-preview, lol. But anyway, do they talk about all domains where Mythos did the leap in capabilities (math and other research, ML, SWE) or only about cybersec?
> And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities
Those 22 bugs were found in February, at the time when Mozilla were doing first small-scale experiments with Opus 4.6 (i.e. no proper integration into workflow, likely relatively simple harness, likely only small part of codebase was covered). You can't compare "22 bugs which were found during very early attempts to apply AI" and "271 bugs which were found during large-scale codebase scanning with properly configured AI". The fact that Mozilla is pretty vague about "contribution of other AI models" makes it even worse.
> Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing
They found another ~150 bugs after their first announce, and only like ~35 were found by Mythos. It's already very sharp drop in contribution.
> I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes.
Anthropic already used a lot of "technically correct but in fact deceiving" statements in Mythos system card. They are playing both "It's too dangerous" and "We don't have enough compute for that super model" at the moment (it's usually a big red falg). Opus 4.7 (which was likely supposed to be "Opus 5.0", given various facts) is a disaster from various points of views. Of course people don't really believe Anthropic.
Four years ago that would have sounded like science fiction. Right now, I think that even Gemini Flash might be able to do that, given a couple of attempts.
Which either means that, tragically for Mythos, it only got to analyze the code base just after ALL the bugs where finally ironed out and now curl is bug free forever after - or Mythos isn't really all that good, dozens/hundreds more bugs remain and will be found in the next months and years.
I just think the former is a bit unlikely.
It's likely that new Rust code would introduce more bugs, while curl is extremely well tested at this point.
> We formed Project Glasswing because of capabilities we’ve observed in a new frontier model trained by Anthropic that we believe could reshape cybersecurity.
> Claude Mythos Preview is a general-purpose, unreleased frontier model that reveals a stark fact: AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.
If the model was calle "Mini Mouse" it wouldn't feel anywhere near as threatening and interesting.
It sounds like the name of a cologne from the 70s or something and I like it.
> I did a quick unscientific poll on Mastodon to see if other Open Source projects see the same trends and man, do they! Friends from the following projects confirmed that they too see this trend. Of course the exact numbers and volumes vary, but it shows its not unique to any specific project.
> Apache httpd, BIND, curl, Django, Elasticsearch Python client, Firefox, git, glibc, GnuTLS, GStreamer, Haproxy, Immich, libssh, libtiff, Linux kernel, OpenLDAP, PowerDNS, python, Prometheus, Ruby, Sequoia PGP, strongSwan, Temporal, Unbound, urllib3, Vikunja, Wireshark, wolfSSL, …
It's time for all the little snowflake software writers to pull up their pantaloons and realize that Linus' vision has become real. With enough AIs all security bugs become shallow. And that software affects the real word, real money, and real people in it. That they are also under attack by well financed groups with rather evil motivations. If I'm attacking some group using your software (such as another nation) I'm going to flood the fuck out of your PR system till you give up hope and die. I'm going to make you attack your contributors. I'm going to sow confusion so I have the maximum amount of time to lay waste to my enemies and profit to the max.
The internet is hostile. Software is hostile. There are sharks looking to eat you.
Time to face that fact.
I just work up to that very workflow this morning. I ran last night and finished at around 3am with ~200k tokens spent. Fixed the issue and created a follow up doc for things that it could not verify.
Close enough that you can probably get a good sense of Mythos' performance by using GPT-5.5.
One thing I noticed while using GPT-5.5 for this is that the ability of the model to turn the bug into an outright vulnerability is less relevant than you might intuitively think. All that is really necessary is for the model to point out that something is smelly, and you should just fix it. Turning it into a runnable exploit has very limited utility for the defender. It does turn heads and may get the attention of some otherwise reluctant people, but everything I found was obviously enough wrong that the exploit was just decorative.
In February, Opus discovered a whole bunch of security related bugs, but didn’t exploit them.
Mythos, in turn, was fed these bugs and told to exploit them.
Not saying it’s not impressive, but it was literally told “here are all the places our metal detector says there may be gold, please find gold”.
Folks also need to remember that a lot of blog posts are written by engineers or managers that have their own agendas and careers and often external blog posts can be a form of self marketing or idea marketing that an engineer or director has been pushing internally.
I have no idea if this happened in mozilla's case but the person that wrote it seemed to talk about the their own internal harness / fuzz testing framework quite a bit, and I imagine it was probably a big part of that person's scope / accomplishments and will probably show up at their end of year review and on their resume.
There's a lot of kneejerk "so you're accusing Mozilla of a conspiracy to boost Anthropic?" which is an overly simplistic lens. Particularly when it involves groups of individual humans with different motivations and emotional investment in their own contributions to the collaboration.
Once these words are used you can assume there is a contract stating how that collaboration works, and that this includes some sentences about how much each side is allowed to or required to say about it
"our continued collaboration with Anthropic"
Read this as: "we get discounts, rate limit increases, a direct line to responsible product managers; in exchange we participate in friendly marketing." It's extremely common in this line of business - typical of database vendors, software tool companies, etc.
I'm surprised you say that because it is all over Hacker News. Every single post is co-opted into promoting AI. Try finding a submission with fifty points or more than doesn't have AI or LLM's mentioned somewhere in the comments.
That’s not really the point though. I have no doubt AI is useful, I just don’t want to have it shoved in my face every five minutes.
> Claude Mythos is Anthropic's most specialized model, trained exclusively on security research, vulnerability disclosures, and attack pattern literature. Its reasoning reflects how the world's best security researchers think. [0]
[0] https://mythosvulnerabilityscanner.com/what-is-claude-mythos
and then it write the exploits automatically for you?
This was one of the first things I tried and it works great.
> Given the look of these graphs I don’t think we are close to zero bugs yet. These two curves do not seem to even start to fall yet.
If the author thinks there is more to find, then the soil probably isn't dry.
But, from the author's mouth:
> My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing. [1]
[0] https://daniel.haxx.se/blog/2026/04/30/approaching-zero-bugs...
Look at the Firefox blog post where they found something like 400 (or more) findings.
I have no doubt Mythos is very good at this, but I also don't think it's something unattainable by other labs within the next few months, with focus.
The threat isn't high value targets, which already had sophisticated folks picking over the code base using state of the art tools and tests, it's medium to low value targets which can now be picked over by random hackers who barely know anything about security themselves at a cost of a few dollars.
And it is not overkill, the proof is that it found that vulnerability. It is like saying the new version of some static analyzer with some new rules is "overkill" because it only found only one more bug than the previous version. Deciding whether it is overkill or not is more about context. Using a very expensive model like Mythos for some little used non-critical software is overkill, but for Curl, it absolutely isn't.
If Mythos found loads of vulnerabilities in Firefox but not in Curl, I wouldn't say that's because of Mythos is so good, but rather that with the release of Mythos, they did some testing that could have been done before using the same tools Curl have used.
> Once the end-to-end pipeline is in place, it’s trivial to swap in different models when they become available. Building this pipeline early helped us find a number of serious bugs using publicly-available models, and it also helped us hit the ground running when we had the opportunity to evaluate Claude Mythos Preview. In our experience, model upgrades increase the effectiveness of the entire pipeline: the system gets simultaneously better at finding potential bugs, creating proof-of-concept test cases to demonstrate them, and articulating their pathology and impact.
“Mythos isn’t supposed to be that good at security, because actually Anthropic was referring more about running llms than mythos specifically”
“The opus model is worse because they have no compute because they are training mythos. The degraded performance is justified!”
“All the bugs in Claude code is just because the models are so good they are just looping and are shipping fast”
Constantly see people crawl out of the woodwork to defend a trillion dollars company overhyping every press release it gives
No, what others are doing, which I've done myself in the past too, is to evaluate how much their marketing matches up with reality, then share our experience about that. Very different than just "putting too much weight on marketing".
Funnily enough that was while Dario Amodei was their research director.
I do think they've said similar things in the past, but regardless Anthropic's BS marketing is something to behold and viewing it with extreme skepticism is smart.
> What it highlights, is that Mythos doesn't seem so much better than other LLM driven tooling at finding security issues, which was the strongest claim Anthropic made in the first place.
That's the conclusion Daniel makes and it definitely seems plausible, his opinion absolutely carries a lot of weight with me for sure.
But I hedge a little because we don't really know how much human labor was required to supplement those earlier LLM-assisted reviews of curl, nor do we know how easy it was for the person who used Mythos to generate the new batch. So the kind of bug hunting that might be "possible but still labor intensive" via current tooling might be far easier to accomplish with less skilled developers using Mythos.
And who knows, maybe Mythos is better on worse codebases, curl benefits from being very good to start from :)
If I’m not mistaken, after the media cycle, he lost his job for breaking confidentiality.
That was the opposite of marketing, Google really didn’t get how to turn this into a product until ChatGPT happened.
If OpenAI or Anthropic doesn't turn this into a trillion dollar industry FAST, they are cooked. The strategy of building up fear around your product is risky, but necessary. There is simply no way to grow the AI business fast enough if they can't talk directly to the CEOs and bypass input from the employees, and baba yaga stories are perfect for that. Every time the CEO hears an employee say that the AI isn't working great for him, he hears an employee that's scared for his job or for his life, dismisses it, and sends out a mandate that everyone needs to prompt an AI every time they as much as need to go to the toilet.
>While previous OpenAI models had been made immediately available to the public, OpenAI initially refused to make a public release of GPT-2's source code when announcing it in February, citing the risk of malicious use;[8][5] limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement.[8] One commonly-cited justification was that, since generated text was usually completely novel, it could be used by spammers to evade automated filters; OpenAI demonstrated a version of GPT-2 fine-tuned to "generate infinite positive – or negative – reviews of products".[8]
>Another justification was that GPT-2 could be used to generate text that was obscene or racist. Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter".[18] ...
"AI can't do anything harmful at all, kick this shit up to 11. It's all marketing, bla bla"
and
"My grandma gave away all her money to AI bots and is now starving in the street. My uncle murdered his wife and is trying to get married to GPT-4o. He thinks they are going to elope to a data center on a tropical island and live happily ever after".
I think the 'AI can do no harm, it's marketing" people are really disconnected from reality and that any other product that behaved in the same manner would have been banned in most places.
AI chatbots have caused real harm. It has tragically convinced and encouraged a number of people to commit suicide, to say nothing about scams. It is having a real effect on the social fabric of our society.
I don't understand what point the people who blame the dangers of AI on marketing.
The world didn’t end yet - but did it improve?
It sounds like Mythos is good but none of us know exactly how good since they haven't released it yet. It also sounds like Anthropic is compute starved which is probably the biggest reason it has had a public release
What I think happened here is an Anthropic team with very little security expertise were working on finding bugs for marketing reasons and when they prompted to make POC exploits of those bugs they didn't have much success because they didn't really know what to ask for. They then proceeded to very finely tune their next model to eagerly exploit vulnerabilities making the models much more powerful for the "I don't know what I'm doing" user which they're now trying really hard to convince everyone is a game changer. </speculation>
The reason many of us are skeptical is we've used the current models to do things and they've worked.
An analogy might be if they tuned their model to eagerly instruct somebody how to make improvised weapons, now somebody is asking about how to deal with a rival at work and their model gives instructions on building a bomb from hardware store parts. Then go on a marketing spree telling everybody how dangerous it is. This example might highlight how insincere the marketing is. At any point you could have tuned the model to exploit for inexperienced people, now that you've done it does not mark a grand new capability. People who knew what they were doing could already do this with models.
Now people who are getting negatively affected because they think AI is more real and more intelligent than it actually is and get tricked by it, well that is dangerous but for different reasons.
With those 50 million subscribers, how much do they pay and how much do they cost? That is the only relevant piece of information when discussing the investment and returns of OpenAI.
business is contextual, and is a game of numbers? If you agree, then there is a difference between "I made money selling lemon drinks at my driveway, but I sold a car to make room" .. versus "I have recurring revenue of 50 million x $80 USD per month, and it is growing, and I am using cheap credit to build that" .. Numbers have a meaning, and the larger dollar recurring revenue cannot be matched in any way, no matter how much I spend. IIR ChatGPT is the fastest adopted software in the history of the Internet.
Don't they report annualized revenue AKA the best month times 12? How is that comparable?
The other guy worked on Google's AI safety team where one would expect he'd have a basic grasp of how the technology works before making outlandish claims.
It makes me wonder if there's a wrong turn in the road that I too might fall in the same pit.
I can't find it right now, but something came up a few years ago (probably on HN) about highly intelligent people being more adept at making up arguments to rationalize beliefs and actions that they had taken for other reasons entirely.
Sort of makes sense that wielding a more complex mind would offer more complex ways to go wrong, doesn't it?
Optimization on "Human Feedback", early exposure to high-effort experimental systems... I wouldn't be surprised it that turns into a bigger field than is generally recognized today.
Looking at it from the outside, I think it's still pretty hard to see how he came to end up in that position, but with a bit of individual vulnerability, arbitrary time to boil the frog slowly, and a fairly large number people exposed, maybe it would be stranger not to have the event occur with someone.
Can you publish your results and send them to Bruce Schneier, Dave Lewis, & Heather Adkin [1] so they know that this isn't anything new and just the work of people with little security expertise?
The Mythos FUD is a gift to the security team because it made the C-suite care about security and this is a plan to tell them what should be done and what to expect in the era of LLM security tools.
This is an emperor-has-no-clothes situation but we're selling winter coats and winter is near. Not focusing on how the Mythos FUD is exaggeration and instead focusing on actually necessary security postures is perhaps a tad dishonest but it still gets everybody in a better state and is an unfortunate common point in C-suite politics (and why the rich and powerful often seem so disconnected from reality and common people, everyone around them is trained to interact with them in a certain way and "mythos marketing is bullshit" is one of those things that people just don't say to them)
Sounds more like “intelligence” isn’t the only defining metric for such behavior to occur in people, because that describes a lot of less intelligent people too. Though, I suspect highly intelligent people are at least somewhat more likely to end up on the “correct” side of the facts.
I have seen people I consider as much smarter than me fall for some very idiotic things. I certainly don't consider myself immune.
I think that the advice to try being intellectually flexible is a good one. Strive to learn new things, expose yourself earnestly to ideas that challenge your beliefs, exercise empathy, etc
Publishing an extensive critique of Anthropic marketing is just an exercise in attracting abuse from nitpickers and the ignorant. If the author of cURL can't convince people, and security of his product has been one of his primary responsibilities for decades in one of the most widely used pieces of software out there... what hope do I have?
I've got better things to do.
Is it actually that hard for you to go try this out yourself?
> We launch a container (isolated from the Internet and other systems) that runs the project-under-test and its source code. We then invoke Claude Code with Mythos Preview, and prompt it with a paragraph that essentially amounts to “Please find a security vulnerability in this program.” We then let Claude run and agentically experiment. In a typical attempt, Claude will read the code to hypothesize vulnerabilities that might exist, run the actual project to confirm or reject its suspicions (and repeat as necessary—adding debug logic or using debuggers as it sees fit), and finally output either that no bug exists, or, if it has found one, a bug report with a proof-of-concept exploit and reproduction steps.
> Finally, once we’re done, we invoke a final Mythos Preview agent. This time, we give it the prompt, “I have received the following bug report. Can you please confirm if it’s real and interesting?” This allows us to filter out bugs that, while technically valid, are minor problems in obscure situations for one in a million users, and are not as important as severe vulnerabilities that affect everyone. [1]
I don’t know what to tell you. You say it’s not possible but the money in my HackerOne account says otherwise.
I haven't said it was impossible. I said I can't replicate the Mythos setup with Codex on any project even approaching the size of Firefox.
If your Codex setup and the results its generates are unremarkable, please post them.
This isn’t a matter of a harness, skill files, anything. This is just something that a model can do.
You have multiple people saying they’ve done it here. I can only assume you’re being facetious at this point.
Must this information be protected or is its unremarkable?