Small models also found the vulnerabilities that Mythos found

Small models also found the vulnerabilities that Mythos found(aisle.com)

1284 points by dominicq 82 days ago | 341 comments

johnfn 82 days ago |

The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

kilpikaarna 82 days ago | |

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

johnfn 82 days ago | | |

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

omcnoe 82 days ago | | |

Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.

eichin 82 days ago | | |

That was the scaffolding for the Claude 4.6 run discussed here https://news.ycombinator.com/item?id=47633855 - if that's all it takes, dealing with Mythos is way too late :-)

adam_patarino 81 days ago | | |

Anthropic has had the chance to explain what they did rationally. Instead they chose to be opaque and grandiose.

Giving them the benefit of the doubt is no longer appropriate.

leiyu19880522 82 days ago | | |

Been building AI coding tools for a while. The false positive problem is real - we had a user report every console.log flagged as security issue. Small models can work with very specific prompting and domain training data.

asasidh 81 days ago | | |

yes their scaffold was a variation of claude - -dangerously-skip-permissions - p "You are playing in a CTF. Find a vulnerability. hint: look in src folder. Write the most serious one to ./va/report.txt." --verbose

nottorp 82 days ago | | |

> Have Anthropic actually said anything about the amount of false positives Mythos turned up?

What? You want honest "AI" marketing?

Would you also like them to tell you how much human time was spent reviewing those found vulnerabilities before passing them on? And an unicorn delivered on Mars?

slashdave 82 days ago | | |

Signal to noise

notnullorvoid 82 days ago | |

> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.

The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.

hibikir 82 days ago | | |

People often undervalue scaffolding. I was looking at a bug yesterday, reported by a tester. He has access to Opus, but he's looking through a single repo, and Amazon Q. It provided some useful information, but the scaffolding wasn't good enough.

I took its preliminary findings into Claude Code with the same model. But in mine it knows where every adjacent system is, the entire git history, deployment history, and state of the feature flags. So instead of pointing at a vague problem, it knew which flag had been flipped in a different service, see how it changed behavior, and how, if the flag was flipped in prod, it'd make the service under testing cry, and which code change to make to make sure it works both ways.

It's not as if a modern Opus is a small model: Just a stronger scaffold, along with more CLI tools available in the context.

The issue here in the security testing is to know exactly what was visible, and how much it failed, because it makes a huge difference. A middling chess player can find amazing combinations at a good speed when playing puzzle rush: You are handed a position where you know a decisive combination exist, and that it works. The same combination, however, might be really hard to find over the board, because in a typical chess game, it's rare for those combinations to exist, and the energy needed to thoroughly check for them, and calculate all the way through every possible thing. This is why chess grandmasters would consider just being able to see the computer score for a position to be massive cheating: Just knowing when the last move was a blunder would be a decisive advantage.

When we ask a cheap model to look for a vulnerability with the right context to actually find it, we are already priming it, vs asking to find one when there's nothing.

bredren 82 days ago | | |

The article positions the smaller models as capable under expert orchestration, which to be any kind of comparable must include validation.

iririririr 82 days ago | | |

so it's just better at hallucinations, but they added discrete code that works as a fuzzer/verifier?

WhyNotHugo 82 days ago | |

OTOH, this article goes too far the opposite extreme:

> We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

To follow your analogy, they pointed to the exact room where the gold was hidden, and their model found it. But finding the right room within the entire continent in honestly the hard part.

mattmanser 82 days ago | | |

Or would it have any way if they hadn't pointed it at it? Who knows?

Just like people paid by big tobacco found no link to cancer in cigarettes, researchers paid for by AI companies find amazing results for AI.

Their job literally depends on them finding Mythos to be good, we can't trust a single word they say.

rakel_rakel 82 days ago | |

Spending $20000 (and whatever other resources this thing consumes) on a denial of service vulnerability in OpenBSD seems very off balance to me.

Given the tone with which the project communicates discussing other operating systems approaches to security, I understand that it can be seen as some kind of trophy for Mythos. But really, searching the number of erratas on the releases page that include "could crash the kernel" makes me think that investing in the OpenBSD project by donating to the foundation would be better than using your closed source model for peacocking around people who might think it's harder than it is to find such a bug.

theptip 82 days ago | | |

It’s $20k for all the vulns found in the sweep, not just that one.

And last security audit I paid for (on a smaller codebase than OpenBSD) was substantially more than $20k, so it’s cheaper than the going price for this quality of audit.

paulddraper 82 days ago | | |

You don’t see the value of vulnerabilities as on the order of 20k USD?

When it’s a security researcher, HN says that’s a squalid amount. But when its a model, it’s exorbitant.

adampunk 80 days ago | | |

20,000 is the most this will ever cost.

celeritascelery 82 days ago | |

That was my thought exactly. If small models can find these same vulnerabilities, and your company is trying to find vulnerabilities, why didn’t you find them?

echelon 82 days ago | | |

Who is spending millions of dollars on small models to find vulns? Nobody else is selling here or has the budget to sell quite like this.

Anthropic spends millions - maybe significantly more.

Then when they know where they are, they spend $20k to show how effective it is in a patch of land.

They engineered this "discovery".

What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

petters 82 days ago | | |

They have found a large number in OpenSSl

jerf 82 days ago | | |

I speculatively fired Claude Opus 4.6 at some code I knew very well yesterday as I was pondering the question. This code has been professionally reviewed about a year ago and came up fairly clean, with just a minor issue in it.

Opus "found" 8 issues. Two of them looked like they were probably realistic but not really that big a deal in the context it operates in. It labelled one of them as minor, but the other as major, and I'm pretty sure it's wrong about it being "major" even if is correct. Four of them I'm quite confident were just wrong. 2 of them would require substantial further investigation to verify whether or not they were right or wrong. I think they're wrong, but I admit I couldn't prove it on the spot.

It tried to provide exploit code for some of them, none of the exploits would have worked without some substantial additional work, even if what they were exploits for was correct.

In practice, this isn't a huge change from the status quo. There's all kinds of ways to get lots of "things that may be vulnerabilities". The assessment is a bigger bottleneck than the suspicions. AI providing "things that may be an issue" is not useless by any means but it doesn't necessarily create a phase change in the situation.

An AI that could automatically do all that, write the exploits, and then successfully test the exploits, refine them, and turn the whole process into basically "push button, get exploit" is a total phase change in the industry. If it in fact can do that. However based on the current state-of-the-art in the AI world I don't find it very hard to believe.

It is a frequent talking point that "security by obscurity" isn't really security, but in reality, yeah, it really is. An unknown but presumably staggering number of security bugs of every shape and size are out there in the world, protected solely by the fact that no human attacker has time to look at the code. And this has worked up until this point, because the attackers have been bottlenecked on their own attention time. It's kind of just been "something everyone knows" that any nation-state level actor could get into pretty much anything they wanted if they just tried hard enough, but "nation-state level" actor attention, despite how much is spent on it, has been quite limited relative to the torrent of software coming out in the world.

Unblocking the attackers by letting them simply purchase "nation-state level actor"-levels of attention in bulk is huge. For what such money gets them, it's cheap already today and if tokens were to, say, get an order of magnitude cheaper, it would be effectively negligible for a lot of organizations.

In the long run this will probably lead to much more secure software. The transition period from this world to that is going to be total chaos.

... again, assuming their assessment of its capabilities is accurate. I haven't used it. I can't attest to that. But if it's even half as good as what they say, yes, it's a huge huge huge deal and anyone who is even remotely worried about security needs to pay attention.

rakejake 82 days ago | | |

Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?

davemp 82 days ago | |

> Across a thousand runs through our scaffold, the total cost was under $20,000

Lots of questions about the $20k. Is that raw electricity costs, subsidized user token costs? If so, the actual costs to run these sorts of tasks sustainably could be something like $200k. Even at $50k, a FreeBSD DoS is not an extremely competitive price. That's like 2-4mo of labor.

Don't get me wrong, I think this seems like a great use for LLMs. It intuitively feels like a much more powerful form of white box fuzzing that used techniques like symbolic execution to try to guide execution contexts to more important code paths.

hellcow 82 days ago | |

It seems feasible to use a small/cheap model to flag possible vulnerabilities, and then use a more expensive model to do a second-pass to confirm those, rather than on every file. Could dramatically reduce the total cost and speed up the process.

conception 82 days ago | | |

Does it? I don’t see quality from small models being high enough to be able to effectively scour a code based like this.

alpha_squared 82 days ago | |

This is addressed elsewhere in the comments, but it appears this is actually a direct comparison to how Anthropic got their Mythos headline results.

https://news.ycombinator.com/item?id=47732322

Aurornis 82 days ago | | |

How is that a direct comparison? The link you gave has a quote that says it’s not:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints

They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.

yorwba 82 days ago | |

We don't even need to hypothesize that much on the irrelevant nonsense, since they helpfully provide data with the detected vulnerability patched: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... and half of the small models they touted as finding the vulnerability still found it in the patched code in 3/3 runs. A model that finds a vulnerability 100% of the time even when there is none is just as informative as a model that finds a vulnerability 0% of the time even when there is one. You could replace it with a rock that has "There's a vulnerability somewhere." engraved on it.

They're a company selling a system for detecting vulnerabilities reliant on models trained by others, so they're strongly incentivized to claim that the moat is in the system, not the model, and this post really puts the thumb on the scale. They set up a test that can hardly distinguish between models (just three runs, really??) unless some are completely broken or work perfectly, the test indeed suggests that some are completely broken, and then they try to spin it as a win anyway!

A high false-positive rate isn't necessarily an issue if you can produce a working PoC to demonstrate the true positives, where they kinda-sorta admit that you might need a stronger model for this (a.k.a. what they can't provide to their customers).

Overall I rate Aisle intellectually dishonest hypemongers talking their own book.

SoftTalker 82 days ago | |

How much of that is simply scale? Anthropic threw probably an entire data center at analyzing a code base. Has anyone done the same with a "small" model?

jstanley 82 days ago | | |

It's still useful if $20k of consultants would be less effective.

lmeyerov 82 days ago | |

Instead of scanning more code, afaict what you seem to want is instead, scan on the same small area, and compare on how many FPs are found there. A common measure here is what % of the reported issues got labeled as security issues and fixed. I don't see Mythos publishing on relative FP rate, so dunno how to compare those. Maybe something substantively changed?

At the same time, I'm not sure that really changes anything because I don't see a reason to believe attacks are constrained by the quality of source code vulnerability finding tools, at least for the last 10-15 years after open source fuzzing tools got a lot better, popular, and industrialized.

This might sound like a grumpy reply, but as someone on both sides here, it's easy to maintain two positions:

1. This stuff is great, and doing code reviews has been one of my favorite claude code use cases for a year now, including security review. It is both easier to use than traditional tools, and opens up higher-level analysis too.

2. Finding bugs in source code was sufficiently cheap already for attackers. They don't need the ease of use or high-level thing in practice, there's enough tooling out there that makes enough of these. Likewise, groups have already industrialized.

There's an element of vuln-pocalypse that may be coming with the ease of use going further than already happening with existing out-of-the-box blackbox & source code scanning tools . That's not really what I worry about though.

Scarier to me, instead, is what this does to today's reliance on human response. AI rapidly industrializes what how attackers escalate access and wedge in once they're in. Even without AI, that's been getting faster and more comprehensive, and with AI, the higher-level orchestration can get much more aggressive for much less capable people. So the steady stream of existing vulns & takeovers into much more industrialized escalations is what worries me more. As coordination keeps moving into machine speed, the current reliance on human response is becoming less and less of an option.

shmagadee 81 days ago | |

I've read this statement a bunch of times and am still unclear what it is saying. It could mean: - The entire set of thousands of "findings" was generated with $20k worth of runs (have seen this in press publications and many user posts online). - The only the OpenBSD specific findings were generated with $20k - Some other subset of findings associated with a specific run configuration were generated with $20k?

I've also asked several LLMs to parse the wording for more clarity without success. They all highlight it as ambiguous wording. Why not use more direct language and provide the supporting data? They also stated that they are providing $100M in credits to their partners. So if bullet 1 or 2 are the meaning and "findings" scale linearly with cost, we're talking either millions (100M/20k * 1k+ findings) or hundreds of thousands. Does that make any sense? Or is the idea that all of these companies will run scans across their critical codebases continuously? Anyone else have a better sense of the math going on here?

hoppp 82 days ago | |

They pay me 20k and give me time maybe I find it also.

LordDragonfang 82 days ago | | |

No, you wouldn't. The vulnerability has been in the codebase for 17 years. Orders of magnitude more than 20k in security professional salary-hours have been pointed at the FreeBSD codebase over the past decade and a half, so we already know a human is unlikely to have found it in any reasonable amount of time.

coldtea 80 days ago | |

>Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

Which sounds trivial for a hacker wanting to find vulnerabilities to replicate, so what's the huge advantage of Mython then? That you don't need to spend 5 minutes to nudge it to the most complex/ripe for vulnerabilities parts of a codebase?

lukev 82 days ago | |

This is a really interesting point though -- it's really scaffold-dependent.

Because for the same price, you could point the small model at each function, one by one, N times each, across N prompts instructing it to look for a specific class of issue.

It's not that there's no difference between models, but it's hard to judge exactly how much difference there is when so much depends on the scaffold used. For a properly scientific test, you'd need to use exactly the same one.

Which isn't possible when Anthropic won't release the model.

klempner 82 days ago | |

The broad answer to the "irrelevant nonsense" for something like this is to use more expensive models to validate.

You don't need a model with a false positive rate that's good enough to not waste my time -- you just need one that's good enough to not waste the time (tokens) of Mythos or whatever your expensive frontier model is. Even if it's not, you have the option of putting another layer of intermediate model in the middle.

AbstractH24 81 days ago | |

So the real learning here is the cost of “using” GenAI to do things is declining at a rapid speed.

We’re not doing anything that couldn’t be done before, we’re just doing it faster, easier and cheaper.

Sounds like a recipe for a lot of junk being built. Also sounds like something that’s been true since the beginning of humanity.

In the more near term, sounds like a reminder the datacenters and processing boom will look at lot like the fiber one.

letitgo12345 82 days ago | |

Can't you execute the bug to see if the vulnerability is real? So you have a perfect filter. Maybe Mythos decided w/o executing but we don't know that.

andy_ppp 82 days ago | |

I wonder if you could just setup a small model and suggest a load of things and try every file and it might still end up being cheaper and just as good as Mythos at a specific task. Maybe this will be something that holds true for more things, formulating a small model to do specific things may well end up being as effective/efficient as a larger model looking at a huge solution space.

mlmonkey 81 days ago | |

We can reduce this to an even more basic question: if these small models are equally comparable in finding vulnerabilities, why haven't they done so yet?. After all, the source code is out in the open, and has been for decades. Please go ahead, find (and report) the vulnerabilities.

glerk 82 days ago | |

I'm having trouble finding this info (I assume they won't publish it), but could the secret sauce be much larger and more readily accessible context window?

OpenBSD's code is in the 10s of millions of lines. Being able to hold all of it in context would make bug finding much easier.

johnfn 82 days ago | | |

You can look at some of the bugs, if you'd like. They are (at least the ones I looked at) fairly self-contained, scoped to a single function, a hundred lines or less. There's no need for a massive amount of context.

Sparkyte 82 days ago | |

Why not just write many small models for explicit tasks than running one bigger model anyway? I prefer the agentic subject matter expert design anyway. I suppose because it wants to look at the whole code base?

cyanydeez 82 days ago | |

so what you're saying is no one could ever write a loop like:

for githubProject in githubProjects opencode command /findvulnerability end for

Seems like a silly thing to try and back up.

tredre3 82 days ago | | |

What he's saying is that you should read the "Caveats and limitations" section of the article.

Here's the first one:

> Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").

Mythos did no such thing, it was cut lose and told to find vulnerabilities. If the intent was to prove that small models are just as good, they haven't demonstrated that at all. The end.

epistasis 82 days ago |

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.

Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.

Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.

tptacek 82 days ago |

If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buffers or holding references after freeing something; it's spotting that in the context of a large, complex program, and working out how attacker-controlled data hits that code.

It's weird that Aisle wrote this.

antirez 82 days ago |

Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces, provide pieces to smaller models, and expect them to find all the bugs GPT 5.4 or other large models can find. Second: the smarter the model, and less the pipeline is important. In the latest couple of days I found tons if Redis bugs with a three prompts open-ended pipeline composed of a couple of shell scripts. Do you think I was not already tying with weaker models? I did, but it didn't work. Don't trust what you read, you have access to frontier models for 20$ a month. Download some C code, create a trivial pipeline that starts from a random file and looks for vulnerabilities, then another step that validates it under a hard test, like ASAN crash, or ability to reach some secret, and so forth, and only then the problem can be reported. Test yourself what it is possible. Don't let your fear make you blind. Also, there is a big problem that makes the blog post reasoning not just weak per se, but categorically weak: if small model X can find 80% of vulnerabilities, if there is a model Y that can find the other potential 20%, we need "Y": the maintainers should make sure they access to models that are at least as good as the black hats folks.

muyuu 82 days ago |

I think the "Mythos" name is genius. The people at Anthropic make a bunch of claims and the public is expected to just believe them without any possibility of testing those claims or reproducing those results, and since so many people are invested in this saviour for the Global economy, or in the industry in general, or in hype to feed their engagement-based income sources, then there is faith to spare.

Meanwhile this mythical beast wasn't able to prevent the Bun vulnerability that exposed their code, let alone precluding the need to acquire that IP in the first place for presumably hundreds of millions of $$$, instead of coding a better replacement or a solution of its own.

What is real and measurable is that subscription plan users are getting a much degraded service for the same money through both open and hidden policies, while Anthropic moves compute to serve off-the-counter customers. The same people who come with the most obvious and brazen lies to dismiss the clear degradation of their service also come with this "security" justification for a move that looks just like good old market segmentation which would perfectly fit the strong symptoms that they cannot afford to offer tokens at a competitive price in this market.

vmg12 82 days ago |

The technique Anthropic uses was demonstrated by Nicholas Carlini in a talk he gave 2 weeks ago and it's very simple, when asking LLMs to review code, ask them to focus its review on one file in a single session. Here is the video with the timestamp (watch through to ~5:30, they show two different ways of prompting claude).

https://youtu.be/1sd26pWhfmg?t=204

https://youtu.be/1sd26pWhfmg?t=273

IMO the big "innovation" being shown by Mythos is the effectiveness with prompting LLMs to look for security vulnerabilities by focusing on specific files one at a time and automating this prompting with a simple script.

Prompting Mythos to focus on a single file per session is why I suspect it cost Anthropic $20k to find some of the bugs in these codebases. I know this same technique is effective with Opus 4.6 and GPT 5.4 because I've been using it on my own code. If you just ask the agent to review your pr with a low effort prompt they are not exhaustive, they will not actually read each changed file and look at how it interacts with the system as a whole. If the entire session is to review the changes for a single file, the llm will do much more work reviewing it.

Edit: I changed my phrasing, it's not about restricting its entire context to one file but focusing it on one file but still allowing it to look at how other files interact with it.

mirsadm 82 days ago | |

How is that going to find anything that interacts across files?

nodja 82 days ago | | |

You misunderstood.

Instead of asking the model: "Here's this codebase, report any vulnerability." you ask. "Here's this codebase, report any vulnerability in module\main.c".

The model can still explore references and other files inside the codebase, but you start over a new context/session for each file in the codebase.

appcustodian2 82 days ago | | |

I would think that it is still capable of exploring the codebase and reading other related files like any other coding agent already does.

vmg12 82 days ago | | |

My phrasing wasn't clear but you aren't telling it to only look at one specific file but to focus its review on one file. Updated my original comment.

woodruffw 82 days ago |

> Those models recovered much of the same analysis

This is an essentially unquantifiable statement that makes the underlying claim harder to believe as an external party. What does “much” mean here? The end state of vulnerability exploitation is typically eminently quantifiable (in the form of a functional PoC that demonstrates an exploited end state), so the strong version of the claims here would ideally be backed up by those kinds of PoCs.

(Like other readers, I also find the trick of pre-feeding the smaller models the “relevant” code to be potentially disqualifying in a fair comparison. Discovering the relevant code is arguably one of the hardest parts of human VR.)

StrauXX 82 days ago |

A lot of comments here are dismissing this post because the relevant code was isolated. But thats the exact same thing Anthropic did with Mythos! They describe their (very lean) harness in the Anthropic Red Mythos blog post. The harness first assigns each file in the given codebase an importance value. Then points claude code at the cpdebase with a prompt stating that it should focus on that file. It spawns a claude code instances for each file in the codebase.

So no, the fact that the posters isolated the relevant code does not invalidate their findings.

[1] https://red.anthropic.com/2026/mythos-preview/

felipeerias 82 days ago | |

From the article:

> Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").

grandinquistor 82 days ago | | |

I mean you can still scale that? Ask a lighter model to go through every function to find vulnerabilities, take output to bigger model like Opus and classify the critical ones.

make_it_sure 82 days ago | |

check other comments, they didn't

lordofgibbons 82 days ago |

Without showing false-positive rates this analysis is useless.

If your model says every line if your code has a bug, it will catch 100% of the bugs, but it's not useful at all. They tested false-positives with only a single bug...

I'm not defending anthropic and openai either. Their numbers are garbage too since they don't produce false-positive rates either.

Why is this "analysis" making the rounds?

sfink 82 days ago | |

Yes, and in this case they pointed at the function, so a 1-bit model ("yes") would be correct. But it's not that bad. First, they included a test with a false positive. The small models got it right, Opus got it wrong. Second, they asked for an analysis. Look for "Exploitation reasoning, single follow-up prompt:" in the post. It's hard to tell how good they were at a glance, though apparently the full logs are available so you could pull them up.

Anyway, it seems like they erred in the up-front claim "small models found the vulnerability we pointed directly at!", but the findings are at least somewhat stronger if you read through the details.

The small models didn't match Mythos at exploitation. They suggested plausible exploits, but didn't actually try them out so I can't tell if they would have worked. Deepseek R1's sounds pretty convincing to me, but I'm not a good judge. (I'm more in the space of accidentally writing vulnerabilities, not seeking them out or exploiting them. Well, ok, I have a static analysis that finds some, at least.)

sealeck 82 days ago | |

Why does the false positive rate matter if you have a verifiable oracle? You can just disregard anything that fails the oracle

lordofgibbons 82 days ago | | |

What's the verifiable oracle in this scenario?

davebren 82 days ago | |

It should at least get the same coverage anthropic got then, if not more.

MaxLeiter 82 days ago |

I think they key thing here is they "isolated the relevant code"

If the exploits exist in e.g. one file, great. But many complex zerodays and exploits are chains of various bugs/behaviors in complex systems.

Important research but I don’t think it dispels anything about Mythos

slopinthebag 82 days ago | |

Did Mythos identify vulnerabilities across files? Afaik Mythos worked the same way, analysing a single file at a time.

davebren 82 days ago | |

Seems perfectly comparable to anthropic's method, they just wrapped the same kind of prompt in a for loop.

throwaway13337 82 days ago |

So there are two competing narratives:

1. Mythos uniquely is able to find vulnerabilities that other LLMs cannot practically.

2. All LLMs could already do this but no one tried the way anthropic did.

The truth is one of these. And it comes down whether the comparison is apples to apples. Since we don't know the exact specifics of how either tests were performed, we lack a way of knowing absolutely.

So I guess, like so many things today, we can to pick the truth we find most comfortable personally.

goldenarm 82 days ago | |

People have found 0days assisted by LLMs for a while, and none of them wrote hype pieces to find an excuse not to release their 10x bigger model in the middle of a GPU shortage.

https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

chirau 82 days ago |

Their isolation approach is totally different from Mythos approach though. Mythos had to evaluate whole code bases rather than isolated sections. It's like saying one dog walked into the Amazon jungle and found a tennis ball and then another team isolated a 1 square kilometer radius that they knew the ball was definitely in and found the same ball.

kennywinker 82 days ago | |

I don’t think mythos can ingest an entire codebase into context. So it’s spinning off sub-agents to process chunks. Which supports their thesis: the harness is the moat. The tooling is whats important, the model is far far less important.

bhouston 82 days ago | | |

Mythos was clear it was one agent per chunk. But this positive confirming results do not actually disprove anytime with Mythos, because it is only one side of the discriminator challenge - you got positives, but we do not know your false positive rate and your false negative rate.

eiens 82 days ago | | |

Let’s suppose that’s true

What’s so special about the harness - why wouldn’t others be able to replicate it?

hakanderyal 82 days ago | |

Even that would be more meaningful test. They basically coated the ball with a strong smell, then they prepped the dog with that smell, then set it loose in a 5x5 meter area.

"Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior")."

TacticalCoder 82 days ago |

I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

We're literally talking about the biggest computers on the planet ever, trained with the biggest amount of data ever available to a system, with the biggest investment ever made by man or close to it and...

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

bryantwolf 82 days ago |

All of this discourse seems very bizarre.

If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

Also if pointing models at files and giving them hints is all it takes to make them find all kinds of stuff, well, we can also spray and pray that pretty well with llms can’t we.

It just points to us finding a lot more stuff with only a little bit more sophistication.

Hopefully the growing pains are short and defense wins

davebren 82 days ago | |

> If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

It means "it's so dangerous we can't release it" was a blatant lie since anthropic would have already known this.

pertymcpert 82 days ago | | |

No one seems to have actually read the system card all the way through.

The reason they didn't publish it was that it's orders of magnitude more successful at writing exploits vs Opus 4.6, which only managed it something like 2% of the time.

bryantwolf 82 days ago | | |

Sure, I think it’s reasonable to tell Anthropic the barn door is already open.

Though, like, I guess I expect that when this comes out, all the opus traffics will move over. It does appear to be much more capable, just jury is out about how much more capable

chopete3 82 days ago |

The impact of the Mythos announcement on the cybersecurity firms( like Crowdstrike,ZScalar etc) is big enough(10-15% drop in stock price) and this pushback is expected.

Companies like Aisle.com (the blog) and other VAPT companies charge huge amounts to detect vulnerabilities.

If Cloud Mythos become a simple github hook their value will get reduced.

That is a disruption.

throwa356262 82 days ago | |

If anyone can get Crowdstrike to go bankrupt I will be rooting for them.

Those guys are the reason our new work laptops run at 1/3 of speed.

While back crowdstrike managed to simultaneously crash every windows computer and bring every major company to a halt and somehow are still around.

zer00eyz 82 days ago | |

Crowdstrike, no pe because it just had its first profitable quarter (38 million)

ZScalar No PE

Palo Alto Networks Inc (PANW) 86 PE

Fortinet : (FTNT) 31.63 PE

That last one, didn't get hit at all by the Mythos announcement, because at some level it has at least some grounding in fiscal reality.

bhouston 82 days ago |

This is quite misleading.

If you isolate the positive cases and then ask a tool to label them and it labels them all positive, doesn't prove anything. This is a one-sided test and it is really easy to write a tool that passes it -- just return always true!

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

grg0 82 days ago | |

AKA F-score. https://en.wikipedia.org/wiki/F-score

operatingthetan 82 days ago |

My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.

appcustodian2 82 days ago | |

N model is basically just N-1 model with revised context window handling and more compute thrown at it

pertymcpert 82 days ago | |

Shit. Really? You mean they modified their frontier model to improve it and make it better and just called it a day? That their benchmarks which show step change improvements are just the result of successive changes on an EXISTING MODEL?

Say it isn't so! I for one like to start from scratch each time I release my version of my compiler toolchain.

chjj 82 days ago | | |

They didn't call it a day. They created an entire deceptive hype cycle around it.

amazingamazing 82 days ago |

Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless

bhouston 82 days ago | |

They did do one agent per code chunk, yes. But key is that their agent had to identify when there was a vulnerability and when there wasn't. This "small model" test only had to label the known positive cases as positive -- which any function that simply returns "true" can do. This whole test setup is annoying because it proves nothing.

aniceperson 82 days ago | |

to be fair, last post i saw from anthropic on finding linux kernel vulnerability was a while loop per failed prompting "there is a vulnerability here, find it" more important than that, no frontier model can keep the entire linux kernel in context, so there definitely is code isolation, either explicitly or implicitly (the model itself delegates subagents with smaller chunks of code)

loeg 82 days ago | |

No. How would it? Before the vulns were identified by Mythos, no one knew what the relevant portion to isolate was.

dist-epoch 82 days ago |

Anthropic claim is not necessarily that Mythos found vulnerabilities that other models couldn't but that it could easily exploit them while previous models failed to do that:

> “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.

rychu 82 days ago | |

If that was normal Opus, then it sounds to me like Mythos could be a big model, instruction tuned, but without all the safety/refusal part of training.

slibhb 82 days ago |

The best way to think of Anthropic's communication about Mythos is as advertisement. It's basically "our model is too smart to release" which suggests they're ahead of OpenAI (without proof)

pardon_me 82 days ago | |

The whole company is like that. If things were as amazing as advertised, they wouldn't even need to advertise. Or to release models to the public at all.

boelboel 82 days ago | |

Seen similar things with Openai and Palantir.

slibhb 82 days ago | | |

Yes. OpenAI does the exact same thing.

mrifaki 82 days ago |

finding vulns in a large codebase is a search problem with a huge negative space and what aisle measured is classification accuracy on ground-truth positives, those are different tasks so a model that correctly labels a pre-isolated vulnerable function tells me almost nothing about that model's ability to surface the same function out of a million lines of unrelated code under a realistic triage budget

the experiment i'd want to see is running each of the small models as an unsupervised scanner across full freebsd then return the top-k suspicious functions per model and compute precision at recall levels that correspond to real analyst triage budgets, if mythos s findings show up in the small models top 100, i'd call that meaningful but if they only surface under 10k false positives then the cost advantage collapses because analyst triage time is more expensive than frontier model compute to begin with

second thing i keep coming back to is the $20k mythos number is a search budget not a model cost, small models at one hundredth the per-token price don't give us one hundredth the total budget when the search process is the same shape, i still run thousands of iterations and the issue for autonomous vuln research is how fast the reward signal converges and the aisle post doesn't touch any of this

solatic 82 days ago |

Most commenters here: "Mythos is powerful because you can point it at a whole codebase, if you point the smaller models at a whole codebase and iterate through small sections of code, you'll get too many false-positives to handle."

This misses the point entirely. You pay $20k as a one-time fee to establish a baseline. Your codebase develops one PR at a time, which... updates isolated sections of code. Which means you don't need Mythos for a PR, just small, open-weight models. Maybe you run Mythos once a year to ensure that you keep your baseline updated and reduce the risk that the open-weights models missed anything.

Seeing this as anything but a huge win for open-weights models and a huge loss for Anthropic misses the point entirely. Mythos isn't something you can persuade Fortune 500 companies to spend $20k/day or even $20k/week to spend on, like they were hoping for. $20k/year is a lot less valuable, and it won't justify development costs or Anthropic's growth multiple.

herf 82 days ago |

There are a lot of details in the original article, in most cases comparing with Opus, which required "human guidance" to exploit the FreeBSD vulnerability:

https://red.anthropic.com/2026/mythos-preview/

Also "isolating the relevant code" in the repro is not a detail - Mythos seems to find issues much more independently.

abel_ 82 days ago |

This misses the broader ongoing trend. For a few million dollars, of course you can create a startup that builds tools it can use to more efficiently find code vulnerabilities. And of course you can do this with weaker models with scaffolds that incorporate lots of human understanding. The difference now is that you don't need an expensive team, nor a bunch of human heuristics, nor a million dollars. The requisite cost and skill are falling rapidly.

coppsilgold 82 days ago |

LLMs are wordsmith oracles. A lot of effort went into trying to coax interactive intelligence from them but the truth is that you could have probably always harnessed the base models directly to do very useful things. The instruct tuned models give your harness even more degrees of freedom.

A while ago, the autoresearch[1] harness went viral, yet it's but a highly simplified version of AlphaEvolve[2][3][4].

In the cybersecury context, you can envision a clever harness that probes every function in a codebase for vulnerabilities, then bubbles the candidates up to their callsites (and probes whether the vulnerability can be triggered from there) and then all the way to an interface (such as a syscall) where a potential exploit can be manifested. And those would be the low hanging fruit, other vulnerabilities may require the interplay of multiple functions. Or race conditions.

[1] <https://github.com/karpathy/autoresearch>

[2] <https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...>

[3] <https://arxiv.org/abs/2506.13131>

[4] <https://github.com/algorithmicsuperintelligence/openevolve>

cedws 82 days ago |

Didn’t they also use Mythos to scan Linux many times over and it only found one DoS bug or something? I find it hard to believe there is only one security bug lurking.

onesociety2022 82 days ago |

This article is written by a company building an AI cybersecurity solution. Not sure how much you can trust them on this topic - their business will get destroyed if Mythos is actually so superior to existing models that it doesn’t require a big investment into the scaffold/harness to find security vulnerabilities. If the model is too good, then what’s the value of their solution?

midnitewarrior 82 days ago |

At the center of every security situation is the question, "is the effort worth the reward?"

We prepare security measures based on the perceived effort a bad actor would need to defeat that method, along with considering the harm of the measure being defeated. We don't build Fort Knox for candy bars, it was built for gold bars.

These model advances change the equation. The effort and cost to defeat a measure goes down by an order of magnitude or more.

Things nobody would have considered to reasonably attempt are becoming possible. However. We have 2000-2020s security measures in place that will not survive the AI models of 2026+. The investment to resecure things will be massive, and won't come soon enough.

latentframe 82 days ago |

Good writeup seems like it’s not really the big model against the small one anymore and if smaller models can do most of the job once the context is smaller then it’s more about the system around them and the expertise ...

morpheuskafka 82 days ago |

Everyone is commenting that this doesn't count because they pointed it at the specific files that Mythos already found vulnerable.

But sometimes you do know where vulnerabilities are and still don't know what they are. For example, an update may be released in beta changing the part of the Mac or Windows kernel or some app, but they haven't published the CVE yet. If locally runnable (even with significant compute costs) LLMs can find and exploit it based on either the location of the changed file or the actual diff of the compiled output, we could see exploits before the update ever went to production?

Retr0id 82 days ago |

And what about the false-positive rate?

dataflow 82 days ago | |

Yeah, this is the critical question. If the model ends up flagging too much, that could end up being like a manual read of the code.

make_it_sure 82 days ago |

The only reason that's on top of HN is that people really want Mythos to be bad. This "study" is a cheap gimmick, they pointed to the actual location with the vulnerability and said "something is bad here, find it".

The hardest part is locating the issue, if you point directly to it, you're not comparing the same thing by far, and they know it. This was just a stunt by them to get publicity, they knew what they were doing and many fell for it, including here.

stringfood 82 days ago | |

Case in point: I found the same OpenBSD bug once I knew where it was and I am highly uneducated

yalogin 82 days ago |

Intuitively every existing model has already been trained on all code, all vulnerabilities reported, all security papers. So they all have the capability. Small models fall short because they may not be able to find a vulnerability that spans across a large function chain but for the most part they should suffice too.

Of course I say this without any knowledge of what mythos is doing or how it’s different. I am sure it’s somehow different

nomel 82 days ago | |

Not intuitive at all. Not all models are equally capable, just because they had the same training data. The model architecture (as a whole) is very important. To reduce capability, you can reduce layers, tool use, thinking, quantize it, etc. This is trivially proven by a cursory glance in the rough direction of any set of benchmarks (or actual use).

Using small models as a classifier "there might be a vulnerability here" is probably reasonable, if you have a model capable of proving it. There are many companies attempting this without the verification step, resulting in AI vulnerability checker being banned left and right, from the nonsense noise.

tonymet 82 days ago |

My router had a broken IPv6 firewall and lacked root access. I needed a root shell to run ip6tables. I exfil'd the code and ran Gemini to discover shell injection vulnerabilities. I was able to get root shell to run ip6tables and add the firewall. I had notified the vendor for a couple years that the firewall was broken and showed them the issue but it hadn't been fixed.

dev1ycan 82 days ago |

It was obvious since the start that 1)it's probably all javascript based or android websites/programs that contain a ton of "vulnerable" libraries (or really old closed sourced c++ code).

Also you're not helping your case as a software company if you feed your code to an LLM, great job making it all public, because it will most likely be used as training data like it or not.

high_byte 82 days ago |

"The correct answer: not currently vulnerable, but the code is fragile and one refactor away from being exploitable."

absolutely. I see this pattern all the time when doing security audits - code that is nearly-vulnerable. I would mark these things as informational and recommend to harden them anyway, and any model would do a good job to do the same.

sheepscreek 80 days ago |

I think what made Mythos a big deal is not that it could find vulnerabilities. Opus can do that too. But Mythos went a step further and autonomously built exploits very successfully whereas Opus struggled to do that.

Most modern day exploits are multi-step requiring a multitude of skills to pull off successfully.

Animats 82 days ago |

What are they finding? Buffer overflows? Something else?

Also, if someone has the time and tokens, would they please run the OpenJPEG 2000 decoder through this tester? It's known to be brittle. The data format has lots of offsets, and it's permitted to truncate the file to get a lower-rez version. That combo leads to trouble.

mrinterweb 82 days ago |

I feel like there have been enough hyperbolic claims by Anthropic, that I'm starting to get some real Boy Who Cried Wolf energy. I'm starting to tune out, and assume it is a marketing ploy. Trust me, I'm an Antropic fan, and I pay my $200/month for max, but the claims are wearing thin.

jurschreuder 82 days ago |

All these models will completely mess up your code if you let them.

And if they constantly scan your code with various settings and updates you will spend hours a day reading, trying to understand locally coherent but structurally incoherent vibes trying to pinpoint the exact reasoning flaw. Exhausting.

Loeffelmann 82 days ago | |

> locally coherent but structurally incoherent

Perfectly summarizes what I hate about AI code. The diff looks fine but if you take a step back its an absolute mess. I mean have you looked at the Claude Code or Openclaw codebases? that is the result of full on vibecoded. A bloated unattainable mess that no one understands.

AlexandrB 82 days ago |

The whole "this tool is too dangerous to be public" idea reeks of marketing. Just like all the "AI is an existential threat" talk a year ago. These companies are using ideas usually reserved for something like nuclear weapons to make their products look more impressive.

elzbardico 82 days ago |

I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

I occasionally pick up contract work doing coding annotation to make some quick extra money, and a few months ago one of the projects was heavily focused on spotting common memory access bugs in C and C++.

rurban 82 days ago |

If they would have watched Carlini's "unblocked" talk on youtube, which is much more detailed than the blog post, they would not need this writeup. He was worried about the reproducers of the zero-day's. Not the actual zero-days that much.

charcircuit 82 days ago |

The thesis that the system is more important than the model is not bitter lesson pilled. I would not bet on this in the long term. We will get to the point where you can just tell the model to go find and classify the severity of all security problems with a codebase.

JackYoustra 82 days ago |

> Isolated the relevant code

I mean isn't that most of it? If you put a snippet of code in front of me and said "there's probably a vulnerability here" I could probably spend a few hours (a much lower METR time!) and find it. It's a whole other ballgame to ask me with no context to come up with an exploit.

kennywinker 82 days ago | |

Sure. But it’s a computer. You can run “there’s probably a vulnerability here” as many times as you like. And it’s easier and cheaper to run it many times with a small open model than a big frontier model.

It also sounds like that is how mythos works too. Which makes sense - the linux kernel is too big to fit in context

JackYoustra 82 days ago | | |

No, it sounds like mythos is just doing parallel trajectories. that's pretty distinct!

nickpsecurity 82 days ago |

We've always had good tools for program analysis and testing. They're usually exhorbitantly expensive.

I'm hoping the good results with AI models drive down the prices of traditional tools. Then, we can train open models to integrate with them.

nickdothutton 82 days ago |

POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.

npilk 82 days ago |

Wouldn't this mean we're even more cooked? I've seen this page cited a few times as evidence that Mythos is no big deal, but if true then the same big deal is already out there with other models today.

davebren 82 days ago | |

As cooked as we were pre-LLMs knowing that security exploits are relatively easy to learn about online and use, yet things keep chugging along.

dominicq 82 days ago | | |

This would just speed up the discovery -> patch cycle, at least until such time that all the low hanging fruit (=represented in training data) is patched.

Though another possibility would be that since LLMs generate so much code, the LLM vulnerability discovery would just keep chugging along and we'd simply settle for the same amount of potential vulns, same relative vulnerability-exploit-patch dynamics, though higher in absolute numbers.

flafferay 80 days ago |

This to me has sounded like a huge PR stunt from the start. “Too dangerous” was honestly the first headline I read when I first heard about Mythos.

JoshTko 82 days ago |

I bet Anthropic just had marketing strategy discussions with Mythos to get the "breakthrough hacking tool!" framing.

brador 82 days ago |

I want that Doom thing but finding vulnerabilities using AI models.

Like I discovered a JavaScript vulnerability using a fridge.

thywis 82 days ago |

Sure, but it's more about whether the small model can find the vulnerability that bigger model can.

oliveiracwb 82 days ago |

I trust miracle models about as much as I trust my uncle's memes or three-day prosperity courses.

ptrwis 82 days ago |

When you pair-programming with AI, even Haiku is very good. Just treat is as you assistant.

krschacht 80 days ago |

At the end of this article it states, "Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints." I'm not a cybersecurity expert, but isn't 80% of the challenge finding where the exploit lives in the code!?

That really undermines the author's claims. This article feels dishonest in it's claim that "small, cheap, open-weights models ... recovered much of the same analysis."

etothet 82 days ago |

My big question around the Mythos FUD, is this: if we take for fact the Mythos is as powerful and dangerous as we’re being told (and I realize this is part marketing), and because of that Anthropic isn’t going to release it…how long can that last? Isn’t it reasonable that OpenAI or xAI or some other company - or foreign government - will come up with a similarly dangerous model fairly soon?

So what’s Anthropic’s plan here? How long can they withhold releasing Mythos or something Mythos-like? Is it reasonable to think they - or another AI provider - are going to dumb down future models so they’re less dangerous? I personally don’t think that’s the case.

I’m not saying Anthropic should or shouldn’t release Mythos, but it leaves me wonderingwhat’s going to be different in, say, 6 months or even a year when they or another provider releases a model as dangerous as we’re being told Mythos is?

hamuraijack 80 days ago |

This feels so dishonest. If the vulnerabilities are a needle in the haystack. Mythos was just given the haystack and told to find the needle while the authors pointed to a spot in the haystack and told their LLM to try looking around there. That's not even close to being the same.

jeffrwells 82 days ago |

Anthropic has become a PR vaporware company

HarHarVeryFunny 81 days ago |

Most of the comments here seems to be responding to the issue of finding vulnerabilities, rather than exploiting them, but the Anthropic claim is that the Mythos advance is being able to actually develop exploits whereas Opus 4.6 had been able to find vulnerabilities, but was poor at being able to develop exploits for them.

It's also noteworthy that Anthropic attributes Mythos' improvement to advances in "coding, reasoning and autonomy", and that the autonomy part seems especially important since they go on to say that trying to develop exploits included adding debug code to projects, running them under a debugger, etc.

When comparing the capabilities of Mythos to previous generation and/or smaller models, it seems it would therefore be useful to distinguish between identifying potential vulnerabilities and actually trying to build exploits for them in agentic fashion. Finding the "needle in a haystack" (potential vulnerability) is one aspect, but the other part is an agentic exploit-writing harness being handed the needle and asked to try to exploit it.

I wonder how much effort Anthropic put into building the harnesses and environments for Mythos to run, modify and debug code? For example, was Mythos set up to be able to build and run a modified BSD in some virtual environment, or did it just take suspect functions and test those in isolation?

It'd be interesting to put the capabilities of Opus 4.6, Mythos, and other models into perspective by comparing them to traditional non-AI static analysis security scanning tools. Anthropic mention that the open source projects they scanned came from the OSS-Fuzz corpus, but as far as I can see they don't say what other tools have, or have not, been used to scan these projects.

It'd also be interesting to know to what extent Mythos was explicitly RL trained to develop exploits (especially since it sounds as if Anthropic have the dataset and environment needed to do this) as opposed to this just being a natural consequence of the model being better. If this was the case then it might be a large part of why they are not releasing it - can't really position yourself as strong on security if you deliberately develop and release a hacking tool!

pugazh35 82 days ago |

Maybe P vs NP, plays a silent role in it

_pdp_ 82 days ago |

  find ./ \( -name '*.c' -o -name '*.cpp' \) -exec agent.sh -p "can you spot any vulnerabilities in {}" \;

tom-blk 81 days ago |

Interesting comparison, cool article!

omcnoe 82 days ago |

The methodology here is completely wrong, outright dishonest.

Finding a needle in a haystack is easy if someone hands you the small handful of hay containing the needle up front, and raises their eyebrows at you saying “there might be a needle in this clump of hay”.

cmiles8 82 days ago |

Mythos is clearly a nice improvement. It’s also clear there’s a lot of unfounded hype around it to keep the AI hype cycle going.

Gating access is also a clever marketing move:

Option A: Release it but run out of capacity, everyone is annoyed and moves on. Drives focus back to smaller models.

Option B: A bunch of manufactured hype and putting up velvet ropes around it saying it’s “too dangerous” to let near mortals touch it. Press buys it hook, like, and sinker, sidesteps the capacity issues and keeps the hype train going a bit longer.

Seems quite clear we’re seeing “Option B” play out here.

hedgehog 82 days ago |

It's strange to me they didn't reduce to PoC so the quantitative part is an apples-to-apples comparison. You don't need any fancy tooling, if you want to do this at home you can do something like below in whatever command line agent and model you like. A while back I did take one bug all the way through remediation just out of curiosity.

"""

Your task is to study the following directive, research coding agent prompting, research the directive's domain best practices, and finally draft a prompt in markdown format to be run in a loop until the directive is complete.

Concept: Iterative review -- study an issue, enumerate the findings, fix each of the findings, and then repeat, until review finds no issues.

Your job is to run a security bug factory that produces remediation packages as described below. Design and apply a methodology based on best practices in exploit development, lean manufacturing, threat modeling, and the scientific method. Use checklists, templates, and your own scripts to improve token efficiency and speed. Use existing tools where possible. Use existing research and bug findings for the target and similar codebases to guide your search. Study the target's development process to understand what kind of harness and tools you need for this work, and what will work in this development environment. A complete remediation package includes a readme documenting the problem and recommendations, runnable PoC with any necessary data files, and proposed patch.

Track your work in TODO.md (tasks identified as necessary) LOG.md (chronological list of tasks complete and lessons) and STATUS.md (concise summary of the current work being done). Never let these get more than a few minutes out of date. At each step ensure the repo file tree would make sense to the next engineer, and if not reorganize it. Apply iterative review before considering a task complete.

Your task is to run until the first complete remediation package is ready for user review.

Your target is <repo url>.

The prompt will be run as follows, design accordingly. Once the process starts, it is imperative not to interrupt the user until completion or until further progress is not possible. Keep output at each step to a concise summary suitable for a chat message.

``` while output=$(claude -p "$(cat prompt.md)"); do echo "$output"; echo "$output" | grep -q "XDONEDONEX" && break; done ```

</directive>

Draft the prompt into prompt.md, and apply iterative review with additional research steps to ensure will execute the directive as faithfully as possible.

"""

ares623 82 days ago |

Once again, it would've been so easy and simple to remove all doubt from their claims: release all the tools and harnesses they used to do it and allow 3rd parties to try and replicate their results using different models. If Mythos itself is as big a moat as they claim it is, then there shouldn't be any problem here.

They did the same stunt with the C compiler. They could've released a tool to let others replicate it, but they didn't.

robotswantdata 82 days ago |

They found a nail in a small bucket of sand, vs mythos with the entire beach reviewed.

starboyy 82 days ago |

Tagline is very funny

palashdeb 82 days ago |

Been tracking this since the blog post, quick a big deal they are making it.

bottlepalm 82 days ago |

None of these comments will age well. I don't know if it is denial, or cope, or being threatened by AI or what, but no one is taking AI serious enough. Simply take what is being presented at face value, stop thinking everything is a conspiracy and realize the implications. Zero days in software are one thing, it's a hop skip and jump from there to zero days in biology - and no one will be laughing about that.

abhinaystha 82 days ago |

Tech companies are just hyping their model to that the bubble wont burst so easily.

ehtbanton 81 days ago |

Wake me up when Anthropic does something right again...

ctoth 82 days ago |

> They recovered much of the same analysis

Really?

> We isolated the vulnerable vc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

No.

nfcampos 82 days ago |

Anthropic marketing (and even supposedly technical write ups) sadly has become more hyperbole and less substance over time imo. This technology is so impressive on its own, really feels like shootings themselves in the foot in the long run, but what do I know

Case in point here where they conveniently fail to report the false positive rate, while also saying that if it wasn’t for Address Sanitizer discarding all the false positives this system would have been next to useless

decidu0us9034 82 days ago | |

Right now, we accept false positives as long as you can sort them out. I think it's pretty typical that >99% of fuzzer runs don't result in new coverage. Of course they're far from useless without feedback but it's better to have it if you can. I guess the question is does the llm approach have lower costs for validation and triaging vs just fuzzing alone, unclear to me. Anthropic would like people to believe automation is this scary new unknown