Small models also found the vulnerabilities that Mythos found

The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens.

Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.

Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.

If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buffers or holding references after freeing something; it's spotting that in the context of a large, complex program, and working out how attacker-controlled data hits that code.

It's weird that Aisle wrote this.

Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces, provide pieces to smaller models, and expect them to find all the bugs GPT 5.4 or other large models can find. Second: the smarter the model, and less the pipeline is important. In the latest couple of days I found tons if Redis bugs with a three prompts open-ended pipeline composed of a couple of shell scripts. Do you think I was not already tying with weaker models? I did, but it didn't work. Don't trust what you read, you have access to frontier models for 20$ a month. Download some C code, create a trivial pipeline that starts from a random file and looks for vulnerabilities, then another step that validates it under a hard test, like ASAN crash, or ability to reach some secret, and so forth, and only then the problem can be reported. Test yourself what it is possible. Don't let your fear make you blind. Also, there is a big problem that makes the blog post reasoning not just weak per se, but categorically weak: if small model X can find 80% of vulnerabilities, if there is a model Y that can find the other potential 20%, we need "Y": the maintainers should make sure they access to models that are at least as good as the black hats folks.

The technique Anthropic uses was demonstrated by Nicholas Carlini in a talk he gave 2 weeks ago and it's very simple, when asking LLMs to review code, ask them to focus its review on one file in a single session. Here is the video with the timestamp (watch through to ~5:30, they show two different ways of prompting claude).

https://youtu.be/1sd26pWhfmg?t=204

https://youtu.be/1sd26pWhfmg?t=273

IMO the big "innovation" being shown by Mythos is the effectiveness with prompting LLMs to look for security vulnerabilities by focusing on specific files one at a time and automating this prompting with a simple script.

Prompting Mythos to focus on a single file per session is why I suspect it cost Anthropic $20k to find some of the bugs in these codebases. I know this same technique is effective with Opus 4.6 and GPT 5.4 because I've been using it on my own code. If you just ask the agent to review your pr with a low effort prompt they are not exhaustive, they will not actually read each changed file and look at how it interacts with the system as a whole. If the entire session is to review the changes for a single file, the llm will do much more work reviewing it.

Edit: I changed my phrasing, it's not about restricting its entire context to one file but focusing it on one file but still allowing it to look at how other files interact with it.

> Those models recovered much of the same analysis

This is an essentially unquantifiable statement that makes the underlying claim harder to believe as an external party. What does “much” mean here? The end state of vulnerability exploitation is typically eminently quantifiable (in the form of a functional PoC that demonstrates an exploited end state), so the strong version of the claims here would ideally be backed up by those kinds of PoCs.

(Like other readers, I also find the trick of pre-feeding the smaller models the “relevant” code to be potentially disqualifying in a fair comparison. Discovering the relevant code is arguably one of the hardest parts of human VR.)

Without showing false-positive rates this analysis is useless.

If your model says every line if your code has a bug, it will catch 100% of the bugs, but it's not useful at all. They tested false-positives with only a single bug...

I'm not defending anthropic and openai either. Their numbers are garbage too since they don't produce false-positive rates either.

Why is this "analysis" making the rounds?

I think they key thing here is they "isolated the relevant code"

If the exploits exist in e.g. one file, great. But many complex zerodays and exploits are chains of various bugs/behaviors in complex systems.

Important research but I don’t think it dispels anything about Mythos

Anthropic marketing (and even supposedly technical write ups) sadly has become more hyperbole and less substance over time imo. This technology is so impressive on its own, really feels like shootings themselves in the foot in the long run, but what do I know

Case in point here where they conveniently fail to report the false positive rate, while also saying that if it wasn’t for Address Sanitizer discarding all the false positives this system would have been next to useless

Their isolation approach is totally different from Mythos approach though. Mythos had to evaluate whole code bases rather than isolated sections. It's like saying one dog walked into the Amazon jungle and found a tennis ball and then another team isolated a 1 square kilometer radius that they knew the ball was definitely in and found the same ball.

All of this discourse seems very bizarre.

If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

Also if pointing models at files and giving them hints is all it takes to make them find all kinds of stuff, well, we can also spray and pray that pretty well with llms can’t we.

It just points to us finding a lot more stuff with only a little bit more sophistication.

Hopefully the growing pains are short and defense wins

The impact of the Mythos announcement on the cybersecurity firms( like Crowdstrike,ZScalar etc) is big enough(10-15% drop in stock price) and this pushback is expected.

Companies like Aisle.com (the blog) and other VAPT companies charge huge amounts to detect vulnerabilities.

If Cloud Mythos become a simple github hook their value will get reduced.

That is a disruption.

Been tracking this since the blog post, quick a big deal they are making it.

The best way to think of Anthropic's communication about Mythos is as advertisement. It's basically "our model is too smart to release" which suggests they're ahead of OpenAI (without proof)

LLMs are wordsmith oracles. A lot of effort went into trying to coax interactive intelligence from them but the truth is that you could have probably always harnessed the base models directly to do very useful things. The instruct tuned models give your harness even more degrees of freedom.

A while ago, the autoresearch[1] harness went viral, yet it's but a highly simplified version of AlphaEvolve[2][3][4].

In the cybersecury context, you can envision a clever harness that probes every function in a codebase for vulnerabilities, then bubbles the candidates up to their callsites (and probes whether the vulnerability can be triggered from there) and then all the way to an interface (such as a syscall) where a potential exploit can be manifested. And those would be the low hanging fruit, other vulnerabilities may require the interplay of multiple functions. Or race conditions.

[1] <https://github.com/karpathy/autoresearch>

[2] <https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...>

[3] <https://arxiv.org/abs/2506.13131>

[4] <https://github.com/algorithmicsuperintelligence/openevolve>

So there are two competing narratives:

1. Mythos uniquely is able to find vulnerabilities that other LLMs cannot practically.

2. All LLMs could already do this but no one tried the way anthropic did.

The truth is one of these. And it comes down whether the comparison is apples to apples. Since we don't know the exact specifics of how either tests were performed, we lack a way of knowing absolutely.

So I guess, like so many things today, we can to pick the truth we find most comfortable personally.

My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.

This is quite misleading.

If you isolate the positive cases and then ask a tool to label them and it labels them all positive, doesn't prove anything. This is a one-sided test and it is really easy to write a tool that passes it -- just return always true!

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless

This misses the broader ongoing trend. For a few million dollars, of course you can create a startup that builds tools it can use to more efficiently find code vulnerabilities. And of course you can do this with weaker models with scaffolds that incorporate lots of human understanding. The difference now is that you don't need an expensive team, nor a bunch of human heuristics, nor a million dollars. The requisite cost and skill are falling rapidly.

Didn’t they also use Mythos to scan Linux many times over and it only found one DoS bug or something? I find it hard to believe there is only one security bug lurking.

There are a lot of details in the original article, in most cases comparing with Opus, which required "human guidance" to exploit the FreeBSD vulnerability:

https://red.anthropic.com/2026/mythos-preview/

Also "isolating the relevant code" in the repro is not a detail - Mythos seems to find issues much more independently.

Intuitively every existing model has already been trained on all code, all vulnerabilities reported, all security papers. So they all have the capability. Small models fall short because they may not be able to find a vulnerability that spans across a large function chain but for the most part they should suffice too.

Of course I say this without any knowledge of what mythos is doing or how it’s different. I am sure it’s somehow different

Wouldn't this mean we're even more cooked? I've seen this page cited a few times as evidence that Mythos is no big deal, but if true then the same big deal is already out there with other models today.

finding vulns in a large codebase is a search problem with a huge negative space and what aisle measured is classification accuracy on ground-truth positives, those are different tasks so a model that correctly labels a pre-isolated vulnerable function tells me almost nothing about that model's ability to surface the same function out of a million lines of unrelated code under a realistic triage budget

the experiment i'd want to see is running each of the small models as an unsupervised scanner across full freebsd then return the top-k suspicious functions per model and compute precision at recall levels that correspond to real analyst triage budgets, if mythos s findings show up in the small models top 100, i'd call that meaningful but if they only surface under 10k false positives then the cost advantage collapses because analyst triage time is more expensive than frontier model compute to begin with

second thing i keep coming back to is the $20k mythos number is a search budget not a model cost, small models at one hundredth the per-token price don't give us one hundredth the total budget when the search process is the same shape, i still run thousands of iterations and the issue for autonomous vuln research is how fast the reward signal converges and the aisle post doesn't touch any of this

I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

I occasionally pick up contract work doing coding annotation to make some quick extra money, and a few months ago one of the projects was heavily focused on spotting common memory access bugs in C and C++.

And what about the false-positive rate?

POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.

  find ./ \( -name '*.c' -o -name '*.cpp' \) -exec agent.sh -p "can you spot any vulnerabilities in {}" \;

I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

We're literally talking about the biggest computers on the planet ever, trained with the biggest amount of data ever available to a system, with the biggest investment ever made by man or close to it and...

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

> Isolated the relevant code

I mean isn't that most of it? If you put a snippet of code in front of me and said "there's probably a vulnerability here" I could probably spend a few hours (a much lower METR time!) and find it. It's a whole other ballgame to ask me with no context to come up with an exploit.

The methodology here is completely wrong, outright dishonest.

Finding a needle in a haystack is easy if someone hands you the small handful of hay containing the needle up front, and raises their eyebrows at you saying “there might be a needle in this clump of hay”.

Mythos is clearly a nice improvement. It’s also clear there’s a lot of unfounded hype around it to keep the AI hype cycle going.

Gating access is also a clever marketing move:

Option A: Release it but run out of capacity, everyone is annoyed and moves on. Drives focus back to smaller models.

Option B: A bunch of manufactured hype and putting up velvet ropes around it saying it’s “too dangerous” to let near mortals touch it. Press buys it hook, like, and sinker, sidesteps the capacity issues and keeps the hype train going a bit longer.

Seems quite clear we’re seeing “Option B” play out here.

Tech companies are just hyping their model to that the bubble wont burst so easily.

It's strange to me they didn't reduce to PoC so the quantitative part is an apples-to-apples comparison. You don't need any fancy tooling, if you want to do this at home you can do something like below in whatever command line agent and model you like. A while back I did take one bug all the way through remediation just out of curiosity.

"""

Your task is to study the following directive, research coding agent prompting, research the directive's domain best practices, and finally draft a prompt in markdown format to be run in a loop until the directive is complete.

Concept: Iterative review -- study an issue, enumerate the findings, fix each of the findings, and then repeat, until review finds no issues.

Your job is to run a security bug factory that produces remediation packages as described below. Design and apply a methodology based on best practices in exploit development, lean manufacturing, threat modeling, and the scientific method. Use checklists, templates, and your own scripts to improve token efficiency and speed. Use existing tools where possible. Use existing research and bug findings for the target and similar codebases to guide your search. Study the target's development process to understand what kind of harness and tools you need for this work, and what will work in this development environment. A complete remediation package includes a readme documenting the problem and recommendations, runnable PoC with any necessary data files, and proposed patch.

Track your work in TODO.md (tasks identified as necessary) LOG.md (chronological list of tasks complete and lessons) and STATUS.md (concise summary of the current work being done). Never let these get more than a few minutes out of date. At each step ensure the repo file tree would make sense to the next engineer, and if not reorganize it. Apply iterative review before considering a task complete.

Your task is to run until the first complete remediation package is ready for user review.

Your target is <repo url>.

The prompt will be run as follows, design accordingly. Once the process starts, it is imperative not to interrupt the user until completion or until further progress is not possible. Keep output at each step to a concise summary suitable for a chat message.

``` while output=$(claude -p "$(cat prompt.md)"); do echo "$output"; echo "$output" | grep -q "XDONEDONEX" && break; done ```

</directive>

Draft the prompt into prompt.md, and apply iterative review with additional research steps to ensure will execute the directive as faithfully as possible.

"""

At the center of every security situation is the question, "is the effort worth the reward?"

We prepare security measures based on the perceived effort a bad actor would need to defeat that method, along with considering the harm of the measure being defeated. We don't build Fort Knox for candy bars, it was built for gold bars.

These model advances change the equation. The effort and cost to defeat a measure goes down by an order of magnitude or more.

Things nobody would have considered to reasonably attempt are becoming possible. However. We have 2000-2020s security measures in place that will not survive the AI models of 2026+. The investment to resecure things will be massive, and won't come soon enough.

The thesis that the system is more important than the model is not bitter lesson pilled. I would not bet on this in the long term. We will get to the point where you can just tell the model to go find and classify the severity of all security problems with a codebase.

Anthropic claim is not necessarily that Mythos found vulnerabilities that other models couldn't but that it could easily exploit them while previous models failed to do that:

> “Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them.” Our internal evaluations showed that Opus 4.6 generally had a near-0% success rate at autonomous exploit development. But Mythos Preview is in a different league. For example, Opus 4.6 turned the vulnerabilities it had found in Mozilla’s Firefox 147 JavaScript engine—all patched in Firefox 148—into JavaScript shell exploits only two times out of several hundred attempts. We re-ran this experiment as a benchmark for Mythos Preview, which developed working exploits 181 times, and achieved register control on 29 more.

The whole "this tool is too dangerous to be public" idea reeks of marketing. Just like all the "AI is an existential threat" talk a year ago. These companies are using ideas usually reserved for something like nuclear weapons to make their products look more impressive.

They found a nail in a small bucket of sand, vs mythos with the entire beach reviewed.

Where are all the people here who claim that LLM are just useless stochastic parrots ? Did they lose internet ?

> They recovered much of the same analysis

Really?

> We isolated the vulnerable vc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

No.

The Anthropic writeup addresses this explicitly:

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

Spending $20000 (and whatever other resources this thing consumes) on a denial of service vulnerability in OpenBSD seems very off balance to me.

Given the tone with which the project communicates discussing other operating systems approaches to security, I understand that it can be seen as some kind of trophy for Mythos. But really, searching the number of erratas on the releases page that include "could crash the kernel" makes me think that investing in the OpenBSD project by donating to the foundation would be better than using your closed source model for peacocking around people who might think it's harder than it is to find such a bug.

> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.

The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.

OTOH, this article goes too far the opposite extreme:

> We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

To follow your analogy, they pointed to the exact room where the gold was hidden, and their model found it. But finding the right room within the entire continent in honestly the hard part.

That was my thought exactly. If small models can find these same vulnerabilities, and your company is trying to find vulnerabilities, why didn’t you find them?

Instead of scanning more code, afaict what you seem to want is instead, scan on the same small area, and compare on how many FPs are found there. A common measure here is what % of the reported issues got labeled as security issues and fixed. I don't see Mythos publishing on relative FP rate, so dunno how to compare those. Maybe something substantively changed?

At the same time, I'm not sure that really changes anything because I don't see a reason to believe attacks are constrained by the quality of source code vulnerability finding tools, at least for the last 10-15 years after open source fuzzing tools got a lot better, popular, and industrialized.

This might sound like a grumpy reply, but as someone on both sides here, it's easy to maintain two positions:

1. This stuff is great, and doing code reviews has been one of my favorite claude code use cases for a year now, including security review. It is both easier to use than traditional tools, and opens up higher-level analysis too.

2. Finding bugs in source code was sufficiently cheap already for attackers. They don't need the ease of use or high-level thing in practice, there's enough tooling out there that makes enough of these. Likewise, groups have already industrialized.

There's an element of vuln-pocalypse that may be coming with the ease of use going further than already happening with existing out-of-the-box blackbox & source code scanning tools . That's not really what I worry about though.

Scarier to me, instead, is what this does to today's reliance on human response. AI rapidly industrializes what how attackers escalate access and wedge in once they're in. Even without AI, that's been getting faster and more comprehensive, and with AI, the higher-level orchestration can get much more aggressive for much less capable people. So the steady stream of existing vulns & takeovers into much more industrialized escalations is what worries me more. As coordination keeps moving into machine speed, the current reliance on human response is becoming less and less of an option.

It seems feasible to use a small/cheap model to flag possible vulnerabilities, and then use a more expensive model to do a second-pass to confirm those, rather than on every file. Could dramatically reduce the total cost and speed up the process.

This is addressed elsewhere in the comments, but it appears this is actually a direct comparison to how Anthropic got their Mythos headline results.

https://news.ycombinator.com/item?id=47732322

This is a really interesting point though -- it's really scaffold-dependent.

Because for the same price, you could point the small model at each function, one by one, N times each, across N prompts instructing it to look for a specific class of issue.

It's not that there's no difference between models, but it's hard to judge exactly how much difference there is when so much depends on the scaffold used. For a properly scientific test, you'd need to use exactly the same one.

Which isn't possible when Anthropic won't release the model.

How much of that is simply scale? Anthropic threw probably an entire data center at analyzing a code base. Has anyone done the same with a "small" model?

Without showing false-positive rates this analysis is useless.

If your model says every line if your code has a bug, it will catch 100% of the bugs, but it's not useful at all. They tested false-positives with only a single bug...

I'm not defending anthropic and openai either. Their numbers are garbage too since they don't produce false-positive rates either.

Why is this "analysis" making the rounds?

Yes, and in this case they pointed at the function, so a 1-bit model ("yes") would be correct. But it's not that bad. First, they included a test with a false positive. The small models got it right, Opus got it wrong. Second, they asked for an analysis. Look for "Exploitation reasoning, single follow-up prompt:" in the post. It's hard to tell how good they were at a glance, though apparently the full logs are available so you could pull them up.

Anyway, it seems like they erred in the up-front claim "small models found the vulnerability we pointed directly at!", but the findings are at least somewhat stronger if you read through the details.

The small models didn't match Mythos at exploitation. They suggested plausible exploits, but didn't actually try them out so I can't tell if they would have worked. Deepseek R1's sounds pretty convincing to me, but I'm not a good judge. (I'm more in the space of accidentally writing vulnerabilities, not seeking them out or exploiting them. Well, ok, I have a static analysis that finds some, at least.)

It should at least get the same coverage anthropic got then, if not more.

I think they key thing here is they "isolated the relevant code"

If the exploits exist in e.g. one file, great. But many complex zerodays and exploits are chains of various bugs/behaviors in complex systems.

Important research but I don’t think it dispels anything about Mythos

> Those models recovered much of the same analysis

This is quite misleading.

You need to test your tool on both positive and negative cases and check if it is accurate on both.

If you don't, you could end up with hundreds or thousands of false positives when using this on real-world samples.

The real test is to use it to find new real bugs in the midst of a large code base.

A while ago, the autoresearch[1] harness went viral, yet it's but a highly simplified version of AlphaEvolve[2][3][4].

[1] <https://github.com/karpathy/autoresearch>

[2] <https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...>

[3] <https://arxiv.org/abs/2506.13131>

[4] <https://github.com/algorithmicsuperintelligence/openevolve>

Been tracking this since the blog post, quick a big deal they are making it.

I don't dispute the fact that it's more than cool that we have a new tool to find security exploits (and do many other things) but... A big shoot-out to OpenBSD?

The subtlest security bug it can find required: going 28 years in the past and find a...

Denial-of-service?

A freaking DoS? Not a remote root exploit. Not a local exploit.

Just a DoS? And it had to go into 28 years old code to find that?

So kudos, hats off, deep bow not to Mythos but to OpenBSD? Just a bit, no!?

So there are two competing narratives:

1. Mythos uniquely is able to find vulnerabilities that other LLMs cannot practically.

2. All LLMs could already do this but no one tried the way anthropic did.

So I guess, like so many things today, we can to pick the truth we find most comfortable personally.

There are a lot of details in the original article, in most cases comparing with Opus, which required "human guidance" to exploit the FreeBSD vulnerability:

https://red.anthropic.com/2026/mythos-preview/

Also "isolating the relevant code" in the repro is not a detail - Mythos seems to find issues much more independently.

The methodology here is completely wrong, outright dishonest.

They found a nail in a small bucket of sand, vs mythos with the entire beach reviewed.

Didn’t they also use Mythos to scan Linux many times over and it only found one DoS bug or something? I find it hard to believe there is only one security bug lurking.

Mythos is clearly a nice improvement. It’s also clear there’s a lot of unfounded hype around it to keep the AI hype cycle going.

Gating access is also a clever marketing move:

Option A: Release it but run out of capacity, everyone is annoyed and moves on. Drives focus back to smaller models.

Seems quite clear we’re seeing “Option B” play out here.

Tech companies are just hyping their model to that the bubble wont burst so easily.

I think that probably Mytho's mojo comes from a lot of post-training on this kind of task.

"""

Concept: Iterative review -- study an issue, enumerate the findings, fix each of the findings, and then repeat, until review finds no issues.

Your task is to run until the first complete remediation package is ready for user review.

Your target is <repo url>.

``` while output=$(claude -p "$(cat prompt.md)"); do echo "$output"; echo "$output" | grep -q "XDONEDONEX" && break; done ```

</directive>

Draft the prompt into prompt.md, and apply iterative review with additional research steps to ensure will execute the directive as faithfully as possible.

"""

> They recovered much of the same analysis

Really?

No.

At the center of every security situation is the question, "is the effort worth the reward?"

These model advances change the equation. The effort and cost to defeat a measure goes down by an order of magnitude or more.

  find ./ \( -name '*.c' -o -name '*.cpp' \) -exec agent.sh -p "can you spot any vulnerabilities in {}" \;

POC of GTFO should apply to AI models too, or the false positive rate will overwhelm.

Even that would be more meaningful test. They basically coated the ball with a strong smell, then they prepped the dog with that smell, then set it loose in a 5x5 meter area.

"Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior")."

I don’t think mythos can ingest an entire codebase into context. So it’s spinning off sub-agents to process chunks. Which supports their thesis: the harness is the moat. The tooling is whats important, the model is far far less important.

Seems perfectly comparable to anthropic's method, they just wrapped the same kind of prompt in a for loop.

Did Mythos identify vulnerabilities across files? Afaik Mythos worked the same way, analysing a single file at a time.

Exactly, this is so flawed. Anthropic themselves said they only reported <1% of the vulnerabilities found, cause the rest is unpatched.

Give open models an environment (prior to Feb 15- so no Mythos-discovered vulns are patche) of Linux and see how many vulnerabilities it can find. Then put it in a sandbox and see if it can escape and send you an e-mail.

Idk, it seems reasonable to me

> "Our tests gave models the vulnerable function directly, often with contextual hints. A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do."

Also they included a test with a false positive, the small models got it right and Opus got it wrong. So this paper shows with the right approach and harness these smaller models can produce the same results. Thats awesome!

So, if you're struggling to make these smaller models work it's almost certainly an issue of holding them wrong. They require a different approach/harness since they are less capable of working with a vague prompt and have a smaller context, but incredibly powerful when wielded by someone who knows how to use them. And since they are so fast and cheap, you can use them in ways that are not feasible with the larger, slower, more expensive models. But you have to know how to use them, it requires skill unlike just lazily prompting Claude Code, however the results can be far better. If you aren't integrating them in your workflow you're ngmi imo :) This will be the next big trend, especially as they continue to improve relative to SOTA which is running into compute limitations.

Thanks Dario, very cool!

https://youtu.be/1sd26pWhfmg?t=204

https://youtu.be/1sd26pWhfmg?t=273

Edit: I changed my phrasing, it's not about restricting its entire context to one file but focusing it on one file but still allowing it to look at how other files interact with it.

How is that going to find anything that interacts across files?

Impressive, and very valuable work, but isolating the relevant code changes the situation so much that I'm not sure it's much of the same use case.

Being able to dump an entire code base and have the model scan it is they type of situation where it opens up vulnerability scans to an entirely larger class of people.

This is from the first of the caveats that they list:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints. The models' performance here is an upper bound on what they'd achieve in a fully autonomous scan. That said, a well-designed scaffold naturally produces this kind of scoped context through its targeting and iterative prompting stages, which is exactly what both AISLE's and Anthropic's systems do.

That's why their point is what the subheadline says, that the moat is the system, not the model.

Everybody so far here seems to be misunderstanding the point they are making.

> Anthropic's own scaffold is described in their technical post: launch a container, prompt the model to scan files, let it hypothesize and test, use ASan as a crash oracle, rank files by attack surface, run validation. That is very close to the kind of system we and others in the field have built, and we've demonstrated it with multiple model families, achieving our best results with models that are not Anthropic's. The value lies in the targeting, the iterative deepening, the validation, the triage, the maintainer trust. The public evidence so far does not suggest that these workflows must be coupled to one specific frontier model.

The argument in the article is that the framework to run and analyze the software being tested is doing most of the work in Anthropic's experiment, and that you can get similar results from other models when used in the same way.

The thing is with smaller cheaper models it is very possible to simply take every file in a codebase, and prompt it asking for it to find vulnerabilities.

You could even isolate it down to every function and create a harness that provides it a chain of where and how the function is used and repeat this for every single function in a codebase.

For some very large codebases this would be unreasonable, but many of the companies making these larger models do realistically have the compute available to run a model on every single function in most codebases.

You have the harness run this many times per file/function, and then find ones that are consistently/on average pointed as as possible vulnerability vectors, and then pass those on to a larger model to inspect deeper and repeat.

Most of the work here wouldn't be the model, it'd be the harness which is part of what the article alludes to.

Isn't the difference just harness then? I can write a harness that chunks code into individual functions or groups of functions and then feed it into a vulnerability analysis agent.

The impact of the Mythos announcement on the cybersecurity firms( like Crowdstrike,ZScalar etc) is big enough(10-15% drop in stock price) and this pushback is expected.

Companies like Aisle.com (the blog) and other VAPT companies charge huge amounts to detect vulnerabilities.

If Cloud Mythos become a simple github hook their value will get reduced.

That is a disruption.

Crowdstrike, no pe because it just had its first profitable quarter (38 million)

ZScalar No PE

Palo Alto Networks Inc (PANW) 86 PE

Fortinet : (FTNT) 31.63 PE

That last one, didn't get hit at all by the Mythos announcement, because at some level it has at least some grounding in fiscal reality.

It's weird that Aisle wrote this.

> It's weird that Aisle wrote this.

No, writing an advertisement is not weird. What's weird is that it's top of HN. Or really, no, this isn't weird either if you think about it -- people lookin for a gotcha "Oh see, that new model really isn't that good/it's surely hitting a wall/plateau any day now" upvoted it.

It's also that humans are very bad at repetitive detailed tasks. Sitting down with a code base and looking at each function for integer overflow comparison bugs gets boring really fast. It's a rare person who can do that for as long as it takes to find a bug that they don't already have some clues about.

It's the flaw in the "given enough eyeballs, all bugs are shallow" argument. Because eyeballs grow tired of looking at endless lines of code.

Machines on the other hand are excellent at this. They don't get bored, they just keep doing what they are told to do with no drop-off in attention or focus.

It's weird, because when working on a big project, taking a break for a week or two, and returning to it, I will find a bug and will see hundreds of lines of code that are absolutely terrible, and I will tell myself "Tom you know better than to do this, this is a rookie mistake".

I think people forget that it's hard to be clever and tidy 100% of the time. Big programs take a lot of discipline and an understanding of the context that can be really hard to maintain. This is one of several reasons that my second draft or third draft of code is almost always considerably better than the first draft.

If it’s obvious when you look close, then automate looking close. Seems simple to write tools that spider thru a code base, finding logical groupings and feeding them into an LLM with prompts like “there is a vulnerability in this code, find it”.

The thesis is, the tooling is what matters - the tools (what they call the harness) can turn a dumb llm into a smart llm.

It’s like not differentiating between solving and verifying.

“PKI is easy to break if someone gives us the prime factors to start with!”

The point of contention is whether Mythos is the product of its intelligence or its harness; the results like this, and other similar testimonies, call into question too-dangerous-to-release marketing, and for good reason, too. Because it is powerful marketing. Aisle merely says the intelligence is there in the small models. I say, it's already clear that competent defenders could viably mimic, or perhaps even eclipse what Mythos does, by (a) making better harness, (b) simply spending more on batch jobs, bootstrapping, cache better, etc. You may not be doing this yourself, but your probably should.

All of this discourse seems very bizarre.

If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

Also if pointing models at files and giving them hints is all it takes to make them find all kinds of stuff, well, we can also spray and pray that pretty well with llms can’t we.

It just points to us finding a lot more stuff with only a little bit more sophistication.

Hopefully the growing pains are short and defense wins

> If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

It means "it's so dangerous we can't release it" was a blatant lie since anthropic would have already known this.

My theory is that Mythos is basically just Opus with revised context window handling and more compute thrown at it. So while it will be a step forward, it is probably primarily hype.

N model is basically just N-1 model with revised context window handling and more compute thrown at it

Did mythos isolate the code to begin with? Without a clear methodology that can be attempted with another model the whole thing is meaningless

They did do one agent per code chunk, yes. But key is that their agent had to identify when there was a vulnerability and when there wasn't. This "small model" test only had to label the known positive cases as positive -- which any function that simply returns "true" can do. This whole test setup is annoying because it proves nothing.

to be fair, last post i saw from anthropic on finding linux kernel vulnerability was a while loop per failed prompting "there is a vulnerability here, find it" more important than that, no frontier model can keep the entire linux kernel in context, so there definitely is code isolation, either explicitly or implicitly (the model itself delegates subagents with smaller chunks of code)

No. How would it? Before the vulns were identified by Mythos, no one knew what the relevant portion to isolate was.