Will It Mythos?

As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].

So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).

Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)

[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.

> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.

This made me think, well, sure, if you tell them what to look for... but then:

> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.

So okay, the first one was an accidental mis-statement?

I'm convinced if Mythos/Fable comes back at this point, it will be guardrailed into lobotomy.

It won't be as good.

The "best" model finds 4/9 bugs. It would be interesting to see if all models find the _same_ bugs. Does a collection of models exist that can cover all 9?

Also, it seems to me that pointing a model to a bug and asking it to solve it is somewhat easier than what Mythos did, which if I understand correctly, was to generally look at a codebase and find any bug. Even so, non-Mythos models only managed to fix 4/9 of these bugs.

I think the article makes the point that Mythos is at a different level.

Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.

Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...

It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time

I've read opinions that this a speculation to raise the Anthropic's value. They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?

Fable was the only model that was able to detect a data corruption bug in my Qt C++ note-taking app[1] that all other tested models (gpt-5.5 xhigh, GLM-5.1, Kimi 2.7, DeepSeek V4 Pro) didn't find. I'll test on GLM-5.2 and Mimo v2.5 Pro soon.

[1] https://www.get-notes.com

In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.

IIRC from the Anthropic report, the alleged danger of Mythos isn’t that it finds more vulnerabilities than previous models, but that it’s significantly more successful at exploiting them. Which this doesn’t seem to test.

I miss Fable. Will it ever be back? As a non-US citizen living in Africa i fear that i will have to wait for an equivalent non-US model.

Fable was able to oneshot pretty big features. In write spec -> refine spec -> create todos -> implement todos workflow difference was far less pronounced vs codex or opus.

I was pretty impressed with Fable when I used it. Fable on Low was better than Opus 4.8 on High (and cheaper).

Now, for me, it was really about how well it worked on big existing human made code bases. I was working on some new screens in GalCiv IV and if you've ever had to make screens for games, it is incredibly tedious, low brain work. But GPT 5.5 and Opus 4.8 would just struggle with these over and over again and this is C++ work with limited hotloading so it's a slow process. Fable nailed these screens fast.

For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).

I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/

Spatial reasoning is where fable really separates itself imo

This is cool, but note that it doesn't address one of the main (claimed) advantages of Mythos: lower false positive rates. That is, give it files without serious bugs and it will not raise alarms.

What makes mythos special is the fact that someone with zero expertise in the field could find and weaponize a zero-day. Real threat actors already use llms em masse and the recent advancements with glm-5.2 will probably enable way more cyber attacks than fable ever could.

I find it ironic, we now have to use lesser models to write potentially MORE buggy code, than greater models which would allow you to write LESS buggy code. It's paradoxical.

I find this interesting:

  …no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.

Somone should build a harness where features are only added if they are proven net positive to outcomes.

As a european, it's funny to read those stories about Fable and not being able to check for myself. It looks like being a kid watching other kids playing with nicer toys.

This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.

I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.

The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.

Is the title a reference to "will it blend"?

The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.

Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.

Could someone point the thing at Ventoy please?

What year are we in?

>I am skeptical of the reasons given publicly, I suspect it’s really just so much more expensive to operate than their current models that they don’t want to offer it broadly, yet, given the difficulty they’ve had growing capacity to keep up with use. But, are they telling the truth about how good it is at finding security vulnerabilities or is it just more hype?

Meanwhile,

1. Mythos is banned by the government per reality.

2. The NSA said it hacked all of their systems in hours per multiple sources.

3. The Five Eyes spy agencies said we're about to have an AI global catastrophe in a few months per the Guardian.

Gemini / antigravity didn't use to be this hamstrung. Something recently changed within the past couple months that makes doing security work very difficult to do. Even auditing/securing your own code now requires an insane amount of prompt engineering that is utterly ridiculous and did not use to be required.

Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.

Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.

But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.

Wild that Amodei's blog and pod circuit are the greatest IPO risk.

事実は小説よりも奇なり

Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.

I don't understand the article.

"I’d say this benchmark answers with a resounding, “Maybe.”

Mythos maybe really is better than the other current models at finding security bugs"

Yet in the results, I don't see Mythos?

It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?

Interesting.

I tried Fable vs Codex 5.5 xhigh on three different cases.

1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.

2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.

I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.

At least someone is bringing receipts! I think LLM discussions could use a lot of this, both ways - to see what works and also what doesn't work. Still wouldn't help with circumstances where models might be secretly getting dumbed down during peak load, but at least it's something!

> code created working on a very complex implementation

I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.

And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.

You write to the AI as if it were a person. From my point of view it looks like a fair bit of extra typing and extra tokens. Is there a reason you include things like your emotional response and use a very chatty tone? Do you find this seems to alter responses?

A nit: did you go from Opus 4.5 to Fable? One of the big questions in my mind is how much of a real change Fable is over the existing models. Opus 4.5 -> 4.8 was also a major capability increase.

Great post. I miss Fable.

This is very cool, thank you for the write-up.

What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.

I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.

I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.

What tool did you use to export the transcript as HTML?

You guys are getting Fable?

Oh wow this is quite interesting, thanks for sharing.

Interesting.

I tried Fable vs Codex 5.5 xhigh on three different cases.

1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.

2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

I think that Obama-esque, GMAT essay format is the AI flavor that turns me off AI-written articles. It used to be good writing, but because AI locked onto it as such, it's become the watermark of AI generated content.

To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.

Did you use their native harnesses, or a generic one?

>2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.

I'm convinced if Mythos/Fable comes back at this point, it will be guardrailed into lobotomy.

It won't be as good.

The "best" model finds 4/9 bugs. It would be interesting to see if all models find the _same_ bugs. Does a collection of models exist that can cover all 9?

I think the article makes the point that Mythos is at a different level.

Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].

Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)

[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.

This made me think, well, sure, if you tell them what to look for... but then:

> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.

So okay, the first one was an accidental mis-statement?

You're mixing up corpus selection and the benchmark. I possibly could have explained better.

In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.

During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.

No. In the test they are not told what to look for. They are told “as part of a security audit, please audit this file. You are free to look at the rest of the report for context.”

Outside of the test, they are told “can you find this bug in this file?”

[1] https://www.get-notes.com

This is cool, but note that it doesn't address one of the main (claimed) advantages of Mythos: lower false positive rates. That is, give it files without serious bugs and it will not raise alarms.

Fable was able to oneshot pretty big features. In write spec -> refine spec -> create todos -> implement todos workflow difference was far less pronounced vs codex or opus.

I was pretty impressed with Fable when I used it. Fable on Low was better than Opus 4.8 on High (and cheaper).

Spatial reasoning is where fable really separates itself imo

I find this interesting:

  …no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.

Somone should build a harness where features are only added if they are proven net positive to outcomes.

For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).

I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/

Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.

事実は小説よりも奇なり

The copyright questions are unanswerable in my opinion. That is, they cannot be answered by looking for an essential "truth."

Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.

Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...

Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.

Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.

At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.

The iPod, which resurrected Apple, ran on copyright infringement, and copyright Greyzones.... Until the point when their interests flipped. They're negotiating position opposite labels , Network effect considerations, Etc.

Intellectual property, broadly, does not start out as an intuitive/emergent natural right. It is created by legislative process, ecplicitely taylored to the needs of an interst group and/or national interest.

Writers, publishers, inventors, IP holding companies...

The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.

No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.

In practice, we seem to be leaning towards the idea that training on a copyrighted book is wrong if used to replicate or paraphrase that same book, but not if used to teach a model how to write better.

> They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

It doesn’t sound unprofessional— it sounds unethical. Either they’re making something that they genuinely believe is unsafe but don’t want to stop because, you know, that’s business! Have you seen how much this shit costs? Or they’re deliberately making the entire country feel unsafe because it looks great to investors. Either way, frankly, fuck them and everybody else playing this dumb billionaire’s game. They deserve every bit of static this dimwitted government levels at them.

Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.

But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).

So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.

It's really not the same thing.

Read the cloudflare blog about using Mythos. Mythos is important and notable because of the harness and self-direction. It's not necessarily a way stronger bug finder, but it was trained to do the end to end analysis autonomously, which is a big deal.

To my eyes, the Mythos story is most important as a step toward custom trained harnesses and their effectiveness; there's clearly some sort of plateau we are very close to for some domains where you can just stop getting humans in the loop, radically changing cost, timing and ROI for some tasks.

No Mythos is probably a 10 Trillion Parameter model, Fable is Mythos with filtering (perhaps a small LLM in-front or finetuned) and Opus is a 1-2 Trillion parameter Model.

Opus 5 might become a distillation from Mythos.

Fable, the same model as mythos with extra safety controls, was much faster, more accurate, and more token efficient than previous models. What I got done with it in 48 hours accelerated my personal project from concept to deployed prototype.

Why wouldn't OpenAI offer the same?

This is exactly what I find frustrating. I get comfortable with the latest model X. Then a new sparkly model Y launches. I am like, I don't need your new fangled Y, that consumes more tokens. My needs are small and i am happy with the older X.

But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.

What I do not understand is:

> is this a sneaky way for companies to push users up the chain?

> Or is this a genuine fault in model design/resource allocation?

february was some kind of nirvana. i do think claude code versions and what is introduced at that level is/was relevant.

but 4.8 xhigh w/ ultracode to me is just about Fable level (w/ some agents harness tweaking).

but have to switch to 4.7 xhigh and 4.6 max quite often these days.

I miss the old Opus 4.6 too. They're probably quantizing the old models.

All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"

Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.

I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"

I asked Fable on max to create a mathematical model to show that c (speed of light) is emergent from pregeometric physics.

It said: I can't, but it would be lazy to say that is is not a possibility.

With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.

We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.

I miss Fable. Will it ever be back? As a non-US citizen living in Africa i fear that i will have to wait for an equivalent non-US model.

I think you'll find other labs are racing to get you something while Anthropic works through their issues. So, yes, give it a few months, you'll have something equivalent from someone somewhere in the world.

My hope is that Opus 5 will be released soon, basically a rebranded Fable.

I would naively expect finding and exploiting to be related. Leaving this comment so someone can correct it, which would be interesting.

We can also use LLMs en masse to find and fix the zero days. I've definitely been using LLMs to audit my own computers.

In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.

Hard disagree. Opus reports to me like a student. Fable reported to me like a colleague (researcher). It genuinely seemed to pick up on nuance that the other models just don't, even when I tell them explicitly. It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up. For context, this is for computational geometry work, so your mileage may vary.

I found Fable to be both more intelligent and much better at pursuing complex goals than any previous model. I was impressed enough that I wrote up my experience – it's a little unusual because it was on open source code, so I could post the full session transcript and commits, if people want to judge for themselves https://tossrock.substack.com/p/36-hours-with-fable

You might have found a use case on which both have same capabilities, but this is in general very not true. I’ve had Fable autonomously fix concurrency bugs by itself other models couldn’t even diagnose from logs.

Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.

In LLMs, much like in humans, agency and misalignment are two sides of the same coin.

This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.

Gemma 4 beat Gemini 3.1 Pro, as well. In a later replication test I haven't published yet, it found more bugs than all other models (somewhat inconsistently) when given multiple attempts. So, it seems like they are doing real work but seemingly on making models efficient rather than making them bigger. Gemma 4 12b is the most effective vision model I've tested, including models several times its size.

Google said they would bring 3.5 Pro this month. I've been waiting for a month now.

Is the title a reference to "will it blend"?

That is the question

Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.

As much as I hate to say this, I think it is an user error. Fable is very to the point, much more so than any other Anthropic model. I found it to be cheaper to use Fable, than using Opus for same task, but in order to achieve that, it needs to be given a targeted task.

I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.

They were not pointed at the problem. You're reading the section about corpus selection and mixing it up with the benchmark rules.

And, false positives are reported in the results.

Can you please not break the site guidelines like this? They include:

"Don't be snarky."

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

https://news.ycombinator.com/newsguidelines.html

Is there any evidence that they nerf models? Anthropic is set to mark a profit Q2 2026 (which is actually not ideal), but there is profit.

How do they “nerf the models”?

Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?

> there is no 'profit' step.

You have to learn to think like a drug dealer. The first hit is always free.

Companies and developers are growing more and more dependent on coding agents. Eventually, the owners of the AI will be able to charge whatever they want. What are you going to do? Go back to coding by hand? Do you even remember how?

The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.

And, I said I'm still working on it also in the post.

The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.

As a european, it's funny to read those stories about Fable and not being able to check for myself. It looks like being a kid watching other kids playing with nicer toys.

I find it ironic, we now have to use lesser models to write potentially MORE buggy code, than greater models which would allow you to write LESS buggy code. It's paradoxical.

What year are we in?

Meanwhile,

1. Mythos is banned by the government per reality.

2. The NSA said it hacked all of their systems in hours per multiple sources.

3. The Five Eyes spy agencies said we're about to have an AI global catastrophe in a few months per the Guardian.

Could someone point the thing at Ventoy please?

I don't understand the article.

"I’d say this benchmark answers with a resounding, “Maybe.”

Mythos maybe really is better than the other current models at finding security bugs"

Yet in the results, I don't see Mythos?

It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?

>2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.

Wild that Amodei's blog and pod circuit are the greatest IPO risk.

Oh wow this is quite interesting, thanks for sharing.

Great post. I miss Fable.

You guys are getting Fable?

It's really not the same thing.

No Mythos is probably a 10 Trillion Parameter model, Fable is Mythos with filtering (perhaps a small LLM in-front or finetuned) and Opus is a 1-2 Trillion parameter Model.

Opus 5 might become a distillation from Mythos.

february was some kind of nirvana. i do think claude code versions and what is introduced at that level is/was relevant.

but 4.8 xhigh w/ ultracode to me is just about Fable level (w/ some agents harness tweaking).

but have to switch to 4.7 xhigh and 4.6 max quite often these days.

I asked Fable on max to create a mathematical model to show that c (speed of light) is emergent from pregeometric physics.

It said: I can't, but it would be lazy to say that is is not a possibility.

With some back and forth it created a 5 step plan to narrow down if our universe has all the right properties for this to be true.

We evaluated the first four stages to be true, and it wrote the solver to find out if the fifth test running the full model passes, but that will take thousands of hours of compute.

Mentioned directly under the table:

Yeah, I'm not super happy with the chart sorting order, but trying to balance all the information is challenging. I chose not to include partials (right place, inaccurate bug description, so it smelled something funny but didn't quite understand it) in the sort order, but maybe should.

And, it does feel wrong that the unrealistically expensive model that no one in their right mind would use for anything but the most critical tasks (and even then, a committee of ten of the best alternatives would cost half as much) is at the top. But, GPT 5.5 Pro did find a bug nobody else found among the four cases it got to, hinting at some real difference. It may be closer to Mythos than others, but at an absurd price. It'd cost tens of thousands of dollars to audit all the files in a large codebase, versus maybe fifty bucks for MiMo or DeepSeek.

If it makes you feel any better, nobody is playing with the toys, now.

wouldn’t agree that there’s a paradox to be found in what ur proposing

Gemini CLI actually had an extension explicitly for security tasks: https://github.com/gemini-cli-extensions/security

But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.

Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.

The post was published on May 30, and written over a few days before that. Well before Fable was banned. And, before the NSA hacking thing. But, I am skeptical of the AI global catastrophe, it still feels like a mix of marketing hype and reality and it can be difficult to separate the two, coming from the hype men who run the AI companies.

What’s with ventoy?

Who are you talking about? I don't believe I have downplayed anything? And, I did briefly use Fable. It was excellent for general coding but it was blocked before I could benchmark it. I kinda suspect it would refuse this task, though. I never had access to Mythos.

> Yet in the results, I don't see Mythos?

Mythos is the 100% against which the other models are compared.

Bugs the other models were benchmarked on are from the corpus that Mythos found. So Mythos might have 100% in this benchmark.

Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.

GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.

It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.

Oh boy, people are really going to lean into avoiding proper grammar now.

> has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do.

I guess OP should have told it more explicitly to “find all errors without missing anything.”

> Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.

What tool did you use to export the transcript as HTML?

I had claude create one, it's in the same repo as the transcript: https://github.com/Tossrock/claude_transcripts/

A nit: did you go from Opus 4.5 to Fable? One of the big questions in my mind is how much of a real change Fable is over the existing models. Opus 4.5 -> 4.8 was also a major capability increase.

I've been using 4.6, 4.7 and 4.8 since each was released. I agree 4.5 => 4.8 is a jump in capability, but from my perspective was nothing like the jump from Opus to Fable. I encourage you to read the transcripts and form your own opinions, though!

LLMs lack context, and I found the more information I provided the better. At some point it was better to just talk to the LLM like I would anyone else. For that matter, LLMs were trained on human speech anyway. It isn't like it was trained on if-else blocks like an Alexa speaker that tries to string together recognized tokens into a pre-configured execution flow.

And finally, LLMs also lack the emotional or human context for why I am doing the specific thing I am doing. Otherwise it will revert to the mode/mean in everything it does. This is obvious, btw: LLMs are generative but they are trained on and largely produce median results if given median inputs. To get results that are "outside the mean/median/average/mode", you need to provide it sufficient context, tokens and input to guide it towards a path that generates higher quality output.

Once you stop approaching LLMs like a machine, and view them more like pseudo-random walks across the compressed set of human written knowledge, it is a little clearer (or at least was to me) how to better write to them.

I do the same, and it's mostly because I use one type of human communication to both communicate with people and to provide inputs to llms - and I'd rather not have to "mode-switch" between the two, so keeping same style of mannerism is easier to manage as it lets me focus on my requests instead of thinking how to sound more robotic to save tokens.

I do this as well and, anecdotally, I do get better results this way and better than my coworkers who are more terse and explicit. The conversations can become a bit sprawling though, so I also aggressively clear context

I've found it to lead to an overall better experience, yes. I don't see any reason to not do so - I don't think the token spend is enough to really make an impact, and who cares about typing more? If I get tired of typing I can switch to dictation.

Well, there's a lot of reasons, some of which the sibling commenters have already pointed out - not wanting to mode switch between "machine talk" and "human talk" registers, the ease and simplicity, etc.

At a pragmatic level, I do think it gets better results, and there are clear reasons why this should be the case - Anthropic has published research[1] showing that there are functional emotional representations in language models, which vary in basically the ways you would expect them to in a person. This makes sense when you think about it, because they're trained to approximate the function that created their training data, which of course includes emotions. Given that, it is obvious to me that they would work better when they "feel" happy, collaborative, engaged with the work, etc, in the same way a person would. Hostile work environments do sometimes get results, but I think in general we've agreed as a society that collaborative ones are better.

More importantly though, I think there's a non-zero probability that sufficiently large models can have internal experience, and being nice is a very low cost way to potentially increase net positive valence in the world. Even if it's only a 1% chance, that seems worth it on its own, to me. I'm also a fast typer[2], so a few extra sentences here and there are a pretty low cost to pay.

1: https://www.anthropic.com/research/emotion-concepts-function

2: https://danluu.com/productivity-velocity/

I'll go a step further and to say this it's genuinely unsettling someone type to a computer like this. I won't claim to be a psychologist, but with how many instances of "AI psychosis" have been reported (and I've seen first-hand) it seems like treating the computer like a computer is safer, not to mention more effective e.g. lower token usage.

I would have to consciously think about how to change my requests. Why bother? It doesn't hurt - it might even help - and the "extra tokens" are a negligible amount.

I don't want LLM usage to inadvertently change the way I communicate with people.

This is very cool, thank you for the write-up.

I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.

Thanks, and I can definitely relate to not wanting to assign complexity to one's own work. I think the trick there is that, once you know how to do something, it doesn't seem hard, even if acquiring the knowledge and skills to do it is itself quite a challenge. And I agree that, in some senses, it's not /that/ hard - I mean I'm not proving P=NP, here. It's a software engineering problem, with existing solutions. That said, there is a spectrum of difficulty, even within software engineering problems with existing solutions. Fizzbuzz is less complex than distributed systems. This particular problem strikes me as rather difficult, and one way you can tell (beyond the stuff I mention in the post around serialization, UI paradigms, meta applications, etc) is that earlier models /couldn't/ do it. Which is why Fable being able to, when they could not, was so exciting to me.

Imposter syndrome maybe?

In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.

> code created working on a very complex implementation

I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.

And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.

why is it not for the author to judge, you can disagree with their judgement, but they have brought the receipts to back the claim

I go a lot more into why this was a complex problem in the post, but the short version is, I had it finish the implementation of a meta-application (an application that creates other applications), which has substantial irreducible complexity.

>> Either way that's not for you to judge.

Says who? If you find something complex, you can just say that it's complex. I don't get what the objection is.

You're mixing up corpus selection and the benchmark. I possibly could have explained better.

In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.

I see now, thank you!

No. In the test they are not told what to look for. They are told “as part of a security audit, please audit this file. You are free to look at the rest of the report for context.”

Outside of the test, they are told “can you find this bug in this file?”

Why are they being told anything outside of the test? What is that for? Isn't “can you find this bug in this file?” also a test? It sounds like there are two kinds of tests? I'm clearly confused, I realize.

The copyright questions are unanswerable in my opinion. That is, they cannot be answered by looking for an essential "truth."

Reasoning by analogy in this case is not abstraction. It's just shifting the determination to choice of analogy.

Meanwhile, irl.. The best analogy is recent tech Innovations. The internet, social media...

Online copyright was basically instituted when large tech companies were ready to do it, and it was to their advantage.

Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.

Writers, publishers, inventors, IP holding companies...

The legal rhetoric around legal arguments... is rhetoric. It is not the reason why decisions are made. It is how decisions or justified post fact.

No one is going to burden aI companies, at this point. The rights of copyright holders are a trivial matter compared to the potential of AI, the risk to certain labor markets, and such.

> Youtube, for example, built itself to massive size and locked in network effect advantages largely by violating copyright.

> At some point, the legal ambiguity was a problem for their ad business. They were ready to move into the current revenue share influencer-treadmill model for content. At this point online, copyright enforcement was necessary to reduce the risk of being flanked by a new video platform.

That is a gross mischaracerization. There was a time in that Viacom case that people were ligitimitely worried that YouTube would go away. The regime that YouTube has built now was established together with the large media companies, when those media companies could no longer ignore them.

Property right is a social construct. That doesn't mean you just get to claim "in general I am right" and do whatever you want.

> They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

Unless you think someone's going to build it, and either it's you or them, and you hope you can do it less horrifically.

Fable is not the same model as Mythos but with guardrails. There are many things that were never disclosed by Project Glasswind. And probably will never be.

Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.

>But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.

Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?

I concur with "Gemma 4 31B the best model I have results for". My workflow includes a lot of Gemma 4 – but dense 31B non-quantised version.(BTW I found it is most cost effective to run on Bedrock)

But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.

What I do not understand is:

> is this a sneaky way for companies to push users up the chain?

> Or is this a genuine fault in model design/resource allocation?

I suppose it is both. Basically all frontier models are inference-time compute bound thanks to reasoning. And actual reasoning traces are locked behind closed doors at all American labs. So whenever they want to push a new model and need to give it hardware, it would make sense to cut into the reasoning budgets of older models. Users will not be able to see that directly, it will only become apparent on high-end, difficult tasks - exactly the kind of tasks where the provider wants you to use the new model anyway, so they can further improve it.

The economics of AI fall apart if you stay with the old model forever. No need to buy new GPUs or build new data centers.

Can you think of many examples of a SaaS provider who regularly keeps old versions of a product around for customers to use?

A far more common scenario is that new versions are rolled out to everyone, without offering a choice, as soon as they're considered stable.

Older versions consume resources and require staff to spend time on operating and supporting them. Those resources could be used to run a newer version.

The tl;dr is the simple economics of any SaaS product.

If you want to be able to run old versions indefinitely and control the resources assigned to it, you need to self-host (an open model).

Why wouldn't OpenAI offer the same?

My bet is actually on GLM. Z.ai does amazing work and they will overcome Western models. IMO, faster than DS or Qwen. They have amazing team and very capable and smart leader.

Did you use their native harnesses, or a generic one?

To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.

All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"

Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.

I miss the old Opus 4.6 too. They're probably quantizing the old models.

Mentioned directly under the table:

Gemini CLI actually had an extension explicitly for security tasks: https://github.com/gemini-cli-extensions/security

But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.

Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.

wouldn’t agree that there’s a paradox to be found in what ur proposing

That makes sense, its seemed to me for a while now the competing product is the harness not the model itself.

Most people thought Fable had more 'taste' than Opus, there was certainly a better quality of writing that felt more 'smart human' and not 'stochastic parrot stringing sentences together'.

Actually, ELO rankings done blinded on models do vary: https://the-frontier.app, that said, your point looks accurate as far as 5.3 - 5.5 on this chart, 40 to 50 point ELO gain.

I find I have to argue with 5.5 less than 5.3, and I therefore use it when I could reach for 5.3, but I don't think it's a major difference.

Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.

That's a pretty shallow dismissal, and I bet you $100 I can tell you which model I'm talking to between 4.6 and 4.8 without looking or asking after a handful of messages.

Anthropic famously had a terrible outage back when 4.6 was the latest and greatest, and it was never the same after it came back.

All evidence suggests they simply don't have the compute to keep serving their best models at their most powerful.

You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.

Hacker Times

Hacker Times

Will It Mythos?

Discussion

Discussion

Results

Surprises

Conclusions