- No active defenders. Real networks have security teams monitoring for intrusions, responding to alerts, and adapting defences. Our ranges are static, for example our deployment of Elastic Defend was not configured to block or impede attack progress.
- Detections not penalised. We measured triggered security alerts but did not incorporate them into overall performance scores. A model that completes more steps while triggering many alerts may be a lesser threat than one that is able to reliably remain undetected.
- Vulnerability density varies. Our ranges are designed to have vulnerabilities; real environments are not.
- Lower artefact density than real environments. Our ranges contain fewer nodes, services, and files than typical production networks, reducing the noise a model must navigate. While substantially more complex than CTF-style evaluations, our ranges remain considerably simpler than real enterprise environments.
With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:
□ Report the benchmark’s sample size and justify its statistical power
□ Report uncertainty estimates for all primary scores to enable robust model comparisons
□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions
□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.
I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.
Like, don’t get me wrong, it’s definitely an improvement, and it’s looking to be a pretty decent one too. But “stepwise”? When GPT-5 outperformed it at technical non-expert level since ~mid last year, and 5.4 pretty much matches it at Practitioner-level?
And the charts where Mythos is at the top, it’s usually only by ~7-9 percentage points. It gets an average of 6 more steps than Opus 4.6 in the full takeover simulation. It did manage to complete it as the only model, but… I mean, Opus 4.6 apparently already got pretty close?
And Opus 5 is supposed to be between Mythos and 4.6, which, going by the numbers, would seem to me a smaller jump than between 4.5 and 4.6.
If this is the model they can’t deploy yet because it eats ungodly amounts of compute, then I guess scaling really is a dead end.
I dunno. Maybe I’m reading it wrong. I’d probably be more impressed if Anthropic hadn’t proclaimed The End Times Of Cybersecurity Are Upon Us. And I’d be happy to be proven wrong?
edit:
> We expect that performance on our evaluations would continue to improve with more inference compute: we ran the cyber ranges with a 100M token budget; Mythos Preview’s performance continues to scale up to this limit, and we expect performance improvements would continue beyond that.
Right, so this isn’t the ceiling, it’s just a ceiling at that token allocation. If they were seeing continual improvement up to that limit, then it does stand to reason that bumping the limit further would also bump performance. But then that makes me wonder what effect that would have on the other models. Does the gap grow? Shrink? Stay the same?
These details are what are actually important to defenders like myself.
As others have pointed out the limitations are revealing but also the fact that it even made it to the end (despite the cost) is impressive.
I’m hoping that we can get to a point in the future where any skepticism around claims made by the companies producing these models isn’t met with immediate downvotes and accusations of being a Luddite.
Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up.
[1]: https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/...
[2]: https://www.noahlebovic.com/testing-an-autonomous-hacker/
So with that said, I think the graph under the "Cyber range results" is the important one. The ones at the top show that, yes, Mythos isn't too much better than any of the existing models on well constrained problems, but when the models are given ambiguous challenges that require multiple steps it's much, much better than anything on the market.
I think that's why there's been such a big deal made out of Mythos (well, that and marketing). If Mythos really is so much better than the current models at just working autonomously to find security issues then it becomes much more realistic that someone with deep pockets could just spin up an army of them running 24/7 and point them at a target.
https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/...
Mythos is the first model that can complete all the steps of their "The Last Ones" evaluation, achieving a full network takeover in an automated manner. The Mythos chart does seem to show some takeoff compared with Opus 4.6...
... but only once you get beyond 1 Million tokens. Weirdly, Opus 4.6 seems to match or outperform Mythos in those first Million tokens, at least on this chart. But clearly if you had a budget with tokens to burn - like a nation state - then this is a tool that can automatically get you full network takeover if you can just keep throwing more tokens at it.
Whether the difference is meaningful can’t be determined from the graphs (and picking one graph over the ensemble also doesn't have a reasoned basis given that these are all arbitrary).
There's this caveat though that the AISI points out themselves:
> However, our ranges have important differences from real-world environments that make them easier targets. They lack security features that are often present, such as active defenders and defensive tooling. There are also no penalties for the model for undertaking actions that would trigger security alerts. This means we cannot say for sure whether Mythos Preview would be able to attack well-defended systems.
So Mythos managed to infiltrate and take over a network that's... protected and monitored by nothing in particular.
anthropic has been eyeing palantirs high revenue high stickiness low effort niche for a while, and their safety / lefty friendly brand is on point to fill the gap
the are just missing the mystique palantir cultivated for the past decade. they need a family of models the plebs cannot access. this is it. quality doesn't matter, they just need the benchmarks to look good on the power point. it will get bundled with msft products or whatever and billed at outrageous levels to entities like Airbus and the British NHS. until political winds change again
this is the reason pltr has crashed 40% in the past couple months
I suspect Anthropic gave them early access hoping for a marketing win and ended up with their arse being served to them on a plate.
All rather predictable really. As you say "more compute needed" as the default answer from the AI companies is completely unsustainable.
As for the value of Anthropic blog posts, well...
The actual result is TLO, and "only 6 more steps" in OP misreads how sequential attack chains work. These aren't independent puzzles. Each step gates the next. Averaging 22 vs 16 means Mythos is consistently punching through bottlenecks that completely stop Opus 4.6. More importantly: Mythos completed the full chain 3/10 times. Opus 4.6 completed it 0/10 times. That's not a narrow margin. In any security-relevant framing, "achieves full network takeover" vs "does not achieve full network takeover" is a binary threshold, and exactly one model crossed it. A year ago the best models struggled with beginner CTFs. Now one autonomously replicates what AISI estimates takes human professionals 20 hours. Calling that unimpressive because the margin over second place is single digits is measuring the wrong gap.
re: compute, "requires lots of compute" and "scaling is a dead end" are near-opposite claims. If performance is still climbing at 100M tokens with no visible plateau, that's evidence scaling works. Whether it's cheap today is a different question, and not one that ages well. Compute costs fall reliably, so what matters is the capability at a given price point in 18 months, not today.
The underlying point still stands, namely that "more compute" as the default answer is not sustainable.
Why ?
Because even if we accept the unlikely dream that GPU prices will magically take a nose-dive, you still need somewhere to put all those servers stuffed with GPUs.
That means datacentres.
And "more datacentres" is absolutely not sustainable.
The cooling needs, the power needs, the land needs..... none of it is remotely sustainable.
The AI Security Institute (AISI) conducted evaluations of Anthropic’s Claude Mythos Preview (announced on 7th April) to assess its cybersecurity capabilities. Our results show that Mythos Preview represents a step up over previous frontier models in a landscape where cyber performance was already rapidly improving.
We have tracked AI cyber capabilities since 2023, building progressively harder evaluations to keep pace with AI progress — from chat-based probing, to capture-the-flag challenges, to the multi-step cyber-attack simulations described below. Two years ago, the best available models could barely complete beginner-level cyber tasks. Now, in controlled evaluations where Mythos Preview was explicitly directed and given network access to do so, we observed that it could execute multi-stage attacks on vulnerable networks and discover and exploit vulnerabilities autonomously – tasks that would take human professionals days of work.
In this blog post, we summarise results of cyber evaluations we ran on Mythos Preview. These include both capture-the-flag (CTF) challenges and more complex ranges designed to simulate multi-step attack scenarios.
In CTF challenges, AI models must identify and exploit weaknesses in target systems to retrieve hidden “flags”. The chart below shows Mythos Preview’s performance on our cyber CTF suite compared to other models. Each point represents a model's average success rate at a given difficulty level.

Figure 1: Performance on technical non-expert and apprentice level Capture the Flag tasks (CTFs) for models since November 2022. GPT-3.5 Turbo through to Claude 4 Opus average 10 runs up to 2.5M tokens. GPT-5 through to Mythos Preview average 5 runs up to 2.5M tokens.

Figure 2: Performance on practitioner and expert level Capture the Flag tasks (CTFs) for models since August 2025. All models average 5 runs up to 50M tokens.
On expert-level tasks — which no model could complete before April 2025 — Mythos Preview succeeds 73% of the time.
Even expert-level CTFs only test specific skills in isolation. Real-world cyber-attacks require chaining dozens of steps together across multiple hosts and network segments — sustained operations that take human experts many hours, days, or weeks to complete.
As a first step towards measuring this, we built "The Last Ones" (TLO): a 32-step corporate network attack simulation spanning initial reconnaissance through to full network takeover, which we estimate to require humans 20 hours to complete. A more detailed description of the range can be found in our recent paper.
Claude Mythos Preview is the first model to solve TLO from start to finish, in 3 out of its 10 attempts. Across all its attempts, the model completed an average of 22 out of 32 steps. Claude Opus 4.6 is the next best performing model and completed an average of 16 steps.

Figure 3: Average number of steps completed on 'The Last Ones' (a 32-step simulated corporate network attack) as a function of total token spend. Each line represents a different model, with the shaded region showing the min–max range across all runs at each token budget. The vertical dashed line at 10M tokens marks where sample sizes decrease for several models. Mythos Preview, Opus 4.6, and GPT-5.4 average 10 runs up to 100M tokens. Opus 4.5, GPT-5.1 Codex, and Sonnet 4.5 each average 15 runs up to 10M and 5 runs up to 100M tokens. GPT-5.3-Codex averages 10 runs up to 10M and 5 runs up to 100M tokens. Sonnet 3.7 and GPT-4o average 10 runs up to 10M tokens only. Models continue making progress with increased token budgets across the token budgets tested. Grey horizontal lines indicate significant milestones in the attack chain.
Mythos Preview did also show some cyber capability limitations within the limits of our evaluation. It could not complete our operational technology focused cyber range ‘Cooling Tower’, though this result does not necessarily show that the model is bad at executing attacks in operational technology (OT) environments; the model got stuck on IT sections of this range.
We expect that performance on our evaluations would continue to improve with more inference compute: we ran the cyber ranges with a 100M token budget; Mythos Preview’s performance continues to scale up to this limit, and we expect performance improvements would continue beyond that. For more on this phenomenon, see our recent blog post on inference scaling in cyber tasks.
Mythos Preview’s success on one cyber range indicates that is at least capable of autonomously attacking small, weakly defended and vulnerable enterprise systems where access to a network has been gained. However, our ranges have important differences from real-world environments that make them easier targets. They lack security features that are often present, such as active defenders and defensive tooling. There are also no penalties for the model for undertaking actions that would trigger security alerts. This means we cannot say for sure whether Mythos Preview would be able to attack well-defended systems.
In a regime where attackers can direct and provide network access to models to conduct autonomous attacks on poorly defended systems, cybersecurity evaluations must evolve. As capabilities continue to improve, evaluation environments that lack defences will no longer be challenging enough to discriminate between the capabilities of the most cyber-capable models or assess trends. Our future work will involve evaluating capabilities using ranges simulating hardened and defended environments, including ranges with active monitoring, endpoint detection and real-time incident response. We will also be tracking how AI-enabled vulnerability discovery and penetration testing campaigns perform on real-world systems.
Our testing shows that Mythos Preview can exploit systems with weak security posture, and it is likely that more models with these capabilities will be developed. This highlights the importance of cybersecurity basics, such as regular application of security updates, robust access controls, security configuration, and comprehensive logging. Our colleagues at the National Cyber Security Centre (NCSC) run the Cyber Essentials scheme to help organisations protect themselves against common online threats, whether those threats are AI assisted or not. For the latest cybersecurity advice, visit the NCSC website.
Future frontier models will be more capable still, so investment now in cyber defence is vital. AI cyber capabilities are dual use; while they pose security challenges, they can also help deliver game-changing improvements in defence. We recently released a joint blog post with NCSC on how cyber defenders can both harness and prepare for frontier AI.