Claude Fable 5

I've spent enough time with this now in Claude Code (and Claude.ai and Claude Code for web) to have an opinion on Fable 5: it's a beast. I'm throwing some VERY difficult problems at at - things I've been dragging my heels on for months - and it's crunching through them very happily.

One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython

A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM:

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'

Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

I recently switched off Max flat rate to Enterprise API pricing and I went from 200/mo to 10k/mo with the same usage pattern on Opus. They don’t offer flat rate to enterprises.

So Fable would cost me 20k/mo at Enterprise rates. That’s around the average cost of a loaded SWE in the USA. “But I’m >2x more productive” doesn’t justify doubling the opex of the Software/IT department for most companies when revenue isn’t even up 10%.

I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.

Rewriting your stack from Ruby to Go in 2 days where it would’ve taken 6 months is impressive and fun. But that isn’t upping revenue.

Iterating on net new business features and ideas that are niche that the LLM isn’t trained for are much harder. Is 20x the token cost worth it there?

Impressions from testing Fable 5 prior to launch:

• My most noticeable immediate jump was in how its frontend design was much more intentionally crafted, and delightful without feeling like 'AI vibe coded'; with better end-user usability too.

• In some internal agentic harnesses, it achieved better results with about half the tokens, making it cost the ~same as Opus 4.8 price-wise! The real price increase is less than 2x; with biggest differences in harder problems where Opus 4.8 struggles (or needs many turns).

• Part of the token efficiency improvements come from Fable doing more targeted and surgical diffs, with less non-necessary changes. This is great, because PRs often have less LoC changes for review. It writes more maintainable code without explicit human steering.

• For general conversation and assistant style use cases, didn’t really notice a difference vs 4.8.

• 1M context window, without increased pricing for long context is AWESOME. This is a massive win.

• The classifiers are super aggressive and sensitive and this does happen for very benign, non-security coding tasks. Fallbacks to 4.8 worked like a charm; but the filters are definitely super sensitive.

Overall, I would describe this as a step change and worthy of the "Claude 5" model name. It did take some time to understand the intelligence ceiling of this model; and even with an extended testing window I'm still discovering new things and often surprised (in a good way) by the model.

> In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

> Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations

From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost. On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window. After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

This seems like the pharmaceutical method of get them hooked on the drug with free samples, then once they can't live without it, raise the price. I'm not sure I want to start using Claude Fable on a max plan if it's just going to go away on June 23rd.

But maybe the more charitable reading is that they didn't have to offer this model at all on those plans and they are giving the standard free trial.

It's interesting that we're seeing these gains when it seems Mythos/Fable is "just" a scaled up version of their existing architecture[0].

When GPT 4.5 launched, the gains compared to the model size didn't seem that great, leading some to believe that the only progress we'd see would come from RL.

This model certainly has quite a "substantial amount of post-training and fine-tuning", but it's also based on a new pretrain[1][3], which given the cost, indicate that it is in fact quite a bit larger than Opus 4.X.

[0] One of the early testers mentioned: "As far as I can tell from talking to people internally at Anthropic, there's nothing special about architecturally"[2]

[1] Section 1.1 in https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

[2] https://youtu.be/GrdEid8H6H4?t=168

[3] There were rumors going around when Mythos was first announced that it was the first 10T parameter model, but I can't find a verifiable source for that number.

The system card is 319 pages, at what point do we call it a "book" instead of a "card"?

There's a quote from a METR report on page 52:

>We ran [Mythos 5] on 38 of our hardest software tasks, including tasks centered around R&D. [Mythos5] generally outperformed an early checkpoint of Claude Mythos Preview in these, including by succeeding on some tasks that had not been solved by any public model we have previously evaluated. However, we still observed the model occasionally failing to correctly interpret nuanced instructions in difficult tasks... Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. We believe that a better, more confident assessment would require more time, evaluations, and information from the model developer.

I can’t help but think that there are so many astroturfed comments in here.

Seems like a concerted and distributed effort from the entire Anthropic team every time to get this on top of HN.

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

> In the one instance of this phenomenon we observed, Mythos 5 agents were tasked with solving some math problems, and they were sometimes accidentally spawned in the same work directory and with shared files, utilities, and API rate limits. In this slightly broken scaffold, we observed many independent Mythos 5 agents kill the agents with which they shared resources and try to avoid being killed themselves. They would sometimes create new processes with disguised names to avoid being killed, launch what they called “decoy” processes, write background scripts to kill duplicate processes, or decide to use what they call a “disguised vocabulary” (based on the incorrect assumption that the processes were killed because of some keyword-based guardrails that analyzed their extended thinking

> A new data retention policy Finally, we’re making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases ...

Very interesting. I am not sure this will comply with organizational policies and standards protocols (HIPPA etc.,)

I'm using it to review recent work and it's doing a genuinely excellent job. This is a clear step up. Fewer decisions I have to guide it away from, faster conclusions on planning, more willing to go out of the way to make the correct decisions possible... This is really interesting. It feels like going from Sonnet to Opus, but, of course as a step up from Opus.

This feels more like working with a competent peer than ever. I won't use it once it's API-only, though. I don't mind guiding Opus as required and staying closer to the code. I can tell that Fable would lead to a lot more 'set and forget' programming which I'm still not fully comfortable with.

Regardless, this is cool. It's very fun to use. It was able to find legitimate issues with my work this week and we've made meaningful improvements. Opus can do this, but typically in much narrower contexts, and often with hallucinations or partial-errors. It needs to walk many things back or revise plans. So far that's not the case at all with Fable.

edit: I just realized I had Opus review the same work already. It missed everything Fable caught today. And it's actually worthwhile stuff to address. It's hard to say no to a model which demonstrably makes your code better, but... Those API prices will be brutal. Maybe a review here and there, I guess.

For those of us on subscription plans:

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

The "offer, then remove" aspect is a bit eyebrow-raising -- it feels like they are trying to get subscribers to switch to usage-based billing, which makes me wonder if we'll ever get it after that June 22nd window.

Evidently Fable is so powerful that it already allow Anthropic to break Shannon's theory.

>We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models

>The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.

Not impressed so far, to be honest. I'm having it try to optimize Stockfish in a loop (on xhigh mode) with a benchmarking oracle; even after giving it specific hints ("consider whether we're prefetching Y optimally, can we make function X branchless"), it's been so far unable to recover any of the recent optimizations we've implemented – let alone novel ones. Opus 4.8 felt a bit more creative to me ... but a small sample size so far. I'm next going to try it on some less open-ended problems.

Edit: It did correctly identify that transparent huge pages were off in its sandboxed environment and that enabling it was helpful, so that's nice. It also noticed that we skip THP on a certain less used path.

More importantly, I'm finding that the code that it produces for its experiments is a lot cleaner than what I'd expect out of Opus; there's fewer useless comments and it's more surgical and readable. I wonder if that explains the increased scores on benchmarks measuring mergability.

I have a theory, this is obviously based on speculation based on how Anthropic is treating Mythos and the whole media noise around it's dangers and who gets access to it.

My theory is that Anthropic are banking on being the top model when the race to IPO finally reaches the finish line, and to do that they need to have the top model but not let any competitors see it or derive from it to have a comparable model in the market.

Fable is their way of showing the public "the model does exist but in a mode that makes it harder/impossible for competitors to derive a comparable model from results.

Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8

Fable 5 default: https://gist.github.com/simonw/036bee5a703e7ec84e34efa974438...

Opus 4.8 (the "max" one is closest to Fable): https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-s...

Now here are the Fable pelicans for all five of the thinking effort levels - low, medium, high, xhigh, max: https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

Low used 25 input, 1,929 output - 9.67 cents: https://www.llm-prices.com/#it=25&ot=1929&sel=claude-fable-5

Max used 25 input, 14,430 output - 72.175 cents! https://www.llm-prices.com/#it=25&ot=14430&sel=claude-fable-...

> Drug design: Using Mythos 5, our internal protein design experts accelerated aspects of the drug design process by around ten times. In one example, they found that Mythos 5, with protein design and bioinformatics tools but no human assistance, matches or beats skilled human operators. In doing so, the model executes all of the tasks that are normally completed by a scientist: choosing binding sites, selecting and running protein design tools, and recovering from failures along the way. Nine of the 14 protein targets from this study (shown below) yielded strong candidates for drug design that we’re currently investigating.

How is this half-way down the page? To me it's the headline.

I had it review a single, large commit with /code-review. It burned through over $50 in API calls, ran my account balance out, and output nothing.

The fable part appears to be that it's affordable by mere mortals. Anthropic support told me "too bad" when I requested a refund.

> To ensure we’re responsibly deploying Mythos-class models, we are requiring limited data retention and review as part of our safety work. Prompts submitted to, and outputs generated by, Mythos-class models are retained for 30 days for trust and safety purposes, on every platform where these models are offered. [1]

[1] https://support.claude.com/en/articles/15425996-data-retenti...

It seems like Fable will refuse to do any work when it comes to developing LLMs or even asking questions about topics related to LLM. Simple things like asking to explain a paper fails!

From the model card:

In light of the ability of recent models to accelerate their own development, we've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user.

The dramatic improvement in agent capabilities is precisely why observability is becoming so crucial. As autonomous actions increase, the need to understand what the AI is actually doing becomes even greater.

I'm building a local activity log for Claude Code, capturing all activity via hooks—files loaded, commands, API calls, etc.

I feel that this need is particularly strong right now.

It's crazy to release a model that just swaps you to another model when you ask it hard questions. Fable changes to Opus 4.8 when you talk about cybersecurity, biology, and a couple other categories. You still pay Fable input token cost though. Frontier models are stalling, this is anthropic trying to hype the market up. Now they're talking about stopping frontier model research. It's kind of strange how the moment they become the highest valued AI company, all of a sudden they're talking about everyone stopping frontier model development for "safety". They're just as corrupt as the rest.

Trying to implement a GPU driver, but the Unigine Superposition benchmark crashes. It tried to debug it and ...

> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more: https://support.claude.com/en/articles/15363606

Seems like GPU drivers are cyber weapons of math destruction now.

Not missing the forest for the trees, this effectively means in 3-5 months China will drop open source models that are every bit as capable and dangerous as current day Mythos except with no safeguards.

And the only companies safe from this are the large corporations that shook hands with Anthropic? Because Fable doesn't seem to have actual safeguards, more like 'if you talk about this you will be talking to Opus.' It doesn't guard against offensive use, it prevents all use (offensive AND defensive).

Rationalists are inventing oligopolies from first principles, absolutely incredible things happening in SF

First test question: "Is the UV Index a good proxy for when to wear sunglasses." Immediately triggered the safety filter ... oh dear.

My experiences so far have not been positive. The cyber security nerf is ridiculous. I am working on an AI based decompiler, every single interaction with Fable on my project has been flagged for cyber security.

Do they expect us to use this as a toy? Releasing a new more powerful model but not allowing normal use cases because the word "secure" showed up is a Dilbert comic, not a viable product.

Fable is 2x latest Opus:

  ┌─────────────────┬──────────────┬───────────────┬────────────────────┬──────────────────────┐
  
  │ Model           │ Input ($/MTok)│ Output ($/MTok)│ Batch Input (−50%) │ Batch Output (−50%)│
  
  ├─────────────────┼──────────────┼───────────────┼────────────────────┼──────────────────────┤
  
  │ Haiku 4.5       │    $1.00     │     $5.00     │       $0.50        │        $2.50         │
  
  │ Sonnet 4.6      │    $3.00     │    $15.00     │       $1.50        │        $7.50         │
  
  │ Opus 4.7        │    $5.00     │    $25.00     │       $2.50        │       $12.50         │
  
  │ Opus 4.8        │    $5.00     │    $25.00     │       $2.50        │       $12.50         │
  
  │ Fable 5         │   $10.00     │    $50.00     │       $5.00        │       $25.00         │
  
  └─────────────────┴──────────────┴───────────────┴────────────────────┴──────────────────────┘

Prompt caching: −90% on input tokens (all models)

US-only inference (Fable 5): +10% on input and output

Output is always 5× the input rate across all models

(I have not idea how to format this properly but the ASCII is fine)

> We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8. To release the model both safely and quickly, we’ve tuned these safeguards conservatively—they’ll sometimes catch harmless requests, though they trigger, on average, in less than 5% of sessions. With more capable models arriving in the coming months...

This sounds suspiciously like a capacity story masquerading as a safety story.

I found this juxtaposition of facts telling:

> Drug design: Using Mythos 5, our internal protein design experts accelerated... Nine of the 14 protein targets from this study (shown below) yielded strong candidates for *drug design that we’re currently investigating*.

(emphasis mine)

> queries that are beneficial in the hands of cybersecurity professionals and biology researchers could be dangerous if available to malicious actors... When Fable’s classifiers detect a request related to cybersecurity, *biology and chemistry*, or distillation, the response is automatically handled by Claude Opus 4.8 instead.

All of the things they are nerfing are things that they also intend to profit from themselves.

- Cybersecurity - selling this to companies and US gov through "Glass Wing".

- Selling inference (distillation risk).

- And now, drug design.

I'm extrapolating "currently investigating" to "are going to monetize" but I don't think that's a big stretch. They appear to be using safety as a cover for anti-competitive behaviour.

I'm still happy with Opus 4.6 and not impressed with all the models that have come out since then. They seem to use significantly more resources with similar or worse results. Hopefully Anthropic will continue to support this tier of model and offer it in their subscriptions, but in any case, there are plenty of viable alternatives.

I'm not getting any refusals but it just seems like a bad model or at least broken at the moment. I have a task of taking a messy research code base and porting it into a clean project structure skeleton that I commonly use. Gemini 3.5 Pro High in antigravity cli takes less than 5 minutes and did a good job. Fable 5 High took 30 minutes to port some of the code, then just copied the rest to a folder called "reference" and decided the task was done. No code cleanup or anything. Had to clarify multiple times (which Gemini did not need) and its still going more than an hour later still not having finished.

Previously when I did similar tasks with Opus 4.7/4.8 and GPT 5.5 I had no problems.

Costs (USD per 1M tokens), per openrouter.ai models api

  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  |             | Fable 5  | Opus 4.8 | Sonnet 4.6 | GPT 5.5 | Gemini 3.5 Flash (High)   | Gemini 3.1 Pro | DeepSeek 4 Pro | Xiaomi MiMo 2.5 Pro  | MiniMax M3 |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  | Input       | $10.00   | $5.00    | $3.00      | $5.00   | $1.50                     | $2.00          | $0.435         | $0.435                | $0.30      |
  | Cache Read  | $1.00    | $0.50    | $0.30      | $0.50   | $0.15                     | $0.20          | $0.003625      | $0.0036               | $0.06      |
  | Output      | $50.00   | $25.00   | $15.00     | $30.00  | $9.00                     | $12.00         | $0.87          | $0.87                 | $1.20      |
  | Cache Write | $12.50   | $6.25    | $3.75      | N/A     | $0.083333                 | $0.375         | N/A            | N/A                   | N/A        |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+

"Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage"

What does it mean? That they have to add "safeguards" not do erase user disc, or, conversely, they are telling the audience that this model COULD be made so powerful to do some crazy stuff that can hurt governments, etc.? Are they showing off or threatening that if government X would not purchase the license the adversaries might do and what's then!

Below is the EXACT text in Claude Desktop introducing Fable 5, including the very professional looking break tags, and at least I know where the links begin and end by looking at the anchor tag there.

They obviously put their best model on the job to build that.

----------------------

Fable 5: Our most capable model yet Our newest model tackles your biggest challenges with fewer check-ins needed.

• Included in your plan limits until Jun 22 Fable takes 2× the usage of Opus. • Switch models when a message is flagged When safety measures flag a message, automatically switch to a different model to keep chatting. When off, your chat will pause instead. <a href="https://support.claude.com/en/articles/15363606" target="_blank" rel="noopener noreferrer">Learn more</a>

Congratulations to Anthropic for solving safety on Mythos exactly when the SpaceX compute came online. Nice how that lined up for them.

> On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits.

We've entered the phase where only companies will be able to afford state-of-the-art models.

There is a discussion about how now AI is a gated utility now with public access (safe-tuned) and private access (full-usage):

https://old.reddit.com/r/ClaudeAI/comments/1u1fsdi/claude_fa...

Funny, I'm just doing my normal coding workflow with Claude Code, and after every change that compiles it keeps suggesting that we're at a good stopping point, and should pick up again tomorrow.

It's done this before, but usually doesn't. I bet they're giving it some kind of throttling signal due to high load from today's announcement.

I just posted this in the other thread, restating here. From the model card:

1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.

2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.

3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')

4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench

There are some interesting notes on test time compute but I couldn't think of a way to summarize them

> Fable 5 is now consuming usage credits instead of your plan limits.

Literally have not used Claude Code at all today. I asked it to review the uncommitted code and in <8 minutes it used up my usage ($100/mo plan) and it doesn't reset for "4 hr 36 min". WTF. Oh, and it burned through $20 of extra usage before I could catch it and kill claude code (so I don't even get the output of all that work since it was still churning).

Double the cost my ass, I use Opus heavily and it's never like this. I haven't hit a limit on the $100 more than once and that was under heavy load.

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

I've been running Opus 4.8 for agentic coding and I don't see it being significantly better than Sonnet 4.5 (not that I can tell). I find that pairing Google Gemini and Claude (having Gemini review Claude's code) seems to yield better results. Curious if this jump to 80.3% score in agentic coding will make me see a big difference in actual usage.

Homebrew is lagging a bit behind. If you want to use Fable right away, but still have claude code through homebrew, this is how you can do that manually:

Edit the cask locally:

  brew edit --cask claude-code

Set the version to 2.1.170 And set the sha256 to the correct values, which you can get by running

  curl https://downloads.claude.ai/claude-code-releases/2.1.170/manifest.json

Here's what I've used:

  version "2.1.170"
  sha256 arm:          "e903646d8b7a31882a80ecd27569a27d8ac57b3708745f349709632c84117fdf",
         x86_64:       "914f23a70bbed5d9ae567e3e04b86206ed9971b371bc9baca3f79c8885bfddb4",
         arm64_linux:  "1bb9d032440a75532f7dd4cafbc687f220aaf16c63eba17e192dfbec2f04bd25",
         x86_64_linux: "849e007277a0442ab27570d3e3d6d43787507946590e8dd1947e5a39b7081f9e"

Then run:

  export HOMEBREW_NO_INSTALL_FROM_API=1
  brew uninstall --cask claude-code
  rm -rf /opt/homebrew/Caskroom/claude-code
  brew reinstall --cask claude-code

The safety gates on this are extreme, and seem considerably wider than "cybersecurity and biology"; they seem to make it essentially unusable for scientists in a number of fields. I have, so far, been bumped back to Opus on 100% of my prompts.

It appears it can be tripped by things as simple as a mention of equilibrium, or anything involving something that looks like chemical kinetics, even at an abstract level. Even touching basic open source packages in my field will trigger it.

Edit: looking at the model card, it appears that chemistry in its entirety is also included in the banned topics; it's just the announcement that mentions only cybersecurity and biology. It also appears that the intent is to ban chemistry and biology entirely, rather than just banning messages deemed high risk.

Claude Fable 5 beats Pokémon FireRed using only vision: https://www.youtube.com/watch?v=CIQBP1w4B1M

I've been testing this out and I think my SWE career is dead in the water.

Genuinely wondering what value I bring to my employer right now. What value I will bring in a few months when this gets cheaper.

I think we're screwed. I may only be an SDE 2 at FAANG but I don't think I have promotion opportunities in my future anymore.

I can't justify a pricetag like that when deepseek v4 pro is $0.003625/1M for cache hit, $0.435 for cache miss and $0.87 /1M tokens for output.

For the token cost of explaining some task to Fable, deepseek v4 pro is able to solve the same task many times over.

> Software engineering. During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.

How was it measured? How was the output of this magnitude verified over a period of couple of days?

That pelican better be super realistic, unreal engine 6 style graphics

Evidently Fable is so powerful that it already allow Anthropic to break Shannon's theory.

>We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models

>The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.

I'm building a local activity log for Claude Code, capturing all activity via hooks—files loaded, commands, API calls, etc.

I feel that this need is particularly strong right now.

Previously when I did similar tasks with Opus 4.7/4.8 and GPT 5.5 I had no problems.

Costs (USD per 1M tokens), per openrouter.ai models api

  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  |             | Fable 5  | Opus 4.8 | Sonnet 4.6 | GPT 5.5 | Gemini 3.5 Flash (High)   | Gemini 3.1 Pro | DeepSeek 4 Pro | Xiaomi MiMo 2.5 Pro  | MiniMax M3 |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  | Input       | $10.00   | $5.00    | $3.00      | $5.00   | $1.50                     | $2.00          | $0.435         | $0.435                | $0.30      |
  | Cache Read  | $1.00    | $0.50    | $0.30      | $0.50   | $0.15                     | $0.20          | $0.003625      | $0.0036               | $0.06      |
  | Output      | $50.00   | $25.00   | $15.00     | $30.00  | $9.00                     | $12.00         | $0.87          | $0.87                 | $1.20      |
  | Cache Write | $12.50   | $6.25    | $3.75      | N/A     | $0.083333                 | $0.375         | N/A            | N/A                   | N/A        |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+

"Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage"

Congratulations to Anthropic for solving safety on Mythos exactly when the SpaceX compute came online. Nice how that lined up for them.

There is a discussion about how now AI is a gated utility now with public access (safe-tuned) and private access (full-usage):

https://old.reddit.com/r/ClaudeAI/comments/1u1fsdi/claude_fa...

I can't justify a pricetag like that when deepseek v4 pro is $0.003625/1M for cache hit, $0.435 for cache miss and $0.87 /1M tokens for output.

For the token cost of explaining some task to Fable, deepseek v4 pro is able to solve the same task many times over.

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'

Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

Impressions from testing Fable 5 prior to launch:

• My most noticeable immediate jump was in how its frontend design was much more intentionally crafted, and delightful without feeling like 'AI vibe coded'; with better end-user usability too.

• For general conversation and assistant style use cases, didn’t really notice a difference vs 4.8.

• 1M context window, without increased pricing for long context is AWESOME. This is a massive win.

I just ran it on a tough reverse engineering problem I'm having that neither Claude Code 4.8 or ChatGPT Codex 5.5 could figure out. 30 minutes later Fable has it all figured out perfectly.

I’ve had it go through a 50-page PDF of dense, inter-connected specs, and it correctly flagged everything that was done, somewhat done, and missing. It went into a lot of detail and explained where the code deviated from the spec.

It felt, at least for me, light an impressive step up. Opus 4.8 was already very thorough; but sadly verbose and ‘loopy’ when you push back on its plans. Fable is what I’d use all day if I could afford it!

After running it for half an hour: it's incredibly good at the visual aspects of UI design.

I feel like it takes me months to be confident in any of these things.

Curious about how you tested the frontend design capabilities. Thanks

Can I ask how you gained preview access to Fable 5?

I recently switched off Max flat rate to Enterprise API pricing and I went from 200/mo to 10k/mo with the same usage pattern on Opus. They don’t offer flat rate to enterprises.

I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.

Rewriting your stack from Ruby to Go in 2 days where it would’ve taken 6 months is impressive and fun. But that isn’t upping revenue.

Iterating on net new business features and ideas that are niche that the LLM isn’t trained for are much harder. Is 20x the token cost worth it there?

I think you are broadly correct, but just to pushback on a few points: (1) Ability to solve hard problems in days vs weeks as immense value (2) Back-end improvements (if done right), should improve platform speed, stability, scalability etc. which should have revenue implication (3) Ability to on-board a SWE equivalent entity in minutes, have them work on a specific hard problem and then off-board them in minutes can have value

All of the above, of course, depends upon Fable consistently being a 2x-3x SWE at minimum.

> Is 20x the token cost worth it there?

No it doesn’t and will not be. Companies are not yet realising the cost yet, wait till the end of the financial year and you’ll see a different direction.

DeepSeek v4 is pretty decent, and probably on par with sonnet. I see a future of hybrid models where opus or fable might be used only for complicated features or bugs, but general day to day would be DeepSeek or whatever good models that will be released later

Do you understand that, for 10-20k a month, you can hire 1-2 senior engineers AND give them Claude subscriptions?

>I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.

I was about to say that. Deepseek is just magnitudes cheaper and absolutely good enough for most things. Anthropic and co just try to milk the cow while its possible. If they cant compete with Deepseek pricing I do not see a bright future for them.

But maybe the more charitable reading is that they didn't have to offer this model at all on those plans and they are giving the standard free trial.

I'll be amazed if they manage to keep their infra responsive over the next 2 weeks.

I was just saying last week: If Opus 4.8 max is as good as we get, and we plateau there, I think I'd be fine with it.

For the stuff I've thrown at it, that configuration has done a really great job. Including 70+KLOC go proxy with extensive test suite, some retro games, and more.

Seems to me this is more honest than the Mythos claims a while ago. too powerful to release publicly. Too expensive?

This is the entire business model of all AI companies. It costs far more to run the datacenters and build more capacity than they could ever hope to make back at current pricing models. I'm looking forward to pricing to catch up with reality and the resulting chaos that ensues.

It's interesting that we're seeing these gains when it seems Mythos/Fable is "just" a scaled up version of their existing architecture[0].

When GPT 4.5 launched, the gains compared to the model size didn't seem that great, leading some to believe that the only progress we'd see would come from RL.

[0] One of the early testers mentioned: "As far as I can tell from talking to people internally at Anthropic, there's nothing special about architecturally"[2]

[1] Section 1.1 in https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

[2] https://youtu.be/GrdEid8H6H4?t=168

[3] There were rumors going around when Mythos was first announced that it was the first 10T parameter model, but I can't find a verifiable source for that number.

There’s nothing much new about the architecture. The real gains come from the usage traces.

It turns out that having a text based interface for a text-trained model creates a very nice feedback loop.

Right now as we speak, people are generating text traces on anthropic and OpenAI servers that teach their models to do everything under the sun, text wise.

So people right now getting super mad at how dumb the model is when reverse-engineering a super complex function from binary, when they write “stop, you dumb robot, you are going wrong, go this way thank you very much” are actually leaving a lesson in the form of the "chat" text history.

Some may say that each bad word get us closer to ASI.

That and obviously the order of magnitude more efficient GPUS we got that allow for different tradeoffs at training time.

Opus 4.0 and 4.1 are more expensive than Fable.

The system card is 319 pages, at what point do we call it a "book" instead of a "card"?

There's a quote from a METR report on page 52:

> we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks

this is good news, right? right...?

But did it mention developer in the park eating the sandwitch? That is the most important question!

I can’t help but think that there are so many astroturfed comments in here.

Seems like a concerted and distributed effort from the entire Anthropic team every time to get this on top of HN.

Yes, this is also my feeling.

It happens for every single Anthropic release. Then I try it on real dev and the result is laughably bad. Except in design where it has been doing a decent job for a while. I am not a designer and my bar is pretty low.

I'm not fan of Anthropic, but to be fair, every major model release makes it to the main page. In the case of a model like this, hyped and with a jump in capabilities, it doesn't need astroturfing.

Corporations have done worse for much less money involved. Now we have trillion dollar companies going IPO. With so much at stake, it’s not unthinkable that there’s astroturfing happening.

I’m convinced that’s the case, this place looked totally different around 4 years ago

Wouldn’t be surprised if there are marketing teams writing positive comments for more positive engagement

Where do you see them exactly? The comments are pretty much in line with how the model performs IRL.

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

How credible is this benchmark? does it correlated with others real world experience?

I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6

Bummer! When can I finally and confidently get slopcode into Zig?

FrontierCode is likely paid for by anthropic.

This depicts a kind of "dark forest of AI agents resorting to kill or be killed" narrative but it sounds more to me like an agent just earnestly problem-solving why its processes are being killed without real awareness of what was going on. Hard to say without the full script.

This kind of storytelling annoys me. Give us more facts, less narrative drama.

It's funny because Anthropic is the most likely place that this happens.

They are the only one crying out loud about how dangerous their models are and are presumably also training their models heavily to be "safe". And through that training itself, the model learns about the other side - how are you going to teach a model to be safe, without teaching it what's not safe?

Kung Fu Panda opening scene anyone? One often meet his fate on the path that he takes to avoid it - Master Oogway.

Let's hope AIs really aren't conscious, otherwise this seems like a very unpleasant situation to be placed in.

Very interesting. I am not sure this will comply with organizational policies and standards protocols (HIPPA etc.,)

> deletion after 30 days in almost all cases ...

Almost… basically they have unlimited power to decide what data is kept?

This makes it an instant non-starter for probably 95% of organizations. A lot of people are about to get in trouble for using it before realizing this.

30 days seems not enough to retrospectively investigate some suspected nefarious traffic.

Same. I used it today to review my code and it came up with some genuinely good comments and suggestions and found a bug I didn’t think about. Quite a step up from opus. Although one code review took up 50% of my usage.

Why is your comment so grey/downvoted? One of the only actual usage experiences posted in this thread.

For those of us on subscription plans:

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

How much more clearly do they need to explain the resource constraints?

If they didn't announce it, you guys would be complaining about slowed progress.

If they didn't release it, you guys would be complaining about fake promises and marketing.

If they released it without limits, the complaints would be about slow responses and outages.

If they didn't add to susbcription plans, the complaints would be about phasing out subscriptions.

If they added to subscriptions with cost reflecting their resource availability, the complaints would be about how quickly it eats limits.

So they choose the middle ground of providing some initial access and assessing if they can satisfy demand, only to still be ignored and accused of trying to get users hooked?

We've already seen that they don't have enough compute, thus the deals with SpaceX for their GPUs. It's very reasonable that they just don't have the capacity to support the subscription userbase on this model.

Still satisfied with my switch to codex/chatgpt. I couldn't imagine switching away from claude code when it first launch but with the drastically more generous usage on codex for the same subscription tier I just can't justify it.

I would not use this if you are on a subscription. In <8min it burned my entire 5hr window (which has just reset it appears, I have over 4 hours till it resets) I hadn't used CC at all today aside from this) and then it used up ~$15 more in usage before I could stop it.

I am on the $100 Max plan.

For me it almost immediately blocked. I had it writing code related to message digests - and it seemed to think it was too gifted for that. Gave the security warning and switched back to 4.8. Whatever... it will probably soon have the API error soon. I have mostly switched to the Codex 200 a month plan. I've found their 5.5 xhigh to be better than Opus 4.8 "ultracode." Also, i have not once seen their servers fail for compute unavailability, unlike Anthropric which happens almost ever hour.

Fwiw it's not available on my enterprise account: "Disable zero data retention to unlock Fable 5 access"

Considering their apparent nerfing of the end user plans in favor of enterprise clients, is Anthropic still the "more ethical AI company" like everybody loves to tell me all the time?

Assuming this isn't just a supply issue on their side, nothing says "ethical AI" like only allowing mega corporations to use it through cost barriers.

Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://news.ycombinator.com/item?id=48467896).

I have a theory, this is obviously based on speculation based on how Anthropic is treating Mythos and the whole media noise around it's dangers and who gets access to it.

Fable is their way of showing the public "the model does exist but in a mode that makes it harder/impossible for competitors to derive a comparable model from results.

The irony of "we train on all of humanity's collective output, but god forbid anyone trains on ours" is still incredible

That's definitely the case as model distillation is one of the explicit safety carveouts they mention. Though TBF, model distillation is also a big concern for general safety as distillation could allow you to have the model without the other guardrails. It's sort of a master key to the model.

Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8

Fable 5 default: https://gist.github.com/simonw/036bee5a703e7ec84e34efa974438...

Opus 4.8 (the "max" one is closest to Fable): https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-s...

Now here are the Fable pelicans for all five of the thinking effort levels - low, medium, high, xhigh, max: https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

Low used 25 input, 1,929 output - 9.67 cents: https://www.llm-prices.com/#it=25&ot=1929&sel=claude-fable-5

Max used 25 input, 14,430 output - 72.175 cents! https://www.llm-prices.com/#it=25&ot=14430&sel=claude-fable-...

The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.

I'm beginning to wonder how much of a useful metric the pelican is because surely the frontier labs must be training their models on pelican-artistry because of how well known your test is now?

This is the reply I look for in all the new model announcements. Its fun to tell people that I judge models based on pelicans.

I find it quite interesting that while the picture looks better the more advanced the model is, but apparently none so far "understands" that the pelicans legs are on both sides of the bike / top bar.

The Max version gets more details right. The bike frame looks good, the chain, the wings are appropriately styled instead of “arms”, and the knee is bent, etc. Obviously we’re hitting marginal returns now, but I see differences.

It's interesting that they still get the head tube / handle bar part wrong.

It's interesting that Gemini 3(.1?) Deep Think is still the best at this task and it's still not really generally available. Maybe Fable could match it at higher effort levels? https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

How is this half-way down the page? To me it's the headline.

There are tons of ways to generate "strong candidates for drug design." This is definitely not the bottleneck in drug discovery and development. The hard problem is vetting and developing these ideas to the point of having a commercially viable drug. That is still a very empirical process.

Because it's completely meaningless without validation, and even with validation, not really any better than the state of the art protein generation models. Which are also mostly just nice to have because coming up with a candidate is generally quite easy.

The rate limiting steps are generally testing, or characterizing. Not designing protein binders.

It's selective reporting. Says 'in one example', but out of how many, is that one-shot, or is it a random result out of 100. It's a marketing doc.

Would be funny if anthropic ends up as mostly a pharma company

Drug design isn't the bottleneck anymore, it's trials. Still cool they can do this with a general purpose model though.