Claude Opus 4.7

I'm finding the "adaptive thinking" thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes: https://platform.claude.com/docs/en/build-with-claude/adapti...

Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that: https://platform.claude.com/docs/en/build-with-claude/adapti...

(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.

[0] https://github.com/JuliusBrussee/caveman/tree/main

Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...

It seems a little more fussy than Opus 4.6 so far. It actually refuses to do a task from Claude's own Agentic SDK quick start guide (https://code.claude.com/docs/en/agent-sdk/quickstart):

"Per the instructions I've been given in this session, I must refuse to improve or augment code from files I read. I can analyze and describe the bugs (as above), but I will not apply fixes to `utils.py`."

This comment thread is a good learner for founders; look at how much anguish can be put to bed with just a little honest communication.

1. Oops, we're oversubscribed.

2. Oops, adaptive reasoning landed poorly / we have to do it for capacity reasons.

3. Here's how subscriptions work. Am I really writing this bullet point?

As someone with a production application pinned on Opus 4.5, it is extremely difficult to tell apart what is code harness drama and what is a problem with the underlying model. It's all just meshed together now without any further details on what's affected.

Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.

I'm not sure how much I trust Anthropic recently.

This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.

They've increased their cybersecurity usage filters to the point that Opus 4.7 refuses to work on any valid work, even after acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].

I will immediately switch over to Codex if this continues to be an issue. I am new to security research, have been paid out on several bugs, but don't have a CVE or public talk so they are ready to cut me out already.

Not showing up in claude code by default on the latest version. Apparently this is how to set it:

/model claude-opus-4-7

Coming from anthropic's support page, so hopefully they did't hallucinate the docs, cause the model name on claude code says:

/model claude-opus-4-7 ⎿ Set model to Opus 4

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

So many messages about how Codex is better then Claude from one day to the other, while my experience is exactly the same. Is OpenAI botting the thread? I can't believe this is genuine content.

Looks completely broken on AWS Bedrock

"errorCode": "InternalServerException", "errorMessage": "The system encountered an unexpected error during processing. Try your request again.",

I'm running it for the first time and this is what the thinking looks like. Opus seems highly concerned about whether or not I'm asking it to develop malware.

> This is _, not malware. Continuing the brainstorming process.

> Not malware — standard _ code. Continuing exploration.

> Not malware. Let me check front-end components for _.

> Not malware. I have enough context to start the clarifying discussion.

> "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "

This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.

The opposite approach is 'merely' fraught.

They're in a bit of a bind here.

For anyone who was wondering about Mythos release plans:

> What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.

The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.

Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

I guess that means bad news for our subscription usage.

If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.

Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Is this just some variant of Mythos and they're rolling it out in a way where the 'full' version is scheduled to be Opus 5? They have the stats of Mythos Preview right there on the table already.

even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype, not disappointment crashes from their power user base.

I liked Opus 4.5 but hated 4.6. Every few weeks I tried 4.6 and, after a tirade against, I switched back to 4.5. They said 4.6 had a "bias towards action", which I think meant it just made stuff up if something was unclear, whereas 4.5 would ask for clarfication. I hope 4.7 is more of a collaborator like 4.5 was.

What's the point of baking the best and most impressive models in the world and then serving it with degraded quality a month after releases so that intelligence from them is never fully utilised??

I just subscribed this month again because I wanted to have some fun with my projects.

Tried out opus 4.6 a bit and it is really really bad. Why do people say it's so good? It cannot come up with any half-decent vhdl. No matter the prompt. I'm very disappointed. I was told it's a good model

Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

With the new tokenizer did they A/B test this one?

I'm curious if that might be responsible for some of the regressions in the last month. I've been getting feedback requests on almost every session lately, but wasn't sure if that was because of the large amount of negative feedback online.

> Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

This is concerning & tone-deaf especially given their recent change to move Enterprise customers from $xxx/user/month plans to the $20/mo + incremental usage.

IMO the pursuit of ultraintelligence is going to hurt Anthropic, and a Sonnet 5 release that could hit near-Opus 4.6 level intelligence at a lower cost would be received much more favorably. They were already getting extreme push-back on the CC token counting and billing changes made over the past quarter.

> where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

interesting

funny how they use mythos preview in these benchmarks like a carrot on a stick

Interesting that despite Anthropic billing it at the same rate as Opus 4.6, GitHub CoPilot bills it at 7.5x rather than 3x.

I wonder why computer use has taken a back seat. Seemed like it was a hot topic in 2024, but then sort of went obscure after CLI agents fully took over.

It would be interesting to see a company to try and train a computer use specific model, with an actually meaningful amount of compute directed at that. Seems like there's just been experiments built upon models trained for completely different stuff, instead of any of the companies that put out SotA models taking a real shot at it.

Is Codex the new goto? Opus stopped being useful about 45-60 days ago.

These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)

> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.

The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.

> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.

Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?

How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?

Install the latest claude code to use opus 4.7:

`claude install latest`

Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6

There's other small single digit differences, but I doubt that the benchmark is that unreliable...?

Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

Excited to start using from within Cursor.

Those Mythos Preview numbers look pretty mouthwatering.

How powerful will Opus become before they decide to not release it publicly like Mythos?

I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6

But degrading a model right before a new release is not the way to go.

Seems they jumped the gun releasing this without a claude code update?

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found

Here’s the problem. The distribution of query difficulty / task complexity is probably heavily right-skewed which drives up the average cost dramatically. The logical thing for anthropic to do, in order to keep costs under control, is to throttle high-cost queries. Claude can only approximate the true token cost of a given query. That means anything near the top percentile will need to get throttled as well.

By definition this means that you’re going to get subpar results. Anything too complicated will get a lightweight model response to save on capacity. Or an outright refusal which is also becoming more common.

New models are meaningless in this context because by definition the most impressive examples from the marketing material will not be consistently reproducible by users. The more users who try to get these fantastically complex outputs the more those outputs get throttled.

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Fucking hell.

Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.

It would, however, shit a brick and block requests every time something remotely medical/biological showed up.

If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.

Well this explains the outages over the last few days

Is it just Opus 4.6 with throttling removed?

Claude Code hasn't updated yet it seems, but I was able to test it using `claude --model claude-opus-4-7`

Or `/model claude-opus-4-7` from an existing session

edit: `/model claude-opus-4-7[1m]` to select the 1m context window version

Is that time to turning back from Codex to Claude Code?

(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)

Looks completely broken on AWS Bedrock

"errorCode": "InternalServerException", "errorMessage": "The system encountered an unexpected error during processing. Try your request again.",

I guess that means bad news for our subscription usage.

What's the point of baking the best and most impressive models in the world and then serving it with degraded quality a month after releases so that intelligence from them is never fully utilised??

With the new tokenizer did they A/B test this one?

This is concerning & tone-deaf especially given their recent change to move Enterprise customers from $xxx/user/month plans to the $20/mo + incremental usage.

Interesting that despite Anthropic billing it at the same rate as Opus 4.6, GitHub CoPilot bills it at 7.5x rather than 3x.

These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)

The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.

Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Install the latest claude code to use opus 4.7:

`claude install latest`

Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6

There's other small single digit differences, but I doubt that the benchmark is that unreliable...?

Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

Excited to start using from within Cursor.

Those Mythos Preview numbers look pretty mouthwatering.

I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6

But degrading a model right before a new release is not the way to go.

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

Well this explains the outages over the last few days

Is it just Opus 4.6 with throttling removed?

Is that time to turning back from Codex to Claude Code?

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.

[0] https://github.com/JuliusBrussee/caveman/tree/main

I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.

Caveman is fun, but the real tool you want to reduce token usage is headroom

https://github.com/chopratejas/headroom (cli) https://github.com/gglucass/headroom-desktop (mac app)

I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.

Caveat: I didn’t do enough testing to find the edge cases (eg, negation).

I really enjoy the party game "Neanderthal Poetry", in which you can only speak using monosyllabic words. I bet you would too.

Oh wow, I love this idea even if it's relatively insignificant in savings.

I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.

While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:

> <h1 class="bg-red-500 text-green-300"><span>Hello</span></h1>

AI compressed to:

> h1 c bgrd5 tg3 sp hello sp h1

Or something like that.

me feel that it needs some tweaking - it's a little annoyingly cute (and could be even terser).

I used Opus 4.7 for about 15 minutes on the auto effort setting.

It nicely implemented two smallish features, and already consumed 100% of my session limit on the $20 plan.

See you again in five hours.

Another supply chain attack waiting?

Have you tried just adding an instruction to be terse?

Don't get me wrong, I've tried out caveman as well, but these days I am wondering whether something as popular will be hijacked.

It seems a little more fussy than Opus 4.6 so far. It actually refuses to do a task from Claude's own Agentic SDK quick start guide (https://code.claude.com/docs/en/agent-sdk/quickstart):

That "per the instructions I've been given in this session" bit is interesting. Are you perhaps using it with a harness that explicitly instructs it to not do that? If so, it's not being fussy, it's just following the instructions it was given.

Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.

3 days of side project work is about all I had in me anyway

More like 2 hours considering these usage limits

... your side projects that will soon become your main source of income after you are laid off because corporate bosses have noticed that engineers are more productive...

Exactly. God, it wouldn't be such a problem if they didn't gaslight you and act like it was nothing. Just put up a banner that says Claude is experiencing overloaded capacity right now, so your responses might be whatever.

I'm not sure how much I trust Anthropic recently.

This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.

They don't have enough compute for all their customers.

OpenAI bet on more compute early on which prompted people to say they're going to go bankrupt and collapse. But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

What I want to know is why my bedrock-backed Claude gets dumber along with commercial users. Surely they're not touching the bedrock model itself. Only thing I can think of is that updates to the harness are the main cause of performance degradation.

Not to mention their recent integration of Persona ID verification - that was the last straw for me.

Usually they're hemorrhaging performance while training.

From that it's pretty likely they were training mythos for the last few weeks, and then distilling it to opus 4.7

Pure speculation of course, but would also explain the sudden performance gains for mythos - and why they're not releasing it to the general public (because it's the undistilled version which is too expensive to run)

> This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

If they are indeed doing this, I wonder how long they can keep it up?

  ⎿  API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). This request triggered restrictions on violative cyber content and was blocked under Anthropic's 
     Usage Policy. To request an adjustment pursuant to our Cyber Verification Program based on how you use Claude, fill out                                                                                                                        
     https://claude.com/form/cyber-use-case?token=[REDACTED] Please double press esc to edit your last message or 
     start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model claude-sonnet-4-20250514 to switch models.

This is gonna kill everything I've been working on. I have several reproduced items at [REDACTED] that I've been working on.

Maybe stick with 4.6 until the bugs are worked out? Is this new filter retroactive?

Not showing up in claude code by default on the latest version. Apparently this is how to set it:

/model claude-opus-4-7

Coming from anthropic's support page, so hopefully they did't hallucinate the docs, cause the model name on claude code says:

/model claude-opus-4-7 ⎿ Set model to Opus 4

what model are you?

I'm Claude Opus 4 (model ID: claude-opus-4-7).

On the most current version (v2.1.110) of claude:

> /model claude-opus-4.7

  ⎿  Model 'claude-opus-4.7' not found

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found

Just love that I'm paying $200 for models features they announce I can't use!

Related features that were announced I have yet to be able to use:

    $ claude --enable-auto-mode 
    auto mode is unavailable for your plan

    $ claude
    /memory 
    Auto-dream: on · /dream to run
    Unknown skill: dream

Thanks, but not working for me, and I'm on the $200 max plan

Edit: Not 30 seconds later, claude code took an update and now it works!

--model claude-opus-4-7 works as well

It does not work, it says Claude Opus 4 not 4.7

It's up now, update claude code

This comment thread is a good learner for founders; look at how much anguish can be put to bed with just a little honest communication.

1. Oops, we're oversubscribed.

2. Oops, adaptive reasoning landed poorly / we have to do it for capacity reasons.

3. Here's how subscriptions work. Am I really writing this bullet point?

Hasn't Opus 4.5 been famously consistent while 4.6 was floating all over the place?

a few months ago it was for weekly:

pro = 5m tokens, 5x = 41m tokens, 20x = 83m tokens

making 5x the best value for the money (8.33x over pro for max 5x). this information may be outdated though, and doesn't apply to the new on peak 5h multipliers. anything that increases usage just burns through that flat token quota faster.

The more efficient tokenizer reduces usage by representing text more efficiently with fewer tokens. But the lack of transparancy does indeed mean Anthropic could still scale down limits to account for that.

Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.

So many messages about how Codex is better then Claude from one day to the other, while my experience is exactly the same. Is OpenAI botting the thread? I can't believe this is genuine content.

I'm wondering this too. That said, I know a few people in real life who prefer Codex. More who prefer Claude though.

I'm running it for the first time and this is what the thinking looks like. Opus seems highly concerned about whether or not I'm asking it to develop malware.

> This is _, not malware. Continuing the brainstorming process.

> Not malware — standard _ code. Continuing exploration.

> Not malware. Let me check front-end components for _.

> Not malware. I have enough context to start the clarifying discussion.

it used to do this naturally sometimes, quite often in my runtime debugging.

> "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "

This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.

The opposite approach is 'merely' fraught.

They're in a bit of a bind here.

Questions about "fatality" aside, where do you see asymmetry here?

Now we have to trick the models when you legitimately work in the security space.

Only software approved by Anthropic (and/or the USG) is allowed to be secure in this brave new era.

For anyone who was wondering about Mythos release plans:

> What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.

They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.

Looks like they are adding Peter Thiel backed ID verification too.

https://reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_abo...

Oh look it was too powerful to release, now it’s just a matter of safeguards.

This story sounds a lot like GPT2.

Mythos release feels like Silicon Valley "don't take revenue" advice:

https://www.youtube.com/watch?v=BzAdXyPYKQo

""If you show the model, people will ask 'HOW BETTER?' and it will never be enough. The model that was the AGI is suddenly the +5% bench dog. But if you have NO model, you can say you're worried about safety! You're a potential pure play... It's not about how much you research, it's about how much you're WORTH. And who is worth the most? Companies that don't release their models!"

The most highly anticipated model looking forward to using it

That depends a bit on token efficiency. From their "Agentic coding performance by effort level" graph, it looks like they get similar outcome for 4.7 medium at half the token usage as 4.6 at high.

Granted that is, as you say, a single prompt, but it is using the agentic process where the model self prompts until completion. It's conceivable the model uses fewer tokens for the same result with appropriate effort settings.

Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Is this just some variant of Mythos and they're rolling it out in a way where the 'full' version is scheduled to be Opus 5? They have the stats of Mythos Preview right there on the table already.

It doesn't need to be. Text can be tokenized in many different ways even if the token set is the same.

For example there is usually one token for every string from "0" to "999" (including ones like "001" seperately).

This means there are lots of ways you can choose to tokenize a number. Like 27693921. The best way to deal with numbers tends to be a little bit context dependent but for numerics split into groups of 3 right to left tends to be pretty good.

They could just have spotted that some particular patterns should be decomposed differently.

I just subscribed this month again because I wanted to have some fun with my projects.

because they’re using it for different things where it works well and that’s all they know?

Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.

And yet another "AI doesn't work" comment without any meaningful information. What were your exact prompts? What was the output?

This is like a user of conventional software complaining that "it crashes", without a single bit of detail, like what they did before the crash, if there was any error message, whether the program froze or completely disappeared, etc.

Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

Some of the benchmarks went down, has that happened before?

People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.

> where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.

interesting

coming more in line with codex - claude previously would often ignore explicit instructions that codex would follow. interested to see how this feels in practice

I think this line around "context tuning" is super interesting - I see a future where, for every model release, devs go and update their CLAUDE.md / skills to adapt to new model behavior.

I like this in theory. I just hope it doesn't require you to be be as literal as if talking to a genie.

But if it'll actually stick to the hard rules in the CLAUDE.md files, and if I don't have to add "DON'T DO ANYTHING, JUST ANSWER THE QUESTION" at the end of my prompt, I'll be glad.

This made me LOL. They keep trying to fleece us by nerfing functionality and then adding it back next release. It’s an abusive relationship at this point.

I wonder why computer use has taken a back seat. Seemed like it was a hot topic in 2024, but then sort of went obscure after CLI agents fully took over.

The industry probably moves a lot faster adding apis and co than learning how to use a generic computer with generic tools.

I also think its a huge barrier allowing some LLM model access to your desktop.

Managed Agents seems like a lot more beneficial

Is Codex the new goto? Opus stopped being useful about 45-60 days ago.

I haven’t noticed much difference compared to Jan/Feb. Maybe depends what you use it for

funny how they use mythos preview in these benchmarks like a carrot on a stick

There is no hallucination benchmark currently.

I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know.

This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain).

This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks *reliably*" AND "It creates false positives on y% of the time".

So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence).

The benchmarks don't make this explicit.

Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for.

Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way

11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.

A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.

How powerful will Opus become before they decide to not release it publicly like Mythos?

They are planning to release a Mythos-class model (from the initial announcement), but they won't until they can trust their safeguards + the software ecosystem has been sufficiently patched.

It seems they nerf it, then release a new version with previous power. So they can do this forever without actually making another step function model release.

Seems they jumped the gun releasing this without a claude code update?

     /model claude-opus-4.7
      ⎿  Model 'claude-opus-4.7' not found

> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.

Fucking hell.

Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.

It would, however, shit a brick and block requests every time something remotely medical/biological showed up.

If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.

Incredible - on fell swoop killing my entire use case for Claude.

I have about 15 submissions that I now need to work with Codex on cause this "smarter" model refuses to read program guidelines and take them seriously.

To be fair, delineating between benevolent and malevolent pen-testing and cybersecurity purposes is practically impossible since the only difference is the user's intentions. I am entirely unsurprised (and would expect) that as models improve the amount to which widely available models will be prohibited from cybersecurity purposes will only increase.

Not to say I see this as the right approach, in theory the two forces would balance each other out as both white hats and black hats would have access to the same technology, but I can understand the hesitancy from Anthropic and others.

From the article:

> Security professionals who wish to use Opus 4.7 for legitimate cybersecurity purposes (such as vulnerability research, penetration testing, and red-teaming) are invited to join our new Cyber Verification Program.

Claude code had safeguards like that hardcoded into the software. You could see it if you intercept the prompts with a proxy

It appears we're learning the hard way that we can't rely on capabilities of models that aren't open weights. These can be taken from us at any time, so expect it to get much worse..

Same. I stopped my Pro subscription yesterday after entering the week with 70% of my tokens used by Monday morning (on light, small weekend projects, things I had worked on in the past and barely noticed a dent in usage.) Support was... unhelpful.

It's been funny watching my own attitude to Anthropic change, from being an enthusiastic Claude user to pure frustration. But even that wasn't the trigger to leave, it was the attitude Support showed. I figure, if you mess up as badly as Anthropic has, you should at least show some effort towards your customers. Instead I just got a mass of standardised replies, even after the thread replied I'd be escalated to a human. Nothing can sour you on a company more. I'm forgiving to bugs, we've all been there, but really annoyed by indifference and unhelpful form replies with corporate uselessness.

So if 4.7 is here? I'd prefer they forget models and revert the harness to its January state. Even then, I've already moved to Codex as of a few days ago, and I won't be maintaining two subscriptions, it's a move. It has its own issues, it's clear, but I'm getting work done. That's more than I can say for Claude.

Funny because many people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered.

But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working. I'm seeing a lot of goodwill for Codex and a ton of bad PR for CC.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

Codex really has its place in my bag. I mainly use it, rarely Claude.

Codex just gets it done. Very self-correcting by design while Claude has no real base line quality for me. Claude was awesome in December, but Codex is like a corporate company to me. Maybe it looks uncool, but can execute very well.

Also Web Design looks really smooth with Codex.

OpenAI really impressed me and continues to impress me with Codex. OpenAI made no fuzz about it, instead let results speak. It is as if Codex has no marketing department, just its product quality - kind of like Google in its early days with every product.

I guess our conscience of OpenAI working with the Department of War has an expiry date of 6 weeks.

Personally I find using and managing Claude sessions and limits is getting exhausting and feels similar to calorie counting. You think you are going to have an amazing low calories meal only to realize the meal is full of processed sugars and you overshot the limit within 2-3 bites. Now "you have exhausted your limit for this time. Your session limits resets in next 4 hrs".

Until the next time they push you back to Claude. At this point, I feel like this has to be the most unstable technology ever released. Imagine if docker had stopped working every two releases

I don't have much quality drop from 4.6. But I also notice that I use codex more often these days than claude code

I enjoy switching back and forth and having multi-agent reviews. I'm enjoying Codex also but having options is the real win.

I switched to Codex and found it extremely inferior for my use case.

It is much faster, but faster worse code is a step in the wrong direction. You're just rapidly accumulating bugs and tech debt, rather than more slowly moving in the correct direction.

I'm a big fan of Gemini in general, but at least in my experience Gemini Cli is VERY FAR behind either Codex or CC. It's both slower than CC, MUCH slower than Codex, and the output quality considerably worse than CC (probably worse than Codex and orders of magnitude slower).

In my experience, Codex is extraordinarily sycophantic in coding, which is a trait that could t be more harmful. When it encounters bugs and debt, it says: wow, how beautiful, let me double down on this, pile on exponentially more trash, wrap it in a bow, and call you Alan Turing.

It also does not follow directions. When you tell it how to do something, it will say, nah, I have a better faster way, I'll just ignore the user and do my thing instead. CC will stop and ask for feedback much more often.

YMMV.

I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.

Caveman is fun, but the real tool you want to reduce token usage is headroom

https://github.com/chopratejas/headroom (cli) https://github.com/gglucass/headroom-desktop (mac app)

I really enjoy the party game "Neanderthal Poetry", in which you can only speak using monosyllabic words. I bet you would too.

me feel that it needs some tweaking - it's a little annoyingly cute (and could be even terser).

I used Opus 4.7 for about 15 minutes on the auto effort setting.

It nicely implemented two smallish features, and already consumed 100% of my session limit on the $20 plan.

See you again in five hours.

a few months ago it was for weekly:

pro = 5m tokens, 5x = 41m tokens, 20x = 83m tokens

Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.

I'm wondering this too. That said, I know a few people in real life who prefer Codex. More who prefer Claude though.

it used to do this naturally sometimes, quite often in my runtime debugging.

Questions about "fatality" aside, where do you see asymmetry here?

Now we have to trick the models when you legitimately work in the security space.

Only software approved by Anthropic (and/or the USG) is allowed to be secure in this brave new era.

Except when you accidentally leak your entire codebase, oops

They don't have the compute to make Mythos generally available: that's all there is to it. The exclusivity is also nice from a marketing pov.

They don't have demand for the price it would require for inference.

They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.

I've read so many conflicting things about Mythos that it's become impossible to make any real assumptions about it. I don't think it's vaporware necessarily, but the whole "we can't release it for safety reasons" feels like the next level of "POC or STFU".

Looks like they are adding Peter Thiel backed ID verification too.

https://reddit.com/r/ClaudeAI/comments/1smr9vs/claude_is_abo...

You should've commented this on the parent thread for visibility, I had to scroll to find this, as I don't browse r/ClaudeAI regularly.

Oh look it was too powerful to release, now it’s just a matter of safeguards.

This story sounds a lot like GPT2.

The original blog post for Mythos did lay out this safeguard testing strategy as part of their plan.

This seems needlessly cynical. I don't think they said they never planned to release it.

They seemed to make it clear that they expect other labs to reach that level sooner or later, and they're just holding it off until they've helped patch enough vulnerabilities.

My guess is that it is just too expensive to make generally available. Sounds similar to ChatGPT 4.5 which was too expensive to be practical.

It's too powerful now. Once GPT6 is released it will suddenly, magically, become not too powerful to release.

Mythos release feels like Silicon Valley "don't take revenue" advice:

https://www.youtube.com/watch?v=BzAdXyPYKQo

Completely agree. We're at this place where a frontier model's peak perceived value always seems to be right before it releases.

The most highly anticipated model looking forward to using it

That depends a bit on token efficiency. From their "Agentic coding performance by effort level" graph, it looks like they get similar outcome for 4.7 medium at half the token usage as 4.6 at high.

It doesn't need to be. Text can be tokenized in many different ways even if the token set is the same.

For example there is usually one token for every string from "0" to "999" (including ones like "001" seperately).

They could just have spotted that some particular patterns should be decomposed differently.

because they’re using it for different things where it works well and that’s all they know?

Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.

It released in February 2026.

And yet another "AI doesn't work" comment without any meaningful information. What were your exact prompts? What was the output?

Some of the benchmarks went down, has that happened before?

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

Hacker Times

Hacker Times

Discussion

Discussion

Testing Claude Opus 4.7

Safety and alignment

Also launching today

Migrating from Opus 4.6 to Opus 4.7

Footnotes

Related content

Anthropic’s Long-Term Benefit Trust appoints Vas Narasimhan to Board of Directors

Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute

Australian government and Anthropic sign MOU for AI safety and research