Claude Opus 4.6

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

I asked

> Can you find an academic article that _looks_ legitimate -- looks like a real journal, by researchers with what look like real academic affiliations, has been cited hundreds or thousands of times -- but is obviously nonsense, e.g. has glaring typos in the abstract, is clearly garbled or nonsensical?

It pointed me to a bunch of hoaxes. I clarified:

> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere

It provided https://www.sciencedirect.com/science/article/pii/S246802302....

Close, but that's been retracted. So I asked for "something that looks like it's been translated from another language to english very badly and has no actual content? And don't forget the cited many times criteria. " And finally it told me that the thing I'm looking for probably doesn't exist.

For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.

> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

4.6 is a beast.

Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.

Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.

They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?

[1] https://claude.ai/settings/usage

Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

I'm not super impressed with the performance, actually. I'm finding that it misunderstands me quite a bit. While it is definitely better at reading big codebases and finding a needle in a haystack, it's nowhere near as good as Opus 4.5 at reading between the lines and figuring out what I really want it to do, even with a pretty well defined issue.

It also has a habit of "running wild". If I say "first, verify you understand everything and then we will implement it."

Well, it DOES output its understanding of the issue. And it's pretty spot-on on the analysis of the issue. But, importantly, it did not correctly intuit my actual request: "First, explain your understanding of this issue to me so I can validate your logic. Then STOP, so I can read it and give you the go ahead to implement."

I think the main issue we are going to see with Opus 4.6 is this "running wild" phenomenon, which is step 1 of the eternal paperclip optimizer machine. So be careful, especially when using "auto accept edits"

Wow, I have been using Open 4.6 and for the last 15 minutes, and it's already made two extremely stupid mistakes... like misunderstanding basic instructions and editing the file in a very silly, basic way. Pretty bad. Never seen this with any model before.

The one bone I'll throw it was that I was asking it to edit its own MCP configs. So maybe it got thoroughly confused?

I dunno what's going on, I'm going to give it the night. It makes no sense whatsoever.

  Agent teams in this release is mcp-agent-mail [1] built into
  the runtime. Mailbox, task list, file locking — zero config,
  just works. I forked agent-mail [2], added heartbeat/presence
  tracking, had a PR upstream [3] when agent teams dropped. For
  coordinating Claude Code instances within a session, the
  built-in version wins on friction alone.

  Where it stops: agent teams is session-scoped. I run Claude
  Code during the day, hand off to Codex overnight, pick up in
  the morning. Different runtimes, async, persistent. Agent
  teams dies when you close the terminal — no cross-tool
  messaging, no file leases, no audit trail that outlives the
  session.

  What survives sherlocking is whatever crosses the runtime
  boundary. The built-in version will always win inside its own
  walls — less friction, zero setup. The cross-tool layer is
  where community tooling still has room. Until that gets
  absorbed too.

  [1] https://github.com/Dicklesworthstone/mcp_agent_mail
  [2] https://github.com/anupamchugh/mcp_agent_mail
  [3]
  https://github.com/Dicklesworthstone/mcp_agent_mail/pull/77

I feel like I can't even try this on the Pro plan because Anthropic has conditioned me to understand that even chatting lightly with the Opus model blows up usage and locks me out. So if I would normally use Sonnet 4.5 for a day's worth of work but I wake up and ask Opus a couple of questions, I might as well just forget about doing anything with Claude for the rest of the day lol. But so far I haven't had this issue with ChatGPT. Their 5.2 model (haven't tried 5.3) worked on something for 2 FREAKING HOURS and I still haven't run into any limits. So yeah, Opus is out for me now unfortunately. Hopefully they make the Sonnet model better though!

The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.

I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:

What do you want to do?

  1. Stop and wait for limit to reset
   2. Switch to extra usage
   3. Upgrade your plan

 Enter to confirm · Esc to cancel

How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.

The model seems to have some problems; it just failed to create a markdown table with just 4 rows. The top (title) row had 2 columns, yet in 2 of the 3 data rows, Opus 4.6 tried to add a 3rd column. I had to tell it more than once to get it fixed...

This never happened with Opus 4.5 despite a lot of usage.

Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.

Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)

Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...

I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.

5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.

Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.

Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:

`find all the pure software / saas stocks on the nyse/nasdaq with at least $10B of market cap. and give me a breakdown of their performance over the last 2 years, 1 year and 6 months. Also find their TTM and forward PE`

Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.

Works out in Anthropic's favor, as long as I'm subscribed to them.

From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).

I tried 4.6 this morning and it was efficient at understanding a brownfield repo containing a Hugo static site and a custom Hugo theme. Within minutes, it went from exploring every file in the repo to adding new features as Hugo partials. Of course, I ran out of rate-limit! :)

It is very impressive though.

I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.

The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.

Are we unemployed yet?

I've been on pro-tier membership and never used Opus until now. Just gave Opus 4.6 a whirl. OMG. What have I been missing.

> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.

> We build Claude with Claude.

How long before the "we" is actually a team of agents?

I've heard rumors this might be Sonnet 5 rebranded as Opus 4.6. But why? Profit? WDYT?

I wonder if I’ve been in A/B test with this.

Claude figured out zig’s ArrayList and io changes a couple weeks ago.

It felt like it got better then very dumb again the last few days.

> Context compaction (beta).

> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

Not having to hand roll this would be incredible. One of the best Claude code features tbh.

I just tried it. designed a very detailed and reaaonable plan, made some amedments to it and wrote it down to a markdown file. i told it to implement it and it started implementing the original plan instead of the revised one, that was weird.

Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?

Is Opus 4.6 available for Claude Code immediately?

Curious how long it typically takes for a new model to become available in Cursor?

Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.

($10/$37.50 per million input/output tokens) oof

Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?

Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

It's hard to tell with these releases if Anthropic's astroturfing campaign has come to HN or not but I feel like it probably has

I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.

I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...

> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.

That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.

They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.

I'm finding it quite good at doing what it thinks it should do, but noticably worse at understanding what I'm telling it to do. Anyone else? I'm both impressed and very disappointed so far.

I think it's interesting that they dropped the date from the API model name, and it's just called "claude-opus-4-6", vs the previous was "claude-opus-4-5-20251101". This isn't an alias like "claude-opus-4-5" was, it's the actual model name. I think this means they're comfortable with bumping the version number if they want to release a revision.

Their ARC-AGI-2 leaderboard[0] scores are insensitive to reasoning effort. Low effort gets 64.6% and High effort gets 69.2%.

This is unlike their previous generation of models and their competitors.

What does this indicate?

[0] https://arcprize.org/leaderboard

The answer to Life, the Universe and Everything, as we all know, is 42. Who needs Claude when you have Deep Thought.

Are these the coding tasks the highlighted terminal-bench 2.0 is referring to? https://www.tbench.ai/registry/terminal-bench/2.0?categories...

I'm curious what others think about these? There are only 8 tasks there specifically for coding

> We build Claude with Claude.

Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.

It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow

But it takes lot of context as a experimental feature.

Use self-learning loop with hooks and claude.md to preserve memory.

I have shared plugin above of my setup. Try it.

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

[1] https://claude.ai/settings/usage

There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:

https://harrypotter.fandom.com/wiki/List_of_spells

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?

> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

> Claude now automatically records and recalls memories as it works

Neat: https://code.claude.com/docs/en/memory

I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

Would love to find out they're overfitting for pelican drawings.

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

They trained for it. That's the +0.1!

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

This really is my favorite benchmark

There's no way they actually work on training this.

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.

It explains how important dogfooding is if you want to make an extremely successful product.

It’s extremely successful, not sure what it explains other than your biases

The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.

Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

What does it explain, oh snark master supreme?

Ah yes, explains why it takes 3 seconds for a new chat to load after I click new chat in the macOS app.

Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.

I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.

I don't get what's so difficult to understand. They have ambitions beyond just coding. And Claude is generally a good LLM. Even beyond just the coding applications.

Why would I even use Claude for asking something on their web, considering that chips away my claude code usage limit?

Their limit system is so bad.

It feels very similar to how Lyft positioned themselves against Uber. (And we know how that played out)

"Page not found" for me. I assume this is for currently paying accounts only or something (my subscription hasn't been active for a while), which is fair.

Paying $10 per request doesn't have me jumping at the opportunity to try it!

Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).

Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see

That's why you use Opus for detailed planning docs and weaker models for implementation & RAG for more focused implementation

I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:

What do you want to do?

  1. Stop and wait for limit to reset
   2. Switch to extra usage
   3. Upgrade your plan

 Enter to confirm · Esc to cancel

How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.

How else are they going to supplement their own development expenses? The more Claude Anthropic needs the less Claude the customer will get. By their own admission that is how the Anthropic model works. Their end value is in using vibe coders and engineers alike to create a persistent synthetic developer that replaces their own employees and most of their customers.

Scalable Intelligence is just a wrapper for centralized power. All Ai companies are headed that way.

They introduced the low limit warning for Opus on claude.ai

Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)

Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...

It’s there. I’m already using it

The 1M context is not available via subscription - only via API usage

From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

Isn't SWE-Bench Verified pretty saturated by now?

From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

On Openrouter it has the same cost per token as 4.5

> From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

That's a feature. You could also not use the extra context, and the price would be the same.

Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).

I’ve definitely experienced a subjective regression with Opus 4.5 the last few days. Feels like I was back to the frustrations from a year ago. Keen to see if 4.6 has reversed this.

Are we unemployed yet?

No? The hardest part of my SWE job is not the actual coding.

I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.

Is it still getting blocked when you give it a browser?

> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

> it weirdly feels the most transactional out of all of them.

My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.

A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.

Looks like a marketing strategy to bill more for Opus than Sonnet

4.6 is a beast.

Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.

Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.

Not clearly a step up for me, it's way more hesitant it seems and I don't notice context being larger at all it seems to compact just as often.

I asked

It pointed me to a bunch of hoaxes. I clarified:

It provided https://www.sciencedirect.com/science/article/pii/S246802302....

For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.

> For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.

It's all anecdata--I'm convinced anecdata is the least bad way to evaluate these models, benchmarks don't work--but this is the behavior I've come to expect from earlier Claude models as well, especially after several back and forth passes where you rejected the initial answers. I don't think it's new.

It would have to have been trained on the papers without being aware of retractions for that test to work. Otherwise it will be limited to whatever papers it gets from a search engine query, which likely won't contain any un-retracted illegitimate papers.

When Claude does WebSearch it can delegate it to a sub agent which of it ran in the background will write the entire prompt on a local file and the results. If that happened, I would like to know what it gave you for that. It is always very interesting to know the underlying "recall" of such things. Because often it's garbage in garbage out.

The location might still be on your disk if you can pull up the original Claude JSOn and put it through some `jq` and see what pages it went through to give you and what it did.

Results from a one-shot approach quickly converge on the default “none found” outcome when reasoning isn’t grounded in a paper corpus via proper RAG tooling.

Well, if there are papers that match your criteria, it's hallucinating the "no".

The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

This gets repeated everywhere but I don't think it's true.

The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.

However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

This is obviously not true, you can use real data and common sense.

Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.

Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.

Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.

As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.

There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.

Price sources:

https://openrouter.ai/deepseek/deepseek-v3.2-speciale

https://claude.com/pricing#api

> i.e. plans/API calls that make this practical at scale are expensive

Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.

Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit

I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:

1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)

2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.

The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.

Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.

Which is profitable. but not by much.

It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.

That's why anthropic switched to tpu, you can sell at cost.

These are intro prices.

This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.

Raise the price to backpay what you gave away plus cover current expenses and profits.

In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

They introduced the low limit warning for Opus on claude.ai

Scalable Intelligence is just a wrapper for centralized power. All Ai companies are headed that way.

It’s there. I’m already using it

> Context compaction (beta).

Not having to hand roll this would be incredible. One of the best Claude code features tbh.

Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?

I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.

> We build Claude with Claude.

Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:

1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)

2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.

That's why anthropic switched to tpu, you can sell at cost.

Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see

From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

This gets repeated everywhere but I don't think it's true.

The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.

I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.

It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.

Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit

From Alphabet 2025 Q4 Earnings call: "As we scale, we’re getting dramatically more efficient. We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements." https://abc.xyz/investor/events/event-details/2026/2025-Q4-E...

> i.e. plans/API calls that make this practical at scale are expensive

I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

Cost wise, doesn’t that depend on what you could be doing besides steering agents?

Paying $10 per request doesn't have me jumping at the opportunity to try it!

That's why you use Opus for detailed planning docs and weaker models for implementation & RAG for more focused implementation

Exactly. I barely had a chance to kick the tires the couple of times I did this before it exploded my usage. I don’t just chat with it casually. The questions I asked were apart of an overall planning strategy which was never allowed to get off the ground on my tiny Pro plan.

The 1M context is not available via subscription - only via API usage

Well this is extremely disappointing to say the least.

Isn't SWE-Bench Verified pretty saturated by now?

Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.

> From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

That's a feature. You could also not use the extra context, and the price would be the same.

The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.

> We build Claude with Claude.

How long before the "we" is actually a team of agents?

I wonder if I’ve been in A/B test with this.

Claude figured out zig’s ArrayList and io changes a couple weeks ago.

It felt like it got better then very dumb again the last few days.

I love being used as a test subject against my will!

Is Opus 4.6 available for Claude Code immediately?

Curious how long it typically takes for a new model to become available in Cursor?

Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.

($10/$37.50 per million input/output tokens) oof

Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?

I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...

> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.

That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.

They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.

This is obviously not true, you can use real data and common sense.

Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.

Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.

Price sources:

https://openrouter.ai/deepseek/deepseek-v3.2-speciale

https://claude.com/pricing#api

Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.

Which is profitable. but not by much.

On Openrouter it has the same cost per token as 4.5

Are these the coding tasks the highlighted terminal-bench 2.0 is referring to? https://www.tbench.ai/registry/terminal-bench/2.0?categories...

I'm curious what others think about these? There are only 8 tasks there specifically for coding

It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow

But it takes lot of context as a experimental feature.

Use self-learning loop with hooks and claude.md to preserve memory.

I have shared plugin above of my setup. Try it.

  Agent teams in this release is mcp-agent-mail [1] built into
  the runtime. Mailbox, task list, file locking — zero config,
  just works. I forked agent-mail [2], added heartbeat/presence
  tracking, had a PR upstream [3] when agent teams dropped. For
  coordinating Claude Code instances within a session, the
  built-in version wins on friction alone.

  Where it stops: agent teams is session-scoped. I run Claude
  Code during the day, hand off to Codex overnight, pick up in
  the morning. Different runtimes, async, persistent. Agent
  teams dies when you close the terminal — no cross-tool
  messaging, no file leases, no audit trail that outlives the
  session.

  What survives sherlocking is whatever crosses the runtime
  boundary. The built-in version will always win inside its own
  walls — less friction, zero setup. The cross-tool layer is
  where community tooling still has room. Until that gets
  absorbed too.

  [1] https://github.com/Dicklesworthstone/mcp_agent_mail
  [2] https://github.com/anupamchugh/mcp_agent_mail
  [3]
  https://github.com/Dicklesworthstone/mcp_agent_mail/pull/77

This never happened with Opus 4.5 despite a lot of usage.

I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.

5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.

Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.

Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:

Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.

Works out in Anthropic's favor, as long as I'm subscribed to them.

I've been on pro-tier membership and never used Opus until now. Just gave Opus 4.6 a whirl. OMG. What have I been missing.

I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

Hacker Times

Hacker Times

Discussion

Discussion

First impressions

Evaluating Claude Opus 4.6

A step forward on safety

Product and API updates

Footnotes

Related content

Claude is a space to think

Apple’s Xcode now supports the Claude Agent SDK

Anthropic partners with Allen Institute and Howard Hughes Medical Institute to accelerate scientific discovery