Claude Opus 4.1

Listen to this article (with local TTS)

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

Opus 4.1 is now available to paid Claude users and in Claude Code. It's also on our API, Amazon Bedrock, and Google Cloud's Vertex AI. Pricing is the same as Opus 4.

Claude Opus 4.1

Opus 4.1 advances our state-of-the-art coding performance to 74.5% on SWE-bench Verified. It also improves Claude’s in-depth research and data analysis skills, especially around detail tracking and agentic search.

Chart showing Claude's progress on a popular coding evaluation

GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring. Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks. Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

A benchmark table comparing Claude Opus 4.1 to prior Claude models and other public models

Getting started

We recommend upgrading from Opus 4 to Opus 4.1 for all uses. If you’re a developer, simply use claude-opus-4-1-20250805 via the API. You can also explore our system card, model page, pricing page, and docs to learn more.

As always, your [feedback](mailto: feedback@anthropic.com) helps us improve, especially as we continue to release new and more capable models.

Appendix

Data sources

OpenAI: o3 launch post, o3 system card
Gemini: 2.5 Pro model card
Claude: Sonnet 3.7 launch post, Claude 4 launch post

Benchmark reporting

Claude models are hybrid reasoning models. The benchmarks reported in this blog post show the highest scores achieved with or without extended thinking. We’ve noted below for each result whether extended thinking was used:

No extended thinking: SWE-bench Verified, Terminal-Bench
The following benchmarks were reported with extended thinking (up to 64K tokens): TAU-bench, GPQA Diamond, MMMLU, MMMU, AIME

TAU-bench methodology

Scores were achieved with a prompt addendum to both the Airline and Retail Agent Policy instructing Claude to better leverage its reasoning abilities while using extended thinking with tool use. The model is encouraged to write down its thoughts as it solves the problem distinct from our usual thinking mode, during the multi-turn trajectories to best leverage its reasoning abilities. To accommodate the additional steps Claude incurs by utilizing more thinking, the maximum number of steps (counted by model completions) was increased from 30 to 100 (most trajectories completed under 30 steps with only one trajectory reaching above 50 steps).

SWE-bench methodology

For the Claude 4 family of models, we continue to use the same simple scaffold that equips the model with solely the two tools described in our prior releases here—a bash tool, and a file editing tool that operates via string replacements. We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. On all Claude 4 models, we report scores out of the full 500 problems. Scores for OpenAI models are reported out of a 477 problem subset.

Discussion (35 comments)

All three major labs released something within hours of each other. This anime arc is insane.

Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.

Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.

1: https://openrouter.ai/anthropic/claude-opus-4.1

2: https://openrouter.ai/anthropic/claude-sonnet-4

3: https://block.github.io/goose/

4: https://openrouter.ai/anthropic/claude-3.5-sonnet

5: https://openrouter.ai/google/gemini-2.5-flash

6: https://openrouter.ai/openai/gpt-4.1-mini

I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?

They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon

(He had been stuck in the Team Rocket hideout (I believe) for weeks)

Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.

At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.

I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

The article says "We plan to release substantially larger improvements to our models in the coming weeks."

Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.

I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.

This is the bit I'm most interested in:

> We plan to release substantially larger improvements to our models in the coming weeks.

This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.

Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!

o3 and o3-pro are just so good. Sonnet goes off the deep end too often and Opus, in my experience, is not as strong at reasoning compared to OpenAI, despite the higher costs. Rarely do we see a worse, more expensive product win - but competition is good and I’m rooting for Anthropic nonetheless!

Is there any tool like Claude Code that can go into the same "automatic feedback and coding loop" (I don't know if it has an official name) but compatible with using different LLMs.

I've used Aider for a while, and I kind of liked if, but it felt like it needed way more manual work, and I also want to use different models, probably locally hosted. Haven't used Aider in 2 or 3 months, so I don't know if it already has evolved in that way...

edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.

Cheekily announcing during oAI's oss model launch :D

Why is everything releasing today?

Claude plus failed me today badly compared to chatGPT plus.

I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.

I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).

The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.

Has anyone tested it yet? How's it acting?

Will the price for 4 go down? I still find Opus completely unusable for the cost/performance, as someone who spends thousands per month on tokens. There's really no noticeable difference from Sonnet, at nearly 10x the price.

It's interesting that Anthropic maintains current prices for prior state of the art models when doing a new release. Why offer a model with worse performance for the same price? What incentives are they trying to create?

Claude lost me after I used it for a day. Their pricing model is bonkers. There is no way any developer in their right mind would go with Claude.

Have been using it in Claude Code with Max Plan for one day. The rate of acceptance is noticeably higher.

Opus 4.1 is now set as default model in Claude Code - just a heads-up.

just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/

Funny Open AI and Anthropic seems to be coordinating their releases on the same day

Their limits are just … a real road blocker

Claude Code has honestly made me at least 10x more productive. I’ve burned through about 3 billion tokens and have been consistently merging 5+ PRs a day, tackling tons of tech debt, improving GitHub Actions, and making crazy progress on product work

> 1 min read

What the point of these?

Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".

It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.

Is it just me, or is Opus 4.1 substantially worse in Claude Code than Opus 4.0 was? I feel like I'm using Sonnet.

It's making really stupid errors and I have to work three times as much to get the same results as last week.

Is it just me or is it super slow?

Well wait another 24hrs…

For me this is the big news of the day. Looks insane.

Notice how Anthropic has never open sourced any of their models.

This makes them (Anthropic) worse than OpenAI in terms of openness.

Since in this case as we all know. [0]

"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."

[0] https://news.ycombinator.com/item?id=34865626

The article says "We plan to release substantially larger improvements to our models in the coming weeks."

Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.

I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.

Claude plus failed me today badly compared to chatGPT plus.

I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).

Have been using it in Claude Code with Max Plan for one day. The rate of acceptance is noticeably higher.

Funny Open AI and Anthropic seems to be coordinating their releases on the same day

> 1 min read

What the point of these?

Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".

It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.

Is it just me, or is Opus 4.1 substantially worse in Claude Code than Opus 4.0 was? I feel like I'm using Sonnet.

It's making really stupid errors and I have to work three times as much to get the same results as last week.

Is it just me or is it super slow?

Well wait another 24hrs…

For me this is the big news of the day. Looks insane.

All three major labs released something within hours of each other. This anime arc is insane.

1: https://openrouter.ai/anthropic/claude-opus-4.1

2: https://openrouter.ai/anthropic/claude-sonnet-4

3: https://block.github.io/goose/

4: https://openrouter.ai/anthropic/claude-3.5-sonnet

5: https://openrouter.ai/google/gemini-2.5-flash

6: https://openrouter.ai/openai/gpt-4.1-mini

Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.

In every price comparison I make. Claude (API) always comes out cheapest if you manage to keep most of your context cached. 90% price reduction for input is crazy.

Well, it's expensive compared to other models. But it's often much cheaper than human labor.

E.g. if need a self-contained script to do some data processing, for example, Opus can often do that in one shot. 500 line Python script would cost around $1, and as long as it's not tricky it just works - you don't need back-and-forth.

I don't think it's possible to employ any human to make 500 line Python script for $1 (unless it's a free intern or a student), let alone do it in one minute.

Of course, if you use LLM interactively, for many small tasks, Opus might be too expensive, and you probably want a faster model anyway. Really depends on how you use it.

(You can do quite a lot in file-at-once mode. E.g. Gemini 2.5 Flash could write 35 KB of code of a full ML experiment in Python - self-contained with data loading, model setup training, evaluation, all in one file, pretty much on the first try.)

Large models are for querying the model

Small models are for querying the context

Opus is cheap if you use it for its niche

GLM 4.5 / Kimi K2 / Qwen Coder 3 / Gemini Pro 2.5

They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon

(He had been stuck in the Team Rocket hideout (I believe) for weeks)

The finest of AI, probably using electricity/water for 100s of homes can not even beat a very simple children game with millions of texts guides etc. about it.

When can we replace doctors with it?

Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.

At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.

I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.

I feel like this is just related to my projects getting bigger. Claude Code is trying to keep up with my project evolving from 2k lines of code to 100k lines. Of course it’s going to feel worse.

Other than it starting out trying to produce a full and complete web app (or whatever) for my daily yak shaving session instead of the normal "let's talk about and work through this thing" the new Opus 4.1 seems to 'get it' a lot quicker than the old daffy robot did. It asked pertinent questions to understand the system we are working on and accomplished the goal of updating the design document so I don't have to keep explaining details at the start of every chat session. Something, by the way, it always previously failed to do causing me to have to explain stuff each and every time before forward progress could be made.

I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.

Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.

> I've basically wasted the morning on Claude Code when I should've just been doing it all myself.

Welcome to the machine

https://www.youtube.com/watch?v=tBvAxSx0nAM&t=45s

it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference

i think its probably mostly vibes but that still counts, this is not in the charts

> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

That's why they named it 4.1 and not 4.5

They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.

I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.

I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.

"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!

Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.

I don't think this could even be called an improvement? It's small enough that it could just be random chance

This is the bit I'm most interested in:

> We plan to release substantially larger improvements to our models in the coming weeks.

This is so people don't immediately migrate for GPT5

This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.

You're prompting it wrong !

I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?

Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?

You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.

Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.

> use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.

Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart

What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.

Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month. Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits. So I would recommend paying $20 and trying the Claude Code via that subscription.

VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.

But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.

This workflow works well for exploring and understanding new topics and technologies.

Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.

Hope this helps!

OpenAI also has Flex processing[1] for o3. I've spent most of my time with Gemini 2.5, but lately been trying out a ton of o3 as it seems to work quite well and I get really cheap tokens (~95% of my agentic tokens are cached which is 75% discount and flex mode adds 50% for $0.25 / million input tokens)

[1] https://platform.openai.com/docs/guides/flex-processing?api-...

o3 feels pretty good to me as well but o3-pro has consistently one shotted problems other LLMs got stuck on.

I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.

Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.

o3-pro level LLMs at reduced cost and increased speed will already be amazing..

Is there any tool like Claude Code that can go into the same "automatic feedback and coding loop" (I don't know if it has an official name) but compatible with using different LLMs.

edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.

Claude Code Router (https://github.com/musistudio/claude-code-router) lets you use Claude Code with other, non-Anthropic models.

opencode, but note that all the self-hosted llms are much worse at coding than claude code with opus/sonnet.

there's also claude-code-proxy to make claude code use other models.

Cheekily announcing during oAI's oss model launch :D

Why is everything releasing today?

If they release before GPT-5, they don't have to compare to GPT-5 in their benchmarks. It's a big PR win to be able to plausibly claim that your model is the best coding model at the time of release.

Could it be nobody wanted to be first and overshadowed, nor the only one left out - and it cascaded after the first announcement? My first hunch, though, was that it had been agreed upon. Game theory I think tells us that releasing same day in the pattern ABC BCA CAB etc would be lowest risk and highest average gain?

Has anyone tested it yet? How's it acting?

Tested it on a refactor of Zig code. It worked fine, but was very slow.

No obvious gains I feel from quick chats, but too early to tell.

These benchmark gains aren't that high, so I doubt it is that obvious.

waiting for this, too.

> What incentives are they trying to create?

One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.

I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."

I'm guessing it's mostly for legacy reasons. When 3.7 came out many people were not happy with it and went back to 3.5; I guess supporting older models for a while makes sense.

Claude lost me after I used it for a day. Their pricing model is bonkers. There is no way any developer in their right mind would go with Claude.

Opus 4.1 is now set as default model in Claude Code - just a heads-up.

just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/

Their limits are just … a real road blocker

Notice how Anthropic has never open sourced any of their models.

This makes them (Anthropic) worse than OpenAI in terms of openness.

Since in this case as we all know. [0]

"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."

[0] https://news.ycombinator.com/item?id=34865626

This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.

Given the GPT5 rumors, August is just getting started.

What a time to be alive

as if they wait competitor first then launch it at the same time to make market decide which one is best

None of them seem to have published any papers associated with them on how these new models advanced the state-of-the-art though. =^(

There's so many leakers in every lab

They likely sit on releases ready to go.

It's definitely a coincidence

Eu auto brands colluded for years to synchronize new tech into their model lines. Could it be the AI SaaS sector is showing its first steps towards "maturity"? /s

I don't doubt Opus is technically superior, but it's not practically superior for me.

It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.

For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.

Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.

> yet the general consensus and my own experience seem to be that Sonnet is much much better

Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.

I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.

I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.

Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.

I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.

Hacker Times

Hacker Times

Claude Opus 4.1

Claude Opus 4.1

Getting started

Appendix

Discussion (35 comments)

Claude Opus 4.1

Getting started

Appendix