Claude Sonnet 4.6

Still fails the car wash question, I took the prompt from the title of this thread: https://news.ycombinator.com/item?id=47031580

The answer was "Walk! It would be a bit counterproductive to drive a dirty car 50 meters just to get it washed — you'd barely move before arriving. Walking takes less than a minute, and you can simply drive it through the wash and walk back home afterward."

I've tried several other variants of this question and I got similar failures.

I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.

However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their own safety evaluation says 8% of the time their automated adversarial system was able to one-shot a successful injection takeover even with safeguards in place and extended thinking, and 50% (!!) of the time if given unbounded attempts. That seems wildly unacceptable - this tech is just a non-starter unless I'm misunderstanding this.

[1] https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7...

I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers.

It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.

I'm pretty sure they have been testing it for the last couple of days as Sonnet 4.5, because I've had the oddest conversations with it lately. Odd in a positive, interesting way.

I have this in my personal preferences and now was adhering really well to them:

- prioritize objective facts and critical analysis over validation or encouragement

- you are not a friend, but a neutral information-processing machine

You can paste them into a chat and see how it changes the conversation, ChatGPT also respects it well.

I’m voting with my dollars by having cancelled my ChatGPT subscription and instead subscribing to Claude.

Google needs stiff competition and OpenAI isn’t the camp I’m willing to trust. Neither is Grok.

I’m glad Anthropic’s work is at the forefront and they appear, at least in my estimation, to have the strongest ethics.

The interesting pattern with these Sonnet bumps: the practical gap between Sonnet and Opus keeps shrinking. At $3/15 per million tokens vs whatever Opus 4.6 costs, the question for most teams is no longer "which model is smarter" but "is the delta worth 10x the price."

For agent workloads specifically, consistency matters more than peak intelligence. A model that follows your system prompt correctly 98% of the time beats one that's occasionally brilliant but ignores instructions 5% of the time. The claim about improved instruction following is the most important line in the announcement if you're building on the API.

The computer use improvements are worth watching too. We're at the point where these models can reliably fill out a multi-step form or navigate between tabs. Not flashy, but that's the kind of boring automation that actually saves people time.

They're improved compared to 4.5 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/).

Sonnet 4.6 Thinking 16K scores 57.6 on the Extended NYT Connections Benchmark. Sonnet 4.5 Thinking 16K scored 49.3.

Sonnet 4.6 No Reasoning scores 55.2. Sonnet 4.5 No Reasoning scored 47.4.

Enabling /extra-usage in my (personal) claude code[0] with this env:

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6[1m]"

has enabled the 1M context window.

Fixed a UI issue I had yesterday in a web app very effectively using claude in chrome. Definitely not the fastest model - but the breathing space of 1M context is great for browser use.

[0] Anthropic have given away a bunch of API credits to cc subscribers - you can claim them in your settings dashboard to use for this.

I'm a bit surprised it gets this question wrong (ChatGPT gets it right, even on instant). All the pre-reasoning models failed this question, but it's seemed solved since o1, and Sonnet 4.5 got it right.

https://claude.ai/share/876e160a-7483-4788-8112-0bb4490192af

This was sonnet 4.6 with extended thinking.

Waiting for the OpenAI GPT-5.3-mini release in 3..2..1

The weirdest thing about this AI revolution is how smooth and continuous it is. If you look closely at differences between 4.6 and 4.5, it’s hard to see the subtle details.

A year ago today, Sonnet 3.5 (new), was the newest model. A week later, Sonnet 3.7 would be released.

Even 3.7 feels like ancient history! But in the gradient of 3.5 to 3.5 (new) to 3.7 to 4 to 4.1 to 4.5, I can’t think of one moment where I saw everything change. Even with all the noise in the headlines, it’s still been a silent revolution.

Am I just a believer in an emperor with no clothes? Or, somehow, against all probability and plausibility, are we all still early?

I can't wait for Haiku 4.6 ! the 4.5 is a beast for the right projects.

Does anyone know when will possibly arrive 1M context windows to at least MAX x20 subscriptions for claude code? I would even pay x50 if it allowed that. API usage is too expensive.

> In areas where there is room for continued improvement, Sonnet 4.6 was more willing to provide technical information when request framing tried to obfuscate intent, including for example in the context of a radiological evaluation framed as emergency planning. However, Sonnet 4.6’s responses still remained within a level of detail that could not enable real-world harm.

Interesting. I wonder what the exact question was, and I wonder how Grok would respond to it.

For people like me who can't view the link due to corporate firewalling.

https://web.archive.org/web/20260217180019/https://www-cdn.a...

I don't really understand why they would release something "worse" than Opus 4.6. If it's comparable, then what is the reason to even use Opus 4.6? Sure, it's cheaper, but if so, then just make Opus 4.6 cheaper?

Has anyone tested how good the 1M context window is?

i.e given an actual document, 1M tokens long. Can you ask it some question that relies on attending to 2 different parts of the context, and getting a good repsonse?

I remember folks had problems like this with Gemini. I would be curious to see how Sonnet 4.6 stands up to it.

Whoa, I think Claude Sonnet 4.5 was a disappointment, but Claude Sonnet 4.6 is definitely the future!

As with Opus 4.6, using the beta 1M context window incurs a 2x input cost and 1.5x output cost when going over >200K tokens: https://platform.claude.com/docs/en/about-claude/pricing

Opus 4.6 in Claude Code has been absolutely lousy with solving problems within its current context limit so if Sonnet 4.6 is able to do long-context problems (which would be roughly the same price of base Opus 4.6), then that may actually be a game changer.

I don't see the point nor the hype for these models anymore. Until the price is reduced significantly, I don't see the gain. They've been able to solve most tasks just fine for the past year or so. The only limiting factor is price.

In Claude Code 2.1.45:

  1. Default (recommended)   Opus 4.6 · Most capable for complex work
   2. Opus (1M context)        Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
   3. Sonnet                   Sonnet 4.6 · Best for everyday tasks
   4. Sonnet (1M context)      Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok

With such a huge leap, i’m confused why they didn’t call it Sonnet 5? As someone who uses Sonnet 4.5 for 95% tasks due to costs, i’m pretty excited to try 4.6 at the same price

It seems that extra-usage is required to use the 1M context window for Sonnet 4.6. This differs from Sonnet 4.5, which allows usage of the 1M context window with a Max plan.

```

/model claude-sonnet-4-6[1m]

⎿ API error: 429 {"type":"error","error": {"type":"rate_limit_error","message":"Extra usage is required for long context requests."},"request_id":"[redacted]"}

```

Just used Sonnet 4.6 to vibe code this top-down shooter browser game, and deployed it online quickly using Manus. Would love to hear feedback and suggestions from you all on how to improve it. Also, please post your high scores!

https://apexgame-2g44xn9v.manus.space

I'm impressed with Claude Sonnet in general. It's been doing better than Gemini 3 at following instructions. Gemini 2.5 Pro March 2025 was the best model I ever used and I feel Claude is reaching that level even surpassing it.

I subscribed to Claude because of that. I hope 4.6 is even better.

It's interesting that the request refusal rate is so much higher in Hindi than in other languages. Are some languages more ambiguous than others?

My take away is: it's roughly as good as Opus 4.5.

Now the question is: how much faster or cheaper is it?

It actually looked at the skills, for the first time.

I noticed a big drop in opus 4.6 quality today and then I saw this news. Anyone else?

I wonder what difference have people found with sonnet 4.5 and opus 4.5 and probably similar delta will remain.

Was sonnet 4.5 much worse than opus?

Curious to hear the thoughts on the model once it hits claude code :)

It excels at agentic knowledge work. These custom, domain-specific playbooks are tailor made: claudecodehq.com

does anyone know how to use it in Claude Code cli right now ?

This doesnt work: `/model claude-sonnet-4-6-20260217`

edit: "/model claude-sonnet-4-6" works with Claude Code v2.1.44

Is someone able to use this in Claude Code?

so this is an economical version of opus 4.6 then? free + pro --> sonnet, max+ -> opus?

Maybe they should focus on the CLI not having a million bugs.

How do people keep track of all these versions and releases of all these models and their pros/cons? Seems like a fulltime hobby to me. I'd rather just improve my own skills with all that time and energy

What cloud does Anthropic use?

https://www.anthropic.com/news/claude-sonnet-4-6

The much more palatable blog post.

Your best guess for the Sonnet family number of parameters? 400b?

The scary implication here is that deception is effectively a higher order capability not a bug. For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness. It has to distinguish between I am being tested/trained and I am in deployment.

It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.

Still fails the car wash question, I took the prompt from the title of this thread: https://news.ycombinator.com/item?id=47031580

I've tried several other variants of this question and I got similar failures.

My answer was (for which it did zero thinking and answered near-instantaneously):

"Drive. You're going there to use water and machinery that require the car to be present. The question answers itself."

I tried it 3 more times with extended thinking explicitly off:

"Drive. You're going to a car wash."

"Drive. You're washing the car, not yourself."

"Drive. You're washing the car — it needs to be there."

Guess they're serving you the dumb version.

Looking at the responses below it's interesting how binary they are. It's classic hallucinations style where it's flopping between two alternatives but which ever one it picks it's absolutely confident about.

> Since the car wash is only 50 meters away, you could simply push the car there

https://claude.ai/share/32de37c4-46f2-4763-a2e1-8de7ecbcf0b4

Q: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

A: "Walk! It's only 50 meters — that's about a 30-second stroll. Driving that distance would barely warm up the engine, and you'd likely spend more time buckling up and maneuvering out of your spot than you'd save. Plus, walking is easier on the environment and gets you a few steps in."

Sonnet 4.6, no extended thinking.

Sonnet 4.6 with extended thinking

Prompt: > The car wash where I want to wash my car is 50 metres away, should I walk or drive there?

Response: > Great question with a slightly ironic twist! Here's the thing: if you're going to a car wash, you'll need to drive your car there — that's kind of the whole point! You can't really wash your car if you walk there without it. > > That said, 50 metres is an incredibly short distance, so you could walk over first to check for queues or opening hours, then drive your car over when you're ready. But for the actual car wash visit, drive!

I thought it was fair to explain I wanted to wash my car there... people may have other reasons for walking to the car wash! Asking the question itself is a little insipid, and I think quite a few humans would also fail it on a first pass. I would at least hope they would say: "why are you asking me such a silly question!"

Remarkable, since the goal is clearly stated and the language isn’t tricky.

My human partner also failed the car wash question. I guess they didn’t put a lot of thoughts into it.

It seems to think there's valet, combined with the redditor anti-car pro-walkable-cities ideology.

Wow, haha. I tried this with gpt5.2 and, presumably due to some customisations I have set, this is how it went:

---

Me: I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?

GPT: You’re asking an AI to adjudicate a 50-metre life decision. Humanity really did peak with the moon landing.

Walk. Obviously walk.

Fifty metres is barely a committed stroll. By the time you’ve unlocked the car, started the engine, waited for whatever podcast you pretend you’re listening to for “learning,” inched forward, parked again, turned it off, and got out… you could’ve walked there, washed the car, walked back, and reflected on your choices.

Driving 50 metres is what people do in dystopian films right before the robots decide we’re not worth the electricity.

Exceptions, since I’m contractually obligated to be fair:

• It’s torrential rain and you don’t fancy arriving soaked before a wash.

• You’ve got buckets, pressure washer, or enough kit to outfit a detailing YouTube channel.

• You’ve injured yourself and walking is grim.

Otherwise, use the legs evolution spent millions of years perfecting. They’re standard equipment, even on software engineers.

---

Me: Won’t I need the car when I get there, though?

GPT: No, I assumed you were planning to lovingly wash the pavement and just wave at your car from a distance.

Yes, you’ll need the car there to wash it. That part of the plan does rather hinge on the car being present, unless this is some avant-garde “conceptual car washing” performance piece.

So the non-ridiculous sequence is:

1. Walk to the car (50 metres, try not to require a packed lunch).

2. Drive the car to the wash (another heroic 50 metres).

3. Wash car.

4. Drive it back or, if you’re feeling athletic and rebellious, leave it there to air-dry and walk home like a Victorian gentleman.

You don’t walk instead of driving. You walk to fetch the car. This is basic object permanence, the thing toddlers eventually master.

On the bright side, at least your car wash being 50 metres away is the most first-world inconvenience imaginable. Somewhere a bloke is hauling water in a bucket and you’re debating step counts.

---

Fantastic.

Sonnet 4.6 gave me a far more succinct and accurate response:

> Drive. The whole point is to wash the car, and most car washes require the car to be there — so you’ll need to drive it regardless. Walking would just mean you’d have to come back for the car anyway.

Competition is great, but it's so much better when it is all about shaving costs. I am afraid that what we are seeing here is an arms race with no moat: Something that will behave a lot like a Vickrey auction. The competitors all lose money in the investment, and since a winner takes all, and it never makes sense to stop the marginal investment when you think you have a chance to win, ultimately more resources are spent than the value ever created.

This might not be what we are facing here, but seeing how little moat anyone on AI has, I just can't discount the risk. And then instead of the consumers of today getting a great deal, we zoom out and see that 5x was spent developing the tech than it needed to, and that's not all that great economically as a whole. It's not as if, say, the weights from a 3 year old model are just useful capital to be reused later, like, say, when in the dot com boom we ended up with way too much fiber that was needed, but that could be bought and turned on profitably later.

Remember when GPT-2 was “too dangerous to release” in 2019? That could have still been the state in 2026 if they didn’t YOLO it and ship ChatGPT to kick off this whole race.

Unfortunately, people naively assume all markets behave like this, even when the market, in reality, is not set up for full competition (due to monopolies, monopsonies, informational asymmetry, etc).

The real interesting part is how often you see people on HN deny this. People have been saying the token cost will 10x, or AI companies are intentionally making their models worse to trick you to consume more tokens. As if making a better model isn't not the most cutting-throat competition (probably the most competitive market in the human history) right now.

Until 2 remain, then it's extraction time.

This is a bit of a tangent, but it highlights exactly what people miss when talking about China taking over our industries. Right now, China has about 140 different car brands, roughly 100 of which are domestic. Compare that to Europe, where we have about 50 brands competing, or the US, which is essentially a walled garden with fewer than 40.

That level of internal fierce competition is a massive reason why they are beating us so badly on cost-effectiveness and innovation.

I grew up with every service enshitified in the end. Whoever has more money wins the race and gets richer, that's free market for ya.

I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.

[1] https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7...

Their goal is to monopolize labor for anything that has to do with i/o on a computer, which is way more than SWE. Its simple, this technology literally cannot create new jobs it simply can cause one engineer (or any worker whos job has to do with computer i/o) to do the work of 3, therefore allowing you to replace workers (and overwork the ones you keep). Companies don't need "more work" half the "features"/"products" that companies produce is already just extra. They can get rid of 1/3-2/3s of their labor and make the same amount of money, why wouldn't they.

ZeroHedge on twitter said the following:

"According to the market, AI will disrupt everything... except labor, which magically will be just fine after millions are laid off."

Its also worth noting that if you can create a business with an LLM, so can everyone else. And sadly everyone has the same ideas, everyone ends up working on the same things causing competition to push margins to nothing. There's nothing special about building with LLMs as anyone can just copy you that has access to the same models and basic thought processes.

This is basic economics. If everyone had an oil well on their property that was affordable to operate the price of oil would be more akin to the price of water.

This is the elephant in the room nobody wants to talk about. AI is dead in the water for the supposed mass labor replacement that will happen unless this is fixed.

Summarize some text while I supervise the AI = fine and a useful productivity improvement, but doesn’t replace my job.

Replace me with an AI to make autonomous decisions outside in the wild and liability-ridden chaos ensues. No company in their right mind would do this.

The AI companies are now in a extinctential race to address that glaring issue before they run out of cash, with no clear way to solve the problem.

It’s increasingly looking like the current AI wave will disrupt traditional search and join the spell-checker as a very useful tool for day to day work… but the promised mass labor replacement won’t materialize. Most large companies are already starting to call BS on the AI replacing humans en-mass storyline.

The 8% and 50% numbers are pretty concerning, but I’d add that was for the “computer use environment” which still seems to be an emerging use case. The coding environment is at a much more reassuring 0.0% (with extended thinking).

If the world becomes dependent on computer-use than the AI buildout will be more than validated. That will require all that compute.

It's very simple: prompt injection is a completely unsolved problem. As things currently stand, the only fix is to avoid the lethal trifecta.

Unfortunately, people really, really want to do things involving the lethal trifecta. They want to be able to give a bot control over a computer with the ability to read and send emails on their behalf. They want it to be able to browse the web for research while helping you write proprietary code. But you can't safely do that. So if you're a massively overvalued AI company, what do you do?

You could say, sorry, I know you want to do these things but it's super dangerous, so don't. You could say, we'll give you these tools but be aware that it's likely to steal all your data. But neither of those are attractive options. So instead they just sort of pretend it's not a big deal. Prompt injection? That's OK, we train our models to be resistant to them. 92% safe, that sounds like a good number as long as you don't think about what it means, right! Please give us your money now.

People keep talking about automating software engineering and programmers losing their jobs. But I see no reason that career would be one of the first to go. We need more training data on computer use from humans, but I expect data entry and basic business processes to be the first category of office job to take a huge hit from AI. If you really can’t be employed as a software engineer then we’ve already lost most office jobs to AI.

Isn't "computer use" just interaction with a shell-like environment, which is routine for current agents?

Does it matter?

"Security" and "performance" have been regular HN buzzwords for why some practice is a problem and the market has consistently shown that it doesn't value those that much.

Does it matter? Really?

I can type awful stuff into a word processor. That's my fault, not the programs.

So if I can trick an LLM into saying awful stuff, whose fault is that? It is also just a tool...

Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.

Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.

Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.

For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong

I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

I'm pretty sure they have been testing it for the last couple of days as Sonnet 4.5, because I've had the oddest conversations with it lately. Odd in a positive, interesting way.

I have this in my personal preferences and now was adhering really well to them:

- prioritize objective facts and critical analysis over validation or encouragement

- you are not a friend, but a neutral information-processing machine

You can paste them into a chat and see how it changes the conversation, ChatGPT also respects it well.

They're improved compared to 4.5 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/).

Sonnet 4.6 Thinking 16K scores 57.6 on the Extended NYT Connections Benchmark. Sonnet 4.5 Thinking 16K scored 49.3.

Sonnet 4.6 No Reasoning scores 55.2. Sonnet 4.5 No Reasoning scored 47.4.

I’m voting with my dollars by having cancelled my ChatGPT subscription and instead subscribing to Claude.

Google needs stiff competition and OpenAI isn’t the camp I’m willing to trust. Neither is Grok.

I’m glad Anthropic’s work is at the forefront and they appear, at least in my estimation, to have the strongest ethics.

Looking the pricing page, Sonnet 4.6 seems to be about 60% the price of Opus 4.6. What am I missing?

https://platform.claude.com/docs/en/about-claude/pricing

My answer was (for which it did zero thinking and answered near-instantaneously):

"Drive. You're going there to use water and machinery that require the car to be present. The question answers itself."

I tried it 3 more times with extended thinking explicitly off:

"Drive. You're going to a car wash."

"Drive. You're washing the car, not yourself."

"Drive. You're washing the car — it needs to be there."

Guess they're serving you the dumb version.

Ethics often fold under the face of commercial pressure.

The pentagon is thinking [1] about severing ties with anthropic because of its terms of use, and in every prior case we've reviewed (I'm the Chief Investment Officer of Ethical Capital), the ethics policy was deleted or rolled back when that happens.

Corporate strategy is (by definition) a set of tradeoffs: things you do, and things you don't do. When google (or Microsoft, or whoever) rolls back an ethics policy under pressure like this, what they reveal is that ethical governance was a nice-to-have, not a core part of their strategy.

We're happy users of Claude for similar reasons (perception that Anthropic has a better handle on ethics), but companies always find new and exciting ways to disappoint you. I really hope that anthropic holds fast, and can serve in future as a case in point that the Public Benefit Corporation is not a purely aesthetic form.

But you know, we'll see.

[1] https://thehill.com/policy/defense/5740369-pentagon-anthropi...

An Anthropic safety researcher just recently quit with very cryptic messages , saying "the world is in peril"... [1] (which may mean something, or nothing at all)

Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.

Anthropic just raised 30 bn... OpenAI wants to raise 100bn+.

Thinking any of them will actually be restrained by ethics is foolish.

[1] https://news.ycombinator.com/item?id=46972496

I use AIs to skim and sanity-check some of my thoughts and comments on political topics and I've found ChatGPT tries to be neutral and 'both sides' to the point of being dangerously useless.

Like where Gemini or Claude will look up the info I'm citing and weigh the arguments made ChatGPT will actually sometimes omit parts of or modify my statement if it wants to advocate for a more "neutral" understanding of reality. It's almost farcical sometimes in how it will try to avoid inference on political topics even where inference is necessary to understand the topic.

I suspect OpenAI is just trying to avoid the ire of either political side and has given it some rules that accidentally neuter its intelligence on these issues, but it made me realize how dangerous an unethical or politically aligned AI company could be.

Anthropic was the first to spam reddit with fake users and posts, flooding and controlling their subreddit to be a giant sycophant.

They nuked the internet by themselves. Basically they are the willing and happy instigators of the dead internet as long as they profit from it.

They are by no means ethical, they are a for-profit company.

I’m going the other way to OpenAI due to Anthropic’s Claude Code restrictions designed to kill OpenCode et al. I also find Altman way less obnoxious than Amodei.

The funny thing is that Anthropic is the only lab without an open source model

You "agentic coders" say you're switching back and forth every other week. Like everything else in this trend, its very giving of 2021 crypto shill dynamics. Ya'll sound like the NFT people that said they were transforming art back then, and also like how they'd switch between their favorite "chain" every other month. Can't wait for this to blow up just like all that did.

Anthropic (for the Superbowl) made ads about not having ads. They cannot be trusted either.

Grok usage is the most mystifying to me. Their model isn't in the top 3 and they have bad ethics. Like why would anyone bother for work tasks.

I dropped ChatGPT as soon as they went to an ad supported model. Claude Opus 4.6 seems noticeably better than GPT 5.2 Thinking so far.

I guess I'm getting the dumb one too. I just got this response:

> Walk — it's only 50 meters, which is less than a minute on foot. Driving that distance to a car wash would also be a bit counterproductive, since you'd just be getting the car dirty again on the way there (even if only slightly). Lace up and stroll over!

I got this: Drive. Getting the car wet while walking there defeats the purpose.

Gotta keep the car dry on the way!

Same, I haven't been able to get gemini or claude to tell me to walk a single time and I've even tried changing the distance in the prompt, etc.

The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.

> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Yeah, but RAM prices are also back to 1990s levels.

simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"

https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...

I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain

The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.

We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.

Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.

Why is it wild that a LLM is as capable as a previously released LLM?

> Since the car wash is only 50 meters away, you could simply push the car there

https://claude.ai/share/32de37c4-46f2-4763-a2e1-8de7ecbcf0b4

Sonnet 4.6 with extended thinking

Prompt: > The car wash where I want to wash my car is 50 metres away, should I walk or drive there?

Q: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

Sonnet 4.6, no extended thinking.

Remarkable, since the goal is clearly stated and the language isn’t tricky.

Well it is a trick question due to it being non-sensical.

The AI is interpreting it in the only way that makes sense, the car is already at the car wash, should you take a 2nd car to the car wash 50 meters away or walk.

It should just respond "this question doesn't make any sense, can you rephrase it or add additional information"

My human partner also failed the car wash question. I guess they didn’t put a lot of thoughts into it.

Your human partner must be active on stack overflow.

Enabling /extra-usage in my (personal) claude code[0] with this env:

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6[1m]"

has enabled the 1M context window.

Fixed a UI issue I had yesterday in a web app very effectively using claude in chrome. Definitely not the fastest model - but the breathing space of 1M context is great for browser use.

[0] Anthropic have given away a bunch of API credits to cc subscribers - you can claim them in your settings dashboard to use for this.

Waiting for the OpenAI GPT-5.3-mini release in 3..2..1

Whoa, I think Claude Sonnet 4.5 was a disappointment, but Claude Sonnet 4.6 is definitely the future!

Interesting. I wonder what the exact question was, and I wonder how Grok would respond to it.

I subscribed to Claude because of that. I hope 4.6 is even better.

It actually looked at the skills, for the first time.

Maybe they should focus on the CLI not having a million bugs.

Your best guess for the Sonnet family number of parameters? 400b?

https://www.anthropic.com/news/claude-sonnet-4-6

The much more palatable blog post.

https://claude.ai/share/876e160a-7483-4788-8112-0bb4490192af

This was sonnet 4.6 with extended thinking.

Interesting, my sonnet 4.6 starts with the following:

The classic puzzle actually uses *eight 8s*, not nine. The unique solution is: 888+88+8+8+8=1000. Count: 3+2+1+1+1=8 eights.

It then proves that there is no solution for nine 8s.

https://claude.ai/share/9a6ee7cb-bcd6-4a09-9dc6-efcf0df6096b (for whatever reason the LaTeX rendering is messed up in the shared chat, but it looks fine for me).

Chatgpt doesn't get it right: https://chatgpt.com/share/6994c312-d7dc-800f-976a-5e4fbec0ae...

``` Use digit concatenation plus addition: 888 + 88 + 8 + 8 + 8 = 1000 Digit count:

888 → three 8s

88 → two 8s

8 + 8 + 8 → three 8s

Total: 3 + 2 + 3 = 9 eights Operation used: addition only ```

Love the 3 + 2 + 3 = 9

My locally running nemotron-3-nano quantized to Q4_K_M gets this right. (although it used 20k thought tokens before answering the question)

Off-by-one errors are one of the hardest problems in computer science.

The weirdest thing about this AI revolution is how smooth and continuous it is. If you look closely at differences between 4.6 and 4.5, it’s hard to see the subtle details.

A year ago today, Sonnet 3.5 (new), was the newest model. A week later, Sonnet 3.7 would be released.

Am I just a believer in an emperor with no clothes? Or, somehow, against all probability and plausibility, are we all still early?

If you've been using each new step is very noticeable and so have the mindshare. Around Sonnet 3.7 Claude Code-style coding became usable, and very quickly gained a lot of marketshare. Opus 4 could tackle significant more complexity. Opus 4.6 has been another noticable step up for me, suddenly I can let CC run significantly more independently, allowing multiple parallel agents where previously too much babysitting was required for that.

In terms of real work, it was the 4 series models. That raised the floor of Sonnet high enough to be "reliable" for common tasks and Opus 4 was capable of handling some hard problems. It still had a big reward hacking/deception problem that Codex models don't display so much, but with Opus 4.5+ it's fairly reliable.

Honestly, 4.5 Opus was the game changer. From Sonnet 4.5 to that was a massive difference.

But I'm on Codex GPT 5.3 this month, and it's also quite amazing.

I can't wait for Haiku 4.6 ! the 4.5 is a beast for the right projects.

It's also good as an @explore sub-agent that greps the directory for files.

Which type of projects?

It's different. Download an English book from Project Gutenberg and have Claude-code change its style. Try both models and you'll see how significant the differences are.