1M context is now generally available for Opus 4.6 and Sonnet 4.6

The big change here is:

> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.

For Claude Code users this is huge - assuming coherence remains strong past 200k tok.

I'm fairly sure that your best throughput is single-prompt single-shot runs with Claude (and that means no plan, no swarms, etc) -- just with a high degree of work in parallel.

So for me this is a pretty huge change as the ceiling on a single prompt just jumped considerably. I'm replaying some of my less effective prompts today to see the impact.

It’s interesting because my career went from doing higher level language (Python) to lower language (C++ and C). Opus and the like is amazing at Python, honestly sometimes better than me but it does do some really stupid architectural decisions occasionally. But when it comes to embedded stuff, it’s still like a junior engineer. Unsure if that will ever change but I wonder if it’s just the quality and availability of training data. This is why I find it hard to believe LLMs will replace hardware engineers anytime soon (I was a MechE for a decade).

I'm very happy about this change. For long sessions with Claude it was always like a punch to the gut when a compaction came along. Codex/GPT-5.4 is better with compactions so I switched to that to avoid the pain of the model suddenly forgetting key aspects of the work and making the same dumb errors all over again. I'm excited to return to Claude as my daily driver!

I used this for a bit and I felt like it was slower and generally worse than using 200K with context compaction. Context compaction does lose some things though.

Is there a writeup anywhere on what this means for effective context? I think that many of us have found that even when the context window was 100k tokens the actual usable window was smaller than that. As you got closer to 100k performance degraded substantially. I'm assuming that is still true but what does the curve look like?

Claude Code 2.1.75 now no longer delineates between base Opus and 1M Opus: it's the same model. Oddly, I have Pro where the change supposedly only for Max+ but am still seeing this to be case.

EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.

The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.

The weirdest thing about Claude pricing is their 5X pricing plan is 5 times the cost of the previous plan.

Normally buying the bigger plan gives some sort of discount.

At Claude, it's just "5 times more usage 5 times more cost, there you go".

the coherence question is the one that matters here. 1M tokens is not the same as actually using 1M tokens well.

we've been testing long-context in prod across a few models and the degradation isn't linear — there's something like a cliff somewhere around 600-700k where instruction following starts getting flaky and the model starts ignoring things it clearly "saw" earlier. its not about retrieval exactly, more like... it stops weighting distant context appropriately.

gemini's problems with loops and tool forgetting that someone mentioned are real. we see that too. whether claude actually handles the tail end of 1M coherently is the real question here, and "standard pricing with no long-context premium" doesn't answer it.

honestly the fact that they're shipping at standard pricing is more interesting to me than the window size itself. that suggests they've got the KV cache economics figured out, which is harder than it sounds.

Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.

I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

It was also the first AI I felt, "Damn, this thing is smarter than me."

The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.

Am I crazy or wasn’t this announced like 2 weeks ago?

Or was that a different company or not GA. It’s all becoming a blur.

Do subscription users still need to tap into "extra usage" spending to go above 200K tokens?

This is amazing. I have to test it with my reverse engineering workflow. I don't know how many people use CC for RE but it is really good at it.

Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.

CC will change development landscape so much in next year. It is exciting and terrifying in same time.

This is great news. The 1M context is much easier to work with than compacting all the time and seems to perform and remember quite well despite the insane amount of data.

This is super exciting. I've been poking at it today, and it definitely changes my workflow -- I feel like a full three or four hour parallel coding session with subagents is now generally fitting into a single master session.

The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.

p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?

Do long sessions also burn through token budgets much faster?

If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?

I heard, the middle of the context is often ignored.

Do long context windows make much sense then or is this just a way of getting people to use more tokens?

My testing was extremely disappointing, this is not a context window that magically extends your breathing room for a conversation. I can tell blindly at this point when 150 - 200 k tokens are reached because the coding quality and coherence just drops by one or two generations. Its great for the case you really need a giant context for specific task but it changes nothing for needing to compact or handover at 200k.

I've been avoiding context beyond 100k tokens in general. The performance is simply terrible. There's no training data for a megabyte of your very particular context.

If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.

I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.

Compared to yesterday my Claude Max subscription burns usage like absolutely crazy (13% of weekly usage from fresh reset today with just a handful prompts on two new C++ projects, no deps) and has become unbearably slow (as in 1hr for a prompt response). GGWP Anthropic, it was great while it lasted but this isn't worth the hundreds of dollars.

1M is truly amazing. However, what is the incidence of hallucination? I haven't found a benchmark, but I feel that maintaining context at 1M would likely increase hallucination. Is there some kind of mechanism to suppress hallucination?

This blew my mind the first i saw this. Another leap in AI that just swooshes by. In a couple of months, every model will be the same. Can't wait for IDEs like cursor and vs code to update their tooling to adap for this massive change in claude models.

Sample of one and all that, but it's way, way more sloppy than it used to be for me.

To the extent, that I have started making manual fixes in the code - I haven't had to stoop to this in 2 months.

Max subscription, 100k LOC codebases more or less (frontend and backend - same observations).

The stuff I built with Opus 4.6 in the past 2.5 weeks:

Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com

An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/

I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.

I never get to more than 20% of the 1M context window, and it’s working great. (Have the same experience in Codex with 5.4.)

What about response coherence with longer context? Usually in other models with such big windows I see the quality to rapidly drop as it gets past a certain point.

I've been using Opus 4.5 for programmatic SEO and localizing game descriptions. If 4.6 truly improves context compaction, it could significantly lower the API costs for large-scale content generation. Has anyone tested its logic consistency on JSON output compared to 4.5?

Awesome.... With Sonnet 4.5, I had Cline soft trigger compaction at 400k (it wandered off into the weeds at 500k). But the stability of the 4.6 models is notable. I still think it pays to structure systems to be comprehensible in smaller contexts (smaller files, concise plans), but this is great.

(And, yeah, I'm all Claude Code these days...)

> Standard pricing now applies across the full 1M window for both models, with no long-context premium.

Does that mean it's likely not a Transformer with quadratic attention, but some other kind of architecture, with linear time complexity in sequence length? That would be pretty interesting.

The no-degradation-at-scale claim is the interesting part. Context rot has been the main thing limiting how useful long context actually is in practice — curious to see what independent evals show on retrieval consistency across the full 1M window.

Hot take... the 1MM context degrades performance drastically.

I don't get the announcement. Is this included in the standard 5 or 20x Max plans?

finally. before 1m, i must speak 60k context for just telling the past chat and project

This is fantastic. I keep having to save to memory with instructions and then tell it to restore to get anywhere on long running tasks.

Are there evals showing how this improves outputs?

If this is a skill issue, feel free to let me know. In general Claude Code is decent for tooling. Onduty fullstack tooling features that used to sit ignored in the on-caller ticket queue for months can now be easily built in 20 minutes with unit tests and integration tests. The code quality isn't always the best (although what's good code for humans may not be good code for agents) but that's another specific and directed prompt away to refactor.

However, I can't seem to get Opus 4.6 to wire up proper infrastructure. This is especially so if OSS forks are used. It trips up on arguments from the fork source, invents args that don't exist in either, and has a habit of tearing down entire clusters just to fix a Helm chart for "testing purposes". I've tried modifying the CLAUDE.md and SPEC.md with specific instructions on how to do things but it just goes off on a tangent and starts to negotiate on the specs. "I know you asked for help with figuring out the CNI configurations across 2 clusters but it's too complex. Can we just do single cluster?" The entire repository gets littered with random MD files everywhere for directory specific memories, context, action plans, deprecated action plans, pre-compaction memories etc. I don't quite know which to prune either. It has taken most of the fun out of software engineering and I'm now just an Obsidian janitor for what I can best describe as a "clueless junior engineer that never learns". When the auto compaction kicks in it's like an episode of 50 first dates.

Right now this is where I assume is the limitation because the literature for real-world infrastructure requiring large contexts and integration is very limited. If anyone has any idea if Claude Opus is suitable for such tasks, do give some suggestions.

This is incredible. I just blew through $200 last night in a few hours on 1M context. This is like the best news I've heard all year in regards to my business.

What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"

im guessing this is why the compacts have started sucking? i just finished getting me some nicer tools for manipulating the graph so i could compact less frequently, and fish out context from the prior session.

maybe itll still be useful, though i only have opus at 1M, not sonnet yet

Just have to ask. Will I be spending way more money since my context window is getting so much bigger?

Pentagon may switch to Claude knowing OpenAI has the premium rates for 1M context.

I notice Claude steadily consuming less tokens, especially with tool calling every week too

Is this also applicable for usage in Claude web / mobile apps for chat?

Oh nice, does it mean less game of /compact, /clear, and updating CLAUDE.md with Claude Code?

Noticed this just now - all of a sudden i have 1M context window (!!!) without changing anything. It's actually slightly disturbing because this IS a behavior change. Don't get me wrong, I like having longer context but we really need to pin down behaviour for how things are deployed.

Friends, just write the code. It’s not that hard.

Could be pure coincidence, but my Claude Code session last night was an absolute nightmare. It kept forgetting things it had done earlier in the session and why it had done them, messed up a git merge so badly that it lost the CLAUDE.md file along with a lot of other stuff, and then started running commands on the host machine instead of inside the container because it no longer had a CLAUDE.md to tell it not to. Last night was the first time I've ever sworn at it.

are the costs the same as the 200k context opus 4.6?

compaction has been really good in claude we don't even recognize the switch

I am currently mass translating millions of records with short descriptions. Somehow tokens are consumed extremely fast. I have 3 max memberships. And all 3 of them are hitting the 5 hour limit in about 5 to 10 minutes. Still don't understand why this is happening.

Finally, I don't have to constantly reload my Extra Usage balance when I already pay $200/mo for their most expensive plan. I can't believe they even did that. I couldn't use 1M context at all because I already pay $200/mo and it was going to ask me for even more.

Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.

Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.

I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

It was also the first AI I felt, "Damn, this thing is smarter than me."

The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.

I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.

It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.

It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.

I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.

It probably still saved time compared to making all the changes myself. But it was way more frustrating.

I find that Opus misses a lot of details in the code base when I want it to design a feature or something. It jumps to a basic solution which is actually good but might affect something elsewhere.

GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).

So on the latest models for my use case gpt > opus but these change all the time.

Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.

What kinds of things are you building? This is not my experience at all.

Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.

I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.

Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.

After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.

When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.

After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.

We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.

Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.

I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.

Im convinced everyone saying this is building the simplest web apps, and doing magic tricks on themselves.

My experience is that it gets you 80-90% of the way at 20x the speed, but coaxing it into fixing the remaining 10-20% happens at a staggeringly slow speed.

All programming is like this to some extent, but Claude's 80/20 behavior is so much more extreme. It can almost build anything in 15-30 minutes, but after those 15-30 minutes are up, it's only "almost built". Then you need to spend hours, days, maybe even weeks getting past the "almost".

Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.

I am starting to believe it’s not OPUS but developers getting better at using LLMs across the board. And not realizing they are just getting much better at using these tools.

I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.

Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.

The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.

I've seen a few instances of where Claude showed me a better way to do something and many many more instances of where it fails miserably.

Super simple problem :

I had a ZMK keyboard layout definition I wanted it to convert it to QMK for a different keyboard that had one key less so it just had to trim one outer key. It took like 45 minutes of back and forth to get it right - I could have done it in 30 min manually tops with looking up docs for everything.

Capability isn't the impressive part it's the tenacity/endurance.

> PRD

Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?

> It was also the first AI I felt, "Damn, this thing is smarter than me."

1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.

Gemini 3.1 also gives me these vibes.

I had been able to get it into the classic AI loop once.

It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.

I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.

I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6

This is amazing. I have to test it with my reverse engineering workflow. I don't know how many people use CC for RE but it is really good at it.

Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.

CC will change development landscape so much in next year. It is exciting and terrifying in same time.

I heard, the middle of the context is often ignored.

Do long context windows make much sense then or is this just a way of getting people to use more tokens?

I've been avoiding context beyond 100k tokens in general. The performance is simply terrible. There's no training data for a megabyte of your very particular context.

Sample of one and all that, but it's way, way more sloppy than it used to be for me.

To the extent, that I have started making manual fixes in the code - I haven't had to stoop to this in 2 months.

Max subscription, 100k LOC codebases more or less (frontend and backend - same observations).

I never get to more than 20% of the 1M context window, and it’s working great. (Have the same experience in Codex with 5.4.)

What about response coherence with longer context? Usually in other models with such big windows I see the quality to rapidly drop as it gets past a certain point.

(And, yeah, I'm all Claude Code these days...)

I don't get the announcement. Is this included in the standard 5 or 20x Max plans?

finally. before 1m, i must speak 60k context for just telling the past chat and project

This is fantastic. I keep having to save to memory with instructions and then tell it to restore to get anywhere on long running tasks.

maybe itll still be useful, though i only have opus at 1M, not sonnet yet

Pentagon may switch to Claude knowing OpenAI has the premium rates for 1M context.

I notice Claude steadily consuming less tokens, especially with tool calling every week too

Is this also applicable for usage in Claude web / mobile apps for chat?

are the costs the same as the 200k context opus 4.6?

compaction has been really good in claude we don't even recognize the switch

The big change here is:

> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.

For Claude Code users this is huge - assuming coherence remains strong past 200k tok.

If it's not coding, even with 200k context it starts to write gibberish, even with the correct information in the context.

I tried to ask questions about path of exile 2. And even with web research on it gave completely wrong information... Not only outdated. Wrong

I think context decay is a bigger problem then we feel like.

Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.

No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

I've been using the 1M window at work through our enterprise plan as I'm beginning to adopt AI in my development workflow (via Cline). It seems to have been holding up pretty well until about 700k+. Sometimes it would continue to do okay past that, sometimes it started getting a bit dumb around there.

(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)

So a picture is worth 1,666 words?

The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv

yeah it totally does not remain coherent past 200k, would have been too nice.

Well, the question is what is contributing to the usage. Because as the context grows, the amount of input tokens are increasing. A model call with 800K token as input is 8 times more expensive than a model call with 100K tokens as input. Especially if we resume a conversation and caching does not hit, it would be very expensive with API pricing.

As someone who did Python professionally from a software engineering perspective, I've actually found Python to be pretty crappy really: unaware of _good_ idioms living outside tutorials and likely 90% of Python code out there that was simply hacked together quickly.

I have not tested, but I would expect more niche ecosystems like Rust or Haskell or Erlang to have better overall training set (developer who care about good engineering focus on them), and potentially produce the best output.

For C and C++, I'd expect similar situation with Python: while not as approachable, it is also being pushed on beginning software engineers, and the training data would naturally have plenty of bad code.

I've found it's ok at Rust. I think a lot of existing Rust code is high quality and also the stricter Rust compiler enforces that the output of the LLM is somewhat reasonable.

It is really good at writing C++ for Arduino, can one-shot most programs.

LLMsdo great with Rust though

I've had a similar experience as a graphics programmer that works in C++ every day

Writing quick python scripts works a lot better than niche domain specific code

nor web engineers (backend) that are not doing standard crud work.

I have seen these shine on frontend work

I think the combinatorial space is just too much. When I did web dev it was mostly transforming HTML/JSON from well-defined type A to well-defined type B. Everything is in text. There's nothing to reason about besides what is in the prompt itself. But constructing and maintaining a mental model of a chip and all of its instructions and all of the empirical data from profiling is just too much for SOTA to handle reliably.

> As you got closer to 100k performance degraded substantially

In practice, I haven't found this to be the case at all with Claude Code using Opus 4.6. So maybe it's another one of those things that used to be true, and now we all expect it to be true.

And of course when we expect something, we'll find it, so any mistakes at 150k context use get attributed to the context, while the same mistake at 50k gets attributed to the model.

Personally, even though performance up to 200k has improved a lot with 4.5 and 4.6, I still try to avoid getting up there — like I said in another comment, when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I feel like the dropoff starts around maybe 150k, but I could be completely wrong. I thought it was funny that the graph in the post starts at 256k, which convenient avoids showing the dropoff I'm talking about (if it's real).

I mentioned this at work but context still rots at the same rate. 90k tokens consumed has just as bad results in 100k context window or 1M.

Personally, I’m on a 6M+ line codebase and had no problems with the old window. I’m not sending it blindly into the codebase though like I do for small projects. Good prompts are necessary at scale.

The benchmark charts provided are the writeup. Everything else is just anecdata.

Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts.

I'm not an expert but maybe this explains context rot.

Claude Code 2.1.75 now no longer delineates between base Opus and 1M Opus: it's the same model. Oddly, I have Pro where the change supposedly only for Max+ but am still seeing this to be case.

EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.

The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.

No change for Pro, just checked it, the 1M context is still extra usage.

I have Max 20x and they're still separate on 2.1.75.

The weirdest thing about Claude pricing is their 5X pricing plan is 5 times the cost of the previous plan.

Normally buying the bigger plan gives some sort of discount.

At Claude, it's just "5 times more usage 5 times more cost, there you go".

Those sorts of volume discounts are what you do when you're trying to incentivize more consumption. Anthropic already has more demand then they're logistically able to serve, at the moment (look at their uptime chart, it's barely even 1 9 of reliability). For them, 1 user consuming 5 units of compute is less attractive than 5 users consuming 1 unit.

They would probably implement _diminishing_-value pricing if pure pricing efficiency was their only concern.

Yeah the free lunch on tokens is almost over. Get them while they’re still cheap

It is not the plan they want you to buy. It is a pricing strategy to get you to buy the 20x plan.

I think they are both subsidized so either is a great deal.

5 times the already subsidised rate is still a discount.

We’ll make it up on volume.

the coherence question is the one that matters here. 1M tokens is not the same as actually using 1M tokens well.

Spot on. That cliff might be less about the model failing at distance and more about noise accumulating faster than signal. In prod, most of what fills the window is file reads, grep output, and tool overhead, i.e., low-value tokens. By 700k you're not really testing long-context reasoning, you're testing the model's ability to find signal in a haystack it built itself.

It probably still saved time compared to making all the changes myself. But it was way more frustrating.

One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.

I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.

This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.

Same here. I don't understand how people leave it running on an "autopilot" for long periods of time. I still use it interactively as an assistant, going back and forth and stepping in when it makes mistakes or questionable architectural decisions. Maybe that workflow makes more sense if you're not a developer and don't have a good way to judge code quality in the first place.

There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.

Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.

If you use eslint and tell it how to run lint in CLAUDE.md it will run lint itself and find and fix most issues like this.

Definitely not ideal, but sure helps.

Undefined variable references? Did you not instruct it to run typescript after changes?

Start over, create a new plan with the lessons learned.

You need to converge on the requirements.

You’re using it wrong. As soon as it starts going off the rails once you’ve repeated yourself, you drop the whole session and start over.

p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?

Did it get better? I used sonnet 4.5 1m frequently and my impression was that it was around the same performance but a hell of a lot faster since the 1m model was willing to spends more tokens at each step vs preferring more token-cautious tool calls.

Random: are you personally paying for Claude Code or is it paid by you employer?

My employer only pays for GitHub copilot extension

Do long sessions also burn through token budgets much faster?

That's correct. Input caching helps, but even then at e.g. 800k tokens with all of them cached, the API price is $0.50 * 0.8 = $0.40 per request, which adds up really fast. A "request" can be e.g. a single tool call response, so you can easily end up making many $0.40 requests per minute.

If you use context cacheing, it saves quite a lot on the costs/budgets. You can cache 900k tokens if you want.

Yeah, morning eastern time Claude is brutal.

The stuff I built with Opus 4.6 in the past 2.5 weeks:

Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com

An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/

That game is AWESOME! The fact that was vibe coded is insane.

Out of curiosity, what specific use cases on programmatic SEO are you currently doing with Opus?

I don't think they're claiming "no degradation at scale", are they? They still report a 91.9->78.3 drop. That's just a better drop than everyone else (is the claim).

Hot take... the 1MM context degrades performance drastically.

Same. First time in 2 months that I found it easier to fix the bugs it created manually, rather than get it to fix. Its google-code-CLI-on-gemini-2.5 level bad for me today. Meaning, almost comically bad.

Are there evals showing how this improves outputs?

Improves outputs relative to what? Compared to previous contexts of 1M, it improves outputs by allowing them to exist (because previously you couldn't exceed 200K). Compared to contexts of <200K, it degrades outputs rather than improves them, but that's what you'd expect from longer contexts. It's still better than compaction, which was previously the alternative.

This is incredible. I just blew through $200 last night in a few hours on 1M context. This is like the best news I've heard all year in regards to my business.

What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"

Did u use the API or subscription?

rarely go over 25 percent in codex but i hit 80 on claude code in just a short time.

Just have to ask. Will I be spending way more money since my context window is getting so much bigger?

Yes, full context is used to generate each new token.

Oh nice, does it mean less game of /compact, /clear, and updating CLAUDE.md with Claude Code?

I’ve been using 1M for a while and it defers it and makes it worse almost when it happens. Compacting a context that big loses a ton of fidelity. But I’ve taken to just editing the context instead (double esc). I also am planning to build an agent to slice the session logs up into contextually useful and useless discarding the useless and keeping things high fidelity that way. (I.e., carve up with a script the jsonl and have subagent haiku return the relevant parts and reconstructing the jsonl)

You can pin to specific models with —-model. Check out their doc. See https://support.claude.com/en/articles/11940350-claude-code-.... You can also pin to a less specific tag like sonnet-4.5[1m] (that’s from memory might be a little off).

Anthropic is famous for changing things under your feet. Claude code is basically alpha software with a global footprint.

Friends, just write the code. It’s not that hard.

I hear what you're saying, but for a lot of people coding isn't something we can throw 40+ hours per week at.

My main job is running a small eComm business, and I have to both develop software automations for the office (to improve productivity long-term) while also doing non-coding day to day tasks. On top of this, I maintain an open source project after hours. I've also got a young family with 3 kids.

I'm not saying Claude is the damn singularity or anything, but stuff is getting done now that simply wasn't being addressed before.

Not hard, but time consuming. In the past two weeks I've had Claude Code write me around 35k lines of code across 350 commits. It's a project which is giving positive impact to the company, but we would never have started it without CC as the effort would have been too big compared to the impact.

It's not that interesting.

You're witnessing the rise of the Developer Technician or Software Technician. They can get a machine to print out an application but you will still need an engineer to know how it works or to get it working. This used to be juniors learning to be senior devs/engineers. Now it is a split between technicians and engineers. The market will be up shit creek when all their technicians can't vibe code their way out of not understanding the code.

Only someone not using Claude could equate human coding.

I think this is just the nature of a nondeterministic system; occasionally you're gonna be unlucky enough to encounter the leftmost segment of the bell curve.

In my experience dumping a summary + starting a fresh session helps in these cases.

Unless you're clearing up the context for each description or processing them in parallel with subagents your context window will grow for each short description added to it making you hit those hour limits.

I find it hard to understand that people consider $200 p/m a lot for what they are getting. Expensive compared to what? A netflix sub?

A 1hr of a senior dev is at least $100, depending where one lives. Since Claude saves me hours every day, it pays for itself almost instantly. I think the economic value of the Claude subscription is on the order of $20-40k a month for a pro.

So a picture is worth 1,666 words?

(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)

My experience is that it gets you 80-90% of the way at 20x the speed, but coaxing it into fixing the remaining 10-20% happens at a staggeringly slow speed.

Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.

I've seen a few instances of where Claude showed me a better way to do something and many many more instances of where it fails miserably.

Super simple problem :

Capability isn't the impressive part it's the tenacity/endurance.

> It was also the first AI I felt, "Damn, this thing is smarter than me."

1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.

Gemini 3.1 also gives me these vibes.

Random: are you personally paying for Claude Code or is it paid by you employer?

My employer only pays for GitHub copilot extension

Interesting, so a prompt that causes a couple dozen tool calls will end up costing in the tens of dollars?

If you use context cacheing, it saves quite a lot on the costs/budgets. You can cache 900k tokens if you want.

Yeah, morning eastern time Claude is brutal.

That game is AWESOME! The fact that was vibe coded is insane.

Honestly that game wasnt oneshotted. I had longtine PdP enthusiasts play it and guve feedback

Out of curiosity, what specific use cases on programmatic SEO are you currently doing with Opus?

I don't think they're claiming "no degradation at scale", are they? They still report a 91.9->78.3 drop. That's just a better drop than everyone else (is the claim).

Did u use the API or subscription?

Max subscription and "extra usage" billing

rarely go over 25 percent in codex but i hit 80 on claude code in just a short time.

Yes, full context is used to generate each new token.

til you can edit context. i keep a running log and /clear /reload log

Anthropic is famous for changing things under your feet. Claude code is basically alpha software with a global footprint.

I hear what you're saying, but for a lot of people coding isn't something we can throw 40+ hours per week at.

I'm not saying Claude is the damn singularity or anything, but stuff is getting done now that simply wasn't being addressed before.

100% agree with this, as much as I hate the term "game-changer"... it truly is, I'm working on projects that I've always wanted to do but never had the capacity (or money to pay a small team of devs to build something)-- all these things that you thought you'd never have a chance to do, are suddenly now real and completely possible. I know there's a lot of AI haters out there but I'm pretty sure in time, all devs will embrance it and truly enjoy working with it

It's not that interesting.

Only someone not using Claude could equate human coding.

Only someone not using their brain could equate Claude to using their intelligence.

I think this is just the nature of a nondeterministic system; occasionally you're gonna be unlucky enough to encounter the leftmost segment of the bell curve.

In my experience dumping a summary + starting a fresh session helps in these cases.

I find it hard to understand that people consider $200 p/m a lot for what they are getting. Expensive compared to what? A netflix sub?

When did I say anything about what I'm getting? I said I pay $200/mo and I expect that to cover anything up to my usage limit. I don't expect any slightly non-standard configuration to immediately ignore the high subscription price that I pay and go straight to "extra usage" that has to be billed separately by the token. I wouldn't even care if fast mode used 10x or 50x the usage as long as I could actually USE the balance that I already pay for. I thought the point of extra usage was to be for overage.

No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv

Yes, especially with shifts in focus of a long conversation. But given the high error rates of Opus 4.6 the last few weeks it is possibly due to other factors. Conversational and code prodding has been essential.

I've found it's ok at Rust. I think a lot of existing Rust code is high quality and also the stricter Rust compiler enforces that the output of the LLM is somewhat reasonable.

Hacker Times

Hacker Times

1M context is now generally available for Opus 4.6 and Sonnet 4.6

Discussion

Discussion

Long context that holds up

Getting started

Transform how your organization operates with Claude