I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
It was also the first AI I felt, "Damn, this thing is smarter than me."
The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.
Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.
CC will change development landscape so much in next year. It is exciting and terrifying in same time.
Do long context windows make much sense then or is this just a way of getting people to use more tokens?
If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.
I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.
To the extent, that I have started making manual fixes in the code - I haven't had to stoop to this in 2 months.
Max subscription, 100k LOC codebases more or less (frontend and backend - same observations).
(And, yeah, I'm all Claude Code these days...)
However, I can't seem to get Opus 4.6 to wire up proper infrastructure. This is especially so if OSS forks are used. It trips up on arguments from the fork source, invents args that don't exist in either, and has a habit of tearing down entire clusters just to fix a Helm chart for "testing purposes". I've tried modifying the CLAUDE.md and SPEC.md with specific instructions on how to do things but it just goes off on a tangent and starts to negotiate on the specs. "I know you asked for help with figuring out the CNI configurations across 2 clusters but it's too complex. Can we just do single cluster?" The entire repository gets littered with random MD files everywhere for directory specific memories, context, action plans, deprecated action plans, pre-compaction memories etc. I don't quite know which to prune either. It has taken most of the fun out of software engineering and I'm now just an Obsidian janitor for what I can best describe as a "clueless junior engineer that never learns". When the auto compaction kicks in it's like an episode of 50 first dates.
Right now this is where I assume is the limitation because the literature for real-world infrastructure requiring large contexts and integration is very limited. If anyone has any idea if Claude Opus is suitable for such tasks, do give some suggestions.
maybe itll still be useful, though i only have opus at 1M, not sonnet yet
compaction has been really good in claude we don't even recognize the switch
> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.
For Claude Code users this is huge - assuming coherence remains strong past 200k tok.
EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.
The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.
Normally buying the bigger plan gives some sort of discount.
At Claude, it's just "5 times more usage 5 times more cost, there you go".
we've been testing long-context in prod across a few models and the degradation isn't linear — there's something like a cliff somewhere around 600-700k where instruction following starts getting flaky and the model starts ignoring things it clearly "saw" earlier. its not about retrieval exactly, more like... it stops weighting distant context appropriately.
gemini's problems with loops and tool forgetting that someone mentioned are real. we see that too. whether claude actually handles the tail end of 1M coherently is the real question here, and "standard pricing with no long-context premium" doesn't answer it.
honestly the fact that they're shipping at standard pricing is more interesting to me than the window size itself. that suggests they've got the KV cache economics figured out, which is harder than it sounds.
It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.
It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.
I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.
It probably still saved time compared to making all the changes myself. But it was way more frustrating.
The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.
p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?
If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?
Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com
An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/
I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.
What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"
Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.
(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)
All programming is like this to some extent, but Claude's 80/20 behavior is so much more extreme. It can almost build anything in 15-30 minutes, but after those 15-30 minutes are up, it's only "almost built". Then you need to spend hours, days, maybe even weeks getting past the "almost".
Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.
Super simple problem :
I had a ZMK keyboard layout definition I wanted it to convert it to QMK for a different keyboard that had one key less so it just had to trim one outer key. It took like 45 minutes of back and forth to get it right - I could have done it in 30 min manually tops with looking up docs for everything.
Capability isn't the impressive part it's the tenacity/endurance.
1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.
Gemini 3.1 also gives me these vibes.
My employer only pays for GitHub copilot extension
My main job is running a small eComm business, and I have to both develop software automations for the office (to improve productivity long-term) while also doing non-coding day to day tasks. On top of this, I maintain an open source project after hours. I've also got a young family with 3 kids.
I'm not saying Claude is the damn singularity or anything, but stuff is getting done now that simply wasn't being addressed before.
In my experience dumping a summary + starting a fresh session helps in these cases.
A 1hr of a senior dev is at least $100, depending where one lives. Since Claude saves me hours every day, it pays for itself almost instantly. I think the economic value of the Claude subscription is on the order of $20-40k a month for a pro.
No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ
I have not tested, but I would expect more niche ecosystems like Rust or Haskell or Erlang to have better overall training set (developer who care about good engineering focus on them), and potentially produce the best output.
For C and C++, I'd expect similar situation with Python: while not as approachable, it is also being pushed on beginning software engineers, and the training data would naturally have plenty of bad code.
GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).
So on the latest models for my use case gpt > opus but these change all the time.
Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.
Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.
I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.
Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.
After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.
When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.
After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.
We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.
Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.
I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.
I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.
Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.
The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.
Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?
It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.
I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.
I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6
I have seen these shine on frontend work
Personally, I’m on a 6M+ line codebase and had no problems with the old window. I’m not sending it blindly into the codebase though like I do for small projects. Good prompts are necessary at scale.
They would probably implement _diminishing_-value pricing if pure pricing efficiency was their only concern.
Definitely not ideal, but sure helps.
You need to converge on the requirements.
Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...
- I ask for something highly general and claude explores a bit and responds.
- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.
- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.
- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.
- I ask it to review it's own work and then critique that review, etc. We write tests.
Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.
I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.
Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.
If you're not using AI you are cooked. You just don't realize it yet.
For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.
Writing quick python scripts works a lot better than niche domain specific code
I'm not an expert but maybe this explains context rot.
I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.
This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.
There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.
Or was that a different company or not GA. It’s all becoming a blur.
So for me this is a pretty huge change as the ceiling on a single prompt just jumped considerably. I'm replaying some of my less effective prompts today to see the impact.
As languages designed for (and probably written by) AI come out over the next decade, it will be really interesting to see what dragon tradeoffs they make.
Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.
Category
Product
Claude Platform
Claude Code
Date
March 13, 2026
Reading time
5
min
Share
Claude Opus 4.6 and Sonnet 4.6 now include the full 1M context window at standard pricing on the Claude Platform. Standard pricing applies across the full window — $5/$25 per million tokens for Opus 4.6 and $3/$15 for Sonnet 4.6. There's no multiplier: a 900K-token request is billed at the same per-token rate as a 9K one.
What's new with general availability:
1M context is now included in Claude Code for Max, Team, and Enterprise users with Opus 4.6. Opus 4.6 sessions can use the full 1M context window automatically, meaning fewer compactions and more of the conversation kept intact. 1M context previously required extra usage.
A million tokens of context only matters if the model can recall the right details and reason across them. Opus 4.6 scores 78.3% on MRCR v2, the highest among frontier models at that context length.
.png)
Claude Opus 4.6 and Sonnet 4.6 maintain accuracy across the full 1M window. Long context retrieval has improved with each model generation.
That means you can load an entire codebase, thousands of pages of contracts, or the full trace of a long-running agent — tool calls, observations, intermediate reasoning — and use it directly. The engineering work, lossy summarization, and context clearing that long-context work previously required are no longer needed. The full conversation stays intact.
Claude Code can burn 100K+ tokens searching Datadog, Braintrust, databases, and source code. Then compaction kicks in. Details vanish. You're debugging in circles. With 1M context, I search, re-search, aggregate edge cases, and propose fixes — all in one window.
Anton Biryukov, Software Engineer
Before Opus 4.6's 1M context window, we had to compact context as soon as users loaded large PDFs, datasets, or images — losing fidelity on exactly the work that mattered most. We've seen a 15% decrease in compaction events. Now our agents hold it all and run for hours without forgetting what they read on page one.
Jon Bell, CPO
Opus 4.6 with 1M context window made our Devin Review agent significantly more effective. Large diffs didn't fit in a 200K context window so the agent had to chunk context, leading to more passes and loss of cross-file dependencies. With 1M context, we feed the full diff and get higher-quality reviews out of a simpler, more token-efficient harness.
Adhyyan Sekhsaria, Founding Engineer
Eve defaults to 1M context because plaintiff attorneys' hardest problems demand it. Whether it's cross-referencing a 400-page deposition transcript or surfacing key connections across an entire case file, the expanded context window lets us deliver materially higher-quality answers than before.
Mauricio Wulfovich, ML Engineer
Scientific discovery requires reasoning across research literature, mathematical frameworks, databases, and simulation code simultaneously. Claude Opus 4.6’s 1M context and expanded media limits let our agentic systems synthesize hundreds of papers, proofs, and codebases in a single pass, helping us dramatically accelerate fundamental and applied physics research.
Dr. Alex Wissner-Gross, Co-Founder
.png)
.png)
With Claude's 1M context, an in-house lawyer can bring five turns of a 100-page partnership agreement into one session and finally see the full arc of a negotiation. No more toggling between versions or losing track of what changed three rounds ago.
Bardia Pourvakil, Co-founder and CTO
Large-scale production systems have endless context, and production incidents can get very complex. With Claude's 1M context window, we are able to keep every entity, signal, and working theory in view from first alert to remediation without having to repeatedly compact or compromise the nuances of these systems.
Mayank Agarwal, Founder & CTO
We raised our Opus context window from 200k to 500k and the agent runs more efficiently — it actually uses fewer tokens overall. Less overhead, more focus on the goal at hand.
Izzy Miller, AI Research Lead
Real-world spreadsheet tasks require deep research and complex multi-step plans. Claude's 1M context window let’s us maintain task adherence and attention to detail.
Tarun Amasa, CEO
0/5
eBook
1M context is available today on the Claude Platform natively and through Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry. Claude Code Max, Team, and Enterprise users on Opus 4.6 will default to 1M context automatically.
See our documentation and pricing for details.
No items found.
Get the developer newsletter
Product updates, how-tos, community spotlights, and more. Delivered monthly to your inbox.
Please provide your email address if you'd like to receive our monthly developer newsletter. You can unsubscribe at any time.
Thank you! You’re subscribed.
Sorry, there was a problem with your submission, please try again later.
as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.
Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.
as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.
This is the true power of agent teams: https://code.claude.com/docs/en/agent-teams
You maintain very low context usage in the main thread; just orchestration and planning details, while each individual team member remains responsible for their own. Allows you to churn through millions of output tokens in a fraction of the time.
Things have changed. The models have reached a level of coherence that they can be left to make the right decisions autonomously. Opus 4.6 is in a class of its own now.
The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.
But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.
To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.
I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.
I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.
And yes people have had success with very simple prototypes and demos at work.
You can minimize these problems with TLC but ultimately it just will keep fucking up.
This has made my planning / research phase so much better.
I tried to ask questions about path of exile 2. And even with web research on it gave completely wrong information... Not only outdated. Wrong
I think context decay is a bigger problem then we feel like.
Does that mean it's likely not a Transformer with quadratic attention, but some other kind of architecture, with linear time complexity in sequence length? That would be pretty interesting.
Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.
I’ll appreciate the 1M token breathing room.
I'm using CC (Opus) thinking and Codex with xhigh on always.
And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.
Codex did all the planning and verification, CC wrote the code.
This would have not been possible six months ago at all from my experience.
Maybe with a lot of handholding; but I doubt it (I tried).
I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.
In my experience the model will assume the web results are the answer even if the search engine returns irrelevant garbage.
For example you ask it a question about New Jersey law and the web results are about New York or about "many states" it'll assume the New York info or "many states" info is about New Jersey.
> Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
which matches my experience exactly. I consider it to be about as magical as the parent comment is claiming, but I wouldn’t call it totally automatic.
His fix for "the dumb zone" is the RPI Framework:
● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.
● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.
1) No longer found the dumb zone
2) No longer feared compaction
Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.
I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.
In practice, I haven't found this to be the case at all with Claude Code using Opus 4.6. So maybe it's another one of those things that used to be true, and now we all expect it to be true.
And of course when we expect something, we'll find it, so any mistakes at 150k context use get attributed to the context, while the same mistake at 50k gets attributed to the model.
https://github.com/features/copilot/cli
Disclosure: work at Msft
The people I work with who complain about this type of thing horribly communicate their ask to the llm and expect it to read their minds.
As an example of doing this in a session with jagged alliance 3 (an rpg) https://pastes.io/jagged-all-69136
Claude extracting game archives and dissasembling leads to far more reliable results than random internet posts.
It’s lead to me starting new chats with bigger and bigger starting ‘summary, prompts to catch the model up while refreshing it. Surely there’s a way to automate that technique.
I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.
What’s an example of something you built today like this?
The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.
Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.
I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.
Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.
Truth. But not just “using”.
Because here’s where this ship has already landed: humans will not write code, humans will not review code.
I see mostly rage against this idea, but it is already here. Resistance is futile. There will be no “hand crafted software” shops. You have at most 3-4 years left if you think this is your job.
It’s fine for you to take a stand, but please understand your position is simply factually wrong if you think you can outprogram Claude for a range of common tasks.
Being anti AI is fine, but if you deny facts of how far LLM programming has come then you lack credibility.
The most effective anti AI position is to acknowledge it’s power, not pretend that vast numbers of people are somehow hallucinating the power of LLM assisted programming.
the conversation history is a linked list, so you can screw with it, with some care.
I spend this afternoon building an MCP do break the conversation up into topics, then suggest some that aren't useful but are taking up a bunch of context to remove (eg iterations through build/edit just needs the end result)
its gonna take a while before I'm confident its worth sharing
It'll remain a human job for quite a while too. Separability is not a property of vector spaces, so modern AIs are not going to be good at it. Maybe we can manage something similar with simplical complexes instead. Ideally you'd consult the large model once and say:
> show me the small contexts to use here, give me prompts re: their interfaces with their neighbors, and show me which distillations are best suited to those tasks
...and then a network of local models could handle it from there. But the providers have no incentive to go in that direction, so progress will likely be slow.
They are probably doing something like putting the original user prompt into the model's environment and providing special tools to the model, along with iterative execution, to fully process the entire context over multiple invokes.
I think the Recursive Language Model paper has a very good take on how this might go. I've seen really good outcomes in my local experimentation around this concept:
https://arxiv.org/abs/2512.24601
You can get exponential scaling with proper symbolic stack frames. Handling a gigabyte of context is feasible, assuming it fits the depth first search pattern.
People should still understand the code because sometimes the AI solution really is wrong and I have to shove my hand in it's guts and force it to use my solution or even explain the reasoning.
People should be studying architecture. Cause now I can orchestrate stuff that used to take teams and I would throwaway as a non-viable idea. Now I can just do it. But no you will still be reviewing code.
[0] https://en.wikipedia.org/wiki/Software_requirements_specific...
[1] https://news.ycombinator.com/item?id=47323316 who the hell knows that version of "RSI"?
Note that I'm not talking about the low-level grunt work (and even with that, its just that it is tedious and time-consuming, but if I had enough time to read through all the docs and stuff, I will almost always produce grunt code of much higher quality).
But I'm more talking about architecture, the stuff of proper higher level engineering. I use Claude Opus all the time, and I cannot even count how many times I've had to redirect its approach that was obviously betraying a complete lack of seeing the big picture, or some egregiously smelly architectural approach.
Also, expressive typing. I use mostly TypeScript, and it will often give up when I try to push it beyond a certain point, and resort to using "any". Then I have to step and do the job myself.
My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.
It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.
With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.
When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.
For Claude Code compaction feels disastrous, also much longer
Unless you’re using a text editor as an IDE you probably have already
I’ve had thing like a system that has a collection of procedural systems. I would say “replace the following set of defaults that are passed all around for system X (list of files) and in the managed (file) by a config” and it would do that but I’d suddenly see it be like “wait mu and projection distance are also present in system Y and Z. Let me replace that by a config too with the same values”. When system Y and Z uses a different set of optimized values, and that was clearly outside of the scope.
Never had that kind of mistakes happen when dealing with small contexts, but with larger contexts (multiple files, long “thinking” sequences) it does happen sometimes.
Definitely some times when I though “oh well my bad, I should have clarified NOT to also change that other part”, all the while thinking that no human would have thought to change both
It could almost be used as a benchmark good models are in math, memory, updated information etc
Usually things go smoothly but sometimes I have situations like: “please add feature X, needs to have ABCD.” -> does ABC correct but D wrong -> “here is how to fix D” -> fixes D but breaks AB -> “remember I also want AB this way, you broke it” -> fixes AB but removes C and so on
Programming is not hard. You’re just lazy.
I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.
What's been working for me is keeping a CLAUDE.md file in my project root with key decisions and context. The model reads it at the start of every session so I don't have to re-explain everything. Not as elegant as automated compaction but it works.
Or even one with DRM?
Right?
Or?
I don't know if Anthropic has revealed such details since AI research is getting more and more secretive, but the architectural tricks definitely exist.
I keep a running log of important things and then i just clear context and reload that file into context.
would that work
I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.
Working on my first project with it… so far so good.
Childish and naive.
If you said you’ve been using Claude heavily and it’s never done better than you on your own, then your position would be credible.
Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.
> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.
Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/
also, i don't want to make a full parent post
1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on
I've found doing this for games to be far more reliable than trying to find internet posts explaining it. I haven't played POE but if it's anything like any other RPG system Claude will do a great job at this.
I generate task.md files before working on anything, some are short, others are super long and with many steps. The models don't deviate anymore. One trick is to make a post tool use hook to show the first open gate "- [ ]" line from task.md on each tool call. This keeps the agent straight for 100s of gates.
After each gate is executed we don't just check it, we also append a few words of feedback. This makes the task.md become a workbook, covering intent, plan, execution and even judgements. I see it like a programming language now. I can gate any task and the agent will do it, however many steps. It can even generate new gates, or replan itself midway.
You can enforce strict testing policies by just leaning into gate programability power - after each work gate have a test gate, and have judges review testing quality and propose more tests.
The task.md file is like a script or pipeline. It is also like a first class function, it can even ingest other task.md files for regular reflexion. A gate can create or modify gates, or tasks. A task can create or modify gates or tasks.
The place it may fail is obfuscation and server side logic. But generally client side logic, especially in a game with a scripted language backing it, is super easy for claude ot pick apart.
Big refactorings guided by automated tests eat context window for breakfast.
This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.
I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.
In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.
Another reason for less token usage is that 4.6 is much better at delegating agents (its own explorer agents or my custom agents) to avoid cluttering the window.
https://www.claudecodecamp.com/p/how-prompt-caching-actually...
Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.
What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.
Also, only the domain is shorter
But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.
In that way we could erase prompts and responses that didn't yield anything useful or derailed the model.
Why can't we do that?
Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.
Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.
if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.
No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.
Or is thinking about source code line by line the only valid form of thinking in the world?
There is a reason discussions about agent use have been on Hacker News every other day, and it's not a grand conspiracy. Even in this submission, people have talked about how they have used Claude Code and its longer context window successfully as a tool for programming, even if they may be technically skilled to do it themselves. However, if you assume that every commenter is acting in bad faith, then there's no point in continuing.
To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.
At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.