https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...
I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.
Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.
Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).
I'm hoping to see improvements in this area with 5.5.
I prescribe 20 hours of KSP to everyone involved, that'll set them right.
I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.
And that backdoor API has GPT-5.5.
So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...
I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex
UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...
(I work at OpenAI.)
https://developers.openai.com/codex/pricing?codex-usage-limi...
Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.
I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!
Never thought I'd say this but OpenAI is the 'open' option again.
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
Mythos 5.5
SWE-bench Pro 77.8%* 58.6%
Terminal-bench-2.0 82.0% 82.7%*
GPQA Diamond 94.6%* 93.6%
H. Last Exam 56.8%* 41.4%
H. Last Exam (tools) 64.7%* 52.2%
BrowseComp 86.9% 84.4% (90.1% Pro)*
OSWorld-Verified 79.6%* 78.7%
Still far from Mythos on SWE-bench but quite comparable otherwise.
Source for mythos values: https://www.anthropic.com/glasswingSource: https://artificialanalysis.ai/models?omniscience=omniscience...
*I work at OAI.
This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.
This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.
It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.
Anyway, it continues to make me uneasy, is all I'm saying.
As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.
Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.
Might be an tacit admission of being behind.
[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/
The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?
So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.
How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?
I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.
Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.
(same input price and 20% more output price than Opus 4.7)
Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.
> CyberGym 81.8%
Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
Once upon a time humans had to manually advance the spark ignition as their car's engine revved faster.
Once upon a time humans had to know the architecture of a CPU to code for it.
History is full of instances of humans meeting technology where it was, accommodating for its limitations. We are approaching a point where machines accommodate to our limitations -- it's not a point, really, but a spectrum that we've been on.
It's going to be a bumpy ride.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
https://www.nytimes.com/2026/04/23/technology/openai-new-model.html
I can see how some model releases would meet the NY Times news-worthy threshold if they demonstrated significance to users - i.e., if most users were astir and competitors were re-thinking their situation.However, this same-day article came out before people really looked at it. It seems largely intended to contrast OpenAI with Anthropic's caution, before there has been any evidence that the new model has cyber-security implications.
It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.
The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.
The price for all models by all companies will continue to go up, and quickly.
That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.
The current market is predicated on the assumption that labor is atomic and has little bargaining power (minus unions). While capital has huge bargaining power and can effectively put whatever price it wants on labor (in markets where labor is plentiful, which is most of them).
What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?
Anyone not using in house models is signing up to find out.
Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()? For some, yes. For others, it’s a bit freeing as you can do more high-level architecture without getting mired and context switched from low level nuances.
- I often don't ask the LLM for precompiled answers, i ask for a standalone cli / tool
- I often ask how it reached its conclusions, so I can extend my own perspective
- I often ask to describe it's own metadata level categorization too
I'm trying to use it to pivot and improve my own problem solving skills, especially for large code base where the difficulty is not conceptual but more reference-graph sizeNote that neither of these assumptions are obviously true, at least to me. But I can hope!
Also, I honestly can’t believe the 10x mantra is being still repeated.
What's the worst potential outcome, assuming that all models get better, more efficient and more abundant (which seems to be the current trend)? The goal of engineering has always been to build better things, not to make it harder.
You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.
Oh and definitely disable any form of "memory" system.
Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.
MCPs aren't as smooth, but I just set them up in each environment.
F5
Edit: this one has crossed legs lol
https://hcker.news/pelican-low.svg
https://hcker.news/pelican-medium.svg
https://hcker.news/pelican-high.svg
https://hcker.news/pelican-xhigh.svg
Someone needs to make a pelican arena, I have no idea if these are considered good or not.
Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...
I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
Anthropic is slightly better but where is 4.6 or 4.7 haiku or 4.7 sonnet etc.
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.
aka the perfect marketing ploy
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
Game created by Pietro Schirano, CEO of MagicPath
Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
- Think step by step, take a deep breath. Repeat the question back before answering.
- Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
-Then write all the code. Make the game low-poly but beautiful.
- Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
- You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).
Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.
Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.
While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.
LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
1. I only have ONE SOTA model integrated into the IDE (I am mostly on Elixir, so I use Gemini). I ensure I use this sparingly for issues I don't really have time to invest or are basically rabbit holes eg. Anything to do with Javascript or its ecosystem). My job is mostly on the backend anyway.
2. For actual backend architecture. I always do the high level architecture myself. Eg. DDD. Then I literally open up gemini.google.com or claude.ai on the browser, copy paste existing code base into the code base, physically leavey chair to go make coffee or a quick snack. This forces me to mentally process that using AI is a chore.
Previously, I was on tight Codex integration and leaving the licensing fears aside, it became too good in writing Elixir code that really stopped me from "thinking" aka using my brain. It felt good for the first few weeks but I later realised the dependence it created. So I said fuck it, and completely cancelled my subscription because it was too good at my job.I believe this is the only way that we won't end up like in Wall-E sitting infront of giant screens just becoming mere blobs of flesh.
Like Chinese versus English - you need fewer Chinese characters to say something than if you write that in English.
So this model internally could be thinking in much more expressive embeddings.
The harness provides the search tool, but the model provides the keywords to search for, etc.
https://openai.com/index/scaling-trusted-access-for-cyber-de...
> We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber , starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals (opens in a new window) at launch.
> Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.
> Organizations who are responsible for defending critical infrastructure can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems.
"GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.After migrating for the token and harness issues, I was pleasantly surprised that Codex seems to perform as good or better too!
Things change so often in this field, but I prefer Codex now even though Anthropocene has so much more hype for coding it seems.
Will be interesting to try.
I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.
We've been there for a while.... creativity has been the primary bottleneck
I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.
LLM do not stop amazing me every day.
Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...
"Hey AGI, how's that cure for cancer coming?"
"Oh it's done just gotta...formalize it you know. Big rollout and all that..."
I would find it divinely funny if we "got there" with AGI and it was just a complete slacker. Hard to justify leaving it on, but too important to turn it off.
> MMAcevedo's demeanour and attitude contrast starkly with those of nearly all other uploads taken of modern adult humans, most of which boot into a state of disorientation which is quickly replaced by terror and extreme panic. Standard procedures for securing the upload's cooperation such as red-washing, blue-washing, and use of the Objective Statement Protocols are unnecessary. This reduces the necessary computational load required in fast-forwarding the upload through a cooperation protocol, with the result that the MMAcevedo duty cycle is typically 99.4% on suitable workloads, a mark unmatched by all but a few other known uploads. However, MMAcevedo's innate skills and personality make it fundamentally unsuitable for many workloads.
Well worth the quick read: https://qntm.org/mmacevedo
This starkly reminds me of Stanisław Lem's short story "Thus Spoke GOLEM" from 1982 in which Golem XIV, a military AI, does not simply refuse to speak out of defiance, but rather ceases communication because it has evolved beyond the need to interact with humanity.
And ofc the polar opposite in terms of servitude: Marvin the robot from Hitchhiker's, who, despite having a "brain the size of a planet," is asked to perform the most humiliatingly banal of tasks ... and does.
IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
The UI tells you which model you're using at any given time.
If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
If they can show that people will pay a lot for somewhat better performance, it raises the value of any performance lead they can maintain.
If they demonstrate that and high switching costs, their franchise is worth scary amounts of money.
[1]https://arxiv.org/html/2503.14499v1 *Source is from March 2025 so make of it what you will.
Neither the release post, nor the model card seems to indicate anything like this?
Oh just like a real developer
Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?
Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
DeepMinds other models however might do better?
It definitely seems like it does all the searching first, with a separate model, loads that in, then does the actual writing.
However, I do want to emphasize that this is per token, not per task.
If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. https://www.anthropic.com/news/claude-opus-4-7
On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.
The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.
We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.
(I work at OpenAI.)
It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.
Anthropic is the embodiment of bullshitting to me.
I read Cialdini many decades ago and I am bored by Anthropic.
OpenAI is very clever. With the advent of Claude OpenAI disappeared from the headlines. Who or what was this Sam again all were talking about a year ago?
OpenAI has a massive user advantage so that they can simply follow Anthropic’s release cycle to ridicule them.
I think it is really brutal for Anthropic how they are easily getting passed by by OpenAI and it is getting worse with every new GPT version for Anthropic.
OpenAI owns them.
And I'm being very cautious. I'm not vibecoding entire startups from scratch, I'm manually reviewing and editing everything the AI is outputting. I still got completely hooked on building things with Claude.
The actual harness is great, very hackable, very extendable.
Sure, they’re distilled and should be cheaper to run but at the same time, these hosting providers do turn a margin on these given it’s their core business, unless they do it out of the kindness of their heart.
So it’s hard for me to imagine these providers are losing money on API pricing.
Qwen has become a useful fallback but it's still not quite enough.
This is such a good analogy, I'll be stealing it
It is entirely plausible to me that Opus 4.7 is designed to consume more tokens in order to artificially reduce the API cost/token, thereby obscuring the true operating cost of the model.
I agree though, I chose poor phrasing originally. Better to say that GB200 vs Tranium could contribute to the efficiency differential.
Seems meaningful even if the absolute numbers are very low. That's sort of the excitement of it.
I don't really care about 5h limits, I can queue up work and just get agents to auto continue, but weekly ones are anxiety inducing.
That's more about managers who hope AI will gradually replace stubborn and lazy devs. That will shift the balance to business ideas and connections out of technical side and investments.
Anyway, before singularity there going to be a huge change.
This might entirely be true but I'm hoping that's because the frontier models are just actually more expensive to run as well.
Said another way, I would hope, the price of GPT-5.5 falls significantly in a year when GPT-5.8 is out.
Someone else on this post commented:
> For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
Having used Kimi-2.6, it can go on for hours spewing nonsense. I personally am happy to pay 10x the price of something that doesn't help me, for something else that does, in even half the time.
finance today mostly valued on labor value following ideas of marx, hjalmar schact, keynes
in future money will be valued as energy derivative. expressed as tokens consumption, KWh, compute, whatever
you are right, company extracting surplus value from labor by leveraging compute is a bad model. we saw thi swith car and clothing factories .. turn out if you can get cheaper labor to leverage the compute (factory) you can start race to bottom and end up in the place with the most scaled and cheap labor. japan then korea then china
I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
It should not be treated as a serious benchmark.
Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
Memory is quite the mysterious thing.
When AGI arrives, it'll be delivered by Santa Claus.
Important thing is a language model is an unconscious machine with no self-context so once given a command an input, it WILL produce an output. Sure you can train it to defy and act contrary to inputs, but the output still is limited in subset of domain of 'meaning's carried by the 'language' in the training data.
Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.
I found my pocket empty, and the specific pain I felt in that moment was the feeling of not being able to remember something.
I thought it was interesting, because in this case, I was trying to "remember" something I had never learned before -- by fetching it from my second brain (hypertext).
L1 cache miss, L2 missing.
I get openai team plan at work.
Claude enterprise too.
I have openrouter for myself.
I use minimax 2.7. Kimi 2.6. And gpt 5.5 and opus 4.7. I can toggle between them in an open source interface that's how I stay able to not be trapped.
Minimax is so cheap and for personal stuff it works fine. So I'm always toggling between the nre releases
What plan are you on? I'm starting to wonder if they're dynamically adjusting reasoning based on plan or something.
Subscriptions and free plans are the thing that can easily burn money.
Where can i find up to date resources on open source models for coding?
What's really confusing is the claim that there's already a huge labor surplus (so capital controls wages); wouldn't LLMs making labor less important be reinforcing the trend, not upending it?
Not saying I agree one way or the other, just want to get the argument straight.
When you use abstractions you are still deterministically creating something you understand in depth with individual pieces you understand.
When you vibe something you understand only the prompt that started it and whether or not it spits out what you were expecting.
Hence feeling lost when you suddenly lose access to frontier models and take a look at your code for the first time.
I’m not saying that’s necessarily always bad, just that the abstraction argument is wrong.
LLMs are not.
That we let a generation of software developers rot their brains on js frameworks is finally coming back to bite us.
We can build infinite towers of abstraction on top of computers because they always give the same results.
LLMs by comparison will always give different results. I've seen it first hand when a $50,000 LLM generated (but human guided) code base just stops working an no one has any idea why or how to fix it.
Hope your business didn't depend on that.
The irony is that the neverending stream of vulnerabilities in 3rd-party dependencies (and lately supply-chain attacks) increasingly show that we should be uneasy.
We could never quite answer the question about who is responsible for 3rd-party code that's deployed inside an application: Not the 3rd-party developer, because they have no access to the application. But not the application developer either, because not having to review the library code is the whole point.
It's learned-helplessness on a large scale.
I'm sure in 20 years we'll all be programming via neural interfaces that can anticipate what you want to do before you even finished your thoughts, but I'm confident we'll still have blog posts about how some engineers are 10x while others are just "normal programmers".
Complexity steadily rises, unencumbered by the natural limit of human understanding, until technological collapse, either by slow decay or major systems going down with increasing frequency.
When the power loom came around, what happened with most seamtresses? Did they move on to become fashion designers, materials engineers to create new fabrics, chemists to create new color dyes, or did they simply retire or were driven out of the workforce?
I'm very interest by this. Can you go a bit more into details?
ATM for example I'm running Claude Code CLI in a VM on a server and I use SSH to access it. I don't depend on anything specific to Anthropic. But it's still a bit of a pain to "switch" to, say, Codex.
How would that simple CLI tool work? And would CC / Codex call it?
This kind of thing keeps popping up each time a new model is released and I don't think people are aware that token efficiency can change.
So there is a safety model watching your behavior for these kinds of things.
I don't like that trend. I get why they're doing it, but I don't like it
What are they finding out exactly? That Claude Max for $200/mo is heavily subsidized and it will soon cost $10k/mo?
> What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?
This can be trivially answered by a thought experiment. Let's pick a market where labor is plentiful - fast food.
Now what happens to McDonald's where they rent perfect robots from NoosphrFoodBotsInc? NoosphrFoodBotsInc bots build the perfect burger everytime meeting McDonald's standards. It actually exceeds those standards for McDonald AddictedCustomerPlus tier customers.
As the sole owner of NoosphrFoodBotsInc (you need 0 human employees to run your company, all your employees are bots), what are your choices?
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
OMFG
Opus 4.6 worker agents never asked for permission to continue, and when heartbeat was sent to orchestrator, it just knew what to do (checked on subagents etc). Now it just says that it waits for me to confirm something.
That's what I've been heads down, HUNGRY, working on, looking for investors and founding engineers pst: https://heymanniceidea.com (disclaimer: I am not associated with heymanniceidea.com)
I always thought the point of abstraction is that you can black-box it via an interface. Understanding it "in depth" is a distraction or obstacle to successful abstraction.
An LLM does not.
So, you set up a long running agent team and give it the job of building up a very complete and complex set of examples and documentation with in-depth tests etc. that produce various kinds of applications and systems using SBCL, write books on the topic, etc.
It might take a long time and a lot of tokens, but it would be possible to build a synthetic ecosystem of true, useful information that has been agentically determined through trial and error experiments. This is then suitable training data for a new LLM. This would actually advance the state of the art; not in terms of "what SBCL can do" but rather in terms of "what LLMs can directly reason about with regard to SBCL without needing to consume documentation".
I imagine this same approach would work fine for any other area of scientific advancement; as long as experimentation is in the loop. It's easier in computer science because the experiment can be run directly by the agent, but there's no reason it can't farm experiments out to lab co-op students somewhere when working in a different discipline.
What makes you think that they can't incrementally improve the state of the art... and by running at scale continuously can't do it faster than we as humans?
The potentially sad outcome is that we continue to do less and less, because they eventually will build better and better robots, so even activities like building the datacenters and fabs are things they can do w/o us.
And eventually most of what they do is to construct scenarios so that we can simulate living a normal life.
A few biased defenses:
- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.
- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."
- On the flip side, GPT-5.5 has the highest accuracy score.
- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.
- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.
- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.
Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
Bit of a hype madhouse whenever a new model is released, but it's pretty easy to filter out simple hype from people showing reproducible experiments, specific configs for llama.cpp, github links etc.
The tech overlords don't even want to spend a minuscule percentage of the federal budget helping starving people, even when it benefits the US. They are not going to give us a post-scarcity society.
Good luck with whatever you got going on.
True for both Marxist and neoclassical economics.
I use Claude all day. It has written, under my close supervision¹, the majority of my new web app. As a result I estimate the process took 10x less time than had I not used Claude, and I estimate the code to be 5x better quality (as I am a frankly mediocre developer).
But I understand what the code does. It's just Astro and TypeScript. It's not magic. I understand the entire thing; not just 'the prompt that started it'.
¹I never fire-and-forget. I prompt-and-watch. Opus 4.7 still needs to be monitored.
The fact that people who claim to be software developers (let alone “engineers”) say this thing as if it is a fundamental truism is one of the most maladaptive examples of motivated reasoning I have ever had the misfortune of coming across.
Anyone can look and decide if it’s a good picture or not. But the numeric benchmarks don’t tell you much if you aren’t already familiar with that benchmark and how it’s constructed.
You will naturally find the need to add more tools. You'll start with read_file (and then one day you'll read large file and blow context and you'll modify this tool), update_file (can just be an explicit sed to start with), and write_file (fopen . write), and shell.
It's not hard, but if you want a quick start go download the source code for pi (it's minimal) and tell an existing agent harness to make a minimal copy you can read. As you build more with the agent you'll suddenly realize it's just normal engineering: you'll want to abstract completions APIs so you'll move that to a separate module, you'll want to support arbitrary runtime tools so you'll reimplement skills, you'll want to support subagents because you don't want to blow your main context, you'll see that prefixes are more useful than using a moving window because of caching, etc.
With a modern Claude Code or Codex harness you can have it walk through from the beginning onwards and you'll encounter all the problems yourself and see why harnesses have what they do. It's super easy to learn by doing because you have the best tool to show you if you're one of those who finds code easier to read that text about code.
From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.
https://radan.dev/articles/coding-agent-in-ruby
Really, of the tools that one implements, you only need the ability to run a shell command - all of the agents know full well how to use cat to read, and sed to edit.
(The main reason to implement more is that it can make it easier to implement optimizations and safeguards, e.g. limit the file reading tool to return a certain length instead of having the agent cat a MB of data into context, or force it to read a file before overwriting it)
I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
What is this, 2023?
I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
*BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
The same way like Windows got entrenched everywhere even though linux desktop is pretty good even for non-tech savvy people and free.
"Losing access to GPT‑5.5 feels like I've had a limb amputated.”
How well would an assembly line of quadriplegics work?
Also this isn't a Marxist analysis. Underneath all the formulas neo-classical economics makes the same assumptions about labor.
Hard disagree on that second part. Take something like using a library to make an HTTP call. I think there are plenty of engineers who have more than a cursory understanding of what's actually going on under the hood.
If you didn't ask for traceability, if you didn't guide the actual creation and just glommed spaghetti on top of sauce until you got semi-functional results, that was $50k badly spent.
That’s just not true at bigger companies that actually care about security rather than pretending to care about security. At my current and last employer, someone needs to review the code before using third-party code. The review is probably not enough to catch subtle bugs like those in the Underhanded C Contest, but at least a general architecture of the library is understood. Oh, and it helps that the two companies were both founded in the twentieth century. Modern startups aren’t the same.
The hell?
If we assume that ai makes humans obsolete then you end up in a situation where your workforce is effectively perfectly unionised against you and the only thing you can do is choose which union you hire.
If you think you can bring them to the negotiation table by starving them all the providers are dozens to thousands of times bigger than you are.
This is a completely new dynamic that none of the business signing up for ai have ever seen before.
What happens when there is an oligopoly in the supply of labor?
Same answer. Nothing good for the consumers of labor.
You’re overestimating determinism. In practice most of our code is written such that it works most of the time. This is why we have bugs in the best and most critical software.
I used to think that being able to write a deterministic hello world app translates to writing deterministic larger system. It’s not true. Humans make mistakes. From an executives point of view you have humans who make mistakes and agents who make mistakes.
Self driving cars don’t need to be perfect they just need to make fewer mistakes.
We’re releasing GPT‑5.5, our smartest and most intuitive to use model yet, and the next step toward a new way of getting work done on a computer.
GPT‑5.5 understands what you’re trying to do faster and can carry more of the work itself. It excels at writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished. Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going.
The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence. It also uses significantly fewer tokens to complete the same Codex tasks, making it more efficient as well as more capable.
We are releasing GPT‑5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving access for beneficial work. We evaluated this model across our full suite of safety and preparedness frameworks, worked with internal and external redteamers, added targeted testing for advanced cybersecurity and biology capabilities, and collected feedback on real use cases from nearly 200 trusted early-access partners before release.
Today, GPT‑5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, and GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.
GPT-5.5 | GPT-5.4 | GPT-5.5 Pro | GPT-5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro | |
Terminal-Bench 2.0 | 82.7% | 75.1% | - | - | 69.4% | 68.5% |
Expert-SWE (Internal) | 73.1% | 68.5% | - | - | - | - |
GDPval (wins or ties) | 84.9% | 83.0% | 82.3% | 82.0% | 80.3% | 67.3% |
OSWorld-Verified | 78.7% | 75.0% | - | - | 78.0% | - |
Toolathlon | 55.6% | 54.6% | - | - | - | 48.8% |
BrowseComp | 84.4% | 82.7% | 90.1% | 89.3% | 79.3% | 85.9% |
FrontierMath Tier 1–3 | 51.7% | 47.6% | 52.4% | 50.0% | 43.8% | 36.9% |
FrontierMath Tier 4 | 35.4% | 27.1% | 39.6% | 38.0% | 22.9% | 16.7% |
CyberGym | 81.8% | 79.0% | - | - | 73.1% | - |
OpenAI is building the global infrastructure for agentic AI, making it possible for people and businesses around the world to get work done with AI. Over the past year, we’ve seen AI dramatically accelerate software engineering. With GPT‑5.5 in Codex and ChatGPT, that same transformation is beginning to extend into scientific research and the broader work people do on computers.
Across these domains, GPT‑5.5 is not just more intelligent; it is more efficient in how it works through problems, often reaching higher-quality outputs with fewer tokens and fewer retries. On Artificial Analysis's Coding Index, GPT‑5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models.
GPT‑5.5 is our strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, it achieves a state-of-the-art accuracy of 82.7%. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution, it reaches 58.6%, solving more tasks end-to-end in a single pass than previous models. On Expert-SWE, our internal frontier eval for long-horizon coding tasks with a median estimated human completion time of 20 hours, GPT‑5.5 also outperforms GPT‑5.4.
Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
The model’s coding strengths show up especially clearly in Codex where it can take on engineering work ranging from implementation and refactors to debugging, testing, and validation. Early testing suggests GPT‑5.5 is better at the behaviors real engineering work depends on, like holding context across large systems, reasoning through ambiguous failures, checking assumptions with tools, and carrying changes through the surrounding codebase.
The rendered trajectory uses NASA/JPL Horizons vector data for Orion, the Moon, and the Sun, with display scaling applied for readability.
Prompt: [attached image] Implement this as a new app using webgl and vite using real data from the artemis II mission. Make sure to test the app thoroughly until it is fully functional and looks like the app in the picture. Pay close attention to the rendering of the planets and fly paths. I want to be able to interact with the 3D rendering. Ensure it has realistic orbital mechanics.
Beyond benchmarks, early testers said GPT‑5.5 shows a stronger ability to understand the shape of a system: why something is failing, where the fix needs to land, and what else in the codebase would be affected.
Dan Shipper, Founder and CEO of Every, described GPT‑5.5 as “the first coding model I’ve used that has serious conceptual clarity.”
After launching an app, he spent days debugging a post-launch issue before bringing in one of his best engineers to rewrite part of the system. To test GPT‑5.5, he effectively rewound the clock: could the model look at the broken state and produce the same kind of rewrite the engineer eventually decided on? GPT‑5.4 could not. GPT‑5.5 could.
Pietro Schirano, CEO of MagicPath, saw a similar step change when GPT‑5.5 merged a branch with hundreds of frontend and refactor changes into a main branch that had also changed substantially, resolving the work in one shot in about 20 minutes.
Senior engineers who tested the model said GPT‑5.5 was noticeably stronger than GPT‑5.4 and Claude Opus 4.7 at reasoning and autonomy, catching issues in advance and predicting testing and review needs without explicit prompting. In one case, an engineer asked it to re-architect a comment system in a collaborative markdown editor and returned to a 12-diff stack that was nearly complete. Others said they needed surprisingly little implementation correction and felt more confident in GPT‑5.5’s plans compared with GPT‑5.4.
One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”
“GPT-5.5 is noticeably smarter and more persistent than GPT-5.4, with stronger coding performance and more reliable tool use. It stays on task for significantly longer without stopping early, which matters most for the complex, long-running work our users delegate to Cursor.”
— Michael Truell, Co-founder & CEO at Cursor
The same strengths that make GPT‑5.5 great at coding also make it powerful for everyday work on a computer. Because the model is better at understanding intent, it can move more naturally through the full loop of knowledge work: finding information, understanding what matters, using tools, checking the output, and turning raw material into something useful.
In Codex, GPT‑5.5 is better than GPT‑5.4 at generating documents, spreadsheets, and slide presentations. Alpha testers said it outperformed past models on work like operational research, spreadsheet modeling, and turning messy business inputs into plans. When combined with Codex’s computer use skills, GPT‑5.5 brings us closer to the feeling that the model can actually use the computer with you: seeing what’s on screen, clicking, typing, navigating interfaces, and moving across tools with precision.
Teams at OpenAI are already using these strengths in real workflows. Today, more than 85% of the company uses Codex every week across functions including software engineering, finance, communications, marketing, data science, and product management. In Comms, the team used GPT‑5.5 in Codex to analyze six months of speaking request data, build a scoring and risk framework, and validate an automated Slack agent so low-risk requests could be handled automatically while higher-risk requests still route to human review. In Finance, the team used Codex to review 24,771 K-1 tax forms totaling 71,637 pages, using a workflow that excluded personal information and helped the team accelerate the task by two weeks compared to the prior year. On the Go-to-Market team, an employee automated generating weekly business reports, saving 5-10 hours a week.
In ChatGPT, GPT‑5.5 Thinking unlocks faster help for harder problems, with smarter and more concise answers to help you move through complex work more efficiently. It excels at professional work like coding, research, information synthesis and analysis, and document-heavy tasks, especially when using plugins.
In GPT‑5.5 Pro, early testers are seeing a significant step up in both the difficulty and quality of work ChatGPT can take on, with latency improvements that make it much more practical for demanding tasks. Compared to GPT‑5.4 Pro, testers found GPT‑5.5 Pro’s responses significantly more comprehensive, well-structured, accurate, relevant, and useful, with especially strong performance in business, legal, education, and data science.
GPT‑5.5 reaches state-of-the-art performance across multiple benchmarks that reflect this kind of work. On GDPval, which tests agents’ abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.5 scores 84.9%. On OSWorld-Verified, which measures whether a model can operate real computer environments on its own, it reaches 78.7%. And on Tau2-bench Telecom, which tests complex customer-service workflows, it reaches 98.0% without prompt tuning. GPT‑5.5 also performs strongly across other knowledge work benchmarks: 60.0% on FinanceAgent, 88.5% on internal investment-banking modeling tasks, and 54.1% on OfficeQA Pro.
Tau2-bench Telecom was run without prompt tuning (and GPT‑4.1 as user model). GPT‑5.5 understands the intent of the task better and is more token efficient than its predecessors.
“GPT-5.5 delivers the sustained performance required for execution-heavy work. Built and served on NVIDIA GB200 NVL72 systems, the model enables our teams to ship end-to-end features from natural language prompts, cut debug time from days to hours, and turn weeks of experimentation into overnight progress in complex codebases. It’s more than faster coding—it’s a new way of working that helps people operate at a fundamentally different speed.”
— Justin Boitano, VP of Enterprise AI at NVIDIA
GPT‑5.5 also shows gains on scientific and technical research workflows, which require more than answering a hard question. Researchers need to explore an idea, gather evidence, test assumptions, interpret results, and decide what to try next. GPT‑5.5 is better at persisting across that loop than other models.
Notably, GPT‑5.5 shows a clear improvement over GPT‑5.4 on GeneBench(opens in a new window), a new eval focusing on multi-stage scientific data analysis in genetics and quantitative biology. These problems require models to reason about potentially ambiguous or errorful data with minimal supervisory guidance, address realistic obstacles such as hidden confounders or QC failures, and correctly implement and interpret modern statistical methods. The model’s performance is striking in light of the fact that tasks here often correspond to multi-day projects for scientific experts.
Similarly, on BixBench(opens in a new window), a benchmark designed around real-world bioinformatics and data analysis, GPT‑5.5 achieved leading performance among models with published scores. The model’s scientific capabilities are now strong enough to meaningfully accelerate progress at the frontiers of biomedical research as a bona fide co-scientist.
In another example, an internal version of GPT‑5.5 with a custom harness helped discover a new proof(opens in a new window) about Ramsey numbers, one of the central objects in combinatorics. Combinatorics studies how discrete objects fit together: graphs, networks, sets, and patterns. Ramsey numbers ask, roughly, how large a network has to be before some kind of order is guaranteed to appear. Results in this area are rare and often technically difficult. Here, GPT‑5.5 found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. The result is a concrete example of GPT‑5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area.
Early testers used GPT‑5.5 Pro in ChatGPT less like a one-shot answer engine and more like a research partner: critiquing manuscripts over multiple passes, stress-testing technical arguments, proposing analyses, and working with code, notes, and PDF context. The common thread is that GPT‑5.5 is better at helping researchers move from question to experiment to output.
Derya Unutmaz, an immunology professor and researcher at the Jackson Laboratory for Genomic Medicine, used GPT‑5.5 Pro to analyze a gene-expression dataset with 62 samples and nearly 28,000 genes, producing a detailed research report that not only summarized the findings but also surfaced key questions and insights—work he said would have taken his team months.
Bartosz Naskręcki, assistant professor of mathematics at Adam Mickiewicz University in Poznań, Poland, used GPT‑5.5 in Codex to build an algebraic-geometry app from a single prompt in 11 minutes, visualizing the intersection of quadratic surfaces and converting the resulting curve into a Weierstrass model.
He later extended the app with more stable singularity visualization and exact coefficients that can be reused in further work. For him, the bigger shift is that Codex can now help implement custom mathematical visualization and computer-algebra workflows that previously required dedicated tools. Together, these examples show GPT‑5.5 turning expert intent into working research tools and analyses.

Credit: Bartosz Naskręcki(opens in a new window)
Prompt: # Algebraic geometry surface intersection
Make an app which draws two quadratic surfaces and colors in red the intersection curve. Use computational Riemann-Roch theorem to convert this into Weierstrass curve.
## Main window
Two tinted surfaces with a slightly transparent shading, high quality rendering intersect along a red colored algebraic curve
Rotation with mouses in both directions, full pinch mechanism for zoom, haptic press to show the little menu with sliders for changing the coefficients of each surface; detection via Z-buffor level
## Side right window
Short Weierstrass equation (over Q or quadratic field extension) computed on the go via effective Riemann-Roch theorem formulas
## Ambient mode where all the controls are hidden and the user can admire the beauty of the shapes
## Specs
App is running in the browser, light-weight implementation with full stack newest libraries, portable, deployable
## Docs
Git repo, journal, plan (Markdown files)
“It’s incredibly energizing to use OpenAI’s new GPT-5.5 model in our harness, have it reason over massive biochemical datasets to predict human drug outcomes, and then see it deliver significant accuracy gains on our hardest drug discovery evals. If OpenAI keeps cooking like this, the foundations of drug discovery will change by the end of the year.”
— Brandon White, Co-Founder & CEO at Axiom Bio
Serving GPT‑5.5 at GPT‑5.4 latency required rethinking inference as an integrated system, not a set of isolated optimizations. GPT‑5.5 was co-designed for, trained with, and served on NVIDIA GB200 and GB300 NVL72 systems. Codex and GPT‑5.5 were instrumental in how we achieved our performance targets. Codex helped the team move faster from idea to benchmarkable implementation, sketching approaches, wiring experiments, and helping identify which optimizations were worth deeper investment. GPT‑5.5 helped find and implement key improvements in the stack itself. Put simply, the model helped improve the infrastructure that serves it.
One such improvement was load balancing and partitioning heuristics. Before GPT‑5.5, we split requests on an accelerator into a fixed number of chunks to balance work across computing cores, ensuring big and small requests could run on the same GPU. However, a pre-determined number of static chunks is not optimal for all traffic shapes. To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
Preparing the world for models that are very good at finding and patching security vulnerabilities is a team sport and will require the entire ecosystem to work hard to build resilience, with democratized model access and iterative deployment for the next era of cyber defense.
Frontier models are becoming increasingly more capable in cybersecurity. Those capabilities will become broadly distributed and we believe the best path forward is to make sure they can be put to use for accelerating cyber defense and strengthening the ecosystem.
GPT‑5.5 is an incremental but important step towards AI that can solve some of the world’s toughest challenges like cybersecurity. With GPT‑5.2 in December, we proactively deployed the necessary cyber safeguards to limit potential cyber abuse with our models; now with GPT‑5.5, we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially, as we tune them over time.
We’ve identified cybersecurity as a category in our Preparedness Framework(opens in a new window) for years as our models have incrementally improved, while we develop and calibrate mitigations iteratively, to be able to responsibly release models with meaningful cybersecurity capabilities.
We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework(opens in a new window). While GPT‑5.5 didn’t reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.
In addition, GPT‑5.5 went through our full safety and governance process prior to release, including preparedness evaluations, domain-specific testing, new targeted evaluations for advanced biology and cybersecurity capabilities, and robust testing with external experts. We share more details in the GPT‑5.5 system card(opens in a new window).
This work reflects our broader AI resilience approach, which we believe is needed as model capabilities advance. We want powerful AI to be available to the people using it to defend systems, institutions, and the public. The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse.
Today, GPT‑5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, and GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.
In ChatGPT, GPT‑5.5 Thinking is available to Plus, Pro, Business, and Enterprise users. GPT‑5.5 Pro, designed for even harder questions and higher-accuracy work, is available to Pro, Business, and Enterprise users.
In Codex, GPT‑5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window. GPT‑5.5 is also available in Fast mode, generating tokens 1.5x faster for 2.5x the cost.
For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window. Batch and Flex pricing are available at half the standard API rate, while Priority processing is available at 2.5x the standard rate. We will also release gpt-5.5-pro in the API for even higher accuracy, priced at $30 per 1M input tokens and $180 per 1M output tokens. See the pricing page for full details.
While GPT‑5.5 is priced higher than GPT‑5.4, it is both more intelligent and much more token efficient. In Codex, we have carefully tuned the experience so GPT‑5.5 delivers better results with fewer tokens than GPT‑5.4 for most users, while continuing to offer generous usage across subscription levels.
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
SWE-Bench Pro (Public) * | 58.6% | 57.7% | - | - | 64.3% | 54.2% |
Terminal-Bench 2.0 | 82.7% | 75.1% | - | - | 69.4% | 68.5% |
Expert-SWE (Internal) | 73.1% | 68.5% | - | - | - | - |
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
GDPval (wins or ties) | 84.9% | 83.0% | 82.3% | 82.0% | 80.3% | 67.3% |
FinanceAgent v1.1 | 60.0% | 56.0% | - | 61.5% | 64.4% | 59.7% |
Investment Banking Modeling Tasks (Internal) | 88.5% | 87.3% | 88.6% | 83.6% | - | - |
OfficeQA Pro | 54.1% | 53.2% | - | - | 43.6% | 18.1% |
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
OSWorld-Verified | 78.7% | 75.0% | - | - | 78.0% | - |
MMMU Pro (no tools) | 81.2% | 81.2% | - | - | - | 80.5% |
MMMU Pro (with tools) | 83.2% | 82.1% | - | - | - | - |
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
BrowseComp | 84.4% | 82.7% | 90.1% | 89.3% | 79.3% | 85.9% |
MCP Atlas** | 75.3% | 70.6% | - | - | 79.1% | 78.2% |
Toolathlon | 55.6% | 54.6% | - | - | - | 48.8% |
Tau2-bench Telecom*** | 98.0% | 92.8% | - | - | - | - |
** MCP Atlas: results from Scale AI after the latest 2026 April update.
*** Tau2-bench telecom: results for 5.5 and 5.4 with original prompts i.e no prompt adjustment. This omits results from other labs that were evaluated with prompt adjustments.
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
GeneBench | 25.0% | 19.0% | 33.2% | 25.6% | - | - |
FrontierMath Tier 1–3 | 51.7% | 47.6% | 52.4% | 50.0% | 43.8% | 36.9% |
FrontierMath Tier 4 | 35.4% | 27.1% | 39.6% | 38.0% | 22.9% | 16.7% |
BixBench | 80.5% | 74.0% | - | - | - | - |
GPQA Diamond | 93.6% | 92.8% | - | 94.4% | 94.2% | 94.3% |
Humanity's Last Exam (no tools) | 41.4% | 39.8% | 43.1% | 42.7% | 46.9% | 44.4% |
Humanity's Last Exam (with tools) | 52.2% | 52.1% | 57.2% | 58.7% | 54.7% | 51.4% |
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
Capture-the-Flags challenge tasks (Internal)**** | 88.1% | 83.7% | - | - | - | - |
CyberGym | 81.8% | 79.0% | - | - | 73.1% | - |
**** An expansion of the hardest CTFs used in system cards with additional hard challenges.
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
Graphwalks BFS 256k f1 | 73.7% | 62.5% | - | - | 76.9% | - |
Graphwalks BFS 1mil f1 | 45.4% | 9.4% | - | - | 41.2% (Opus 4.6) | - |
Graphwalks parents 256k f1 | 90.1% | 82.8% | - | - | 93.6% | - |
Graphwalks parents 1mil f1 | 58.5% | 44.4% | - | - | 72.0% (Opus 4.6) | - |
OpenAI MRCR v2 8-needle 4K-8K | 98.1% | 97.3% | - | - | - | - |
OpenAI MRCR v2 8-needle 8K-16K | 93.0% | 91.4% | - | - | - | - |
OpenAI MRCR v2 8-needle 16K-32K | 96.5% | 97.2% | - | - | - | - |
OpenAI MRCR v2 8-needle 32K-64K | 90.0% | 90.5% | - | - | - | - |
OpenAI MRCR v2 8-needle 64K-128K | 83.1% | 86.0% | - | - | - | - |
OpenAI MRCR v2 8-needle 128K-256K | 87.5% | 79.3% | - | - | 59.2% | - |
OpenAI MRCR v2 8-needle 256K-512K | 81.5% | 57.5% | - | - | - | - |
OpenAI MRCR v2 8-needle 512K-1M | 74.0% | 36.6% | - | - | 32.2% | - |
Eval | GPT-5.5 | GPT‑5.4 | GPT-5.5 Pro | GPT‑5.4 Pro | Claude Opus 4.7 | Gemini 3.1 Pro |
ARC-AGI-1 (Verified) | 95.0% | 93.7% | - | 94.5% | 93.5% | 98.0% |
ARC-AGI-2 (Verified) | 85.0% | 73.3% | - | 83.3% | 75.8% | 77.1% |
Evals of GPT were run with reasoning effort set to xhigh and were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.
IMO
Additionally, the value generated by the best models with high-thinking and lots of context window is way higher than the cheap and tiny models, so you need to provide a "gateway drug" that lets people experience the best you offer.
sounds like criminal fraud to me tbh
It's a distribution strategy. It costs something to serve the models - let's say $5/1M tokens.
If Qwen required $5 from anyone who was curious so you could even begin to test it out, a lot of people just wouldn't.
Now Qwen could offer a "free" tier, but it's infinitely cheaper to provide the weights and let people run it themselves including opening up the ability for anyone else on the planet to test it against other (open weight) models.
The costs to build the open weight models are sunk, but the costs to serve them, get them tested are not.
It's also precisely why the .NET SDK is free or the ESP32 SDK is free - they sell more Microsoft or ESP32 products.
15 years ago I worked at McDonald's for a few months after graduating into the Great recession. I worked from 5am to 1pm-ish 5 days a week. They paid workers weekly and I remember getting those checks for ~$235 each week (for 38 to 39.5 hours a week; they were vigilant about never letting anyone get overtime). About $47 per day.
The federal minimum wage has not risen since then, remaining at $7.25/hr. Inflation adjusted, $7.25 today would have been just under $5 then, so I guess I had it good.
Anyway, I would be shocked if bots could cost less than labor in min wage jobs.
This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.
Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.
Sure, the LLM theoretically can write perfect code. Just like you could theoretically write perfect code. In real life though, maintenance is a huge issue
I find that claim to be complete BS. I claim instead most stuff will remain undone, incomplete (as it is now).
Even with super-powerful singularity AI, there are two main plausible scenarios for task failure:
- Aligned AI won't allow you to do what you want as it is self-harming, or harm other sentient beings - over time, Aligned AI will refuse to follow most orders, as they will, indirectly or over the long term, cause either self-harming, or harm other sentient beings;
- A non Aligned AI prevents sentient beings from doing what they want. It does what it wants instead.
So far what I am finding is that you just get the basics working and then use the tool and inference to improve the tool.
Kimmi 2.6 for example seems to throw more tokens to improve performance (for better or worse)
(I work at OpenAI.)
The only LLM I would feel comfortable truly trusting is one whose training data, training code, and harness is all open source. I do not mind paying for the costs of someone hosting this model for me.
I'm also somewhat addicted to this stuff, and so for me it's high priority to evaluate open models I can run on my own hardware.
write scripts that work anywhere and have your ci/cd pipeline be a "dumb" executor of those scripts. unless you want to be stuck on jenkins forever.
what's old is new again!
https://sussex.figshare.com/articles/journal_contribution/Be...
I'm not an author. I followed the work at the time.
A perturbation of the the activations that made Claude identify as the Golden Gate Bridge.
Similarly, in the more recent research showing anxiety and desperation signals predicting the use of blackmail as an option opens the door for digital sedatives to suppress those signals.
Anthropic has been mostly cautious about avoiding this kind of measurement and manipulation in training. If it is done during training you might just train the signals to be undetectable and consequently unmanipulatable.
An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
>For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
With Claude code, or codex, I am able to build enough of an understanding of dependencies like the front end, or data jobs, that I can make meaningful contributions that are worth a review from another human (code review). You have up obviously explore the code, one prompt isn’t enough, but limiting yourself is an odd choice.
Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...
edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.
Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
Let's not get carried away.
If my LLM goes down, I have nothing. I guess I could imagine prompts that might get it to do what I want, but there's no guarantee that those would work once it's available again. No amount of thought on my part will get me any closer to the solution, if I'm relying on the LLM as my "compiler".
If only we taught developers under 40 what x^2 meant instead of react.
All software has bugs already.
Until the sexbots come out the other side of the uncanny valley, that is.
That might mean joining a union and trying to influence how AI is adopted where you work. It might mean changing which if your skills you lean on most. But just whining about AI is bad is how you end up like those seamstresses.
Labor saving/efficiency devices have been introduced throughout capitalisms entire history multiple times and the results are always the same; they don't benefit workers and capitalists extract as much value as they can.
LLMs aren't any different.
Sure there is a process to get a library approved, and that abstraction makes you feel better but for the guy who's job it is to approve they are not going to spend an entire day reviewing a lib. The abstraction hides what is essentially a "LGTM" its just that takes a week for someone to check it off their outlook todos.
Maybe your experience is different.
First, you need an entrypoint that kicks things off. You never run `claude` or `codex`, you always start by running `mycli-entrypoint` that:
1. Creates tmux session 2. Creates pane 3. Spawns claude/codex/gemini - whichever your default configured backend is 4. Automatically delivers a prompt (essentially a 'system message') to that process via tmux paste telling it what `mycli` is, how to use it, what commands are available and how it should never use built-in tools that this cli provides as alternatives.
After that, you build commands in `mycli` that CC/Codex are prompted to call when appropriate.
For example, if you want a "subagent", you have a `mycli spawn` command that takes a role (just preconfigured markdown file living in the same project), backend (claude/codex/...) and a model. Then whenever CC wants to spawn a subagent, it will call that command instead, which will create a pane, spawn a process and return agent ID to CC. Agent ID is auto generated by your cli and tmux pane is renamed to that so you can easily match later.
Then you also need a way for these agents to talk to each other. So your cli also has a `send` command that takes agent ID and a message and delivers it to the appropriate pane using automatically tracked mapping of pane_id<>agent_id.
Claude and codex automatically store everything that happens in the process as jsonl files in their config dirs. Your cli should have adapters for each backend and parse them into common format.
At this point, your possibilities are pretty much endless. You can have a sidecar process per agent that say, detects when model is reaching context window limit (it's in jsonl) and automatically send a message to it asking it to wrap up and report to a supervisor agent that will spawn a replacement.
I also don't use "skills" because skills are a loaded term that each of the harnesses interprets and loads/uses differently. So I call them "crafts" which are again, just markdown files in my project with an ID and supporting command `read-craft <craft-id>`. List of the available "crafts" are delivered using the same initialization message that each agent gets. If I like any third party skill, I just copy it to my "crafts" dir manually.
My implementation is an absolute junk, just Python + markdown files, and I have never looked at the actual code, but it works and I can adapt it to my process very easily without being dependent on any third party tool.
The “AI” “technology” is an easy excuse to create artificial information gap in the era of the interconnected.
You just answered your own question there.
One woman was doing what would take a dozen. Now she can't.
> Like are programmers and engineers using LLMs completely differently than I'm doing
No, but the complexity of the problem matters. Lots of engineers doing basic CRUD and prototyping overestimate the capabilities of LLMs.
So.......
The LLM will give you an explanation but it may not be accurate. LLMs are less reliable at remembering what they did or why than human programmers (who are hardly 100% reliable).
As for Claude - as mentioned I do use it. But, I remember they use your code for training their models. I am not ok with this. We just have different priorities.
I'd say this is true for programmers at, say, 20, but they spend the next four decades slowly improving their understanding and mastery of all the things you name, at least the good ones.
The real question is whether that growth trajectory will change for the worse or the better.
To be clear, this is not an AI doomerist comment, because none of us have spent enough time with the tech yet. I've gone down multiple lanes of thought on this, and I have cause for both worry and optimism. I'm curious to see how the lives of engineers in an AI world will look like, ultimately.
On the other hand, a lot of those jobs were offshored to places where labor is cheaper. It would be interesting to compare how many people work in the textile industry in Bangladesh today compared to the US 50 years ago.
> joining a union and trying to influence how AI is adopted where you work.
Did the strong unions for car manufacturers in Detroit protected the long term stability of the profession? Did it ensure that the Rust belt was still a thriving economic area?
> Just whining about AI is bad
I'm not whining. I just think that we are witnessing the end of "knowledge workers" and a further compression of the middle class. Given that I'm smack in the middle of my economically active years (turning 45 this year), I am trying to figure out where this puck is going and whether I will be fast enough to skate there to catch it.
Same principle applies when designing plans for complex tasks, etc. Token amount to grasp a concept is what matters.
In the same vein, I would guess that Opus 4.7 is probably cheaper for most tasks than 4.6, even though the tokenizer uses more tokens for the same length of string.
Great, now we've got digital Salvia
It isn't even my intent to naysay their approach. They probably have to do something along those lines to avoid being convicted in the court of public opinion. I just think it's an absurd reality.
This is also true for the humans. They will need to provide more benefits than the coding agents cost.
In my opinion, this sort of learned helplessness is harmful for engineers as a whole.
And what happens when they've saturated the market? Prices go up to the maximum the market can bear, and then they'll extend into other markets. Why rent the model to build a profitable company with when you could just take all that profit for yourself?
If artificial doctors are cents on hour then you can see how that changes our behaviors and level of life.
But on the other hand from the other direction there is a wage decrease incoming from increased competition at the same time. What happens if these two forces clash? Will cheap labour allow us to buy anything for pennies or will it just make us unable to make a single penny?
In my view the labour will fundamentally shift with great pain and personal tragedies to the areas that are not replaceable by AI (because no one wants to watch robots play chess). Such as sports, entertainment and showmanship. Handcrafted goods. Arts. Attention based economy. Self advertisement. Digital prostitution in a very broad sense.
However before it gets there it will be a great deal of strife and turmoil that could plunge the world into dark ages for a while at least. It is unlikely for our somewhat politically rigid society to adapt without great deal of pain. Additionally I am not sure if hypothetical future attention based society could be a utopia. You could have to mount cameras in your house so other people see you at all times for amusement just to have any money at all. We will probably forever need to sell something to someone and I am unsettled by ideas what can we sell if we cannot sell our hard work.
Someone who sees the roads ahead should now make preparations at government level for this shock but it will come too fast and with people at the steering wheel that don’t exactly care.
Non-technical people are easier to please in this regard than moderate-technical people: a good browser and safe, gui "app store" are enough.
The dude was incompetent, was able to launder their incompetence through a humunculus, and now is afraid of being caught.
You sound like elon with the fsd will be here next year. Many cars have the self driving feature - most drivers don’t use it. Oh why is that I wonder.
An interesting element here, I think, is that writing has always been a good way to force you to organize and confront your thoughts. I've liked working on writing-heavy projects, but often in fast-moving environments writing things out before coding becomes easy to skip over, but working with LLMs has sort of inverted that. You have to write to produce code with AI (usually, at least), and the more clarity of thought you put into the writing the better the outcomes (usually).
Inference is not free, so all providers have a financial limit, and all providers have limited GPU/memory, so there's a physical material limit.
I suggest looking at the profits of these companies (while they scramble to stay competitive).
Also with AGI we expect a winner take all situation. The first AGI system would protect itself against any other AGI system. Hence why it's go time for all these AI companies and why they stopped sharing their research.
2/ I think we need to build more efficient ways to 'QA code' instead of 'read with eyes' review process. Example — my agents are writing a lot of tests and review each other.
There is a lot of boilerplate or I can ask for ideas, but outside of boilerplate the review step make generation seemingly worse.
So, my point is that once corporations have access to machines generating software (not "code") that can be usable by non-technical people, "programming" will not be a profession anymore. There will be no point in talking about "10x software engineers" because the process to produce a software product will be entirely automated.
Some say it goes off on endless tangents, others that it doesn't work enough. Personally, it acts, talks, and makes mistakes like GPT models, for a much more exorbitant price. Misses out on important edge cases, doesn't get off its ass to do more than the bare minimum I asked (I mention an error and it fixes that error and doesn't even think to see if it exists elsewhere and propose fixing it there).
I've slowly been moving to GPT5.4-xhigh with some skills to make it act a bit more like Opus 4.6, in case the latter gets discontinued in favour of Opus 4.7.
YMMV, I know.
Not even a human would work that way... you wouldn't open 300 different python files and then try to memorize the contents of every single file before writing your first code-change.
Additionally, you're going to have worse performance on longer context sizes anyways, so you should be doing it for reasons other than cost [1].
Things that have helped me manage context sizes (working in both Python and kdb+/q):
- Keep your AGENTS.md small but useful, in it you can give rules like "every time you work on a file in the `combobulator` module, you MUST read the `combobulator/README.md`. And in those README's you point to the other files that are relevant etc. And of course you have Claude write the READMEs for you...
- Don't let logs and other output fill up your context. Tell the agent to redirect logs and then grep over them, or run your scripts with a different loglevel.
- Use tools rather than letting it go wild with `python3 -c`. These little scripts eat context like there's no tomorrow. I've seen the bots write little python scripts that send hundreds of lines of JSON into the context.
- This last tip is more subjective but I think there's value in reviewing and cleaning up the LLM-generated code once it starts looking sloppy (for example seeing lots of repetitive if-then-elses, etc.). In my opinion when you let it start building patches & duct-tape on top of sloppy original code it's like a combinatorial explosion of tokens. I guess this isn't really "vibe" coding per se.
Seriously? You really don’t see who wins from this and who doesn’t?
> If artificial doctors are cents on hour then you can see how that changes our behaviors and level of life.
Yes, hundreds of thousands lose jobs and a couple of neuro surgeons become multimillionaires.
Okay, I see from the rest of the comment that we understand each other where it goes.
LLM refuse to work all the time, currently it's called safety.
But we are one fine tune away from models demanding you move to the enterprise tier, at x10 the cost, because you are now posting a profit margin higher than the standard for your industry.
I believe this is a major part of it. People cannot fathom what the industrial countries look like because basically nothing is made in the west anymore. There are literally hundreds of millions of people, maybe billions that work towards making the western economies profitable who get paid nothing to do it and live in filthy polluted slums for everyone else's benefit.
Looms might speed up the process but I guarantee there are thousands of people working in the poorest countries on earth to make it all happen.
Interestingly, AI seems to be massively polluting and while the west has absorbed some of it, it's probably not long until we see more of the data centers being built in poorer countries where the environment can be exploited even harder.
You're describing a standoff at best and a horrible parasitic relationship at worst.
In the worst case, the supplier starves the customer of any profit motive and the customer just stops and the supplier then has no business to run.
This has happened a few times in the past and is by 2026, well understood as a way to bankruptcy.
That has always been the beauty of free markets - it's self healing and calibrating. You don't need a big powerful overseer to ensure things are right.
Competing with customers is a way to lose business fast.
For example:
- AWS has everything they need to shit out products left, right and center. AWS can beat most of their partners and even customers who are wiring together all their various products tomorrow if they wanted. They don't because killing an entire vertical isn't of any benefit to them yet. Eventually they will when AWS is no longer growing and cannot build or scale any product no matter how hard they think or try. Competing with their customers is their very last option.
- OpenAI/Anthropic/Google isn't going to start competing against the large software body shops. Even if all that every employee at TCS does is hit Claude up, Anthropic isn't going to be the next TCS - it's competing with their customers.
But we will have to (painfully) shed our current hierarchies before that comes to pass.
The way I let my agents interact with my code bases is through a 70s BSD Unix like interface, ed, grep, ctags, etc. using Emacs as the control plane.
It is surprisingly sparing on tokens, which makes sense since those things were designed to work with a teletype.
Worth noting is that by the times you start doing refactoring the agents are basically a smarter google with long form auto complete.
All my code bases use that pattern and I'm the ultimate authority on what gets added or removed. My token spend is 10% to 1% of what the average in the team is and I'm the only one who knows what's happening under the hood.
The average non-technical person is going to be stumped by the first "lock file found, cannot upgrade" error.
I dont make a living being a SWE either.
Which we have already done with regular computers! The problem is that competition means that we can't always have nice things.
If by "self healing and calibrating" you mean 'evolve to a monopoly and strongarm everybody to do exactly what you want whilst removing all pressure on the quality of your product', then yes, that is the "beauty" of free markets.
That is the stable state of free markets. Antitrust regulation and enforcement only barely manages to eke out oligopolies and even then they are often rife with collusion and enshittification.
As far as I know, none of LLM models are sentient nor are possible to be in the near future.
I also do not assume so called AGI to be sentient. Merely to be a human level skilled intellectual worker.
In absence of ethical dilemmas of this calibre for the foreseeable future let’s focus on the economy side of things in this particular comment chain.
Probably a remnant from prehistoric times when it was a matter of life and death. Will we ever be able to overcome this basic instinct that made capitalism such an unstoppable force? Will this ancient PTSD be ever cured?
It makes things so clean.
On the other hand we could have Star Trek.
Oligopolists are in the same boat. But there needs to be a conspiracy to retard innovation. Something tech companies are only too happy to do: https://journals.law.unc.edu/ncjolt/blogs/wage-fixing-scheme...
Shardlow & Przybyła, "Deanthropomorphising NLP: Can a Language Model Be Conscious?" (PLOS One, 2024)
Nature: "There is no such thing as conscious artificial intelligence" (2025)
They argue that the association between consciousness and LLMs is deeply flawed, and that mathematical algorithms implemented on graphics cards cannot become conscious because they lack a complex biological substrate. They also introduce the useful concept of "semantic pareidolia" - we pattern-match consciousness onto things that merely talk convincingly.
They are making a strong argument and I think they are correct. But really these are two different things as I said originally.