GPT-5.5

Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

(I work at OpenAI.)

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.

A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

The more interesting part of the announcement than "it's better at benchmarks":

> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)

*I work at OAI.

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.

It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

Anyway, it continues to make me uneasy, is all I'm saying.

I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.

I'm here for the pelicans and I'm not leaving until I see one!

This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.

As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.

Is this the first time OpenAI has published comparisons to other labs?

Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.

Might be an tacit admission of being behind.

[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/

If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.

Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.

https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...

For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.

The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?

I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at

I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.

So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.

> It excels at ... researching online

How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?

I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.

Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.

Pricing: $5/1M input, $30/1M output

(same input price and 20% more output price than Opus 4.7)

This is reminding me when Chrome and Firefox where racing to release a new “major version” (at least from the semver POV) without adding significantly new functionality at a time that browsers were already becoming a commodity. As much as we don’t care anymore for a new chrome or Firefox version so will be the release of a new model version.

> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.

> CyberGym 81.8%

Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.

Worth the 100% price increase over GPT-5.4?

Releases keep shifting from API forward to product forward, with API now lagging behind proprietary product surface and special partnerships.

I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.

I'm conflicted whether I should keep my Claude Max 5x subscription at this point and switch back to GPT/Codex... anyone else in a similar position? I'd rather not be paying for two AI providers and context switching between the two, though I'm having a hard time gauging if Claude Code is still the "cream of the crop" for SWE work. I haven't played around with Codex much.

So exhausted from all this endless bs…. Keep releasing , this reminds me of all the .com software during that era where wow we are already at version 3.0 it’s only been 60 Days

Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.

Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.

Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?

I am a heavy Claude Code user. I just tried using Codex with 5.4 (as a Plus user I don't have access to 5.5 yet), and it was quite underwhelming. The app stopped regularly much earlier than what I wanted. It also claimed to have fixed issues when it did not; this is not a hallmark of GPT, and Opus has similar issues, but Claude will not make the same mistake three times in a row. It is unusable at the moment, while Claude allows me do get real work done on a daily basis. Until then...

What a time. I am back here genuinely wishing for OpenAI to release a great model, because without stiff competition, it feels like Anthropic has completely lost its mind.

GPT is really great, but I wish the GPT desktop app supported MCP as well.

You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.

Once upon a time humans had to memorize log tables.

Once upon a time humans had to manually advance the spark ignition as their car's engine revved faster.

Once upon a time humans had to know the architecture of a CPU to code for it.

History is full of instances of humans meeting technology where it was, accommodating for its limitations. We are approaching a point where machines accommodate to our limitations -- it's not a point, really, but a spectrum that we've been on.

It's going to be a bumpy ride.

I like that its more consistent than the 4o and o4 days but still 5.4, 5.3, 5.2, etc still are a mess, for example 5.2 and 5.1 don't have mini models and 5.3 was codex only.

Anthropic is slightly better but where is 4.6 or 4.7 haiku or 4.7 sonnet etc.

In Copilot where it's easy to switch models Opus 4.6 was still providing, IMHO, better stock results than GPT-5.4.

Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).

I'm hoping to see improvements in this area with 5.5.

With such a huge progress of open ai and anthropic . How Chinese open source provides even think to make comparable money . I have a few friends in China they all use Claude. To train the model cost the same but the output from open source model id imagine is 1000 times less . Money flow for them outside of China is abysmal

Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?

> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.

i just installed Codex and And Gave try to GPT 5.5 Its Good As compare to previous one

Labs still aren't publishing ARC-AGI-3 scores, even though it's been out for some time. Is it because the numbers are too embarrassing?

Cool. Now there will be a week or "this is the greatest model ever and I think mine just gained sentience", followed by a week of "I think they must have just nerfed it because it's not as good as it was a week ago", followed by three weeks of smart people cargo culting the specific incantations they then convince themselves make it work best.

I know this is irrelevant on the grand scheme of things, but that WebGL animation is really quite wrong. That is extra funny given the "ensure it has realistic orbital mechanics." phrase in the prompt.

I prescribe 20 hours of KSP to everyone involved, that'll set them right.

Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.

I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.

Seems like a continuation of the current meta where GPT models are better in GPT-like ways and Claude models are better in Claude-like ways, with the differences between each slightly narrowing with each generation. 5.5 is noticeably better to talk to, 4.7 is noticeably more precise. Etc etc.

Looking at the space/game/earthquake tracker examples makes me hopeful that OpenAI is going to focus a bit more on interface visual development/integration from tools like Figma. This is one area where Anthropic definitely reigns supreme.

Very impressive! Interesting how all other benchmarks it seems to surpass Opus 4.7 except SWE-Bench Pro (Public). You would think that doing so well at Cyber, it would naturally possess more abilities there. Wonder what makes up the actual difference there

NYTimes article - on the same day?

  https://www.nytimes.com/2026/04/23/technology/openai-new-model.html

I can see how some model releases would meet the NY Times news-worthy threshold if they demonstrated significance to users - i.e., if most users were astir and competitors were re-thinking their situation.

However, this same-day article came out before people really looked at it. It seems largely intended to contrast OpenAI with Anthropic's caution, before there has been any evidence that the new model has cyber-security implications.

It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.

Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.

https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...

Releases keep shifting from API forward to product forward, with API now lagging behind proprietary product surface and special partnerships.

I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.

Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.

What a time. I am back here genuinely wishing for OpenAI to release a great model, because without stiff competition, it feels like Anthropic has completely lost its mind.

In Copilot where it's easy to switch models Opus 4.6 was still providing, IMHO, better stock results than GPT-5.4.

Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).

I'm hoping to see improvements in this area with 5.5.

I prescribe 20 hours of KSP to everyone involved, that'll set them right.

Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.

I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

(I work at OpenAI.)

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.

Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.

It's such a vague table for pricing information. 30-150 messages...? What?

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

The real 'hype' was that the oh-snap realization that Open AI would absolutely release a competitive model to Mythos within weeks of Anthropic announcing there's, and that Sam would not gate access to it. So the panic was that the cyber world had only a projected 2 weeks to harden all these new zero days before Sam would inevitably create open season for blackhats to discover and exploit a deluge of zero-days.

> Never thought I'd say this but OpenAI is the 'open' option again.

Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.

Doesn't OpenAI get mad if you ask cybersecurity questions and force you to upload a government ID, otherwise they'll silently route you to a less capable model?

> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.

https://developers.openai.com/codex/concepts/cyber-safety

https://chatgpt.com/cyber

isnt it like cyber question are being routed to dumper models at openai?

> Anthropic's gated Mythos model

aka the perfect marketing ploy

I ignore any hype news.

Anthropic is the embodiment of bullshitting to me.

I read Cialdini many decades ago and I am bored by Anthropic.

OpenAI is very clever. With the advent of Claude OpenAI disappeared from the headlines. Who or what was this Sam again all were talking about a year ago?

OpenAI has a massive user advantage so that they can simply follow Anthropic’s release cycle to ridicule them.

I think it is really brutal for Anthropic how they are easily getting passed by by OpenAI and it is getting worse with every new GPT version for Anthropic.

OpenAI owns them.

The more interesting part of the announcement than "it's better at benchmarks":

There's already KernelBench which tests CUDA kernel optimizations.

On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.

So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.

I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.

LLM do not stop amazing me every day.

Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing

Mythos is only real when it's actually available. If you're using Opus 4.7 right now, you know how incredibly nerfed the Opus autonomy is in service of perceived safety. I'm not so confident this will be as great as Anthropic wants us to believe..

They mentioned in their release page, that the Claude team noticed memorization of the SWE-bench test, so the test is actually in the training data.

Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...

I did some study on Verified, not Pro, but Mythos number there rings a lot of questions on my end.

If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).

Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.

Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.

A single benchmark is meaningless, you always get quirky results on some benchmarks.

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

There's something off with this because Haiku should not be that good.

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch

"our newest and most expensive model yet"

can't wait for "our worst and dumbest model yet"

*I work at OAI.

It's genuinely so great at long horizon tasks! GPT-5.5 solved many long-horizon frontier challenges, for the first time for an AI model we've tested, in our internal evals at Canva :) Congrats on the launch!

Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.

Sorry, what is "heartbeats", exactly?

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.

The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.

The price for all models by all companies will continue to go up, and quickly.

Look a cost per intelligence or cost per task instead of cost per token.

As others have mentioned you're ignoring the long tail of open-weights models which can be self hosted. As long as that quasi-open-source competition keeps up the pace, it will put a cap on how expensive the frontier models can get before people have to switch to self-hosting.

That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.

We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.

GPT-4 cost 6x on input and 2x output tokens when it was released as compared go GPT-5.5

It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.

SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

Anyway, it continues to make me uneasy, is all I'm saying.

LLMs upend a few centuries of labor theory.

The current market is predicated on the assumption that labor is atomic and has little bargaining power (minus unions). While capital has huge bargaining power and can effectively put whatever price it wants on labor (in markets where labor is plentiful, which is most of them).

What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?

Anyone not using in house models is signing up to find out.

One might argue that it’s not too too different from higher level abstractions when using libraries. You get things done faster, write less code, library handles some internal state/memory management for you.

Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()? For some, yes. For others, it’s a bit freeing as you can do more high-level architecture without getting mired and context switched from low level nuances.

Who else is trying to leverage the situation so that they don't dig their own grave too fast ?

    - I often don't ask the LLM for precompiled answers, i ask for a standalone cli / tool
    - I often ask how it reached its conclusions, so I can extend my own perspective
    - I often ask to describe it's own metadata level categorization too

I'm trying to use it to pivot and improve my own problem solving skills, especially for large code base where the difficulty is not conceptual but more reference-graph size

Assuming that local models are able to stay within some reasonably fixed capability delta of the cutting edge hosted models (say, 12 months behind), and assuming that local computing hardware stays relatively accessible, the only risk is that you'll lose that bit of capability if the hosted models disappear or get too expensive.

Note that neither of these assumptions are obviously true, at least to me. But I can hope!

Well, they obviously are going to say that, they have vested interest in OpenAI and thus Nvidia stock price growing.

Also, I honestly can’t believe the 10x mantra is being still repeated.

> This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

What's the worst potential outcome, assuming that all models get better, more efficient and more abundant (which seems to be the current trend)? The goal of engineering has always been to build better things, not to make it harder.

Start building your own liteweight "harness" that does things you need. Ignore all functionality of clients like CC or Codex and just implement whatever you start missing in your harness.

You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.

Oh and definitely disable any form of "memory" system.

Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.

It's surprisingly simple to switch. I mean both products offer basically identical coding CLI experiences. Personally I've been paying for Claude max $100, and ChatGPT $20, and then just using ChatGPT to fill in the gaps. Specifically I like it for code review and when Claude is down.

I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md

MCPs aren't as smooth, but I just set them up in each environment.

Anecdotally, I get the same wall time with my Max x5 (100$) and my ChatGPT Teams (30$) subscriptions.

What is the switching cost besides launching a different program? Don’t you just need to type what you want into the box?

Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.

I'm here for the pelicans and I'm not leaving until I see one!

I've come to prompt pelicans and chew gum, and I'm all outta gum!

That's a true CTO right there.

I know a 10x engineer when i see one.

This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.

As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.

Is this the first time OpenAI has published comparisons to other labs?

Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.

Might be an tacit admission of being behind.

[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/

For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.

The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?

I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at

So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.

> It excels at ... researching online

How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?

Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.

Pricing: $5/1M input, $30/1M output

(same input price and 20% more output price than Opus 4.7)

> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.

> CyberGym 81.8%

Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.

Worth the 100% price increase over GPT-5.4?

GPT is really great, but I wish the GPT desktop app supported MCP as well.

You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.

Once upon a time humans had to memorize log tables.

Once upon a time humans had to manually advance the spark ignition as their car's engine revved faster.

Once upon a time humans had to know the architecture of a CPU to code for it.

It's going to be a bumpy ride.

Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?

> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.

Labs still aren't publishing ARC-AGI-3 scores, even though it's been out for some time. Is it because the numbers are too embarrassing?

OpenAI hired the guy behind OpenClaw, so it makes sense that they’re more lenient towards its usage.

That pelican you posted yesterday from a local model looks nicer than this one.

Edit: this one has crossed legs lol

Isn't it awful ? After 5.5 versions it still can't draw a basic bike frame. How is the front wheel supposed to turn sideways ?

That's amazing that the default did that much in just 39 "reasoning tokens" (no idea what a reasoning token is but that's still shockingly few tokens)

Thank you for continuing to post these! Very interesting benchmark.

Is this direct API usage allowed by their terms? I remember Anthropic really not liking such usage.

Hmm. Any idea why it's so much worse than the other ones you have posted lately? Even the open weight local models were much better, like the Qwen one you posted yesterday.

I made pelicans at different thinking efforts:

https://hcker.news/pelican-low.svg

https://hcker.news/pelican-medium.svg

https://hcker.news/pelican-high.svg

https://hcker.news/pelican-xhigh.svg

Someone needs to make a pelican arena, I have no idea if these are considered good or not.

Thank you for doing all this. It's appreciated.

Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”

I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.

Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.

Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?

Yep - its taking sometime.

Looks good, but I’m a little hesitant to try it in Codex as a Plus user since I’m not sure how much it would eat into the usage cap.

Will GPT 5.5 fine tuning be released any time soon?

Great stuff! Congrats on the release!

Just a tip: add [translated] subtitles to the top video.

Are you able to say something about the training you've done to 5.5 to make it less likely to freak out and delete projects in what can only be called shame?

Please next time start with azure foundry lol thanks!

FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.

It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.

In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.

Excited to test 5.5 and see how it is in practice.

LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.

The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.

I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.

It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.

Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.

A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.

[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...

The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.

What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.

  Game created by Pietro Schirano, CEO of MagicPath

  Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
  - Think step by step, take a deep breath. Repeat the question back before answering.
  - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
  -Then write all the code. Make the game low-poly but beautiful.
  - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
  - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.

> It really seems like we could be at the dawn of a new era similiar to flash

We've been there for a while.... creativity has been the primary bottleneck

It’s like all these things though - it’s not a real production worthy product. It’s a super-demo. It looks amazing until you realize there’s many months of work to make it something of quality and value.

I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.

I personally don't think the gameplay itself is that impressive.

NYTimes article - on the same day?

  https://www.nytimes.com/2026/04/23/technology/openai-new-model.html

It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.

I personally don't think the gameplay itself is that impressive.

Please next time start with azure foundry lol thanks!

Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.

Sorry, what is "heartbeats", exactly?

This is entirely expected. The low prices of using LLMs early on was totally and completely unsustainable. The companies providing such services were (and still are) burning money by the truckload.

The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.

The price for all models by all companies will continue to go up, and quickly.

Apparently the cost/price is 20x in the major providers. Not clear how it is a business

Look a cost per intelligence or cost per task instead of cost per token.

That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.

We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.

GPT-4 cost 6x on input and 2x output tokens when it was released as compared go GPT-5.5

It's far more meaningful to look at the actual cost to successfully something. The token efficiency of GPT-5.5 is real; as well as it just being far better for work.

SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.

Not really a big problem. Switch to KIMI, Qwen, GLM. You’ll get 95% quality of GPT or Anthropic for a 10th of a price. I feel like the real dependency is more mental, more of a habit but if you actually dip your toes outside OpenAI, Anthropic, Gemini from time to time, you realise that the actual difference in code is not huge if prompted in a good way. Maybe you’ll have to tell it to do something twice and it won’t be a one shot, but it’s really not an issue at all.

Such an increase tracks the company's valuation trend, which they constantly, somehow have to justify (let alone break even on costs).

LLMs upend a few centuries of labor theory.

Anyone not using in house models is signing up to find out.

Who else is trying to leverage the situation so that they don't dig their own grave too fast ?

    - I often don't ask the LLM for precompiled answers, i ask for a standalone cli / tool
    - I often ask how it reached its conclusions, so I can extend my own perspective
    - I often ask to describe it's own metadata level categorization too

I'm trying to use it to pivot and improve my own problem solving skills, especially for large code base where the difficulty is not conceptual but more reference-graph size

Note that neither of these assumptions are obviously true, at least to me. But I can hope!

Well, they obviously are going to say that, they have vested interest in OpenAI and thus Nvidia stock price growing.

Also, I honestly can’t believe the 10x mantra is being still repeated.

It's very addictive indeed. After I subscribed to Claude, I've been on a sort of hypomanic state where I just want to do stuff constantly. It essentially cured my ADHD. My ability to execute things and bring ideas to fruition skyrocketed. It feels good but I'm genuinely afraid I'll crash and burn once they rug pull the subscriptions.

And I'm being very cautious. I'm not vibecoding entire startups from scratch, I'm manually reviewing and editing everything the AI is outputting. I still got completely hooked on building things with Claude.

I feel like most engineers I talk to still haven't realised what this is going to mean for the industry. The power loom for coding is here. Our skills still matter, but differently.

You are 100% right to be cautious about this. That's why as stupid as it sounds, I've purposely made my workflow with AI full of friction:

1. I only have ONE SOTA model integrated into the IDE (I am mostly on Elixir, so I use Gemini). I ensure I use this sparingly for issues I don't really have time to invest or are basically rabbit holes eg. Anything to do with Javascript or its ecosystem). My job is mostly on the backend anyway.

2. For actual backend architecture. I always do the high level architecture myself. Eg. DDD. Then I literally open up gemini.google.com or claude.ai on the browser, copy paste existing code base into the code base, physically leavey chair to go make coffee or a quick snack. This forces me to mentally process that using AI is a chore.

Previously, I was on tight Codex integration and leaving the licensing fears aside, it became too good in writing Elixir code that really stopped me from "thinking" aka using my brain. It felt good for the first few weeks but I later realised the dependence it created. So I said fuck it, and completely cancelled my subscription because it was too good at my job.I believe this is the only way that we won't end up like in Wall-E sitting infront of giant screens just becoming mere blobs of flesh.

I use Open Code as my harness. It's open source, bring your own API Key or OAuth token or self-hosted model. I've jumped from Opus 4.6 to Opus 4.7 to GPT 5.5 in the last 7 days. No big deal, intelligence is just a commodity in 2026.

The actual harness is great, very hackable, very extendable.

Start building your own liteweight "harness" that does things you need. Ignore all functionality of clients like CC or Codex and just implement whatever you start missing in your harness.

You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.

Oh and definitely disable any form of "memory" system.

Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.

MCPs aren't as smooth, but I just set them up in each environment.

Anecdotally, I get the same wall time with my Max x5 (100$) and my ChatGPT Teams (30$) subscriptions.

What is the switching cost besides launching a different program? Don’t you just need to type what you want into the box?

I use Conductor which lets me flip trivially between OpenAI/Anthropic models

Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.

	GPT-5.5	GPT-5.4	GPT-5.5 Pro	GPT-5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
Terminal-Bench 2.0	82.7%	75.1%	-	-	69.4%	68.5%
Expert-SWE (Internal)	73.1%	68.5%	-	-	-	-
GDPval (wins or ties)	84.9%	83.0%	82.3%	82.0%	80.3%	67.3%
OSWorld-Verified	78.7%	75.0%	-	-	78.0%	-
Toolathlon	55.6%	54.6%	-	-	-	48.8%
BrowseComp	84.4%	82.7%	90.1%	89.3%	79.3%	85.9%
FrontierMath Tier 1–3	51.7%	47.6%	52.4%	50.0%	43.8%	36.9%
FrontierMath Tier 4	35.4%	27.1%	39.6%	38.0%	22.9%	16.7%
CyberGym	81.8%	79.0%	-	-	73.1%	-

Eval	GPT-5.5	GPT‑5.4	GPT-5.5 Pro	GPT‑5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
SWE-Bench Pro (Public) *	58.6%	57.7%	-	-	64.3%	54.2%
Terminal-Bench 2.0	82.7%	75.1%	-	-	69.4%	68.5%
Expert-SWE (Internal)	73.1%	68.5%	-	-	-	-

Eval	GPT-5.5	GPT‑5.4	GPT-5.5 Pro	GPT‑5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
GDPval (wins or ties)	84.9%	83.0%	82.3%	82.0%	80.3%	67.3%
FinanceAgent v1.1	60.0%	56.0%	-	61.5%	64.4%	59.7%
Investment Banking Modeling Tasks (Internal)	88.5%	87.3%	88.6%	83.6%	-	-
OfficeQA Pro	54.1%	53.2%	-	-	43.6%	18.1%

Eval	GPT-5.5	GPT‑5.4	GPT-5.5 Pro	GPT‑5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
OSWorld-Verified	78.7%	75.0%	-	-	78.0%	-
MMMU Pro (no tools)	81.2%	81.2%	-	-	-	80.5%
MMMU Pro (with tools)	83.2%	82.1%	-	-	-	-

Eval	GPT-5.5	GPT‑5.4	GPT-5.5 Pro	GPT‑5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
BrowseComp	84.4%	82.7%	90.1%	89.3%	79.3%	85.9%
MCP Atlas**	75.3%	70.6%	-	-	79.1%	78.2%
Toolathlon	55.6%	54.6%	-	-	-	48.8%
Tau2-bench Telecom*** (original prompts)	98.0%	92.8%	-	-	-	-

Hacker Times

Hacker Times

Discussion

Discussion

Coding

Professional

Computer use and vision

Tool use

Academic

Cybersecurity

Long context

Abstract reasoning

Eval	GPT-5.5	GPT‑5.4	GPT-5.5 Pro	GPT‑5.4 Pro	Claude Opus 4.7	Gemini 3.1 Pro
Graphwalks BFS 256k f1	73.7%	62.5%	-	-	76.9%	-
Graphwalks BFS 1mil f1	45.4%	9.4%	-	-	41.2% (Opus 4.6)	-
Graphwalks parents 256k f1	90.1%	82.8%	-	-	93.6%	-
Graphwalks parents 1mil f1	58.5%	44.4%	-	-	72.0% (Opus 4.6)	-
OpenAI MRCR v2 8-needle 4K-8K	98.1%	97.3%	-	-	-	-
OpenAI MRCR v2 8-needle 8K-16K	93.0%	91.4%	-	-	-	-
OpenAI MRCR v2 8-needle 16K-32K	96.5%	97.2%	-	-	-	-
OpenAI MRCR v2 8-needle 32K-64K	90.0%	90.5%	-	-	-	-
OpenAI MRCR v2 8-needle 64K-128K	83.1%	86.0%	-	-	-	-
OpenAI MRCR v2 8-needle 128K-256K	87.5%	79.3%	-	-	59.2%	-
OpenAI MRCR v2 8-needle 256K-512K	81.5%	57.5%	-	-	-	-
OpenAI MRCR v2 8-needle 512K-1M	74.0%	36.6%	-	-	32.2%	-