Gemini 3.1 Pro Preview

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.

Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.

Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.

I hope the 3.1 release includes a much larger output limit.

Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.

I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

Where is Simon's pelican?

Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case

Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

Somehow doesn't work for me :) "An internal error has occurred"

Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.

I hope to have great next two weeks before it gets nerfed.

Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

The touted SVG improvements make me excited for animated pelicans.

I hope the 3.1 release includes a much larger output limit.

Output limit has consistently been 64k tokens (including 2.5 pro).

People did find Gemini very talkative so it might be a response to that.

> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

> I've been playing with the 3.1 Deep Think version of this

How?

Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.

and I'm sure others I've missed...

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

This is what competition looks like.

It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

Somehow doesn't work for me :) "An internal error has occurred"

Where is Simon's pelican?

It's also quite impressive with SVG animations.

> Create an SVG animation of a Beaver sitting next to a recordplayer and a create of records, his eyes follows the mouse curser.

https://gemini.google.com/share/717be5f9b184

Please no, let's not.

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

I assume all the frontier models are benchmaxxing, so it would make sense

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.

What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.

People did find Gemini very talkative so it might be a response to that.

Output limit has consistently been 64k tokens (including 2.5 pro).

I don't think there's much recursive improvement yet.

I'd say it's a combination of

B) There's more compute online

C) Competition is more fierce.

Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.

This is what competition looks like.

Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?

> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.

According to the blog post, it should be also great at drawing pelicans riding a bicycle.

The touted SVG improvements make me excited for animated pelicans.

I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...

I imagine they're also benchgooning on SVG generation

SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).

My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.

How about STL files for 3d printing pelicans!

> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

No, the model doesn't have purview into this afaik

I'm not even sure what "pausing" means in this context and why it would help when there are insufficient tokens. They should just stop when you reach the limit, default or manually specified, but it's typically a cutoff.

You can see what happens by setting output token limit much lower

> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

I hope to have great next two weeks before it gets nerfed.

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.

> I've been playing with the 3.1 Deep Think version of this

How?

I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks

I find Gemini is outstanding at reasoning (all topics) and architecture (software/system design). On the other hand, Gemini CLI sucks and so I end up using Claude Code and Codex CLI for agentic work.

However, I heavily use Gemini in my daily work and I think it has its own place. Ultimately, I don't see the point of choosing the one "best" model for everything, but I'd rather use what's best for any given task.

> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.

I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.

Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.

Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.

(Shrug) Ask it to write one!

A select few have had early access through various programs Google offers. I believe there was a sentence or two to this effect on the Gemini 3 Deep Think post from Deepmind.

According to the blog post, it should be also great at drawing pelicans riding a bicycle.

Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?

I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.

The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...

SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).

How about STL files for 3d printing pelicans!

I imagine they're also benchgooning on SVG generation

No, the model doesn't have purview into this afaik

You can see what happens by setting output token limit much lower

My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

and I'm sure others I've missed...

I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.

Basically, what does the word "Preview" mean, if newer releases happen before a Preview model is stable? In prior Google models, Preview meant that there'd still be updates and improvements to said model prior to full deployment, something we saw with 2.5. Now, there is no meaning or reason for this designation to exist if they forgo a 3.0 still in Preview for model improvements.

I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

Good to see it wearing a helmet. Their safety team must be on their game.

That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!

Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

Yeah, it does. It was possible with 2.5 Flash.

Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...

Ok it prints some stuff at the end but does it actually count the output tokens? That part was already built in somehow? Is it just retrying until it has enough space to add the footer?

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

It's also quite impressive with SVG animations.

> Create an SVG animation of a Beaver sitting next to a recordplayer and a create of records, his eyes follows the mouse curser.

https://gemini.google.com/share/717be5f9b184

Please no, let's not.

I assume all the frontier models are benchmaxxing, so it would make sense

I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.

xAI just released Grok 4.20 beta yesterday or day before?

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

Agree, can't wait for updates to the diffusion model.

Could be useful for planning too, given its tendency to think big picture first. Even if it's just an additional subagent to double-check with an "off the top off your head" or "don't think, share first thought" type of question. More generally would like to see how sequencing autoregressive thinking with diffusion over multiple steps might help with better overall thinking.

How much do I get if I solve this? :D

https://arcprize.org/play

You are saying something interesting but too esoteric. Can you explain for beginners?

Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.

Thirded, I've been using gemini-3-flash to great effect. Anytime I have something more complicated, I give it to pro & flash to see what happens. Coin flip if flash is nearly equivalent (too many moving vars to be analytical at this point)

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

https://arcprize.org/arc-agi/1/

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

I experimented with many models on my 16G and 32G Macs. For less memory, qwen3:4b is good, for the 32B Mac, gpt-oss:20b is good. I like the smaller Mistral models like mistral:v0.3 and rnj-1:latest is a pretty good small reasoning model.

That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!

Ok it prints some stuff at the end but does it actually count the output tokens? That part was already built in somehow? Is it just retrying until it has enough space to add the footer?

Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

Agree, can't wait for updates to the diffusion model.

xAI just released Grok 4.20 beta yesterday or day before?

You are saying something interesting but too esoteric. Can you explain for beginners?

How much do I get if I solve this? :D

https://arcprize.org/play

Good to see it wearing a helmet. Their safety team must be on their game.

Yes but why would a pelican need a helmet? If it falls over it can just fly away... Common sense 1 Gemini 0

My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.

Given the pace AI is improving and that it doesn't give the exact same answers under many circumstances, is the the [in]stability of "preview" a concern?

GMail was in "beta" for 5 years.

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

I find Gemini is outstanding at reasoning (all topics) and architecture (software/system design). On the other hand, Gemini CLI sucks and so I end up using Claude Code and Codex CLI for agentic work.

> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.

A select few have had early access through various programs Google offers. I believe there was a sentence or two to this effect on the Gemini 3 Deep Think post from Deepmind.

(Shrug) Ask it to write one!

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.

https://arcprize.org/arc-agi/1/

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks

I don't use AI studio for my work. I used Antigravity/Gemini CLI and 3 pro was great for few weeks and now it's worse than 3 flash or any smaller model from competitor which are rated lower on benchmarks

IME, they definitely nerf models. gemini-2.5-pro-exp-03-25 through AI Studio was amazing at release and steadily degraded. The quality started tanking around the time they hid CoT.

Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.

For my custom agentic coding setup, I use Claude Code derived prompts with Gemini models, primarily flash. It's night and day compared to Google's own agentic products, which are all really bad.

The models are all close enough on the benchmarks and I think people are attributing too much difference in the agentic space to the model itself. I strongly believe the difference is in all the other stuff, which is why Antropic is far ahead of the competition. They have done great work with Claude Code, Cowork, and their knowledge share through docs & blog, bar none on this last point imo.

I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

Flash is (was?) was better than Pro on these fronts.

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

Yes but why would a pelican need a helmet? If it falls over it can just fly away... Common sense 1 Gemini 0

Obviously these domestic pelicans can't fly, otherwise why would they need a bike?

Why would a pelican be riding a bicycle at all, for that matter?

Given the pace AI is improving and that it doesn't give the exact same answers under many circumstances, is the the [in]stability of "preview" a concern?

GMail was in "beta" for 5 years.

ChatGPT 4.5 was never released to the public, but it is widely believed to be the foundation the 5.x series is built on.

Wonder how GP feels about the minor bumps for other model providers?

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

For my custom agentic coding setup, I use Claude Code derived prompts with Gemini models, primarily flash. It's night and day compared to Google's own agentic products, which are all really bad.

IME, they definitely nerf models. gemini-2.5-pro-exp-03-25 through AI Studio was amazing at release and steadily degraded. The quality started tanking around the time they hid CoT.

Flash is (was?) was better than Pro on these fronts.

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

Obviously these domestic pelicans can't fly, otherwise why would they need a bike?

ChatGPT 4.5 was never released to the public, but it is widely believed to be the foundation the 5.x series is built on.

Wonder how GP feels about the minor bumps for other model providers?

Why would a pelican be riding a bicycle at all, for that matter?

Because the user asked for it

Hacker Times

Hacker Times

Discussion

Discussion