Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
They never will do on private set, because it would mean its being leaked to google.
I ask because I cannot distinguish all the benchmarks by heart.
$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
edit: they just removed the reference to "3.1" from the pdf
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
Not interested enough to pay $250 to try it out though.
HN guidelines prefer the original source over social posts linking to it.
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
https://hn.algolia.com/?q=1stproof
This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
I'm really glad they did it.
"The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.
Me: Remove comments
Literally Gemini: // Comments were removed
Almost straight away, if OpenAI says "Elite", Google will release "Extraordinary" and Musk will post "Almost AGI, probably about this time next year".
That was about 18-24 months ago when I was trying to make sense of the offerings.
Is it really new?
We download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary.
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
Arc-AGI score isn't correlated with anything useful.
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
Again, like I said before, it's also a terrible benchmark.
Feb 12, 2026
Our most specialized reasoning mode is now updated to solve modern science, research and engineering challenges.
The Deep Think team
Gemini 3 Deep Think has a major upgrade to help solve science, research and engineering challenges. Google AI Ultra subscribers can now access the updated Deep Think in the Gemini app. Researchers, engineers and enterprises can express interest in early access to test Deep Think via the Gemini API.
Summaries were generated by Google AI. Generative AI is experimental.

Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Today, we’re releasing a major upgrade to Gemini 3 Deep Think, our specialized reasoning mode, built to push the frontier of intelligence and solve modern challenges across science, research, and engineering.
We updated Gemini 3 Deep Think in close partnership with scientists and researchers to tackle tough research challenges — where problems often lack clear guardrails or a single correct solution and data is often messy or incomplete. By blending deep scientific knowledge with everyday engineering utility, Deep Think moves beyond abstract theory to drive practical applications.
The new Deep Think is now available in the Gemini app for Google AI Ultra subscribers and, for the first time, we’re also making Deep Think available via the Gemini API to select researchers, engineers and enterprises. Express interest in early access here.
Here is how our early testers are already using the latest Deep Think:
Lisa Carbone, a mathematician at Rutgers University, works on the mathematical structures required by the high-energy physics community to bridge the gap between Einstein’s theory of gravity and quantum mechanics. In a field with very little existing training data, she used Deep Think to review a highly technical mathematics paper. Deep Think successfully identified a subtle logical flaw that had previously passed through human peer review unnoticed.
At Duke University, the Wang Lab utilized Deep Think to optimize fabrication methods for complex crystal growth for the potential discovery of semiconductor materials. Deep Think successfully designed a recipe for growing thin films larger than 100 μm, meeting a precise target that previous methods had challenges to hit.
Anupam Pathak, an R&D lead in Google’s Platforms and Devices division and former CEO of Liftware, tested the new Deep Think to accelerate the design of physical components.
Last year, we showed that specialized versions of Deep Think could successfully navigate some of the toughest challenges in reasoning, achieving gold-medal standards at math and programming world championships. More recently, Deep Think has enabled specialized agents to conduct research-level mathematics exploration.
The updated Deep Think mode continues to push the frontiers of intelligence, reaching new heights across the most rigorous academic benchmarks, including:

Beyond mathematics and competitive coding, Gemini 3 Deep Think now also excels across broad scientific domains such as chemistry and physics. Our updated Deep Think mode demonstrates gold medal-level results on the written sections of the 2025 International Physics Olympiad and Chemistry Olympiad. It also demonstrates proficiency in advanced theoretical physics, achieving a score of 50.5% on CMT-Benchmark.

In addition to its state-of-the-art performance, Deep Think is built to drive practical applications, enabling researchers to interpret complex data, and engineers to model physical systems through code. Most importantly, we are working to bring Deep Think to researchers and practitioners where they need it most — beginning with surfaces such as the Gemini API.
With the updated Deep Think, you can turn a sketch into a 3D-printable reality. Deep Think analyzes the drawing, models the complex shape and generates a file to create the physical object with 3D printing.
Google AI Ultra subscribers will be able to access the updated Deep Think mode starting today in the Gemini app. Scientists, engineers and enterprises can also now express interest in our early access program to test Deep Think via the Gemini API.
We can’t wait to see what you discover.
Related stories
.
It has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results.
It's agents all the way down.
We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics
Put another way, I’m on the capital side of the conversation.
The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.
The computer industry (including SW) has been in the business of replacing jobs for decades - since the 70's. It's only fitting that SW engineers finally become the target.
ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
That said, "Lunar New Year" is probably as good a compromise as any, since we have other names for the Hebrew and Islamic New Years.
And don't get me started with "Lunar New Year? What Lunar New Year? Islamic Lunar New Year? Jewish Lunar New Year? CHINESE Lunar New Year?".
[0] https://www.mom.gov.sg/employment-practices/public-holidays
Have you ever had a Polish Sausage? Did it make you Polish?
Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.
And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.
I don't think that's going to make society very pleasant if everyone's fighting over the few remaining ways to make livelihood. People need to work to eat. I certainly don't see the capitalist class giving everyone UBI and letting us garden or paint for the rest of our lives. I worry we're likely going to end up in trenches or purged through some other means.
but forgot there's likely someone above them making exactly the same one about them
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks
I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
The logic related to the bug wasn't all contained in one file, but across several files.
This was Gemini 2.5 Pro. A whole generation old.
Projects:
https://github.com/alexispurslane/oxen
https://github.com/alexispurslane/org-lsp
(Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)
shrug
You’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.
If you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?