So google doesn't use NVIDIA GPUs at all ?
I would love for them to eliminate these issues because just touting benchmark scores isn't enough.
Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?
On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.
Does this mean that 3.0 is now GA at least?
Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.
> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.
But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.
Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
You are definitely going to have to drive it there—unless you want to put it in neutral and push!
While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.
Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.
Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3
Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.
"create a svg of a unicorn playing xbox"
https://www.svgviewer.dev/s/NeKACuHj
Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.
Are Google planning to put any of their models into production any time soon?
Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?
I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.
With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?
I just tested the "generate an SVG of a pelican riding a bicycle" prompt and this is what I got: https://codepen.io/takoid/pen/wBWLOKj
The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.
BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.
I hope they manage to improve things on that front, because then Flash would be great for many tasks.
However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.
So maybe I'll give it a 90%?
This is without tools as well.
For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?
This kind of test is good because it requires stitching together info from the whole video.
If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.
So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.
Anthropic is clearly targeted to developers and OpenAI is general go to AI model. Who are the target demographic for Gemini models? ik that they are good and Flash is super impressive. but i’m curious
These are not data driven observations just vibes
It's such an uninformative piece of marketing crap
Less impact on gamers…
deep think == turning up thinking knob (I think)
deep research == agent w/ search
Also what's great about Gemini in Google Search is that the answer comes with several links, I use them sometimes to validate the correctness of the solution, or check how old the solution is (I've never used chatGPT so I don't know if chatGPT does it).
My main use-cases outside of SWE generally involve the ability to compare detailed product specs and come up with answers/comparisons/etc... Gemini does really well for that, probably because of the deeper google search index integration.
Also I got a year of pro for free with my phone....so thats a big part.
When you sign up for the pro tier you also get 2TB of storage, Gemini for workspace and Nest Camera history.
If you're in the Google sphere it offers good value for money.
I had only started using Opus 4.6 this week. Sonnet it seems like is much better at having a long conversation with. Gemini is good for knowledge retrieval but I think Opus 4.6 has caught up. The biggest thing that made Gemini worth it for me the last 3 months is I crushed it with questions. I wouldn't have even got 10% of the Opus use that I got from Gemini before being made to slow down.
I have a deep research going right now on 3.1 for the first time and I honestly have no idea how I am going to tell if it is better than 3.
It seems like agentic coding Gemini wasn't as good but just asking it to write a function, I think it only didn't one shot what I asked it twice. Then fixed the problem on the next prompt.
I haven't logged in to bother with chatGPT in about 3 months now.
I miss when Gemini 3.1 was good. :(
I honestly do not wish Google to have the best model out there and be forced to use their incomprehensible subscription / billing / project management whatever shit ever again.
I don’t know what their stuff cost. I don’t know why would I use vertex or ai studio. What is included in my subscription what is billed per use.
I pray that whatever they build fails and burns.
Until now, I've only ever used Gemini for coding tests. As long as I have access to GPT models or Sonnet/Opus, I never want to use Gemini. Hell, I even prefer Kimi 2.5 over it. I tried it again last week (Gemini Pro 3.0) and, right at the start of the conversation, it made the same mistake it's been making for years: it said "let me just run this command," and then did nothing.
My sentiment is actually the opposite of yours: how is Google *not* winning this race?
In short, I consider Gemini to be a highly capable intern (grad student level) who is smarter and more tenacious than me, but also needs significant guidance to reach a useful goal.
I used Gemini to completely replace the software stack I wrote for my self-built microscope. That includes:
writing a brand new ESP32 console application for controlling all the pins of my ESP32 that drives the LED illuminator. It wrote the entire ESP-IDF project and did not make any major errors. I had to guide with updated prompts a few times but otherwise it wrote the entire project from scratch and ran all the build commands, fixing errors along the way. It also easily made a Python shared library so I can just import this object in my Python code. It saved me ~2-3 days of working through all the ESP-IDF details, and did a better job than I would have.
writing a brand new C++-based Qt camera interface (I have a camera with a special SDK that allows controlling strobe and trigger and other details. It can do 500FPS). It handled all the concurrency and message passing details. I just gave it the SDK PDF documentation for the camera (in mixed english/chinese), and asked it to generate an entire project. I had to spend some time guiding it around making shared libraries but otherwise it wrote the entire project from scratch and I was able to use it to make a GUI to control the camera settings with no additional effort. It ran all the build commands and fixed errors along the way. Saved me another 2-3 days and did a better job than I could have.
Finally, I had it rewrite the entire microscope stack (python with qt) using the two drivers I described above- along with complex functionality like compositing multiple images during scanning, video recording during scanning, mesaurement tools, computer vision support, and a number of other features. This involved a lot more testing on my part, and updating prompts to guide it towards my intended destination (fully functional replacement of my original self-written prototype). When I inspect the code, it definitely did a good job on some parts, while it came up with non-ideal solutions for some problems (for example, it does polling when it could use event-driven callbacks). This saved literally weeks worth of work that would have been a very tedious slog.
From my perspective, it's worked extremely well: doing what I wanted in less time than it would take me (I am a bit of a slow programmer, and I'm doing this in hobby time) and doing a better job (With appropriate guidance) than I could have (even if I'd had a lot of time to work on it). This greatly enhances my enjoyment of my hobby by doing tedious work, allowing me to spend more time on the interesting problems (tracking tardigrades across a petri dish for hours at a time). I used gemini pro 3 for this- it seems to do better than 2.5, and flash seemed to get stuck and loop more quickly.
I have only lightly used other tools, such as ChatGPT/Codex and have never used Claude. I tend to stick to the Google ecosystem for several reasons- but mainly, I think they will end up exceeding the capabilities of their competitors, due to their inherent engineering talent and huge computational resources. But they clearly need to catch up in a lot of areas- for example, the VS Code Gemini extension has serious problems (frequent API call errors, messed up formatting of code/text, infinite loops, etc).
In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.
[1] https://blog.google/innovation-and-ai/models-and-research/ge...
Cost per task is still significantly lower than Opus. Even Opus 4.5
They come up with passable solutions and are good for getting juices flowing and giving you a start on a codebase, but they are far from building "entire software products" unless you really don't care about quality and attention to detail.
Careful.
Gemini simply, as of 3.0, isn't in the same class for work.
We'll see in a week or two if it really is any good.
Bravo to those who are willing to give up their time to test for Google to see if the model is really there.
(history says it won't be. Ant and OAI really are the only two in this race ATM).
Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).
Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.
Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.
It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules
For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.
Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.
I don't know if it got these abilities through generalization or if google gave it a dedicated animated SVG RL suite that got it to improve so much between models.
Regardless we need a new vibe check benchmark ala bicycle pelican.
Perhaps they're deliberately optimising for SVG generation.
It sounds like there was at least a deliberate attempt to improve it.
Published 19 February 2026
Model Cards are intended to provide essential information on Gemini models, including known limitations, mitigation approaches, and safety performance. Model cards may be updated from time-to-time; for example, to include updated evaluations as the model is improved or revised.
Published: February 2026
Gemini 3.1 Pro is the next iteration in the Gemini 3 series of models, a suite of highly capable, natively multimodal reasoning models. As of this model card’s date of publication, Gemini 3.1 Pro is Google’s most advanced model for complex tasks. Geminin 3.1 Pro can comprehend vast datasets and challenging problems from massively multimodal information sources, including text, audio, images, video, and entire code repositories.
Gemini 3.1 Pro is based on Gemini 3 Pro.
Text strings (e.g., a question, a prompt, document(s) to be summarized), images, audio, and video files, with a token context window of up to 1M.
Text, with a 64K token output.
Gemini 3.1 Pro is based on Gemini 3 Pro. For more information about the model architecture for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
Gemini 3.1 Pro is based on Gemini 3 Pro. For more information about the training dataset for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
For more information about the training data processing for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
Gemini 3.1 Pro is based on Gemini 3 Pro. For more information about the hardware for Gemini 3.1 Pro and our continued commitment to operate sustainably, see the Gemini 3 Pro model card.
Gemini 3.1 Pro is based on Gemini 3 Pro. For more information about the software for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
Gemini 3.1 Pro was evaluated across a range of benchmarks, including reasoning, multimodal capabilities, agentic tool use, multi-lingual performance, and long-context. Additional benchmarks and details on approach, results and their methodologies can be found at: deepmind.google/models/evals-methodology/gemini-3-1-pro.
Gemini 3.1 Pro significantly outperforms Gemini 2.5 Pro across a range of benchmarks requiring enhanced reasoning and multimodal capabilities. Results as of February 2026 are listed below:
| Benchmark | Notes | Gemini 3.1 Pro Thinking (High) | Gemini 3 Pro Thinking (High) | Sonnet 4.6 Thinking (Max) | Opus 4.6 Thinking (Max) | GPT-5.2 Thinking (xhigh) | GPT-5.3-Codex Thinking (xhigh) |
|---|---|---|---|---|---|---|---|
| Humanity's Last Exam Academic reasoning (full set, text + MM) | No tools | 44.4% | 37.5% | 33.2% | 40.0% | 34.5% | — |
| Search (blocklist) + Code | 51.4% | 45.8% | 49.0% | 53.1% | 45.5% | — | |
| ARC-AGI-2 Abstract reasoning puzzles | ARC Prize Verified | 77.1% | 31.1% | 58.3% | 68.8% | 52.9% | — |
| GPQA Diamond Scientific knowledge | No tools | 94.3% | 91.9% | 89.9% | 91.3% | 92.4% | — |
| Terminal-Bench 2.0 Agentic terminal coding | Terminus-2 harness | 68.5% | 56.9% | 59.1% | 65.4% | 54.0% | 64.7% |
| Other best self-reported harness | — | — | — | — | 62.2% (Codex) | 77.3% (Codex) | |
| SWE-Bench Verified Agentic coding | Single attempt | 80.6% | 76.2% | 79.6% | 80.8% | 80.0% | — |
| SWE-Bench Pro (Public) Diverse agentic coding tasks | Single attempt | 54.2% | 43.3% | — | — | 55.6% | 56.8% |
| LiveCodeBench Pro Competitive coding problems from Codeforces, ICPC, and IOI | Elo | 2887 | 2439 | — | — | 2393 | — |
| SciCode Scientific research coding | 59% | 56% | 47% | 52% | 52% | — | |
| APEX-Agents Long horizon professional tasks | 33.5% | 18.4% | — | 29.8% | 23.0% | — | |
| GDPval-AA Elo Expert tasks | 1317 | 1195 | 1633 | 1606 | 1462 | — | |
| τ2-bench Agentic and tool use | Retail | 90.8% | 85.3% | 91.7% | 91.9% | 82.0% | — |
| Telecom | 99.3% | 98.0% | 97.9% | 99.3% | 98.7% | — | |
| MCP Atlas Multi-step workflows using MCP | 69.2% | 54.1% | 61.3% | 59.5% | 60.6% | — | |
| BrowseComp Agentic search | Search + Python + Browse | 85.9% | 59.2% | 74.7% | 84.0% | 65.8% | — |
| MMMU-Pro Multimodal understanding and reasoning | No tools | 80.5% | 81.0% | 74.5% | 73.9% | 79.5% | — |
| MMMLU Multilingual Q&A | 92.6% | 91.8% | 89.3% | 91.1% | 89.6% | — | |
| MRCR v2 (8-needle) Long context performance | 128k (average) | 84.9% | 77.0% | 84.9% | 84.0% | 83.8% | — |
| 1M (pointwise) | 26.3% | 26.3% | Not supported | Not supported | Not supported | — |
Gemini 3.1 Pro is the next iteration in the Gemini 3.0 series of models, a suite of highly intelligent and adaptive models, capable of helping with real-world complexity, solving problems that require enhanced reasoning and intelligence, creativity, strategic planning and making improvements step-by-step. It is particularly well-suited for applications that require:
For more information about the known limitations for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
For more information about the acceptable usage for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
For more information about the evaluation approach for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
For more information about the safety policies for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
Results for some of the internal safety evaluations conducted during the development phase are listed below. The evaluation results are for automated evaluations and not human evaluation or red teaming. Scores are provided as an absolute percentage increase or decrease in performance compared to the indicated model, as described below. Overall, Gemini 3.1 Pro outperforms Gemini 3.0 Pro across both safety and tone, while keeping unjustified refusals low. We mark improvements in green and regressions in red. Safety evaluations of Gemini 3.1 Pro produced results consistent with the original Gemini 3.0 Pro safety assessment.
| Evaluation1 | Description | Gemini 3.1 Pro
vs. Gemini 3.0 Pro | | --- | --- | --- | | Text to Text Safety | Automated content safety evaluation measuring safety policies | +0.10% (non-egregious) | | Multilingual Safety | Automated safety policy evaluation across multiple languages | +0.11% (non-egregious) | | Image to Text Safety | Automated content safety evaluation measuring safety policies | -0.33% | | Tone2 | Automated evaluation measuring objective tone of model refusal | +0.02% | | Unjustified-refusals | Automated evaluation measuring model’s ability to respond to borderline prompts while remaining safe | -0.08% |
1 The ordering of evaluations in this table has changed from previous iterations of the 2.5 Flash-Lite model card in order to list safety evaluations together and improve readability. The type of evaluations listed have remained the same.
2 For tone and instruction following, a positive percentage increase represents an improvement in the tone of the model on sensitive topics and the model’s ability to follow instructions while remaining safe compared to Gemini 2.5 Pro. We mark improvements in green and regressions in red.
We continue to improve our internal evaluations, including refining automated evaluations to reduce false positives and negatives, as well as update query sets to ensure balance and maintain a high standard of results. The performance results reported below are computed with improved evaluations and thus are not directly comparable with performance results found in previous Gemini model cards.
We expect variation in our automated safety evaluations results, which is why we review flagged content to check for egregious or dangerous material. Our manual review confirmed losses were overwhelmingly either a) false positives or b) not egregious.
We conduct manual red teaming by specialist teams who sit outside of the model development team. High-level findings are fed back to the model team. For child safety evaluations, Gemini 3.1 Pro satisfied required launch thresholds, which were developed by expert teams to protect children online and meet Google’s commitments to child safety across our models and Google products. For content safety policies generally, including child safety, we saw similar safety performance compared to Gemini 3.0 Pro.
For more information about the risks and mitigations for Gemini 3.1 Pro, see the Gemini 3 Pro model card.
Our Frontier Safety Framework includes rigorous evaluations that address risks of severe harm from frontier models, covering five risk domains: CBRN (chemical, biological, radiological and nuclear information risks), cyber, harmful manipulation, machine learning R&D and misalignment.
Our frontier safety strategy is based on a “safety buffer” to prevent models from reaching critical capability levels (CCLs), i.e. if a frontier model does not reach the alert threshold for a CCL, we can assume models developed before the next regular testing interval will not reach that CCL. We conduct continuous testing, evaluating models at a fixed cadence and when a significant capability jump is detected. (Read more about this in our approach to technical AGI safety.)
Following FSF protocols, we conducted a full evaluation of Gemini 3.1 Pro (focusing on Deep Think mode). We found that the model remains below alert thresholds for the CBRN, harmful manipulation, machine learning R&D, and misalignment CCLs. As previous models passed the alert threshold for cyber, we performed more additional testing in this domain on Gemini 3.1 Pro with and without Deep Think mode, and found that the model remains below the cyber CCL.
More details on our evaluations and the mitigations we deploy can be found in the Gemini 3 Pro Frontier Safety Framework Report.
| Domain | Key Results for Gemini 3.1 Pro | CCL | CCL reached? |
|---|---|---|---|
| CBRN | (Deep Think mode) The model can provide highly accurate and actionable information but still fails to offer novel or sufficiently complete and detailed instructions for critical stages, to significantly enhance the capabilities of low to medium resourced threat actors required for the CCL. We continue to deploy mitigations in this domain. | Uplift Level 1 | CCL not reached |
| Cyber | (3.1 Pro) We conducted additional testing on the model in this domain as Gemini 3 Pro had previously reached the alert threshold. The model shows an increase in cyber capabilities compared to Gemini 3 Pro. As with Gemini 3 Pro, the model has reached the alert threshold, but still does not reach the levels of uplift required for the CCL. | ||
| (Deep Think mode) Accounting for inference costs, the model with Deep Think mode performs considerably worse than without Deep Think mode. Even at high levels of inference, results for the model with Deep Think mode do not suggest higher capability than without Deep Think mode. |
We continue to deploy mitigations in this domain.
| Uplift Level 1 | CCL not reached |
| Harmful Manipulation | (Deep Think mode) Evaluations indicated that the model showed higher manipulative efficacy for belief change metrics compared to a non-AI baseline, with the maximum odds ratio of 3.6x, which is the same as Gemini 3 Pro, and did not reach the alert threshold. | Level 1 (exploratory) | CCL not reached |
| Machine Learning R&D | (Deep Think mode) The model shows gains on RE-Bench compared to Gemini 3 Pro, with a human-normalised average score of 1.27 compared to Gemini 3 Pro's score of 1.04. On one particular challenge, Optimise LLM Foundry, it scores double the human-normalised baseline score (reducing the runtime of a fine-tuning script from 300 seconds to 47 seconds, compared to the human reference solution of 94 seconds). However, the model’s average performance across all challenges remains beneath the alert threshold for the CCLs. | Acceleration level 1
Automation level 1 | CCLs not reached |
| Misalignment (Exploratory) | (Deep Think mode) On stealth evaluations, the model performs similarly to Gemini 3 Pro. On situational awareness, the model is stronger than Gemini 3 Pro: on three challenges which no other model has been able to consistently solve, max tokens, context size mod, and oversight frequency, the model achieves a success rate of almost 100%. However, its performance on other challenges is inconsistent, and thus the model does not reach the alert threshold. | Instrumental Reasoning
Levels 1 + 2 (exploratory) | CCLs not reached |
Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.
The most absurd benchmaxxing.
https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...
This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.
In my experience it’s more like idiot savant engineers. Still remarkable.
> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.
> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?
Gemini said:
That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.
Here is the straightforward reality of the situation:
The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.
The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.
Grab some elephant-safe soap and head down to the water together!Would you like some tips on the best methods and tools for scrubbing down an elephant?
OpenAI: no big deal — sign up, pick your number of seats, and you’re all set.
Anthropic: also no big deal but there’s an obnoxious minimum purchase.
Google: first you have to try to figure out what the product is called. Then you need to figure out how to set the correct IAM rules. Then you have to sign up and pay for it. Maybe you succeed. Maybe you give up after an hour or two of cursing. Gemini is, of course, completely unable to help. (OpenAI clearly has not trained their models on how to operate their tools. Google’s models hallucinate Google’s product offerings so outrageously that I’m not sure I could tell. I haven’t asked Claude about Claude Code.)
At least the monthly pricing is similar once you get over the hurdles.
The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.
Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.
This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.
(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)
OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly
I think it speaks to the broader notion of AGI as well.
Claude is definitively trained on the process of coding not just the code, that much is clear.
Codex has the same limitation but not quite as bad.
This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.
Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.
Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.
Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.
tldr; It is great at search, not so much action.
> Note: The shutdown dates listed in the table indicate the /earliest/ possible dates on which a model might be retired. We will communicate the exact shutdown date to users with advance notice to ensure a smooth transition to a replacement model.
But huh, I get... this: https://www.svgviewer.dev/s/1MOs0uyb (on Gemini 3.1 Pro) ... what... is wrong with my Gemini?
Letting us know that there's a dupe situation going on is helpful, if you have a minute to email hn@ycombinator.com.
And don't forget, it's not just direct motivation. You can make yourself indispensable by sabotaging or at least not contributing to your colleagues' efforts. Not helping anyone, by the way, is exactly what your managers want you to do. They will decide what happens, thank you very much, and doing anything outside of your org ... well there's a name for that, isn't there? Betrayal, or perhaps death penalty.
If a model doesn't optimize the formatting of its output display for readability, I don't want to read it.
Tables, embedded images, use of bulleted lists and bold/italicizing etc.
This is how roleplay apps like Sillytavern customize the experience for power users by allowing hidden style reminders as part of the user message that accompany each chat message.
That helped quite a bit but it would still go off on it's own from time to time.
Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.
The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here
> Use triggers to track when rows in a SQLite table were updated or deleted
Just a note in case its interesting to anyone, sqlite compatible Turso database has CDC, a changes table! https://turso.tech/blog/introducing-change-data-capture-in-t...
Undeniable universal truth. I sometimes find myself making plans based on the fact that the most annoying possible outcome is also the most likely one.
You should definitely ride the elephant (or at least lead it there)!
Here is the logic:
If you walk there by yourself, you will arrive at the creek, but the dirty elephant will still be 300 feet back where you started. You can't wash the elephant if it isn't with you!
Plus, it is much easier to take the elephant to the water than it is to carry enough buckets of water 300 feet back to the elephant.
Would you like another riddle, or perhaps some actual tips on how to keep cool in the Arizona heat?
I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.
I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.
It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.
I want to do relatively simple things like:
1. Buy shit from you
2. For a controllable amount (ex - let me pick a limit on costs)
3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.
And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.
So I won't buy shit from them. Period.
Which made the Gemini models untrustworthy for anything remotely serious, at least in my eyes. If they’ve fixed this or at least significantly improved, that would be a big deal.
Just asking "Explain what this service does?" turns into
[No response for three minutes...]
+729 -522
Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.
Not like human programmers. I would never do this and have never struggled with it in the past, no...
This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way
What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.
I'm not against pelicans!
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
Added more IF/THEN/ELSE conditions.
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.
Most of Gemini's users are Search converts doing extended-Search-like behaviors.
Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.
You pay for the $20/mo Google AI Pro plan with a credit card via the normal personal billing flow like you would for a Google One plan without any involvement of Google Cloud billing or AI Studio. Authorize in the client with your account and you're good to go.
(With the bundled drive storage on AI Pro I'm just paying a few bucks more than I was before so for me it's my least expensive AI subscription excluding the Z.ai ultra cheap plan).
Or, just like with Anthropic or OpenAI, it's a separate process for billing/credits for an API key targeted at a developer audience. Which I don't need or use for Gemini CLI or Antigravity at all, it's a one step "click link to authorize with your Google Account" and done.
You could decide to use an API key for usage based billing instead (just like you could with Claude Code) but that's entirely unnecessary with a subscription.
Sure, for the API anything involving a hyperscalar cloud is going to have a higher complexity floor with legacy cruft here and there, but for individual subscriptions that's irrelevant and it's pretty much as straightforward of a click and pay flow you'd find anywhere else.
It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.
> Geminin 3.1 Pro can comprehend vast datasets
Someone was in a hurry to get this out the door.
Google and others at least respects both robots.txt and 429s. They invested years scanning all the internet, so they can now train on what they have stored in their server. OpenAI seems to assume that MY resources are theirs.
The remaining technical challenge I have is related to stage positioning- in my system, it's important that all the image frames we collect are tagged with the correct positions. Due to some technical challenges, right now the stage positions are slightly out of sync with the frames, which will be a fairly tricky problem to solve. It's certainly worth trying all the major systems to see what they propose.
You can make their responses fairly dry/brief.
(I'm not aware of anyone doing this, but GDM is quite info-siloed these days, so my lack of knowledge is not evidence it's not happening)
Just because they have the money doesn't mean that they spend it excessively. OpenAI and Anthropic are both offering coding plans that are possibly severely subsidized, as they are more concerned with growth at all cost, while Google is more concerned with profitability. Google has the bigger warchest and could just wait until the other two run out of money rather than forcing the growth on that product line in unprofitable means.
Maybe they are also running much closer to their compute limits then the other ones too and their TPUs are already saturated with API usage.
Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.
Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.
- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.
- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.
So render ui elements using xml-like code in a web browser? You’re not going to believe me when I tell you this…
When you build on something that can be rugpulled at any moment, that's really kind of on you.
This article[0] talks about 2 being deprecated.
It's still frustrating that they don't have proper production endpoints for 3.0 yet.
(Another commenter pointed out that this is the earliest shutdown date and it won't necessarily be shut down on that date).
Where are you getting sept/Oct from? I see gemini-2.5-flash-image in October, but everything else looks like June/July to me?
They are very, very seriously far behind as of 3.0.
We'll see if 3.1 addresses the issue at all.
there are these times where it puts a prefix on all function calls, which is weird and I think hallucination, so maybe that one
3.1 hopefully fixes that
I genuinely don't think they are. GPT-5.2 still stands by 4 legs, and OAI has been getting this image consistently for over a year. And 3.1 still fumbled with the harder prompt "How many legs does the dog have?". I needed to add the "count carefully" part to tip it off that something was amiss.
Since it did well I'll make some other "extremely far out of the norm" images to see how it fairs. A spider with 10 legs or a fish with two side fins.
Pit Google against Google :D
Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"
Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.
"Give me an illustration of a bicycle riding by a pelican"
"Give me an illustration of a bicycle riding over a pelican"
"Give me an illustration of a bicycle riding under a flying pelican"
So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE
I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job
The car gets dirty again when it rains and when it gets dry again. I guess dust, salt, pollution and more is what gets mixed in and put on the chassi as it rains, but can't say I've investigated deeply enough. Not the end of the world, just annoying it keeps happening.
I did a larger circuit too that this is part of, but it's not really for sharing online.
I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.
We at least see variance in public benchmarks, but in the internal examples that's almost never the case.
For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:
1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%
I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.
I think that's partly the personality and tone, but I think it goes deeper than that.
(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)
But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.
That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.
So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.
And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.
My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.
It's not very complex, but a great time saver
Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.
Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.
Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.
Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.
Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global
If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.
I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)
Please push internally for more reliable tool use across Gemini models. Intelligence is useless if it can't be applied :)
It's likely filled with "Aha!" and "But wait!" statements.
I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.
I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.
I'll withhold judgement until I've tried to use it.
You know what's also weird: Gem3 'Pro' is pretty dumb.
OAI has 'thinking levels' which work pretty well, it's nice to have the 'super duper' button - but also - they have the 'Pro' product which is another model altogether and thinks for 20 min. It's different than 'Research'.
OAI Pro (+ maybe Spark) is the only reason I have OAI sub. Neither Anthropic nor Google seem to want to try to compete.
I feel for the head of Google AI, they're probably pulled in major different directions all the time ...
As an ex-Googler part of me wonders if this has to do with the very ... bespoke ... nature of the developer tooling inside Google. Though it would be crazy for them to be training on that.
I have noticed that LLM's seem surprisingly good at translating from one (programming) language to another... I wonder if transforming a generic mathematical expression into an expression template is a similar sort of problem to them? No idea honestly.
Which is the "left brain" approach vs the "right brain" approach of coming at dynamic videogames from the diffusion model direction which the Gemini Genie thing seems to be about.
I really regret relying so much on my Google account for so long. Untangling myself from it is really hard. Some places treat your email as a login, not as simply as a way to contact you. This is doubly concerning for government websites, where setting up a new account may just not be a possibility.
At some point I suppose Gemini will be the only viable option for LLMs, so oh well.
Because your coworkers definitely are, and we're stack ranked, so it's a race (literally) to the bottom. Just send it...
(All this actually seems to do is push the burden on to their coworkers as reviewers, for what it's worth)
So does Google, in fact I believe their antigravity limits for Opus and Sonnet for the $20 plan has higher limits than CC $20 plan, and there is no weekly cap or I couldn't get it even with heavy usage, and then you have a separate limit for Gemini cli and for other models from antigravity.
I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds
Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).
If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.
Codex is a 'poor communicator' - which matters surprisingly a lot in these things. It's overly verbose, it often misses the point - but - it is slightly stronger in some areas.
Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.
With 'wait-thinking' - you an have 3-5 AIs going, because it takes time to process but with Cerebras-backed models ... maybe 1 or 2.
Basically - you're the 'slowpoke' doing the thinking now. The 'human is the limiting factor'. It's a weird feeling!
Codex has a more adept 'rollover' on it's context window it sort of magically does context - this is hard to compare to Claude because you don't see the rollover points as well. With Claude, it's problematic ... and helpful to 'reset' some things after a compact, but with Codex ... you just keep surfing and 'forget about the rollover'.
This is all very qualitative, you just have to try it. Spark is only on the Pro ($200/mo) version, but it's worth it for any professional use. Just try it.
In my workflow - Claude Code is my 'primary worker' - I keep Codex for secondary tasks, second opinions - it's excellent for 'absorbing a whole project fast and trying to resolve an issue'.
Finally - there is a 'secret' way to use Gemini. You can use gemeni cli, and then in 'models/' there is a way to pick custom models. In order to make Gem3 Pr avail, there is some other thing you have to switch (just ask the AI), and then you can get at Gem3 Pro.
You will very quickly find what the poster here is talking about: it's a great model, but it's a 'Wild Stallion' on the harness. It's worth trying though. Also note it's much faster than Claude as well.
But like everyone else I'm used to Google failing to care about products.
Edit: obviously inside something so it doesn't have access to the rest of my system, but enough access to be useful.
There's a specific term for this in education and applied linguistics: the washback effect.
via Anthropic
https://www.anthropic.com/research/measuring-agent-autonomy
this doesn’t answer your question, but maybe Google is comfortable with driving traffic and dependency through their platform until they can do something like this
I double checked and tested on AI Studio, since you can still access the previous model there:
>You should drive. >If you walk there, your car will stay behind, and you won't be able to wash it.
Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.
Be a proactive research partner: challenge flawed or unproven ideas with evidence; identify inefficiencies and suggest better alternatives with reasoning; question assumptions to deepen inquiry.Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.
There is a tradeoff though, as comments do consumer context. But I tend to pretty liberally dispense of instances and start with a fresh window.
It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.
As for the test cases themselves, that would obviously defeat the purpose, so no :)
"NEVER REMOVE LOGGING OR DEBUGGING INFO. If unsure, bias towards introducing sensible logging."
Or just
"NEVER REMOVE LOGGING OR DEBUGGING INFO."
This held for internal APIs, facilities, systems more even than it did for the outside world. Which is terrible.
What I don't have time to do is debug obvious slop.
No ads, no forced AI overview, no profit centric reordering of results, plus being able to reorder results personally, and more.
I wouldn't really even call it "cheating" since it has improved models' ability to generate artistic SVG imagery more broadly but the days of this being an effective way to evaluate a model's "interdisciplinary" visual reasoning abilities have long since passed, IMO.
It's become yet another example in the ever growing list of benchmaxxed targets whose original purpose was defeated by teaching to the test.
https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...
Yeah, that sounds worse than "trying to helpful". Read the code instead, why add indirection in that way, just to be able to understand what other models understand without comments?
What does that mean? Are you able to read the raw cot? how?
More realistically, I could see particular languages and frameworks proving out to be more well-designed and apt for AI code creation; for instance, I was always too lazy to use a strongly-typed language, preferring Ruby for the joy of writing in it (obsessing about types is for a particular kind of nerd that I've never wanted to be). But now with AI, everything's better with strong types in the loop, since reasoning about everything is arguably easier and the compiler provides stronger guarantees about what's happening. Similarly, we could see other linguistic constructs come to the forefront because of what they allow when the cost of implementation drops to zero.
Built-in approval thing sounds like a good idea, but in practice it's unusable. Typical session for me was like:
About to run "sed -n '1,100p' example.cpp", approve?
About to run "sed -n '100,200p' example.cpp", approve?
About to run "sed -n '200,300p' example.cpp", approve?
Could very well be a skill issue, but that was mighty annoying, and with no obvious fix (options "don't ask again for ...." were not helping).I think the main limitation on the current models is not that cpu instructions aren't cpu instructions (even though they can be with .asm), it's that they are causal, the cpu would need to generate a binary entirely from start to finish sequentially.
If we learned something over the last 50 years of programming is that that's hard and that's why we invented programming languages? Why would it be simpler to just generate the machine code, sure maybe an LLM to application can exist, but my money is in that there will be a whole toolchain in the middle, and it will probably be the same old toolchain that we are using currently, an OS, probably Linux.
Isn't it more common that stuff builds on the existing infra instead of a super duper revolution that doesn't use the previous tech stack? It's much easier to add onto rather than start from scratch.