Hyundai now owns Boston Dynamics and is pushing to get the robots into their factories.
The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.
Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.
I'm really curious what things we could build if we had 100x or 1000x inference throughput.
A few robot legs and arms, big battery, off-the-shelf GPU. Solar panels.
Prompt: "Take care of all this land within its limits and grow some veggies."
So there might be awesome progress behind the scenes, just not ready for the general public.
Anyway, cool.
Or it could turn out to look like satayoma (Japanese peasant forests) or it could be more similar to the crop rotation that was traditionally practiced in many parts of Central Africa where roots were important.
In Russia before the Soviets forced "modern scientific agriculture" on peasants to modernize, they practiced things like contour farming (where they interplanted rows of crops against the contours of the land to slow water down) and maslins (where they intermixed multiple varieties of wheat and barleys in the same patch). Now contour farming are an active area of research for their ability to prevent topsoil loss and build soil health while maslins provide superior yield stability and use little to no pesticides.
That's not even getting into the over 40-120,000 varieties of rice we've documented. Most of which are hyper adapted to a very specific location—often even a single village.
My point is there is no one way to take care of a plot of land. It's all relative to a number of factors beyond just the abiotic characteristics of the land itself. Your goals and intentions matter and you will always find localized unique adaptations.
Of course this is for counting animal legs while giving coordinates and reading analog clocks. Not coding or or solving puzzles. I imagine the image performance to model weight of this model is very high.
That's a bit exaggerated, no? Early roombas would get tangled in socks, drag pet poop all over the floor, break glass stuff and so on, and yet the market accepted that, evolved, and now we have plenty of cleaning robots from various companies, including cheap spying ones from china.
I actually think that there's a lot of value in being the first to deploy bots into homes, even if they aren't perfect. The amount of data you'd collect is invaluable, and by the looks of it, can't be synth generated in a lab.
I think the "safer" option is still the "bring them to factories first, offices next and homes last", but anyway I'm sure someone will jump straight to home deployments.
VLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that.
Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics".
My non-AI dishwasher can't even always keep the water inside. Nothing is perfect.
Nothing was reported in Google status page, not even the CLI is responding, it’s just left there waiting for an answer that will never arrive even after 10 minutes.
The safety guidelines are interesting, they treat them as a goal that they are aspiring to achieve, which seems realistic. It’s not quite ready for prime time yet.
LLMs are really good at the sort of tasks that have been missing from robotics: understanding, reasoning, planning etc, so we'll likely see much more use of them in various robotics applications. I guess the main question right now is:
- who sends in the various fine-motor commands. The answer most labs/researchers have is "a smaller diffusion model", so the LLM acts as a planner, then a smaller faster diffusion model controls the actual motors. I suspect in many cases you can get away with the equivalent of a tool call - the LLM simply calls out a particular subroutine, like "go forward 1m" or "tilt camera right"
- what do you do about memory? All the models are either purely reactive or take a very small slice of history and use that as part of the input, so they all need some type of memory/state management system to actually allow them to work on a task for more than a little while. It's not clear to me whether this will be standardized and become part of models themselves, or everyone will just do their own thing.
I'm all for the task reasoning and the multi-view recognition, based on relevant points. I'm very uncomfortable with the loose world "understanding".
The fault model I see is that e.g., this "visual understanding" will get things mostly right: enough to build and even deliver products. However, these are only probabilistic guarantees based on training sets, and those are unlikely to survive contact with a complex interactive world, particularly since robots are often repurposed as tasks change.
So it's a kind of moral-product-hazard: it delivers initial results but delays risk to later, so product developers will have incentives to build and leave users holding the bag. (Indeed: users are responsible for integration risks anyway.)
It hacks our assumptions: we think that you can take an MVP and productize it, but in this case, you'll never backfit the model to conform to the physics in a reliable way. I doubt there's any way to harness Gemini to depend on a physics model, so we'll end up with mostly-working sunk investments out in the market - slop robots so cheap that tight ones can't survive.
As for memory: my approach is to give the robot a python repl and, basically, a file system - the LLM can write modules, poke at the robot via interactive python, etc.
Basically, the LLM becomes a robot programmer, writing code in real-time.
It was about Googles PaLM-E evolution and progress. It basically has two models one which controls the robot, the other is a llm and they are combined together in some attention layer.
My concern with a household robot is not the dishwasher but the tv screen, the glas door, glas table, animals (fish/aquarium) etc. the robot might walk through, touch through or fall onto.
The planing ahead thing through simulation for example seems to be a very good tool in neuronal network based architectures.
Depending on what the rate of breaking dishes is, this would be a massive improvement on me, a human being, since I break a really important dish I needed to use like ~2x per month on average.
And, I was disappointed to see that pointing was just giving x,y coords. I wanted to see robots pointing at stuff.
Not here to shame you for it, for the record.
That's me ;_;
April 14, 2026 Models
Laura Graesser and Peng Xu
For robots to be truly helpful in our daily lives and industries, they must do more than follow instructions, they must reason about the physical world. From navigating a complex facility to interpreting the needle on a pressure gauge, a robot’s “embodied reasoning” is what allows it to bridge the gap between digital intelligence and physical action.
Today, we’re introducing Gemini Robotics-ER 1.6, a significant upgrade to our reasoning-first model that enables robots to understand their environments with unprecedented precision. By enhancing spatial reasoning and multi-view understanding, we are bringing a new level of autonomy to the next generation of physical agents.
This model specializes in reasoning capabilities critical for robotics, including visual and spatial understanding, task planning and success detection. It acts as the high-level reasoning model for a robot, capable of executing tasks by natively calling tools like Google Search to find information, vision-language-action models (VLAs) or any other third-party user-defined functions.
Gemini Robotics-ER 1.6 shows significant improvement over both Gemini Robotics-ER 1.5 and Gemini 3.0 Flash, specifically enhancing spatial and physical reasoning capabilities such as pointing, counting, and success detection. We are also unlocking a new capability: instrument reading, enabling robots to read complex gauges and sight glasses — a use case we discovered through close collaboration with our partner, Boston Dynamics.
Starting today, Gemini Robotics-ER 1.6 is available to developers via the Gemini API and Google AI Studio. To help you get started, we are sharing a developer Colab containing examples of how to configure the model and prompt it for embodied reasoning tasks.
Figure 1: Benchmark results comparing Gemini Robotics-ER 1.6 with Gemini Robotics-ER 1.5 and Gemini 3.0 Flash models. The instrument reading evaluations were run with agentic vision enabled (except for Gemini Robotics-ER 1.5 which doesn’t support it). All other evals were run with agentic vision disabled. The single view and multiview success detection evaluations contain different examples so are not comparable.
Pointing is a fundamental capability for an embodied reasoning model, evolving with each model generation. Points can be used to express many concepts, including:
Gemini Robotics-ER 1.6 can use points as intermediate steps to reason about more complex tasks. For example, it can use points to count items in an image, or to identify salient points on an image to help the model perform mathematical operations to improve its metric estimations.
The example below shows Gemini Robotics-ER 1.6’s strengths in pointing to multiple elements, and knowing when and when not to point.
Gemini Robotics-ER 1.6 correctly identifies the number of hammers (2), scissors (1), paintbrushes (1), pliers (6), and a collection of garden tools which can be interpreted as a single group or multiple points. It does not point to requested items that are not present in the image — a wheelbarrow and Ryobi drill. In comparison Gemini Robotics-ER 1.5 fails to identify the correct number of hammers or paint brushes, misses the scissors altogether, hallucinates a wheelbarrow and lacks precision on plier pointing. Gemini 3.0 Flash is close to Gemini Robotics-ER 1.6, but does not handle the pliers as well.
In robotics, knowing when a task is finished is just as important as knowing how to start it. Success detection is a cornerstone of autonomy, serving as a critical decision-making engine that allows an agent to intelligently choose between retrying a failed attempt or progressing to the next stage of a plan.
Achieving visual understanding in robotics is challenging, requiring sophisticated perception and reasoning capabilities combined with broad world knowledge in order to handle complicating factors such as occlusions, poor lighting and ambiguous instructions. Additionally, most modern robotics setups include multiple camera views such as an overhead and wrist-mounted feed. This means a system needs to understand how different viewpoints combine to form a coherent picture at each moment and across time.
Gemini Robotics-ER 1.6 advances multi-view reasoning, enabling the system to better understand multiple camera streams and the relationship between them, even in dynamic or occluded environments, as demonstrated in the typical multi-view scenario below.
Gemini Robotics-ER 1.6 takes cues from multiple camera views to determine when the task "put the blue pen into the black pen holder" is complete.
To understand a key strength of Gemini Robotics-ER 1.6, we must look at how it combines capabilities like spatial reasoning and world knowledge to solve complex, real-world problems. A perfect example is instrument reading.
This task stems from facility inspection needs, a critical focus area for our partners at Boston Dynamics. Industrial facilities contain many instruments — thermometers, pressure gauges, chemical sight glasses and more — that require constant monitoring. Spot, a Boston Dynamics robot product, is able to visit the instruments throughout the facility and capture images of them.
Gemini Robotics-ER 1.6 enables robots to interpret a variety of instruments, including circular pressure gauges, vertical level indicators and modern digital readouts.
Instrument reading requires complex visual reasoning. One must precisely perceive a variety of inputs — including the needles, liquid level, container boundaries, tick marks and more — and understand how they all relate to each other. In the case of sight glasses, this involves estimating how much the liquid fills the sightglass taking into account distortion from the camera perspective. Gauges typically have text describing the unit, which must be read and interpreted, and some have multiple needles referring to different decimal places that need to be combined.
Capabilities like instrument reading and more reliable task reasoning will enable Spot to see, understand, and react to real-world challenges completely autonomously.
Marco da Silva
Vice President and General Manager of Spot at Boston Dynamics
Gemini Robotics-ER 1.6 achieves its highly accurate instrument readings by using agentic vision, which combines visual reasoning with code execution. The model takes intermediate steps: first zooming into an image to get a better read of small details in a gauge, then using pointing and code execution to estimate proportions and intervals and get an accurate reading, and ultimately applying its world knowledge to interpret meaning.
Figure 2: How the different elements of Gemini Robotics-ER 1.6 contribute to reaching a high level of performance on the instrument reading task.
This example demonstrates how the model uses pointing and code execution for zooming to derive the reading of gauge down to sub tick accuracy.
Safety is integrated into every level of our embodied reasoning models. Gemini Robotics-ER 1.6 is our safest robotics model to date, demonstrating superior compliance with Gemini safety policies on adversarial spatial reasoning tasks compared to all previous generations.
The model also shows a substantially improved capacity to adhere to physical safety constraints. For example, it makes safer decisions through spatial outputs like pointing regarding which objects can be safely manipulated under gripper or material constraints (e.g., “don't handle liquids”, “don't pick up objects heavier than 20kg“).
We also tested how well the model identifies safety hazards in text and video scenarios based on real-life injury reports. On these tasks, our Gemini Robotics-ER models improve over baseline Gemini 3.0 Flash performance (+6% in text, +10% in video) in perceiving injury risks accurately.
Figure 3: Gemini Robotics-ER 1.6 improves substantially compared to Gemini Robotics-ER 1.5 on Safety Instruction Following which tests the ability to adhere to physical safety constraints. It improves compared to Gemini 3.0 Flash on pointing, and both models have very high accuracy for text. Gemini 3.0 Flash does better on bounding boxes.
We are committed to ensuring Gemini Robotics-ER provides maximum value to the robotics community. If current capabilities are limited for your specialized application, we invite you to submit this form with 10–50 labeled images illustrating specific failure modes to help us build more robust reasoning features. We look forward to collaborating with you to enhance these capabilities in our upcoming releases.