Now you get to hear every person in the office do that around you.
Like, good tech, but do googlers live in the real world? Do they genuinely like the idea of an open office full of people talking to their computers? Do they all live alone without human contact?
Assuming that today the most efficient way for human to transfer information to machines is via voice. Assuming for machines to convey rich information to humans that's by printing html.
Then a combination of screen + eye tracking + voice is all you need. The mouse doesn't make sense anymore.
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
font-feature-settings: "ss02" on;The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
and that's aside from the obvious privacy problems.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
Its wild that they even put this out as a demo. It should have been picked apart in the internal meeting. There is no way I'd ever show my product taking 5s to change a 1 to a 2 in a piece of text that the user was already hovering or taking 10s to drag and drop a line of text from one box to another. Even the image of finding a route between two images could be done quick if images were auto OCR'd which is a setting on most image viewers.
(Not going to happen)
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
Horizontal dragging with a mouse is actually really hard. Nobody's going to use it like that.
Your arm can easily move your hand and cursor up/down by pivoting your shoulder, but there's no mechanism for left/right movement. It's always an arc.
Or put another way: selection will be a lot slower and more tedious than the demo.
Interesting but not “reimagining” anything.
I think the real story here is how vibe coding now enables flashy demo sites like this to be built for a concept that hasn’t yet earned it.
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
<transcript>
Let's see what we've got here.
<selection doc="proposal.md" location="paragraph 3">
The system already...
</selection>
No, I don't like how this is approaching the problem, ...
</transcript>
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
For you and me, who have used keyboard in our lives for more than 1,000 or even 10,000 hours.
There was a brief period when typing slowed people down because they could write the same information down with pen&paper, and that period eventually passed.
the agent occasionally spots your real problem like an experienced engineer
https://www.youtube.com/watch?v=46EopD_2K_4
>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.
Transcript:
>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]
https://www.dgp.toronto.edu/~ravin/papers/chi2005_bubblecurs...
>The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area
>Tovi Grossman, Ravin Balakrishnan; Department of Computer Science; University of Toronto
I've written more about Morgan Dixon's work on Prefab (pre-LLM pattern recognition, which is much more relevent with LLMs now).
https://news.ycombinator.com/item?id=11520967
https://news.ycombinator.com/item?id=14182061
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.
It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.
I think its brilliant UX.
Reads like the argument against cell phones where don't have a cabinet around you...
you select text in vscode, and write a comment, and the llm gets both
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
And the paper https://dam-prod.media.mit.edu/uuid/8e6d934b-6c6f-48e4-b0a1-...
https://en.wikipedia.org/wiki/MacPlaymate
>The game features a panic button that when clicked on will cover the computer screen with a fake spreadsheet. The player can also choose to print out Maxie's current pose as a pinup.
https://archive.org/details/mac_MacPlaymate
Geraldo interviews Chuck Farnham about getting sued by Playboy:
https://news.ycombinator.com/item?id=42571845
https://www.upi.com/Archives/1989/02/09/Playboy-sues-over-se...
Profit!
Why not constrain your computing? It will require some programming chops, but you can note down your common tasks, figure out where actual input are required, and automate the rest.
Also featured in the Starfire vision video from 1992: https://youtu.be/jhe1DFY-SsQ?t=286
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
If we're going to have AI regulation, this is where to start. If a company's AI service acts for a user, the company has non-disclaimable financial responsibility for anything that goes wrong. There's an area of law called "agency", which covers the liability of an employer for the actions of its employees. The law of agency should apply to AI agents. One court already did that. An airline AI gave wrong but reasonable sounding advice on fares, a customer made a decision based on that advice, and the court held that the AI's advice was binding on the company, even though it cost the company money.
This is something lawyers and politicians can understand, because there's settled law on this for human agents.
I guess what I'm saying is - we've always had this problem.
I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.
And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.
I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".
I'd go and find a small meeting room or conference call booth in the office and take it there.
Essentially, a cabinet.
In fact, when humans happen to order other humans, it's typically done in writing.
Until then I've just had it list every surface in a legend, each colored differently, so I can say "three inches down from the top of pole six, and rotate it so the hoop part of the bolt faces northwest."
First things that came to mind:
- facial hair
- getting people to learn to make bigger mouth movements and not mumble
- we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
- non english languages (god forbid bilingualism)
- camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?I'm sure Don Hopkins can tell you a long annotated tale about the NeWS pizza ordering app that displayed a real-time dynamically-updated rotating pie on the screen as you filled out your order.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
And being able to take photos/videos with the glasses (like the Meta ones nowadays) is really useful with my kid because he often does funny or cute stuff and I don't have time to pull my phone out to take a video/photo of it. I guess it could be useful for video calls too so my parents can see him.
But I just don't see anyone sitting in an office, or even at home, talking to their computer. It's really only useful for hands-free settings like when you are driving, or in the kitchen etc.
Maybe you can share a scenario for that one? I can’t figure a scenario where all of this needs to be true. It seems like a recipe for accidents.
Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.
Though there may be some hybrid approach that can work well.
The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.
In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.
Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk
Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.
A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.
Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking
>Sousveillance (/suːˈveɪləns/ soo-VAY-lənss) is the recording of an activity by a member of the public, rather than a person or organisation in authority, typically by way of small wearable or portable personal technologies.[14] The term, coined by Steve Mann,[15] stems from the contrasting French words sur, meaning "above", and sous, meaning "below", i.e. "surveillance" denotes the "eye in the sky" watching from above, whereas "sousveillance" denotes bringing the means of observation down to human level, either physically (by mounting cameras on people rather than on buildings) or hierarchically (with ordinary people observing, rather than by higher authorities or by architectural means).[16][17][23]
https://www.media.mit.edu/publications/put-that-there-voice-...
I hadn’t realized until just now how accurate that is for me as well. Thank you.
May 12, 2026 Research
Adrien Baranes and Rob Marchant
We are developing more seamless, intuitive ways to collaborate with AI
The mouse pointer has been a constant companion on computer screens, across every website, document and workflow. Despite how technologies have changed, the pointer has barely evolved in more than half a century.
We’ve been exploring new AI-powered capabilities to help the pointer not only understand what it’s pointing at, but also why it matters to the user.
Our goal is to address a common frustration: because a typical AI tool lives in its own window, users need to drag their world into it. We want the opposite: intuitive AI that meets users across all the tools they use, without interrupting their flow. For example, imagine pointing to an image of a building, and requesting “Show me directions”. Nothing more is needed when the AI system already understands the context.
Today, we’re outlining the underlying principles guiding our thinking on future user interfaces, and sharing experimental demos of an AI-enabled pointer, powered by Gemini. For example, you could visit Google AI Studio to edit an image or find places on the map, just by pointing and speaking.
This video showcases the experimental environment for our AI-enabled pointer. Sequences are shortened throughout.
We’ve developed four principles that together shift the hard work of conveying context and intent from the user to the computer, replacing text-heavy prompts with simpler, more intuitive interactions. Here are illustrations of our approach and principles.
AI capabilities should work across all apps, not force users into “AI detours” between them. Our prototype AI-enabled pointer is available wherever the user is working. For example, they could point at a PDF and request a bullet-point summary to paste directly into an email, hover over a table of statistics and request a pie chart version, or highlight a recipe and ask for all the ingredients doubled.
Current AI models demand precise instructions. To get a good response, a user has to write a detailed prompt. An AI-enabled pointer would streamline this process by smoothly capturing the visual and semantic context around the pointer, letting the computer “see” and understand what’s important to the user. In our experimental system, just point, and the AI knows exactly which word, paragraph, part of an image, or code block the user needs help with.
In everyday interactions with each other, humans rarely speak in long, detailed paragraphs. We might say, "Fix this", "Move that here", or “What does this mean?” — while relying on physical gestures and our shared context to fill in any gaps in understanding. An AI system that understands this combination of context, pointing and speech would allow users to make complex requests in natural shorthand, no fiddly prompting required.
For decades, computers have only tracked where we are pointing. AI can now also understand what the user is pointing at. This transforms pixels into structured entities, such as places, dates, and objects, that users can interact with instantly. A photo of a scribbled note becomes an interactive to-do list; a paused frame in a travel video becomes a booking link for that cool-looking restaurant.
Building technology that adapts to human behavior — rather than forcing users to adapt to it — enables a future where collaborating with AI feels truly intuitive, fluid and seamless.
We’re excited that these human-first concepts are being woven into products we use every day.
We are now integrating these principles to reimagine pointing in Chrome and our new Googlebook laptop experience. Starting today, instead of writing a complex prompt, you can now use your pointer to ask Gemini in Chrome about the part of the webpage you care about. For example, you can select a few products on a page and ask to compare, or point to where you want to visualize a new couch in your living room. Similarly, we'll soon roll out Magic Pointer in Googlebook, allowing users to harness Gemini at their fingertips for a more intuitive experience. Because there are so many other potentially great applications, we'll continue to test future concepts across our platforms, including Google Labs’ Disco.
Try the AI-enabled pointer in Google AI Studio
I usually convey the same meaning with 80wpm typing. Makes it faster to read too
Maybe I’m just slightly adhd – listening to people talk drives my crazy. Get to the point! Much easier if they type it out
It's like a hidden curse of LLMs -- they're so good at parsing intended meaning from non-grammatically-correct language that we don't have to be very good at clear communication.
Eventually all LLMs will be controlled by humans uttering terse gutteral grunts. We will all become neanderthals, with machines that deliver our every whim.
I recommend the youtube channel @afadingthought to see what people come up with (like v=283-z29TXeM).
People have so many verbal tics and filler words too. Anthropic’s Dario says “you know” after every third word, for example.
Or they meander around unrelated/unimportant details.