I've long thought about why the tools we have operate on screenshots, and not the accessibility tree. To me the latter would have seemed like the obvious choice from the beginning (structured data), but yet, here we are with pixels. Happy to see progress being made here.

lahfir, I vouched your (currently still dead) comment because it was interesting to me.

I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).

Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.

Looks interesting but like every single one of these computer use apps I've seen, it's macOS only.

Does anyone know of a linux one?

The best desktop automation system would take HDMI input and output USB keystrokes and mouse movements so that it can be plugged into any computer transparently, including work computers.

I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.

Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat

That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.

But the OS already exposes structured UI information:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI

Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.

So I built a desktop equivalent: agent-desktop.

It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.

A typical loop looks like this:

  agent-desktop snapshot --app Slack -i --compact
  agent-desktop click @e12
  agent-desktop type @e5 "ship it"
  agent-desktop press cmd+return

So the loop becomes:

  1. Snapshot
  2. Decide
  3. Act
  4. Snapshot again

The main design problem was context size.

A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.

The approach I ended up using is progressive skeleton traversal:

  - First pass: return a shallow tree, typically depth 3, with deeper containers truncated and annotated with children_count
  - Named containers get references so the agent can request only that subtree
  - The agent drills down into the relevant region with --root @e3
  - References are scoped and invalidated only for that subtree
  - After acting, the agent can re-query just that region instead of re-snapshotting the whole app

In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.

A few implementation details that may be interesting here:

  - Rust workspace with strict platform/core separation through a PlatformAdapter trait
  - Accessibility-first activation chain; mouse synthesis is the fallback, not the default
  - Deterministic element refs like @e1, @e2, with optimistic re-identification across UI shifts
  - Structured errors with machine-readable codes plus retry suggestions
  - C ABI via cdylib, so it can be loaded directly from Python, Swift, Go, Node, Ruby, or C without shelling out
  - Batch operations in a single call
  - Support for windows, menus, sheets, popovers, alerts, and notifications
  - Special handling for Chromium/Electron accessibility trees, which can get very deep and noisy

Why I think this matters: pixel-based desktop control feels like a leaky abstraction. The OS already knows the UI semantically. Accessibility APIs give you roles, names, actions, hierarchy, focus, selection, and state directly. That seems like a much better substrate for desktop agents than screenshot loops.

If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.

Install:

  npm install -g agent-desktop
  agent-desktop snapshot --app Finder -i

Repo: https://github.com/lahfir/agent-desktop

I'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?

Interesting, would be nice to see a demo video apart from that unclear GIF

This is big if it works. Nice job!

While the accessibility tree is great in many aspects it has its own limitations for example when it comes to stacked views or lazy loading outside the viewport.

Interesting, would be nice to see a demo video apart from that unclear GIF

This is big if it works. Nice job!

The best desktop automation system would take HDMI input and output USB keystrokes and mouse movements so that it can be plugged into any computer transparently, including work computers.

You don't need hdmi out, just ability to do screenshots, which easy to script.

Arguably though, browser automation gets you 95% of the way there for most things.

Looks interesting but like every single one of these computer use apps I've seen, it's macOS only.

Does anyone know of a linux one?

I don't think the accessibility story on Linux is comprehensive enough to make this possible unfortunately. Especially with Wayland. One advantage Mac apps have is they're all targeting the same underlying OS primitives, which is the layer their accessibility platform lives at.

lahfir, I vouched your (currently still dead) comment because it was interesting to me.

I expect the reason it is dead is that it seems LLM-generated (you "quietly" launched it on github? Who says that?).

Also, your comment claims that the tool is cross-platform and implies that it works on Mac, Windows, and Linux, but the graphic on the github README says it only works on Mac.

It looks hybrid human/LLM at best, but definitely possible that it's mostly human, from someone who is earnestly learning how to use "pitch" language. I got the feeling that some parts, like the bullet points, maybe originated from AI-generated documentation/readme's.

My intuition tells me that it could have been AI-generated, but if that's the case then it was heavily edited by a human. I think anyone who went through it for that would have changed other things as well. That's why I suspect it's pseudo-artificial pitch "coded" human writing with some (mostly, lightly edited) copy/paste of AI bullet points.

Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).

Wouldn't the opposite be true? That an llm would use well-known terms for general purpose writing. I think it's much more likely that a human would remember 'silent' launch, or 'stealth' launch, and use silent as a substitute.

I feel very strongly that comment wasn't AI generated.

Also, there's a bunch of normal comments that seem to be wrongfully flagged.

3 fake comments in the thread also

Why is Claude always pointing out or assuming what is done quietly?

OP claims cross platform.

  > It's a cross-platform CLI for structured desktop automation through the accessibility tree.

While the accessibility tree is great in many aspects it has its own limitations for example when it comes to stacked views or lazy loading outside the viewport.

I've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.

That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.

But the OS already exposes structured UI information:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI

Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.

So I built a desktop equivalent: agent-desktop.

A typical loop looks like this:

  agent-desktop snapshot --app Slack -i --compact
  agent-desktop click @e12
  agent-desktop type @e5 "ship it"
  agent-desktop press cmd+return

So the loop becomes:

  1. Snapshot
  2. Decide
  3. Act
  4. Snapshot again

The main design problem was context size.

A naive approach would dump the full accessibility tree into the model, but real apps get huge. Slack can easily exceed 50,000 tokens for a full tree dump, which makes the approach impractical.

The approach I ended up using is progressive skeleton traversal:

  - First pass: return a shallow tree, typically depth 3, with deeper containers truncated and annotated with children_count
  - Named containers get references so the agent can request only that subtree
  - The agent drills down into the relevant region with --root @e3
  - References are scoped and invalidated only for that subtree
  - After acting, the agent can re-query just that region instead of re-snapshotting the whole app

In practice, this reduced token usage by about 78% to 96% versus full-tree dumps in Electron apps like Slack, VS Code, and Notion.

A few implementation details that may be interesting here:

  - Rust workspace with strict platform/core separation through a PlatformAdapter trait
  - Accessibility-first activation chain; mouse synthesis is the fallback, not the default
  - Deterministic element refs like @e1, @e2, with optimistic re-identification across UI shifts
  - Structured errors with machine-readable codes plus retry suggestions
  - C ABI via cdylib, so it can be loaded directly from Python, Swift, Go, Node, Ruby, or C without shelling out
  - Batch operations in a single call
  - Support for windows, menus, sheets, popovers, alerts, and notifications
  - Special handling for Chromium/Electron accessibility trees, which can get very deep and noisy

If you're building your own desktop agent, internal automation tool, or research prototype, this may be useful.

Install:

  npm install -g agent-desktop
  agent-desktop snapshot --app Finder -i

Repo: https://github.com/lahfir/agent-desktop

I'd especially love feedback from people who've built desktop automation before. What are the biggest pain points you've run into, and what would you want a tool like this to support?

Looks very interesting. Especially like that language environment is abstracted away, through cli, such that one are not stuck with for example python to write your UI logic (or create your own cli wrapper around PyAutoGUI).

How can one help with implementing Linux and Windows support?

This is neat! Tried the finder example and was impressed how quick it was.

I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry.

I think screenshots also don't help with stacked views and lazy loading outside the viewport

AGENT DESKTOP

OBSERVE. DECIDE. ACT.

agent-desktop tutorial demo

agent-desktop is a native desktop automation CLI designed for AI agents, built with Rust. It gives structured access to any application through OS accessibility trees — no screenshots, no pixel matching, no browser required.

Architecture

agent-desktop architecture diagram

Key Features

Native Rust CLI: Fast, single binary, no runtime dependencies
C-ABI cdylib (libagent_desktop_ffi): Load once from Python / Swift / Go / Ruby / Node / C instead of forking the CLI per call
53 commands: Observation, interaction, keyboard, mouse, notifications, clipboard, window management
Progressive skeleton traversal: 78–96% token reduction on dense apps via shallow overview + targeted drill-down
Snapshot & refs: AI-optimized workflow using deterministic element references (@e1, @e2)
AX-first interactions: Every action exhausts pure accessibility API strategies before falling back to mouse events
Structured JSON output: Machine-readable responses with error codes and recovery hints
Works with any app: Finder, Safari, System Settings, Xcode, Slack — anything with an accessibility tree

Installation

npm (recommended)

npm install -g agent-desktop        # downloads prebuilt binary automatically

Or without installing:

npx agent-desktop snapshot --app Finder -i

From source

git clone https://github.com/lahfir/agent-desktop
cd agent-desktop
cargo build --release
cp target/release/agent-desktop /usr/local/bin/

Requires Rust 1.78+ and macOS 13.0+.

Permissions

macOS requires Accessibility permission. Grant it in System Settings > Privacy & Security > Accessibility by adding your terminal app, or:

agent-desktop permissions --request   # trigger system dialog

Language bindings (FFI)

Every GitHub Release ships a prebuilt C-ABI cdylib alongside the CLI tarballs. Hosts that need in-process calls (Python agents, Swift apps, Go services, Node tools, Ruby scripts, C/C++ code) dlopen the dylib and call the functions declared in agent_desktop.h — no fork-exec per command.

Platform	Artifact
macOS arm64	`agent-desktop-ffi-v<ver>-aarch64-apple-darwin.tar.gz`
macOS x86_64	`agent-desktop-ffi-v<ver>-x86_64-apple-darwin.tar.gz`
Linux x86_64 (glibc)	`agent-desktop-ffi-v<ver>-x86_64-unknown-linux-gnu.tar.gz`
Linux arm64 (glibc)	`agent-desktop-ffi-v<ver>-aarch64-unknown-linux-gnu.tar.gz`
Windows x86_64 (MSVC)	`agent-desktop-ffi-v<ver>-x86_64-pc-windows-msvc.zip`

Each archive contains lib/libagent_desktop_ffi.{dylib,so,dll}, include/agent_desktop.h, LICENSE, and a short README. Verify the download with the release's checksums.txt:

shasum -a 256 -c checksums.txt
gh attestation verify agent-desktop-ffi-v*.tar.gz --repo lahfir/agent-desktop   # Sigstore provenance

Minimal Python round-trip:

import ctypes
lib = ctypes.CDLL("./lib/libagent_desktop_ffi.dylib")
lib.ad_adapter_create.restype = ctypes.c_void_p
adapter = lib.ad_adapter_create()
# ... call ad_list_apps / ad_get_tree / ad_execute_action, see docs below
lib.ad_adapter_destroy(adapter)

Full consumer guide — error-handling contract, ownership rules, threading constraints, every entrypoint with Safety docs: skills/agent-desktop-ffi/.

Core Workflow for AI

For dense apps (Slack, VS Code, Notion), use progressive skeleton traversal to minimize token usage:

# 1. Shallow overview — depth-3 map, truncated containers show children_count
agent-desktop snapshot --skeleton --app Slack -i --compact

# 2. Drill into a region of interest (named containers get refs as drill targets)
agent-desktop snapshot --root @e3 -i --compact

# 3. Act on an element found in the drill-down
agent-desktop click @e12

# 4. Re-drill the same region to verify the state change
agent-desktop snapshot --root @e3 -i --compact

For simple apps, a full snapshot is fine:

agent-desktop snapshot --app Finder -i   # get interactive elements with refs
agent-desktop click @e3                  # click a button by ref
agent-desktop type @e5 "quarterly report"  # type into a text field
agent-desktop press cmd+s               # keyboard shortcut
agent-desktop snapshot -i               # re-observe after UI changes

Agent loop:  snapshot → decide → act → snapshot → decide → act → ...

Commands

Observation

agent-desktop snapshot --app Safari -i           # accessibility tree with refs
agent-desktop snapshot --surface menu            # capture open menu
agent-desktop screenshot --app Finder            # PNG screenshot
agent-desktop find --role button --app TextEdit  # search by role, name, value, text
agent-desktop get @e3 value                      # read element property
agent-desktop is @e7 checked                     # check boolean state
agent-desktop list-surfaces --app Notes          # list menus, sheets, popovers, alerts

Interaction

agent-desktop click @e3                  # smart AX-first click (15-step chain)
agent-desktop double-click @e3           # open files, select words
agent-desktop triple-click @e3           # select lines/paragraphs
agent-desktop right-click @e3            # context menu (returns menu tree inline)
agent-desktop type @e5 "hello world"     # type text into element
agent-desktop set-value @e5 "new value"  # set value directly via AX
agent-desktop clear @e5                  # clear element value
agent-desktop focus @e5                  # set keyboard focus
agent-desktop select @e9 "Option B"      # select option in dropdown/list
agent-desktop toggle @e12                # flip checkbox or switch
agent-desktop check @e12                 # idempotent check
agent-desktop uncheck @e12               # idempotent uncheck
agent-desktop expand @e15                # expand disclosure/tree item
agent-desktop collapse @e15              # collapse disclosure/tree item
agent-desktop scroll @e1 down 3          # scroll (AX-first, 10-step chain)
agent-desktop scroll-to @e20             # scroll element into view

Keyboard

agent-desktop press cmd+s               # key combo
agent-desktop press cmd+shift+z          # multi-modifier
agent-desktop press escape               # single key
agent-desktop key-down shift             # hold key
agent-desktop key-up shift               # release key

Mouse

agent-desktop hover @e3                  # move cursor to element
agent-desktop hover --xy 500,300         # move cursor to coordinates
agent-desktop drag @e3 --to @e8          # drag between elements
agent-desktop drag --xy 100,200 --to-xy 400,200  # drag between coordinates
agent-desktop mouse-click --xy 500,300   # click at coordinates
agent-desktop mouse-down --xy 500,300    # press at coordinates
agent-desktop mouse-up --xy 500,300      # release at coordinates

App & Window Management

agent-desktop launch Safari              # launch app by name
agent-desktop launch com.apple.Safari    # launch by bundle ID
agent-desktop close-app Safari           # quit app
agent-desktop close-app Safari --force   # force quit (SIGKILL)
agent-desktop list-apps                  # list running GUI apps
agent-desktop list-windows               # list visible windows
agent-desktop list-windows --app Finder  # windows for specific app
agent-desktop focus-window w-4521        # bring window to front
agent-desktop resize-window w-4521 800 600  # resize
agent-desktop move-window w-4521 100 100    # move
agent-desktop minimize w-4521            # minimize
agent-desktop maximize w-4521            # maximize
agent-desktop restore w-4521             # restore

Notifications (macOS only)

agent-desktop list-notifications                       # list all notifications
agent-desktop list-notifications --app "Slack"         # filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # filter by text
agent-desktop dismiss-notification 1                   # dismiss by index
agent-desktop dismiss-all-notifications                # dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # dismiss all from app
agent-desktop notification-action 1 --action "Reply"   # click action button

Clipboard

agent-desktop clipboard-get              # read clipboard text
agent-desktop clipboard-set "copied"     # write to clipboard
agent-desktop clipboard-clear            # clear clipboard

Wait

agent-desktop wait 500                                       # sleep 500ms
agent-desktop wait --element @e3 --timeout 5000              # wait for element
agent-desktop wait --window "Save" --timeout 10000           # wait for window
agent-desktop wait --text "Loading complete" --app Safari    # wait for text
agent-desktop wait --menu --timeout 3000                     # wait for menu

Batch

agent-desktop batch '[
  {"command": "click", "args": {"ref_id": "@e2"}},
  {"command": "type", "args": {"ref_id": "@e5", "text": "hello"}},
  {"command": "press", "args": {"combo": "return"}}
]' --stop-on-error

System

agent-desktop status                     # platform, permission state
agent-desktop permissions                # check accessibility permission
agent-desktop permissions --request      # trigger system dialog
agent-desktop version                    # version string

Snapshot Options

agent-desktop snapshot [OPTIONS]

Flag	Default	Description
`--app <NAME>`	focused app	Filter to a specific application
`--window-id <ID>`	-	Filter to a specific window
`-i` / `--interactive-only`	off	Only include interactive elements
`--compact`	off	Omit empty structural nodes
`--include-bounds`	off	Include pixel bounds (x, y, width, height)
`--max-depth <N>`	10	Maximum tree depth
`--skeleton`	off	Shallow 3-level overview; truncated containers show `children_count` and get refs as drill targets
`--root <REF>`	-	Start traversal from this ref; merges into existing refmap with scoped invalidation
`--surface <TYPE>`	window	`window`, `focused`, `menu`, `menubar`, `sheet`, `popover`, `alert`

JSON Output

Every command returns structured JSON:

{
  "version": "1.0",
  "ok": true,
  "command": "click",
  "data": { "action": "click" }
}

Errors include machine-readable codes and recovery hints:

{
  "version": "1.0",
  "ok": false,
  "command": "click",
  "error": {
    "code": "STALE_REF",
    "message": "Element at @e7 no longer matches the last snapshot",
    "suggestion": "Run 'snapshot' to refresh refs, then retry"
  }
}

Error Codes

Code	Meaning
`PERM_DENIED`	Accessibility permission not granted
`ELEMENT_NOT_FOUND`	No element matched the ref or query
`APP_NOT_FOUND`	Application not running or no windows
`STALE_REF`	Ref is from a previous snapshot
`ACTION_FAILED`	The OS rejected the action
`TIMEOUT`	Wait condition expired
`INVALID_ARGS`	Invalid argument values

Exit Codes

0 success, 1 structured error (JSON on stdout), 2 argument parse error.

Ref System

snapshot assigns refs to interactive elements in depth-first order: @e1, @e2, @e3, etc. Refs are valid until the next snapshot replaces them.

Interactive roles that receive refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell, radiobutton, incrementor, menubutton, switch, colorwell, dockitem.

Static elements (labels, groups, containers) appear in the tree for context but have no ref.

Stale ref recovery:

snapshot → act → STALE_REF? → snapshot again → retry

Platform Support

	macOS	Windows	Linux
Accessibility tree	Yes	Planned	Planned
Click / type / keyboard	Yes	Planned	Planned
Mouse input	Yes	Planned	Planned
Screenshot	Yes	Planned	Planned
Clipboard	Yes	Planned	Planned
App & window management	Yes	Planned	Planned
Notifications	Yes	Planned	Planned

Development

cargo build                               # debug build
cargo build --release                     # optimized (<15MB)
cargo test --lib --workspace              # run tests
cargo clippy --all-targets -- -D warnings # lint (must pass with zero warnings)

License

Apache-2.0

I think screenshots also don't help with stacked views and lazy loading outside the viewport

3 fake comments in the thread also

I feel very strongly that comment wasn't AI generated.

Also, there's a bunch of normal comments that seem to be wrongfully flagged.

How can one help with implementing Linux and Windows support?

OP claims cross platform.

  > It's a cross-platform CLI for structured desktop automation through the accessibility tree.

Why is Claude always pointing out or assuming what is done quietly?

You don't need hdmi out, just ability to do screenshots, which easy to script.

Arguably though, browser automation gets you 95% of the way there for most things.

Many systems won't allow the end user to install any software (e.g. work issued laptops), but you can plug in HDMI and USB.

Quote from a sibling comment:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI

Then again, I can't find snippets of this language in the repo, so maybe I'm losing my discernment as LLMs advance (as well as the humans who are learning how to use them).

I think this guy is using AI for pretty much everything - he says as much in his GH profile. In fact his photo bears a Gemini watermark, meaning that is AI too.

This is neat! Tried the finder example and was impressed how quick it was.

I would love it if it can support ios simulator, iphone? I am using Maestro but it is so damn slow and seems to be token hungry.

Many systems won't allow the end user to install any software (e.g. work issued laptops), but you can plug in HDMI and USB.

I think this guy is using AI for pretty much everything - he says as much in his GH profile. In fact his photo bears a Gemini watermark, meaning that is AI too.

Quote from a sibling comment:

  - macOS: Accessibility API
  - Windows: UI Automation
  - Linux: AT-SPI

The levels of support are radically different. Compositors, window managers, UI frameworks, and apps all have mixed and inconsistent levels of support such that the overall experience is that you simply cannot rely on using a Linux system via accessibility.

Hacker Times

Hacker Times

I built the Playwright for desktop apps. 80% token savings

Discussion

Discussion

AGENT DESKTOP

Architecture

Key Features

Installation

npm (recommended)

From source

Permissions

Language bindings (FFI)

Core Workflow for AI

Commands

Observation

Interaction

Keyboard

Mouse

App & Window Management

Notifications (macOS only)

Clipboard

Wait

Batch

System

Snapshot Options

JSON Output

Error Codes

Exit Codes

Ref System

Platform Support

Development

License