openclaw - 💡(How to fix) Fix Feature Request: Add screenshot, click, type as native first-class tools (same level as exec/read/write)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Add screenshot, click, type, scroll as native first-class tools at the same level as exec, read, and write.

Not as a skill. Not as an MCP plugin. As a built-in tool inside the agent runtime loop.

Error Message

→ screenshot() ← sees error dialog "password cannot be empty"

Root Cause

An AI assistant that cannot see the screen and cannot click is half an assistant.

Most software in the world does not expose a CLI or API. People fill expense reports in ERP systems. People configure routers through web UIs. People operate Excel, PowerPoint, QuickBooks, WeChat Desktop, government portals, industrial control panels — none of which have --help. An agent limited to exec + read + write is limited to the tiny fraction of software that speaks terminal.

Without screenshot and click, OpenClaw is not a general-purpose assistant. It is a terminal assistant with a browser plugin. That is a subset of what users actually need.

Code Example

LLM thinks → calls tool → result returns to context → LLM sees result → decides next action → calls next tool

---

LLM: write Qt code → exec("make")exec("./app &") 
screenshot()              ← sees button at (150, 300)
click(150, 300)            ← clicks it
screenshot()              ← sees error dialog "password cannot be empty"
edit() fix logic → exec("make") 
screenshot()              ← verifies fix ✅

---

def screenshot():  # ~20 lines
    macos:   screencapture -x /tmp/s.png
    linux:   grim /tmp/s.png  # or scrot for X11
    windows: DXGI screen capture

def click(x, y):  # ~15 lines
    macos:   cliclick c:x,y  # or CGEvent
    linux:   ydotool click x y
    windows: SendInput

def type(text):   # ~10 lines
    macos:   CGEventPost
    linux:   ydotool type ...
    windows: SendInput
RAW_BUFFERClick to expand / collapse

Feature Request: computer tool as first-class primitive (screenshot, click, type, scroll)

Summary

Add screenshot, click, type, scroll as native first-class tools at the same level as exec, read, and write.

Not as a skill. Not as an MCP plugin. As a built-in tool inside the agent runtime loop.

Why this matters

An AI assistant that cannot see the screen and cannot click is half an assistant.

Most software in the world does not expose a CLI or API. People fill expense reports in ERP systems. People configure routers through web UIs. People operate Excel, PowerPoint, QuickBooks, WeChat Desktop, government portals, industrial control panels — none of which have --help. An agent limited to exec + read + write is limited to the tiny fraction of software that speaks terminal.

Without screenshot and click, OpenClaw is not a general-purpose assistant. It is a terminal assistant with a browser plugin. That is a subset of what users actually need.

The correct architecture: tools inside the agent loop

The agent runtime loop already works:

LLM thinks → calls tool → result returns to context → LLM sees result → decides next action → calls next tool

This loop is what makes exec, read, and write powerful. The LLM naturally iterates. It compiles, sees errors, fixes them, recompiles. No external orchestration needed.

computer tools should work the exact same way:

LLM: write Qt code → exec("make") → exec("./app &") 
  → screenshot()              ← sees button at (150, 300)
  → click(150, 300)            ← clicks it
  → screenshot()              ← sees error dialog "password cannot be empty"
  → edit() fix logic → exec("make") 
  → screenshot()              ← verifies fix ✅

Every step is a tool call inside the same LLM reasoning loop. The agent does not need to be told to iterate — the tool results naturally drive iteration. This is the same reason Anthropic added computer_use as a native Claude tool, not as an MCP plugin.

The skill / MCP path is the wrong fit

Currently, desktop control is pushed to skills (Peekaboo) or MCP servers:

DimensionNative tool (exec)Skill / MCP path
In agent runtime loop✅ Managed by runtime❌ External process
In system prompt by default❌ LLM may not know it exists
LLM discovers naturally❌ Needs explicit prompt injection
Token/context continuity✅ Seamless❌ Cross-process boundary
Same UX as coding tools❌ Feels like an afterthought

This is not about Peekaboo being bad — Peekaboo is an excellent automation engine. The problem is placement. A tool outside the agent's default reasoning loop is invisible to the LLM unless the prompt explicitly tells it to look. A native tool is always there, always available, just like exec.

Cross-platform is not the obstacle

Three APIs cover all platforms:

def screenshot():  # ~20 lines
    macos:   screencapture -x /tmp/s.png
    linux:   grim /tmp/s.png  # or scrot for X11
    windows: DXGI screen capture

def click(x, y):  # ~15 lines
    macos:   cliclick c:x,y  # or CGEvent
    linux:   ydotool click x y
    windows: SendInput

def type(text):   # ~10 lines
    macos:   CGEventPost
    linux:   ydotool type ...
    windows: SendInput

This is not a massive cross-platform maintenance burden. It is less code than the browser tool's CDP handling. The OS fragmentation argument has been used to justify pushing this to MCP, but the actual implementation surface is small and well-bounded.

Pixel-first approach: vision models + coordinates

The accessibility-tree approach (Peekaboo) is powerful on macOS but does not generalize. On Linux, AT-SPI2 coverage is inconsistent. On Windows, UIA is available but not universally implemented.

Pixel-first (what Anthropic's computer_use, what UI-TARS uses, what ByteDance's v3.0 does) works everywhere. A screenshot + vision model analysis + coordinate-based interaction works the same on any OS, any application, any UI framework. The model sees pixels, the tool clicks coordinates. No dependency on platform accessibility APIs.

This is the direction the industry is moving. OpenAI's Codex Computer Use launched in April 2026 with background desktop control. Anthropic added computer_use to Claude Code. ByteDance open-sourced UI-TARS with 33k+ stars and a pixel-first VLM. Nous Research's Hermes Agent has hermes computer-use install as a first-class command.

OpenClaw is the only major agent platform that does not have a native screen/click path. The Peekaboo skill path is macOS-only and not in the agent loop. The MCP path requires manual configuration that most users will never do.

Competitive reality

OpenClawHermesClaude CodeCodex
Native computer-use tool✅ (MCP but first-class install)✅ (native)✅ (native)
Cross-platformvia MCPmacOS (cua-driver)macOS
Discoverable by default
Agent can iterate organically

Once Hermes's cua-driver path stabilizes on Linux/Windows — and cua-driver already powers Codex's background computer-use — Hermes will be able to control any desktop application out of the box. OpenClaw agents will still be limited to exec, read, write, and browser.

Users will not care about OpenClaw's superior channel integrations when the competitor can actually use their computer.

History does not need to repeat

Issue #1754 and PR #1946 in January 2026 proposed exactly this: a computer tool integrated at the bash/read level, powered by cua-computer-server, tested and code-complete. The proposal was:

"computer is a primitive like bash or read — it's the low-level capability for GUI interaction, not a workflow."

The response was: "make it a plugin." Then feature freeze. Then silence. The code was already written. The tests passed. The cross-platform approach was ready.

The Cua team that proposed this went on to build cua-driver — now powering Hermes Agent's computer-use and OpenAI Codex's background desktop control. The cross-platform path was available in January 2026 and was not taken.

What I am requesting

  1. Add screenshot, click, type, scroll, move as native tools registered in the core tool set alongside exec, read, write, browser
  2. These tools should be visible in the default system prompt so every agent knows they exist
  3. Implementation should be pixel-first (screenshot → vision model → coordinate interaction), OS-agnostic, ~100 lines of platform-specific code
  4. The implementation can use existing primitives: exec + platform-native screenshot/input binaries in the simplest case, or platform APIs where needed

This is not about replacing Peekaboo. Peekaboo can remain the macOS-native, accessibility-tree-powered, permission-aware bridge for users who want that. This is about giving every agent on every platform the fundamental ability to see and interact with a desktop — because without it, the assistant is not really an assistant.

References

  • [#1754] Proposal: Add computer Tool for Agentic UI Automation (closed, not implemented)
  • [#1946] feat(tools): add computer tool for GUI automation (code-complete PR, closed — feature freeze)
  • [#41024] feat: Native Computer Use UI Integration (closed — implemented as Peekaboo skill, not native tool)
  • [#47499] Dashboard Embedded VNC Viewer + Agent Full Desktop Takeover (open)
  • Anthropic: computer_use tool as native Claude capability
  • Anthropic: computer-use-demo — reference implementation using pixel + coordinates
  • ByteDance: UI-TARS-desktop — 33k+ stars, pixel-first VLM, cross-platform
  • Hermes Agent: hermes computer-use install — first-class computer-use command

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING