hermes - 💡(How to fix) Fix computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

hermes2026-05-26 18:47:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

The computer_use tool (cua-driver backend) makes overly strong assumptions about the responses from the underlying driver. When list_windows returns no windows (e.g. due to is_on_screen filtering), or returns inconsistent data, the entire tool fails hard with no usable output.

This particularly breaks the ability to use auxiliary vision models with computer_use.

Error Message

Improve error messages so users (and agents) understand when the driver is the limiting factor.

Root Cause

mode=vision captures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.
Because the capture fails before any image is produced, the auxiliary vision model never receives any data.
As a result, it is currently not possible to use auxiliary vision models effectively with computer_use.

RAW_BUFFERClick to expand / collapse

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

Summary

This particularly breaks the ability to use auxiliary vision models with computer_use.

Reproduction

Use a text-only main model (e.g. DeepSeek, GLM, local models, etc.).
Configure an auxiliary vision model (auxiliary.vision).
Enable the computer_use toolset.
Attempt to use computer_use with mode=vision, or allow the agent to use desktop control.

Current Behavior

capture(...) always calls list_windows with {"on_screen_only": true} (see tools/computer_use/cua_backend.py:367).
If the driver returns zero windows (common when is_on_screen is false for everything), it immediately returns an empty 0x0 result with no image data.
There is no fallback path, no retry without the on_screen_only filter, and no client-side best-effort logic.
list_apps has similar parsing fragility and can return malformed results.
When the MCP connection to cua-driver has issues, the backend can be left in a broken state.

Impact on Auxiliary Vision

This is especially damaging for users running text-only models who rely on auxiliary.vision.

mode=vision captures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.
Because the capture fails before any image is produced, the auxiliary vision model never receives any data.
As a result, it is currently not possible to use auxiliary vision models effectively with computer_use.

Expected Behavior

The integration should be resilient:

Fall back gracefully when on_screen_only returns no results.
Still produce usable (if lower quality) output when the driver behaves sub-optimally.
Support auxiliary vision workflows even when the driver’s on-screen detection is imperfect.

Relevant Code

tools/computer_use/cua_backend.py:
- capture() (~366–393): Hard dependency on on_screen_only: true
- list_apps() (~627–642): Fragile structured/text fallback parsing
- MCP session handling in _CuaDriverSession
tools/computer_use/tool.py

Impact

On affected systems, computer_use becomes largely unusable.
Text-only models + auxiliary.vision lose desktop control capabilities entirely.
The feature is unreliable for anyone who depends on real computer use, not just users hitting edge cases in the driver.

Suggested Improvements

Add a fallback in capture(): if on_screen_only: true returns nothing, retry without the filter and do client-side filtering.
Make list_apps() more robust when parsing driver responses.
Add basic health checks and recovery for the cua-driver MCP connection.
Consider a "best effort" capture mode that is more tolerant of imperfect driver output.
Improve error messages so users (and agents) understand when the driver is the limiting factor.

Additional Context

This was discovered while debugging real-world failures combining text-only models, auxiliary vision routing, and the cua-driver backend. The current design assumes the driver will reliably report on-screen windows, which does not always hold.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering