hermes - 💡(How to fix) Fix [Bug] computer_use multimodal tool message causes 400 error on providers that don't support multimodal tool content (e.g. Xiaomi MiMo) [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When using computer_use with a vision-capable model (e.g. xiaomi/mimo-v2.5), the tool captures a screenshot and returns a _multimodal dict with content as a list containing both text and image_url parts. This list is then set as the content of the role: "tool" message sent to the API.

However, MiMo's API does not accept list-type content in tool messages — it requires content to be a string for role: "tool". This causes a 400 error:

Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}

Error Message

However, MiMo's API does not accept list-type content in tool messages — it requires content to be a string for role: "tool". This causes a 400 error: Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}

Root Cause

In run_agent.py, _tool_result_content_for_active_model() (line 9621-9659) checks _model_supports_vision() to decide whether to pass through the multimodal content or fall back to text summary.

_model_supports_vision() (line 9479-9497) uses agent.models_dev.get_model_capabilities() which checks modalities.input from models.dev. For mimo-v2.5, modalities.input = ['text', 'image', 'audio', 'video'], so supports_vision returns True.

The bug: _model_supports_vision() checks if the model supports images in user messages, but doesn't check if it supports images in tool messages. These are different things:

  • Most OpenAI-compatible providers support images in user messages (via content as a list)
  • But many providers (including MiMo) require tool message content to be a string, not a list

The OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.

Fix Action

Fixed

Code Example

Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}
RAW_BUFFERClick to expand / collapse

Description

When using computer_use with a vision-capable model (e.g. xiaomi/mimo-v2.5), the tool captures a screenshot and returns a _multimodal dict with content as a list containing both text and image_url parts. This list is then set as the content of the role: "tool" message sent to the API.

However, MiMo's API does not accept list-type content in tool messages — it requires content to be a string for role: "tool". This causes a 400 error:

Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}

Root Cause

In run_agent.py, _tool_result_content_for_active_model() (line 9621-9659) checks _model_supports_vision() to decide whether to pass through the multimodal content or fall back to text summary.

_model_supports_vision() (line 9479-9497) uses agent.models_dev.get_model_capabilities() which checks modalities.input from models.dev. For mimo-v2.5, modalities.input = ['text', 'image', 'audio', 'video'], so supports_vision returns True.

The bug: _model_supports_vision() checks if the model supports images in user messages, but doesn't check if it supports images in tool messages. These are different things:

  • Most OpenAI-compatible providers support images in user messages (via content as a list)
  • But many providers (including MiMo) require tool message content to be a string, not a list

The OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.

Reproduction

  1. Configure Hermes with xiaomi/mimo-v2.5 as the main model
  2. Call computer_use(action='capture', mode='som')
  3. The tool returns _multimodal content with image
  4. _tool_result_content_for_active_model returns the content list (because supports_vision=True)
  5. The tool message with list content is sent to MiMo API
  6. MiMo API returns 400: text is not set

Suggested Fix

Add a provider/model-level flag for supports_multimodal_tool_content (or similar) that controls whether multimodal content is allowed in tool messages specifically. Providers that don't support it should always receive string content (the text_summary fallback).

Possible approaches:

  1. Provider-specific flag: Add supports_multimodal_tool_content = False to the xiaomi provider profile
  2. Conservative default: Only use multimodal tool content for providers known to support it (Anthropic, OpenAI), and use text summary for all others
  3. Fallback with retry: Send multimodal content first; if it fails, retry with text summary (but this wastes a round-trip)

Option 2 is the safest and most backward-compatible.

Related

  • The same issue was previously reported in GitHub issue #27325 (MiMo thinking parameter bug) — that issue is about reasoning_content being stripped, which is a different but related MiMo compatibility issue.
  • The _tool_result_content_for_active_model method already has the right fallback logic for non-vision models (line 9640-9659), but it doesn't apply to vision-capable models that don't support multimodal tool messages.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING