openclaw - ✅(Solved) Fix [Bug]: WEBUI / Edge cases —— Image input causes some multimodal models to misuse tool calling [1 pull requests, 1 participants]

openclaw2026-04-07 13:23:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62514•Fetched 2026-04-08 03:03:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

xiaoHEI-312

Participants

xiaoHEI-312

Timeline (top)

cross-referenced ×2labeled ×1

In the Web UI, image input can cause some multimodal models with weak tool-calling capabilities to misuse tool calls, which may result in the image input being ignored or hallucinations in image understanding.

Error Message

error occurs in logs, there may be two possible cases:

12:29:45+00:00 error [tools] image failed: Local media file not found: /Users/hyf/.openclaw/workspace/.openclaw_cache/attachments/2026-04-07_2026-04-07T12_28_28_672Z_image_0.jpg raw_params={"prompt":"分析这张图片，描述图中包含的元素、场景和特点。请详细说明看到的内容，包括自然景观、颜色、物体等。","image":"/Users/hyf/.openclaw/workspace/.openclaw_cache/attachments/2026-04-07_2026-04-07T12_28_28_672Z_image_0.jpg"} 12:46:09+00:00 warn agents/tools/image {"subsystem":"agents/tools/image"} image tool local input missing raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 source=local-path 12:46:09+00:00 warn agents/tools/image {"subsystem":"agents/tools/image"} image tool local input missing raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 source=local-path

Root Cause

Fix Action

Fixed

Fixed by PR: fix(chat): webui image chat fix (https://github.com/openclaw/openclaw/pull/62523)

PR fix notes

PR #62523: fix(chat): webui image chat fix

Repository: openclaw/openclaw
Author: xiaoHEI-312
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/62523

Description (problem / solution / changelog)

Summary

Problem: vision-capable runs still injected the image tool even when the current user turn already included native image input.
Why it matters: models with weak tool-calling discipline could ignore the tool description, hallucinate a local path or unrelated remote URL, and call image against the wrong target.
What changed: the embedded runner now marks runs that already carry prompt images, and createOpenClawCodingTools() removes image from that turn’s tool list when the selected model already supports native image input.
What did NOT change (scope boundary): image handling/storage, prompt-image loading, and non-vision or text-only runs still behave as before.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #62514
Related #
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: the runtime relied on tool-description guidance to discourage image tool use, but still exposed the tool even when the model already had native image inputs for the current turn.
Missing detection / guardrail: there was no runtime guard that removed image from the tool list when params.images / prompt image order already indicated native image input.
Contributing context (if known): glm-4.6v could ignore the descriptive prohibition and hallucinate nonexistent local attachment paths or unrelated public image URLs.

User-visible / Behavior Changes

When a vision-capable model already receives image input in the current turn, the assistant no longer exposes the fallback image tool for that turn, reducing bogus image-tool calls against hallucinated paths or URLs.

Diagram

Before:
[user sends image] -> [native image input added] -> [image tool still injected] -> [model may hallucinate path/URL] -> [wrong image tool call]

After:
[user sends image] -> [native image input added] -> [image tool removed for this turn] -> [model answers from native image input]

## Changed files

- `src/agents/pi-embedded-runner/run/attempt.ts` (modified, +2/-0)
- `src/agents/pi-tools.ts` (modified, +12/-1)

Code Example

some debug logs：

12:46:03+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run tool defs: runId=480fd177-8207-4beb-98f9-9b88be22503c sessionKey=agent:main:main provider=zai/glm-4.6v builtInTools=0 customTools=18 customToolNames=read,edit,write,exec,process,cron,sessions_list,sessions_history,sessions_send,sessions_yield,sessions_spawn,subagents,session_status,web_search,web_fetch,image,memory_search,memory_get imageTool=name=image description=Analyze one or more images with a vision model. Use image for a single path/URL, or images for multiple (up to 20). Only use this tool when images were NOT already provided in the user's message. Images mentioned in the prompt are automatically visible to you. parameters={"type":"object","properties":{"prompt":{"type":"string"},"image":{"description":"Single image path or URL.","type":"string"},"images":{"description":"Multiple image paths or URLs (up to maxImages, default 20).","type":"array","items":{"type":"string"}},"model":{"type":"string"},"maxBytesMb":{"type":"number"},"maxImages":{"type":"number"}}}
12:46:03+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run model request: runId=480fd177-8207-4beb-98f9-9b88be22503c sessionKey=agent:main:main provider=zai/glm-4.6v historyMessages=7 roleCounts=assistant:2,toolResult:4,user:1 historyImageBlocks=0 inputImages=1 imageOrder=inline detectedPromptRefs=0 loadedPromptImages=0 skippedPromptImages=0 promptImages=1 systemPromptChars=46346 promptChars=144 systemPromptPreview=You are a personal assistant operating inside OpenClaw. ## Tooling Structured tool definitions are the source of truth for tool names, descriptions, and parameters. Tool names are case-sensitive. Call tools exactly as listed in the structured tool definitions. If a tool is pre... promptPreview=Sender (untrusted metadata):  { "label": "openclaw-control-ui", "id": "openclaw-control-ui" }  [Tue 2026-04-07 20:46 GMT+8] 图中有什么
12:46:09+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run tool call received: runId=480fd177-8207-4beb-98f9-9b88be22503c rawTool=image tool=image toolCallId=call_eaf0f08426a843bcadedd802 meta=描述这张图片的内容，包括场景、元素和整体氛围 args={"prompt":"描述这张图片的内容，包括场景、元素和整体氛围","image":"https://picsum.photos/id/1018/800/600"}
12:46:09+00:00 info agents/tools/image {"subsystem":"agents/tools/image"} image tool local input raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 exists=false source=local-path
12:46:09+00:00 warn agents/tools/image {"subsystem":"agents/tools/image"} image tool local input missing raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 source=local-path

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

start a webui, start a new session
choose a multimodal models (glm-4.6v). Notes: This is an edge case that only occurs with weaker multimodal models; state-of-the-art models such as GPT-5.4 do not have this issue.
input an image, and see the logs

Expected behavior

model can understand the image, and response correctly

Actual behavior

error occurs in logs, there may be two possible cases: the first is:

12:29:45+00:00 error [tools] image failed: Local media file not found: /Users/hyf/.openclaw/workspace/.openclaw_cache/attachments/2026-04-07_2026-04-07T12_28_28_672Z_image_0.jpg raw_params={"prompt":"分析这张图片，描述图中包含的元素、场景和特点。请详细说明看到的内容，包括自然景观、颜色、物体等。","image":"/Users/hyf/.openclaw/workspace/.openclaw_cache/attachments/2026-04-07_2026-04-07T12_28_28_672Z_image_0.jpg"}

where the local image path "/Users/hyf/.openclaw/workspace/.openclaw_cache/attachments/2026-04-07_2026-04-07T12_28_28_672Z_image_0.jpg" is hallucination

the second is:

12:46:09+00:00 warn agents/tools/image {"subsystem":"agents/tools/image"} image tool local input missing raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 source=local-path

where the image url "https://picsum.photos/id/1018/800/600" is hallucination

OpenClaw version

2026.4.6

Operating system

Macos

Install method

pnpm dev

Model

glm-4.6v

Provider / routing chain

openclaw - zai - glm-4.6v

Additional provider/model setup details

No response

Logs, screenshots, and evidence

some debug logs：

12:46:03+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run tool defs: runId=480fd177-8207-4beb-98f9-9b88be22503c sessionKey=agent:main:main provider=zai/glm-4.6v builtInTools=0 customTools=18 customToolNames=read,edit,write,exec,process,cron,sessions_list,sessions_history,sessions_send,sessions_yield,sessions_spawn,subagents,session_status,web_search,web_fetch,image,memory_search,memory_get imageTool=name=image description=Analyze one or more images with a vision model. Use image for a single path/URL, or images for multiple (up to 20). Only use this tool when images were NOT already provided in the user's message. Images mentioned in the prompt are automatically visible to you. parameters={"type":"object","properties":{"prompt":{"type":"string"},"image":{"description":"Single image path or URL.","type":"string"},"images":{"description":"Multiple image paths or URLs (up to maxImages, default 20).","type":"array","items":{"type":"string"}},"model":{"type":"string"},"maxBytesMb":{"type":"number"},"maxImages":{"type":"number"}}}
12:46:03+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run model request: runId=480fd177-8207-4beb-98f9-9b88be22503c sessionKey=agent:main:main provider=zai/glm-4.6v historyMessages=7 roleCounts=assistant:2,toolResult:4,user:1 historyImageBlocks=0 inputImages=1 imageOrder=inline detectedPromptRefs=0 loadedPromptImages=0 skippedPromptImages=0 promptImages=1 systemPromptChars=46346 promptChars=144 systemPromptPreview=You are a personal assistant operating inside OpenClaw. ## Tooling Structured tool definitions are the source of truth for tool names, descriptions, and parameters. Tool names are case-sensitive. Call tools exactly as listed in the structured tool definitions. If a tool is pre... promptPreview=Sender (untrusted metadata):  { "label": "openclaw-control-ui", "id": "openclaw-control-ui" }  [Tue 2026-04-07 20:46 GMT+8] 图中有什么
12:46:09+00:00 info agent/embedded {"subsystem":"agent/embedded"} embedded run tool call received: runId=480fd177-8207-4beb-98f9-9b88be22503c rawTool=image tool=image toolCallId=call_eaf0f08426a843bcadedd802 meta=描述这张图片的内容，包括场景、元素和整体氛围 args={"prompt":"描述这张图片的内容，包括场景、元素和整体氛围","image":"https://picsum.photos/id/1018/800/600"}
12:46:09+00:00 info agents/tools/image {"subsystem":"agents/tools/image"} image tool local input raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 exists=false source=local-path
12:46:09+00:00 warn agents/tools/image {"subsystem":"agents/tools/image"} image tool local input missing raw=https://picsum.photos/id/1018/800/600 resolved=https://picsum.photos/id/1018/800/600 source=local-path

Impact and severity

Some models with weaker tool-calling capabilities may produce errors or hallucinations during multimodal recognition, but in essence their underlying multimodal capability is still sufficient for the recognition task.

Additional information

No response

extent analysis

TL;DR

The issue can be mitigated by ensuring that the image input is correctly handled and processed by the multimodal models, potentially by improving the tool-calling capabilities of weaker models.

Guidance

Verify that the image file exists at the specified local path to rule out file system issues.
Check the model's documentation and configuration to ensure that it is correctly set up to handle image inputs and tool calls.
Consider using a state-of-the-art model like GPT-5.4, which does not exhibit this issue, as a temporary workaround.
Review the logs to identify any patterns or common factors that may be contributing to the errors or hallucinations.

Example

No code snippet is provided as the issue is related to model behavior and tool-calling capabilities, rather than a specific code error.

Notes

The issue appears to be specific to weaker multimodal models, and the provided logs suggest that the image tool is not correctly handling local input. However, without further information about the model's implementation and configuration, it is difficult to provide a more specific solution.

Recommendation

Apply a workaround by using a state-of-the-art model like GPT-5.4, which does not exhibit this issue, until the underlying problem with the weaker multimodal models can be resolved.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

model can understand the image, and response correctly

#runtime error #dependency conflict #environment setup #docker error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: WEBUI / Edge cases —— Image input causes some multimodal models to misuse tool calling [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #62523: fix(chat): webui image chat fix

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

User-visible / Behavior Changes

Diagram

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING