hermes - ✅(Solved) Fix Native multimodal mode: expose image path in user text for tool-parameter use [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18960Fetched 2026-05-03 04:53:18
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
labeled ×4cross-referenced ×1

Root Cause

Custom MCP tools that operate on images (issue trackers, OCR-to-endpoint, image-to-DB) cannot work cleanly in native mode without this path exposure. Today the alternatives are:

  • Force image_input_mode: text: doubles the LLM call (auxiliary vision + main) and loses the native vision quality the user paid for.
  • Custom plugin per installation (works, not portable).

A first-party opt-in flag unifies the experience and lives under the same agent.* namespace as the existing image_input_mode.

Fix Action

Fix / Workaround

Workaround (Plugin-Based)

We have implemented a Hermes plugin (ocp-image-path-inject) that hooks into pre_llm_call and prepends [Image attached at: <path>] to the user message when native image parts are present. The workaround is functional but:

PR fix notes

PR #19032: fix(image-routing): expose attached image paths in native multimodal text part

Description (problem / solution / changelog)

Summary

  • In native image mode (vision-capable main models — gpt-4o, claude-sonnet-4, etc.), the local file path of each attached image is now appended to the user-text part of the multimodal turn as [Image attached at: <path>].
  • This lets the model use the path as a string argument for tools that take image_url: str (custom MCP tools, vision_analyze on a re-look, attach-to-tracker workflows) without an extra round-trip.
  • Mirrors the equivalent text-mode hint already produced by Runner._enrich_message_with_vision (vision_analyze using image_url: <path>).

The bug

agent/image_routing.py::build_native_content_parts emits [{"type": "text", "text": "<user caption>"}, {"type": "image_url", ...}]. The image bytes are inlined for native vision, but the file path never enters the conversation text.

Result: a model can see the image, but if asked to invoke an MCP tool like task_attach_image(task_id: int, image_url: str), it has no string handle to pass and responds "I see the image, but the file URL is not in my context, so I cannot pass it to task_attach_image."

The text-mode path (gateway/run.py::_enrich_message_with_vision, lines 10607-10611 / 10616 / 10623) already injects the path: [If you need a closer look, use vision_analyze with image_url: {path}]. Native mode lost that capability when it bypassed _enrich_message_with_vision for vision-capable models in v0.12.0 (#16506).

The fix

build_native_content_parts now collects successfully attached paths and appends one [Image attached at: <path>] line per image to the user-text part:

  • Single text part — no extra parts that some providers handle inconsistently.
  • Skipped (unreadable) paths are NOT advertised — the model is never told a non-existent file is attached.
  • Empty user caption still falls back to the existing neutral prompt (What do you see in this image?), with hints appended after.
  • Format mirrors the text-mode hint so model behaviour is consistent across both image input modes.

No change to image bytes, MIME inference, oversize handling, or the reactive shrink-on-reject loop in run_agent._try_shrink_image_parts_in_messages.

Test plan

  • Focused regression test: tests/agent/test_image_routing.py — 29 tests, all passing. Added two new tests (test_path_hint_appended, test_path_hint_one_per_attached_image) and updated the two equality-checked tests (test_text_then_image, test_empty_text_inserts_default_prompt) to reflect the new contract.
  • Adjacent suite: tests/agent/test_compressor_image_tokens.py, tests/run_agent/test_copilot_native_vision_headers.py, tests/gateway/test_native_image_buffer_isolation.py — all 17 pass with pytest-asyncio (matches CI's .[all,dev] install).
  • Regression guard: with the production change reverted but the new tests left in place, test_path_hint_appended, test_path_hint_one_per_attached_image, and the updated test_text_then_image all fail (assert 0 == 1, no [Image attached at: substring), confirming the new contract is what the production code provides.

Related

  • Fixes #18960
  • Implements parity with the text-mode hint added alongside _enrich_message_with_vision (gateway/run.py:10607-10623).
  • Native multimodal routing landed in #16506; this is a follow-up enhancement on the same code path.

Sibling code paths that may need the same fix: any other caller that constructs multimodal content parts directly (e.g. CLI image-attach flows in run_agent.py). Intentionally left out of this PR's scope to keep the diff small — happy to widen if preferred. The current invocation site is gateway/run.py:12753-12757, which is the only user-facing path that calls build_native_content_parts today.

Changed files

  • agent/image_routing.py (modified, +32/-11)
  • tests/agent/test_image_routing.py (modified, +47/-4)

Code Example

@mcp.tool()
   def task_attach_image(task_id: int, image_url: str) -> dict: ...

---

[... use vision_analyze using image_url: /local/path ...]

---

agent:
  native_vision_path_hint: true  # default: false
RAW_BUFFERClick to expand / collapse

Native multimodal mode: expose image path in user text for tool-parameter use

Problem

When a vision-capable main model (e.g. gpt-4o, gpt-5.5, claude-sonnet-4) receives an image via the gateway in image_input_mode: native (image bytes inlined as image_url content part), the local file path is never exposed as plain text in the conversation context.

This means the model can see the image (full native vision) but cannot reference the file as a string parameter for tools that expect a path or URL — for example, a custom MCP tool that takes an image_url: str argument and attaches the image to a downstream tracker / OCR endpoint / issue tracker.

The text-mode path (_enrich_message_with_vision) already injects this hint ("...use vision_analyze with image_url: /local/path..."), but the native-mode path that replaced it for vision-capable models in v0.12.0 has no equivalent.

Reproduction

  1. Configure Hermes with a vision-capable main model (e.g. gpt-5.5 via openai-codex, or claude-sonnet-4).
  2. Leave agent.image_input_mode unset (default auto → resolves to native for vision-capable models).
  3. Register a custom MCP tool that expects an image URL/path:
    @mcp.tool()
    def task_attach_image(task_id: int, image_url: str) -> dict: ...
  4. From Slack, send an image with caption: "Bunu task #123'e ekle".
  5. Model sees the image (correct OCR/description on demand) but responds:

    "I see the image, but the file URL is not in my context, so I cannot pass it to task_attach_image."

Verified live in Pi 5 production deployment with Slack gateway serving a non-English operations team — v0.12.0 (commit 73bf3ab1b), gpt-5.5 via openai-codex, 2026-05-02. The use case is real day-to-day workflow (operations manager attaching screenshots/photos to internal tracker tasks via Slack), not a synthetic local-dev reproduction.

Code Pointers

  • agent/image_routing.py::build_native_content_parts — only constructs {"type": "image_url", ...} parts; no companion text injection of the source path. The function's input user_text is the original message; the path string never enters the parts list.
  • gateway/run.py::_enrich_message_with_vision — TEXT mode already injects something equivalent for the auxiliary path:
    [... use vision_analyze using image_url: /local/path ...]
    Native mode skips this branch entirely (see _decide_image_input_mode at gateway/run.py:9407 and the if _img_mode == "native" short-circuit at gateway/run.py:4691).
  • gateway/run.py near line 11604 — build_native_content_parts is invoked, but the user-text part it produces does not include any path reference.

Expected Behaviour

In native mode, after the image_url content part is constructed, optionally append a brief text note like [Image attached at: /local/path] to the user text part of the same multimodal turn. This parallels the text-mode injection but for native mode where vision_analyze is not called.

The model then has both:

  1. The pixels (native vision quality preserved).
  2. A string reference to use as a tool argument when needed.

Workaround (Plugin-Based)

We have implemented a Hermes plugin (ocp-image-path-inject) that hooks into pre_llm_call and prepends [Image attached at: <path>] to the user message when native image parts are present. The workaround is functional but:

  • Requires per-installation setup (we run Slack gateway on Raspberry Pi 5 in production).
  • Touches a hot path that is already under careful prompt-cache discipline (per AGENTS.md "Prompt Caching Must Not Break"); a first-party config flag would let upstream guarantee correctness for everyone.
  • Adds friction for any user with a custom image-handling tool, which is a growing class as MCP adoption increases.

The plugin source is shareable on request — comment if interested, I'll publish a standalone repo (hermes-plugin-image-path-inject). It contains only the Hermes plugin pattern + cache path whitelist, no application-internal code.

Proposed Solution

Add an opt-in config flag:

agent:
  native_vision_path_hint: true  # default: false

When true, native mode also injects [Image attached at: <path>] (one line per attached image) into the text part of the multimodal user turn.

Default false to preserve current prompt-cache behaviour — the path string changes per upload, and would invalidate cached prefixes for repeated images. Users who need tool-parameter access opt in and accept the trade-off.

Why This Matters

Custom MCP tools that operate on images (issue trackers, OCR-to-endpoint, image-to-DB) cannot work cleanly in native mode without this path exposure. Today the alternatives are:

  • Force image_input_mode: text: doubles the LLM call (auxiliary vision + main) and loses the native vision quality the user paid for.
  • Custom plugin per installation (works, not portable).

A first-party opt-in flag unifies the experience and lives under the same agent.* namespace as the existing image_input_mode.

Related

  • #15288 (open, P2, comp/gateway) — Adds inbound mode (preprocess vs direct). Adjacent but orthogonal: that issue is about whether to send pixels natively; this one is about what extra metadata to expose alongside the pixels in native mode. Both can ship independently.
  • #13065 / #7641 (open) — Native vision support feature requests; v0.12.0 / PR #16506 partially addresses by making native routing automatic for vision-capable models. Tool-parameter path exposure is not covered there.
  • #5661 (closed, not planned) — Earlier passthrough proposal; superseded.

Environment

  • Hermes v0.12.0 (commit 73bf3ab1b22314ed9dfecbb59242c03742fe72af)
  • Slack gateway on Raspberry Pi 5, systemd --user service
  • Main model: gpt-5.5 via openai-codex provider
  • Custom MCP server with image-handling tools registered via ~/.hermes/config.yamlmcp block

Default-Value Rationale

Default false is deliberate:

  • Preserves prompt-cache behaviour for the majority of users who only need vision (not tool-parameter access). Image paths are per-upload unique strings and would invalidate cache hits.
  • Opt-in users explicitly accept the trade-off in exchange for tool-parameter functionality.
  • Matches the existing pattern of conservative defaults in agent.* config keys (e.g. image_input_mode: auto, busy_ack_enabled opt-in).

A follow-up PR is being prepared with the implementation, tests, and docs updates for review.

extent analysis

TL;DR

Add an opt-in config flag native_vision_path_hint to expose the image path in user text for tool-parameter use in native multimodal mode.

Guidance

  • Implement the proposed solution by adding a config flag native_vision_path_hint with a default value of false to preserve current prompt-cache behavior.
  • When native_vision_path_hint is true, inject the image path into the text part of the multimodal user turn, allowing custom MCP tools to access the image URL/path.
  • Test the implementation with various image-handling tools and scenarios to ensure compatibility and correctness.
  • Consider the trade-off between prompt-cache behavior and tool-parameter functionality when opting in to the new feature.

Example

# Example config.yaml snippet
agent:
  native_vision_path_hint: true

Notes

The proposed solution requires careful consideration of the prompt-cache behavior and its impact on performance. The default value of false ensures that the majority of users who only need vision (not tool-parameter access) are not affected.

Recommendation

Apply the workaround by adding the native_vision_path_hint config flag, as it provides a flexible and opt-in solution for users who need tool-parameter access in native multimodal mode. This approach allows users to explicitly accept the trade-off between prompt-cache behavior and tool-parameter functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Native multimodal mode: expose image path in user text for tool-parameter use [1 pull requests, 1 participants]