hermes - 💡(How to fix) Fix Pass native images to multimodal models instead of always going through vision enrichment pipeline [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16862Fetched 2026-04-29 06:38:25
View on GitHub
Comments
2
Participants
3
Timeline
6
Reactions
0
Timeline (top)
labeled ×4commented ×2

Code Example

# gateway/run.py: _enrich_message_with_vision() line 8237-8280

result_json = await vision_analyze_tool(image_url=path, user_prompt=analysis_prompt)

description = result.get("analysis", "")

# Result: "[The user sent an image~ Here's what I can see:\n{description}]"

---

[The user sent an image~ Here's what I can see:
This image shows a screenshot of a dashboard with...
]

---

# What should happen for multimodal models:

{

    "role": "user",

    "content": [

        {"type": "text", "text": user_text or ""},

        {"type": "image_url", "image_url": {"url": f"file://{cached_path}"}}

    ]

}
RAW_BUFFERClick to expand / collapse

Problem When a user sends an image through any gateway platform (Telegram, Feishu, WeChat, Discord, etc.), the gateway always runs it through _enrich_message_with_vision() — a separate vision analysis tool call — and replaces the original image with a text description before passing it to the LLM:

# gateway/run.py: _enrich_message_with_vision() line 8237-8280

result_json = await vision_analyze_tool(image_url=path, user_prompt=analysis_prompt)

description = result.get("analysis", "")

# Result: "[The user sent an image~ Here's what I can see:\n{description}]"

The agent receives only text like:

[The user sent an image~ Here's what I can see:
This image shows a screenshot of a dashboard with...
]

This means: Multimodal models never actually see the original image — they only get a text description from a separate (often weaker) vision model. Information loss — details the multimodal model could pick up directly (fine print, layout, color, specific icons) are lost or distorted in the vision-to-text roundtrip. Extra latency & cost — every user image requires an extra API call to the vision analysis tool before the main LLM even begins reasoning. Expected Behavior For multimodal models (GPT-4o, Claude 3.5 Sonnet/Opus, Gemini 2.0, GLM-4V, Qwen-VL, etc.), the cached local image should be passed directly to the LLM as a native image_url content part in the OpenAI-style messages array:

# What should happen for multimodal models:

{

    "role": "user",

    "content": [

        {"type": "text", "text": user_text or ""},

        {"type": "image_url", "image_url": {"url": f"file://{cached_path}"}}

    ]

}

Only fall back to the current _enrich_message_with_vision() text-description approach for text-only models. Proposed Implementation The key change is in gateway/run.py inside _handle_message_with_agent(), around line 3910-3924: Detect whether the current model supports multimodal (vision) input — possibly from agent/model_metadata.py If yes → build a content array with type: image_url pointing to the cached local file path, instead of enriching the message string If no → keep existing behavior (use _enrich_message_with_vision() as fallback) AIAgent.run_conversation() currently takes user_message: str (line 8630). It would also need to accept structured messages (content arrays with image_url entries), or a parallel interface like conversation_history-style message dicts. Related #15576: Feishu/WeChat images fail with "Invalid image source" — a different bug in the same enrichment pipeline (empty paths not filtered)

extent analysis

TL;DR

Modify the _handle_message_with_agent() function in gateway/run.py to conditionally pass the original image to multimodal models instead of using the _enrich_message_with_vision() text description approach.

Guidance

  • Check the model's metadata to determine if it supports multimodal input, possibly using agent/model_metadata.py.
  • If the model supports multimodal input, build a content array with a type: image_url entry pointing to the cached local file path.
  • Update AIAgent.run_conversation() to accept structured messages with image_url entries, either by modifying the existing user_message: str parameter or adding a parallel interface.
  • Consider filtering out empty paths to prevent "Invalid image source" errors, as seen in related issue #15576.

Example

# Example of building a content array with image_url entry
content = [
    {"type": "text", "text": user_text or ""},
    {"type": "image_url", "image_url": {"url": f"file://{cached_path}"}}
]

Notes

The proposed implementation requires changes to the _handle_message_with_agent() function and the AIAgent.run_conversation() method. The exact implementation details may vary depending on the specific requirements and constraints of the project.

Recommendation

Apply the proposed workaround by modifying the _handle_message_with_vision() function and AIAgent.run_conversation() method to support multimodal input for compatible models, as this approach reduces information loss, latency, and cost associated with the current text description approach.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Pass native images to multimodal models instead of always going through vision enrichment pipeline [2 comments, 3 participants]