hermes - 💡(How to fix) Fix Add config option to pass inbound images directly to multimodal main models [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15288Fetched 2026-04-25 06:23:13
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×4commented ×1

Error Message

  • If the provider/model cannot handle images, return a clear error rather than silently falling back.

Code Example

vision:
  inbound_mode: preprocess          # current behavior / safest default
  # inbound_mode: direct_if_supported
  # inbound_mode: always_direct
RAW_BUFFERClick to expand / collapse

Problem

Telegram and CLI image attachments are currently always preprocessed through the auxiliary vision pipeline before reaching the main agent model:

  • Telegram media is cached locally and stored on event.media_urls / event.media_types.
  • gateway/run.py::_prepare_inbound_message_text() detects image media and calls _enrich_message_with_vision().
  • _enrich_message_with_vision() calls tools.vision_tools.vision_analyze_tool() with a broad "Describe everything visible..." prompt.
  • The resulting text description plus local image path is prepended to the user's message.
  • CLI follows the same pattern in cli.py::_preprocess_images_with_vision().

This means capable multimodal main models never receive the original image content directly from Telegram/CLI. They receive a lossy textual description produced by a separate auxiliary vision model.

Why this is a design issue

The current design is useful as a compatibility fallback for text-only models, but it is a poor default for multimodal-capable main models:

  1. Lossy information transfer

    • Screenshots, UI details, diagrams, code blocks, tables, layout, colors, and small visual cues can be lost or distorted during the auxiliary description step.
  2. Wrong model does the important perceptual work

    • If the user deliberately selected a strong multimodal main model, the image should ideally be interpreted by that model, not by an auxiliary model that may be weaker, cheaper, slower, or differently configured.
  3. More latency and cost

    • Every image-bearing message triggers an extra vision call before the real model call.
    • Existing related issue: #10809 shows that the default vision preprocessing can produce very long descriptions and slow down image-bearing requests.
  4. Behavior is surprising

    • Users expect a multimodal main model to receive the image directly.
    • Instead Hermes silently converts image input into text, which changes the semantics of the interaction.
  5. Hard to evaluate multimodal models accurately

    • When benchmarking model quality in Hermes, the main model is not actually being tested on image understanding if the image has already been summarized by another model.

Proposed solution

Add a config-gated inbound image mode, for example:

vision:
  inbound_mode: preprocess          # current behavior / safest default
  # inbound_mode: direct_if_supported
  # inbound_mode: always_direct

Suggested semantics:

  • preprocess

    • Current behavior.
    • Analyze image with auxiliary vision model and inject text description.
    • Best compatibility with text-only models and prompt caching.
  • direct_if_supported

    • If the configured main provider/model supports multimodal image content, pass the cached local image as an OpenAI-style image content part (image_url / base64 data URL) into the main conversation.
    • If unsupported or the image is too large / incompatible, fall back to the current preprocessing path.
  • always_direct

    • Always pass image content directly to the main model.
    • If the provider/model cannot handle images, return a clear error rather than silently falling back.

Implementation notes

There already appears to be lower-level multimodal support in run_agent.py for image content parts (image_url / input_image handling). The missing piece is that Telegram/CLI inbound paths normally convert images to text before the main conversation sees them.

Likely touch points:

  • gateway/run.py::_prepare_inbound_message_text()
  • gateway/run.py::_enrich_message_with_vision()
  • cli.py::_preprocess_images_with_vision()
  • possibly the message construction path before AIAgent.run_conversation()

A minimal first version could keep preprocess as the default and add direct_if_supported only for providers/models known to accept OpenAI-style image content.

Expected behavior

When using a multimodal-capable main model, sending a Telegram/CLI image should allow the main model to inspect the original image directly, preserving the current auxiliary vision preprocessing only as an explicit mode or fallback.

extent analysis

TL;DR

To fix the issue, add a config-gated inbound image mode to pass images directly to multimodal-capable main models, falling back to auxiliary vision preprocessing when necessary.

Guidance

  • Introduce a new configuration option vision.inbound_mode with values preprocess, direct_if_supported, and always_direct to control image processing.
  • Update gateway/run.py::_prepare_inbound_message_text() and cli.py::_preprocess_images_with_vision() to respect the chosen inbound_mode.
  • Implement logic to detect if the main model supports multimodal image content and pass the image directly if direct_if_supported mode is chosen.
  • Consider adding error handling for cases where the main model cannot handle images when always_direct mode is used.

Example

vision:
  inbound_mode: direct_if_supported

This configuration would allow multimodal-capable main models to receive images directly while falling back to auxiliary vision preprocessing for unsupported models or incompatible images.

Notes

The implementation should be careful to preserve the current behavior as the default to maintain compatibility with text-only models. The direct_if_supported mode seems like a reasonable starting point, as it balances the need for direct image processing with the requirement for compatibility.

Recommendation

Apply the workaround by introducing the vision.inbound_mode configuration option and updating the relevant code paths to respect the chosen mode. This approach allows for a flexible and controlled transition to direct image processing for multimodal-capable main models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When using a multimodal-capable main model, sending a Telegram/CLI image should allow the main model to inspect the original image directly, preserving the current auxiliary vision preprocessing only as an explicit mode or fallback.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Add config option to pass inbound images directly to multimodal main models [1 comments, 2 participants]