hermes - 💡(How to fix) Fix Image auto-routing cannot use native vision when auxiliary.vision is configured as fallback

Root Cause

DeepSeek main model needs auxiliary.vision explicitly set to a vision-capable backend (Gemini flash), because DeepSeek itself does not support image input.
GPT-5.5-style main models should receive attached images natively, because they support vision and should not be limited to a lossy auxiliary text summary.
With the current routing logic, the explicit auxiliary.vision setting wins in auto mode and prevents the native path from being used for vision-capable main models.

Code Example

model:
  provider: deepseek
  default: deepseek-v4-pro

agent:
  image_input_mode: auto

auxiliary:
  vision:
    provider: gemini
    model: gemini-2.5-flash

---

# In auto mode:
#   - If the user has explicitly configured auxiliary.vision.provider
#     (i.e. not auto and not empty), we assume they want the text pipeline
#     regardless of the main model

---

if _explicit_aux_vision_override(cfg):
    return "text"

supports = _lookup_supports_vision(provider, model)
if supports is True:
    return "native"
return "text"

---

agent:
  image_input_mode: auto_native_first

---

agent:
  image_input_mode: auto
  image_native_priority: true

---

auxiliary:
  vision:
    provider: gemini
    model: gemini-2.5-flash
    use_as_fallback_only: true

Bug Description

When agent.image_input_mode is left at the default auto, an explicit auxiliary.vision override forces all user-attached images through the text-description pipeline, even when the active main model supports native vision.

This makes it impossible to keep a dedicated auxiliary vision backend for text-only main models (for example DeepSeek) while still allowing a vision-capable main model (for example GPT-5.5 / OpenAI-family multimodal models) to inspect the original image pixels natively in the same profile.

In practice, the current config model creates a conflict:

DeepSeek main model needs auxiliary.vision explicitly set to a vision-capable backend (Gemini flash), because DeepSeek itself does not support image input.
GPT-5.5-style main models should receive attached images natively, because they support vision and should not be limited to a lossy auxiliary text summary.
With the current routing logic, the explicit auxiliary.vision setting wins in auto mode and prevents the native path from being used for vision-capable main models.

Current Config Example

model:
  provider: deepseek
  default: deepseek-v4-pro

agent:
  image_input_mode: auto

auxiliary:
  vision:
    provider: gemini
    model: gemini-2.5-flash

This is a useful config for DeepSeek: uploaded images are analyzed by Gemini and converted to text for the text-only main model.

However, when switching the main model/session to a vision-capable GPT-5.5-style model, the same profile still routes uploaded images through vision_analyze instead of attaching the image natively to the main model.

Relevant Code Path

agent/image_routing.py documents and implements the decision:

# In auto mode:
#   - If the user has explicitly configured auxiliary.vision.provider
#     (i.e. not auto and not empty), we assume they want the text pipeline
#     regardless of the main model

And:

if _explicit_aux_vision_override(cfg):
    return "text"

supports = _lookup_supports_vision(provider, model)
if supports is True:
    return "native"
return "text"

So an explicit auxiliary vision backend has higher priority than the active model's supports_vision=True capability.

Expected Behavior

There should be a config-only way to express this policy:

Use native image input whenever the active main model supports vision; otherwise fall back to the configured auxiliary vision backend.

That would allow one profile to work correctly for both:

text-only main models such as DeepSeek, using auxiliary.vision as fallback/text-description backend;
vision-capable main models such as GPT-5.5, using native multimodal input.

Actual Behavior

With agent.image_input_mode: auto and any explicit auxiliary.vision provider/model, Hermes always chooses the text-description path for user-attached images, even if the active main model supports native vision.

The only current config workarounds are incomplete:

Set agent.image_input_mode: native
- fixes vision-capable main models;
- but breaks or risks API errors for text-only models such as DeepSeek.
Clear auxiliary.vision.provider/model
- allows auto to choose native for vision-capable main models;
- but removes the explicit fallback needed for text-only main models.
Use separate Hermes profiles
- works operationally;
- but splits sessions/config/state and is not a real fix for a routing-policy limitation.

Proposed Solution

Add a routing policy that gives native-capable main models priority while keeping auxiliary vision as fallback for text-only models.

Possible config shapes:

agent:
  image_input_mode: auto_native_first

or:

agent:
  image_input_mode: auto
  image_native_priority: true

or under auxiliary vision:

auxiliary:
  vision:
    provider: gemini
    model: gemini-2.5-flash
    use_as_fallback_only: true

Behavior:

If agent.image_input_mode: native, force native as today.
If agent.image_input_mode: text, force auxiliary/text as today.
If auto-native-first / fallback-only policy is enabled:
- check active main model capabilities first;
- if supports_vision=True, return native;
- otherwise use the configured auxiliary.vision text pipeline.
Keep the current behavior as default if backward compatibility is preferred.

Why This Needs Source Changes

The conflict cannot be solved cleanly with current config fields because the only available modes are:

auto: explicit auxiliary.vision forces text pipeline before checking main model vision capability;
native: unsafe for text-only providers;
text: disables native vision entirely.

There is no existing config value for “native if supported, otherwise use configured auxiliary vision fallback”.

Environment

Hermes Agent source checkout: NousResearch/hermes-agent
OS: WSL / Linux
Main text model example: deepseek + deepseek-v4-pro
Auxiliary vision example: gemini + gemini-2.5-flash
Desired vision-capable main model behavior: GPT-5.5-style native multimodal input

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Image auto-routing cannot use native vision when auxiliary.vision is configured as fallback

Recommended Tools

GitHub issue graph ai analysis