hermes - 💡(How to fix) Fix Feature Request: Automatic vision fallback for non-vision primary models

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Current workaround

  1. /new (lose context)
  2. Switch primary model to Gemini/GPT-4o
  3. Send image again
  4. Switch back to DeepSeek

Code Example

unknown variant `image_url`, expected `text`

---

User sends image + text
       |
       v
+----------------------+
| Vision Pre-processor |
| 1. Detect image_url  |
| 2. Check if primary  |
|    model has vision  |
| 3. If NOT: send to   |
|    auxiliary.vision  |
|    -> get descriptn  |
| 4. Replace image_url |
|    with text block   |
+----------------------+
       |
       v
  DeepSeek receives
  text-only message OK

---

auxiliary:
  vision:
    provider: google
    model: gemini-2.5-flash
    auto_fallback: true   # NEW
RAW_BUFFERClick to expand / collapse

Problem

When using a primary model that does not support vision (e.g. DeepSeek), sending an image causes the entire request to fail with:

unknown variant `image_url`, expected `text`

The failure happens at the API level — DeepSeek rejects image_url before the agent sees a response. The agent cannot recover.

Current workaround

  1. /new (lose context)
  2. Switch primary model to Gemini/GPT-4o
  3. Send image again
  4. Switch back to DeepSeek

This is disruptive — especially on Telegram where voice+image messages are common.


Proposed Solution

Pre-processing layer in the agent loop that intercepts image attachments before they reach the primary model:

User sends image + text
       |
       v
+----------------------+
| Vision Pre-processor |
| 1. Detect image_url  |
| 2. Check if primary  |
|    model has vision  |
| 3. If NOT: send to   |
|    auxiliary.vision  |
|    -> get descriptn  |
| 4. Replace image_url |
|    with text block   |
+----------------------+
       |
       v
  DeepSeek receives
  text-only message OK

Config addition

auxiliary:
  vision:
    provider: google
    model: gemini-2.5-flash
    auto_fallback: true   # NEW

Benefits

  • Seamless UX — no /new, no model switching, no context loss
  • Best-of-both-worlds — cheap text models for reasoning + specialized vision models for images
  • Works mid-session — images just work
  • No API changes — purely client-side preprocessing

Edge Cases

  • Multiple images per message — process sequentially or parallel
  • Mixed content — preserve text alongside descriptions
  • Vision model failure — graceful degradation with a note
  • Cost tracking — auxiliary calls should appear in usage stats
  • Streaming — vision must complete before primary model stream begins

Submitted via Hermes Agent on behalf of Vitaliy Li

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING