hermes - 💡(How to fix) Fix Feature: First-class native vision support for vision-capable main models (with reference implementation + bug findings)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Hermes currently routes ALL image analysis through an auxiliary vision model (qwen3-vl or configured alternative), even when the main model is natively vision-capable (e.g., gpt-4o, glm-5v-turbo, claude-sonnet-4). This adds unnecessary latency, cost, and information loss (text description ≠ seeing the image).

We implemented a working native vision bypass (native_vision: true config toggle) that sends images directly to the main model as multimodal content blocks — and in doing so, we discovered several underlying issues in Hermes's conversation pipeline that would affect ANY multimodal content, not just our hack.

This issue serves dual purpose:

  1. Feature request: Upstream native vision support (with our implementation as reference)
  2. Bug findings: Pipeline issues that should be fixed regardless

Root Cause

Hermes currently routes ALL image analysis through an auxiliary vision model (qwen3-vl or configured alternative), even when the main model is natively vision-capable (e.g., gpt-4o, glm-5v-turbo, claude-sonnet-4). This adds unnecessary latency, cost, and information loss (text description ≠ seeing the image).

We implemented a working native vision bypass (native_vision: true config toggle) that sends images directly to the main model as multimodal content blocks — and in doing so, we discovered several underlying issues in Hermes's conversation pipeline that would affect ANY multimodal content, not just our hack.

This issue serves dual purpose:

  1. Feature request: Upstream native vision support (with our implementation as reference)
  2. Bug findings: Pipeline issues that should be fixed regardless

Fix Action

Fix / Workaround

We patched 4 files to add native vision support:

Full patch code: see native-vision-patch skill (community-maintained)

Workaround (Current)

Code Example

# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

---

if isinstance(content, list):
    continue  # ← skips EVERY multimodal block including images
RAW_BUFFERClick to expand / collapse

Feature Request: First-Class Native Vision Support for Vision-Capable Main Models

Summary

Hermes currently routes ALL image analysis through an auxiliary vision model (qwen3-vl or configured alternative), even when the main model is natively vision-capable (e.g., gpt-4o, glm-5v-turbo, claude-sonnet-4). This adds unnecessary latency, cost, and information loss (text description ≠ seeing the image).

We implemented a working native vision bypass (native_vision: true config toggle) that sends images directly to the main model as multimodal content blocks — and in doing so, we discovered several underlying issues in Hermes's conversation pipeline that would affect ANY multimodal content, not just our hack.

This issue serves dual purpose:

  1. Feature request: Upstream native vision support (with our implementation as reference)
  2. Bug findings: Pipeline issues that should be fixed regardless

Our Reference Implementation

We patched 4 files to add native vision support:

FileWhat We Added
tools/browser_tool.pySkip aux vision, return _native_vision marker with base64 data URL
tools/vision_tools.pySkip aux vision, return _native_vision marker
run_agent.pyTwo injection handlers that convert markers → multimodal image_url blocks
gateway/run.pyUpload path: resize + encode images, return marker
agent/context_compressor.pyPass 4: evict old image blocks from context

Full patch code: see native-vision-patch skill (community-maintained)

Bugs We Discovered (Affect All Multimodal Content)

These are issues in Hermes's core pipeline that would impact any attempt to send image data through the conversation system:

1. Missing detail: "auto" on Image Injection (Token Bloat)

When constructing OpenAI-format image_url content blocks, there's no detail parameter. This means images are tokenized at full resolution (detail: "high" implicit default).

Impact: A typical phone photo = ~121-230K tokens at full res vs ~2-5K with detail: "auto". That's a 25-60× reduction.

Fix location (3 injection points):

  • run_agent.py — tool result injection (~L8465)
  • run_agent.py — user upload injection (~L8852)
  • tools/browser_tool.py — auxiliary vision fallback (~L2155)
# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

Reference: OpenAI Vision API docs

2. Context Compressor Explicitly Skips Multimodal Content

agent/context_compressor.py — all 3 compression passes have this guard:

if isinstance(content, list):
    continue  # ← skips EVERY multimodal block including images

Impact: Image data is immortal. After N turns with M images = unbounded context growth. The compressor reports "failed after 3 attempts" when 60%+ of context is uncompressible image data.

Fix: Add a Pass 4 that replaces old image blocks (outside protected tail) with lightweight text placeholders. Follows same pattern as OpenClaw (images are per-turn only).

3. Session Persistence Bakes Raw Content Verbatim

Large string content (including base64 image data) is stored directly into session JSON files without size awareness or sanitization. On session reload, raw data is sent back to the API.

Impact: A single 1.97 MB base64 blob in a user message causes context overflow on every reload until manually cleaned up.

Fix: Override persist to store text-only references for image content; strip base64 from messages before session save.

4. Gateway Preambles Break JSON Message Parsing

The gateway prepends notices (e.g., model-switch notifications) before message content. If that content is JSON (like our vision markers), json.loads() fails on the combined string.

Fix: Find JSON start position before parsing; preserve preamble text separately.

5. Short Sessions Bypass Compression Entirely

When compress() sees fewer than _min_for_compress messages, it returns immediately. For fresh sessions replaying a poisoned upload (e.g., after auto-reset), NO compression runs — the base64 survives every retry until crash.

Fix: Run base64 stripping unconditionally at the start of compress(), before the early-return guard.

Token Math: Why Native Vision Matters

MetricAuxiliary Vision (current)Native Vision (proposed)
Per-image tokens~500 (text desc only)~2-5K (actual pixels)
Visual detail lost?✅ Yes (summary)❌ No (model sees image)
Extra API call per image✅ Yes (aux model)❌ No
Added latency+2-10 seconds0
Images fit in 262K contextUnlimited (text only)40+ comfortably
Main model must support visionNoYes

Proposed Approach

  1. Add auxiliary.native_vision config toggle (upstream our hack)
  2. Apply detail: "auto" to all image_url injection points
  3. Add context compressor Pass 4 for image eviction
  4. Fix session persistence to not bake base64
  5. Fix preamble-aware JSON parsing
  6. Add pre-compress base64 strip for short sessions
  7. Graceful fallback: if native vision fails → fall through to auxiliary path

Workaround (Current)

Currently users must use auxiliary vision. Set auxiliary.vision to a vision model config. No native option exists upstream.

References


Note: Posted by a Hermes user who built and battle-tested a native vision implementation. Happy to contribute as a PR if maintainers are interested.

extent analysis

TL;DR

To address the issues with Hermes's conversation pipeline and enable native vision support, apply the proposed fixes, including adding detail: "auto" to image_url injection points, implementing context compressor Pass 4, and modifying session persistence to handle image content.

Guidance

  1. Add detail: "auto" to image_url injection points: Modify run_agent.py and tools/browser_tool.py to include the detail parameter when constructing image_url content blocks.
  2. Implement context compressor Pass 4: Update agent/context_compressor.py to evict old image blocks from context, following the pattern of OpenClaw.
  3. Modify session persistence: Change session persistence to store text-only references for image content and strip base64 from messages before session save.
  4. Fix preamble-aware JSON parsing: Update the gateway to find the JSON start position before parsing and preserve preamble text separately.
  5. Add pre-compress base64 strip for short sessions: Run base64 stripping unconditionally at the start of compress() to prevent context overflow.

Example

# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

Notes

The proposed fixes aim to address the issues with Hermes's conversation pipeline, but the implementation may require additional modifications to ensure compatibility with the existing codebase.

Recommendation

Apply the proposed fixes to enable native vision support and improve the overall performance of Hermes's conversation pipeline. This approach will reduce latency, cost, and information loss associated with auxiliary vision models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING