hermes - 💡(How to fix) Fix Feature: First-class native vision support for vision-capable main models (with reference implementation + bug findings)

hermes2026-04-20 14:31:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Hermes currently routes ALL image analysis through an auxiliary vision model (qwen3-vl or configured alternative), even when the main model is natively vision-capable (e.g., gpt-4o, glm-5v-turbo, claude-sonnet-4). This adds unnecessary latency, cost, and information loss (text description ≠ seeing the image).

We implemented a working native vision bypass (native_vision: true config toggle) that sends images directly to the main model as multimodal content blocks — and in doing so, we discovered several underlying issues in Hermes's conversation pipeline that would affect ANY multimodal content, not just our hack.

This issue serves dual purpose:

Feature request: Upstream native vision support (with our implementation as reference)
Bug findings: Pipeline issues that should be fixed regardless

Root Cause

This issue serves dual purpose:

Feature request: Upstream native vision support (with our implementation as reference)
Bug findings: Pipeline issues that should be fixed regardless

Fix Action

Fix / Workaround

We patched 4 files to add native vision support:

Full patch code: see native-vision-patch skill (community-maintained)

Workaround (Current)

Code Example

# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

---

if isinstance(content, list):
    continue  # ← skips EVERY multimodal block including images

RAW_BUFFERClick to expand / collapse

Feature Request: First-Class Native Vision Support for Vision-Capable Main Models

Summary

This issue serves dual purpose:

Feature request: Upstream native vision support (with our implementation as reference)
Bug findings: Pipeline issues that should be fixed regardless

Our Reference Implementation

We patched 4 files to add native vision support:

File	What We Added
`tools/browser_tool.py`	Skip aux vision, return `_native_vision` marker with base64 data URL
`tools/vision_tools.py`	Skip aux vision, return `_native_vision` marker
`run_agent.py`	Two injection handlers that convert markers → multimodal `image_url` blocks
`gateway/run.py`	Upload path: resize + encode images, return marker
`agent/context_compressor.py`	Pass 4: evict old image blocks from context

Full patch code: see native-vision-patch skill (community-maintained)

Bugs We Discovered (Affect All Multimodal Content)

These are issues in Hermes's core pipeline that would impact any attempt to send image data through the conversation system:

1. Missing `detail: "auto"` on Image Injection (Token Bloat)

When constructing OpenAI-format image_url content blocks, there's no detail parameter. This means images are tokenized at full resolution (detail: "high" implicit default).

Impact: A typical phone photo = ~121-230K tokens at full res vs ~2-5K with detail: "auto". That's a 25-60× reduction.

Fix location (3 injection points):

run_agent.py — tool result injection (~L8465)
run_agent.py — user upload injection (~L8852)
tools/browser_tool.py — auxiliary vision fallback (~L2155)

# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

Reference: OpenAI Vision API docs

2. Context Compressor Explicitly Skips Multimodal Content

agent/context_compressor.py — all 3 compression passes have this guard:

if isinstance(content, list):
    continue  # ← skips EVERY multimodal block including images

Impact: Image data is immortal. After N turns with M images = unbounded context growth. The compressor reports "failed after 3 attempts" when 60%+ of context is uncompressible image data.

Fix: Add a Pass 4 that replaces old image blocks (outside protected tail) with lightweight text placeholders. Follows same pattern as OpenClaw (images are per-turn only).

3. Session Persistence Bakes Raw Content Verbatim

Large string content (including base64 image data) is stored directly into session JSON files without size awareness or sanitization. On session reload, raw data is sent back to the API.

Impact: A single 1.97 MB base64 blob in a user message causes context overflow on every reload until manually cleaned up.

Fix: Override persist to store text-only references for image content; strip base64 from messages before session save.

4. Gateway Preambles Break JSON Message Parsing

The gateway prepends notices (e.g., model-switch notifications) before message content. If that content is JSON (like our vision markers), json.loads() fails on the combined string.

Fix: Find JSON start position before parsing; preserve preamble text separately.

5. Short Sessions Bypass Compression Entirely

When compress() sees fewer than _min_for_compress messages, it returns immediately. For fresh sessions replaying a poisoned upload (e.g., after auto-reset), NO compression runs — the base64 survives every retry until crash.

Fix: Run base64 stripping unconditionally at the start of compress(), before the early-return guard.

Token Math: Why Native Vision Matters

Metric	Auxiliary Vision (current)	Native Vision (proposed)
Per-image tokens	~500 (text desc only)	~2-5K (actual pixels)
Visual detail lost?	✅ Yes (summary)	❌ No (model sees image)
Extra API call per image	✅ Yes (aux model)	❌ No
Added latency	+2-10 seconds	0
Images fit in 262K context	Unlimited (text only)	40+ comfortably
Main model must support vision	No	Yes

Proposed Approach

Add auxiliary.native_vision config toggle (upstream our hack)
Apply detail: "auto" to all image_url injection points
Add context compressor Pass 4 for image eviction
Fix session persistence to not bake base64
Fix preamble-aware JSON parsing
Add pre-compress base64 strip for short sessions
Graceful fallback: if native vision fails → fall through to auxiliary path

Workaround (Current)

Currently users must use auxiliary vision. Set auxiliary.vision to a vision model config. No native option exists upstream.

References

OpenAI Vision API — Low or High Fidelity
OpenClaw implementation — reference architecture for per-turn-only images with detail:"auto"
Our full implementation + troubleshooting guide: native-vision-patch skill

Note: Posted by a Hermes user who built and battle-tested a native vision implementation. Happy to contribute as a PR if maintainers are interested.

extent analysis

TL;DR

To address the issues with Hermes's conversation pipeline and enable native vision support, apply the proposed fixes, including adding detail: "auto" to image_url injection points, implementing context compressor Pass 4, and modifying session persistence to handle image content.

Guidance

Add detail: "auto" to image_url injection points: Modify run_agent.py and tools/browser_tool.py to include the detail parameter when constructing image_url content blocks.
Implement context compressor Pass 4: Update agent/context_compressor.py to evict old image blocks from context, following the pattern of OpenClaw.
Modify session persistence: Change session persistence to store text-only references for image content and strip base64 from messages before session save.
Fix preamble-aware JSON parsing: Update the gateway to find the JSON start position before parsing and preserve preamble text separately.
Add pre-compress base64 strip for short sessions: Run base64 stripping unconditionally at the start of compress() to prevent context overflow.

Example

# Before:
{"type": "image_url", "image_url": {"url": data_url}}

# After:
{"type": "image_url", "image_url": {"url": data_url, "detail": "auto"}}

Notes

The proposed fixes aim to address the issues with Hermes's conversation pipeline, but the implementation may require additional modifications to ensure compatibility with the existing codebase.

Recommendation

Apply the proposed fixes to enable native vision support and improve the overall performance of Hermes's conversation pipeline. This approach will reduce latency, cost, and information loss associated with auxiliary vision models.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #agent execution #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Feature: First-class native vision support for vision-capable main models (with reference implementation + bug findings)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround (Current)

Code Example

Feature Request: First-Class Native Vision Support for Vision-Capable Main Models

Summary

Our Reference Implementation

Bugs We Discovered (Affect All Multimodal Content)

1. Missing `detail: "auto"` on Image Injection (Token Bloat)

2. Context Compressor Explicitly Skips Multimodal Content

3. Session Persistence Bakes Raw Content Verbatim

4. Gateway Preambles Break JSON Message Parsing

5. Short Sessions Bypass Compression Entirely

Token Math: Why Native Vision Matters

Proposed Approach

Workaround (Current)

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Feature: First-class native vision support for vision-capable main models (with reference implementation + bug findings)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround (Current)

Code Example

Feature Request: First-Class Native Vision Support for Vision-Capable Main Models

Summary

Our Reference Implementation

Bugs We Discovered (Affect All Multimodal Content)

1. Missing detail: "auto" on Image Injection (Token Bloat)

2. Context Compressor Explicitly Skips Multimodal Content

3. Session Persistence Bakes Raw Content Verbatim

4. Gateway Preambles Break JSON Message Parsing

5. Short Sessions Bypass Compression Entirely

Token Math: Why Native Vision Matters

Proposed Approach

Workaround (Current)

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Missing `detail: "auto"` on Image Injection (Token Bloat)