hermes - ✅(Solved) Fix [Bug]: [Discord] Auto-vision returns success=false on attachments, forcing duplicate vision_analyze call per image [1 pull requests, 1 participants]

zach614 · 2026-05-19T22:47:23Z

[hermes] PR 28999: fix gateway : retry auto-vision on transient failure 28972 - Repository: NousResearch/hermes-agent - Author: xxxigm - State: open | merged:… # PR #28999: fix(gateway): retry auto-vision on transient failure (#28972) - Repository: NousResearch/hermes-agent - Author: xxxigm - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/28999 ## Description (problem / solution / changelog) ## What does this PR do? Discord-cached image attachments routinely come back `success: false` from the first `vision_analyze` call inside `_enrich_message_with_vision`, even though calling the tool again against the exact same local path succeeds. The reporter observed the failure on *every* Discord session with an image in their logs. The agent then sees the kawaii fallback string `"couldn't quite see it this time (>_<)"`, recognises it, and reissues `vision_analyze` manually — costing ~30 s of reasoning latency and one wasted tool call per affected image, in every session. **Root cause analysis.** Tracing the code path: - `gateway/platforms/discord.py::_cache_discord_image` writes the attachment via `cache_image_from_bytes`, which uses synchronous `filepath.write_bytes(data)` — the file is fully on disk before the path is returned. - `_handle_message` propagates the path through `event.media_urls`, and `_prepare_event_text` calls `_enrich_message_with_vision(text, image_paths)`. - `vision_analyze_tool` only returns `success: false` when an exception is caught internally (timeout, empty content, transient 5xx, rate limit). The "permanent" failures (image too large, insufficient credits, model doesn't support vision) all also surface this way. So the reporter's "timing race in Discord adapter" hypothesis isn't quite right — the file IS on disk. The actual failure mode is **transient API errors** that resolve on a second attempt ~30 s later when the agent reissues the call manually. **Fix.** Add a bounded inline retry inside `_enrich_message_with_vision` (the reporter's preferred Option 1): - Default **1 retry** (configurable via `HERMES_VISION_AUTO_RETRIES`; set to `0` to opt out and restore the legacy single-shot behaviour). - Exponential backoff starting at 0.6 s, capped at 3 s. - Permanent-failure classifier (`_vision_failure_is_retryable`) short-circuits the retry budget so we don't waste API calls on `image too large` / `insufficient credits` / `does not support` vision / SSRF block / interrupt. Both the `error` and `analysis` JSON fields participate in the match. - Exceptions still bubble out of the helper, so the existing `"something went wrong"` branch in `_enrich_message_with_vision` continues to fire for non-transient failures like missing API keys. Cost on the happy path: **zero extra API calls**. Cost on a transient failure: 1 extra call instead of the current 1 (manual by the agent) + ~30 s reasoning. Cost on a permanent failure: 1 call, same as today. ## Related Issue Fixes #28972 ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature - [ ] 🔒 Security fix - [x] 📝 Documentation update (new env var documented) - [x] ✅ Tests (35 new regression tests) - [ ] ♻️ Refactor - [ ] 🎯 New skill ## Changes Made - `gateway/run.py` — Introduce `_vision_auto_retry_count`, `_vision_failure_is_retryable`, `_vision_analyze_with_auto_retry` helpers plus `_VISION_AUTO_RETRY_COUNT_DEFAULT`, `_VISION_AUTO_RETRY_INITIAL_BACKOFF_S`, `_VISION_AUTO_RETRY_MAX_BACKOFF_S`, `_VISION_NONRETRYABLE_HINTS` class constants. `_enrich_message_with_vision` delegates the tool call to the retry helper. Sub-200-line change in a single file. - `tests/gateway/test_vision_auto_retry.py` — 35 new tests in five classes covering: env var resolution (unset/zero/explicit/negative/garbage/whitespace), permanent-vs-transient classification (parametrised), the retry loop (happy path, transient-then-success, permanent short-circuit, all-fail, env opt-out, exception propagation), the public entry point (the #28972 repro, kawaii-fallback preserved, no-retry on happy path, multi-image budget isolation), and structural invariants (default ≥ 1, lowercase hints, hint table covers known permanent errors). - `tests/gateway/test_vision_memory_leak.py` — Extend the existing `_Stub` fixture to bind the new helpers so the sanitize-context regression coverage continues to exercise the real code path. - `website/docs/reference/environment-variables.md` — Document `HERMES_VISION_AUTO_RETRIES` next to `HERMES_VISION_DOWNLOAD_TIMEOUT`. ## How to Test Reproduce the bug on `main`: ```bash # Configure a Discord adapter, then from Discord send an image # attachment to your bot. In the resulting session: hermes sessions list # find the latest session hermes logs --since 10m | grep -i "couldn't quite see" ``` On `main` you'll see the kawaii fallback embedded in the model's first user message, followed shortly by a duplicate `vision_analyze` tool call. After this PR: ```bash # Same flow. The fallbac

hermes2026-05-19 22:47:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#28972•Fetched 2026-05-20 04:00:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zach614

Participants

zach614

Timeline (top)

labeled ×5cross-referenced ×1

Error Message

Additional Logs / Traceback (optional)

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

Fixed by PR: fix(gateway): retry auto-vision on transient failure (#28972) (https://github.com/NousResearch/hermes-agent/pull/28999)

PR fix notes

PR #28999: fix(gateway): retry auto-vision on transient failure (#28972)

Repository: NousResearch/hermes-agent
Author: xxxigm
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/28999

Description (problem / solution / changelog)

What does this PR do?

Discord-cached image attachments routinely come back success: false from the first vision_analyze call inside _enrich_message_with_vision, even though calling the tool again against the exact same local path succeeds. The reporter observed the failure on every Discord session with an image in their logs.

The agent then sees the kawaii fallback string "couldn't quite see it this time (>_<)", recognises it, and reissues vision_analyze manually — costing ~30 s of reasoning latency and one wasted tool call per affected image, in every session.

Root cause analysis. Tracing the code path:

gateway/platforms/discord.py::_cache_discord_image writes the attachment via cache_image_from_bytes, which uses synchronous filepath.write_bytes(data) — the file is fully on disk before the path is returned.
_handle_message propagates the path through event.media_urls, and _prepare_event_text calls _enrich_message_with_vision(text, image_paths).
vision_analyze_tool only returns success: false when an exception is caught internally (timeout, empty content, transient 5xx, rate limit). The "permanent" failures (image too large, insufficient credits, model doesn't support vision) all also surface this way.

So the reporter's "timing race in Discord adapter" hypothesis isn't quite right — the file IS on disk. The actual failure mode is transient API errors that resolve on a second attempt ~30 s later when the agent reissues the call manually.

Fix. Add a bounded inline retry inside _enrich_message_with_vision (the reporter's preferred Option 1):

Default 1 retry (configurable via HERMES_VISION_AUTO_RETRIES; set to 0 to opt out and restore the legacy single-shot behaviour).
Exponential backoff starting at 0.6 s, capped at 3 s.
Permanent-failure classifier (_vision_failure_is_retryable) short-circuits the retry budget so we don't waste API calls on image too large / insufficient credits / does not support vision / SSRF block / interrupt. Both the error and analysis JSON fields participate in the match.
Exceptions still bubble out of the helper, so the existing "something went wrong" branch in _enrich_message_with_vision continues to fire for non-transient failures like missing API keys.

Cost on the happy path: zero extra API calls. Cost on a transient failure: 1 extra call instead of the current 1 (manual by the agent) + ~30 s reasoning. Cost on a permanent failure: 1 call, same as today.

Related Issue

Fixes #28972

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature
🔒 Security fix
📝 Documentation update (new env var documented)
✅ Tests (35 new regression tests)
♻️ Refactor
🎯 New skill

Changes Made

gateway/run.py — Introduce _vision_auto_retry_count, _vision_failure_is_retryable, _vision_analyze_with_auto_retry helpers plus _VISION_AUTO_RETRY_COUNT_DEFAULT, _VISION_AUTO_RETRY_INITIAL_BACKOFF_S, _VISION_AUTO_RETRY_MAX_BACKOFF_S, _VISION_NONRETRYABLE_HINTS class constants. _enrich_message_with_vision delegates the tool call to the retry helper. Sub-200-line change in a single file.
tests/gateway/test_vision_auto_retry.py — 35 new tests in five classes covering: env var resolution (unset/zero/explicit/negative/garbage/whitespace), permanent-vs-transient classification (parametrised), the retry loop (happy path, transient-then-success, permanent short-circuit, all-fail, env opt-out, exception propagation), the public entry point (the #28972 repro, kawaii-fallback preserved, no-retry on happy path, multi-image budget isolation), and structural invariants (default ≥ 1, lowercase hints, hint table covers known permanent errors).
tests/gateway/test_vision_memory_leak.py — Extend the existing _Stub fixture to bind the new helpers so the sanitize-context regression coverage continues to exercise the real code path.
website/docs/reference/environment-variables.md — Document HERMES_VISION_AUTO_RETRIES next to HERMES_VISION_DOWNLOAD_TIMEOUT.

How to Test

Reproduce the bug on main:

# Configure a Discord adapter, then from Discord send an image
# attachment to your bot.  In the resulting session:
hermes sessions list   # find the latest session
hermes logs --since 10m | grep -i "couldn't quite see"

On main you'll see the kawaii fallback embedded in the model's first user message, followed shortly by a duplicate vision_analyze tool call.

After this PR:

# Same flow.  The fallback string no longer appears and there's
# no duplicate vision_analyze in the tool-call log.

For operators on metered providers who prefer the legacy behaviour:

HERMES_VISION_AUTO_RETRIES=0 hermes gateway run

Automated coverage:

scripts/run_tests.sh tests/gateway/test_vision_auto_retry.py tests/gateway/test_vision_memory_leak.py -q
# 38 passed in 1.29s

scripts/run_tests.sh tests/gateway/test_vision_auto_retry.py tests/gateway/test_vision_memory_leak.py tests/gateway/test_discord_channel_prompts.py tests/gateway/test_fast_command.py tests/agent/test_image_routing.py -q
# 86 passed in 6.37s

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(gateway):, test(gateway):, docs(gateway):)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix
I've run the relevant tests locally and they pass
I've added tests for my changes (35 new regression tests)
Tested on my platform: macOS 15.2 (Darwin 24.6.0). The retry logic itself is platform-independent.

Documentation & Housekeeping

Updated website/docs/reference/environment-variables.md with the new env var
N/A — no config keys added; the knob is env-var-only by design (transient-retry tuning is an operator concern, not a per-session setting)
N/A — no architecture or workflow changes
Cross-platform — the fix lives in shared gateway code and has no platform-specific assumptions (the issue surfaces most often on Discord but the retry helps any platform that auto-enriches images)
N/A — no tool descriptions / schemas changed

Screenshots / Logs

Before — every Discord session with an image (per reporter)

The model's first user message ends up containing:

[The user sent an image but I couldn't quite see it this time (>_<)
You can try looking at it yourself with vision_analyze using
image_url: /Users/.../cache/images/img_abc123.png]

Followed by:

[Tool call: vision_analyze(image_url=/Users/.../cache/images/img_abc123.png, …)]
[Tool result: {"success": true, "analysis": "..."}]

That's one wasted tool call + ~30 s of agent reasoning, per image, per session.

After

The retry layer absorbs the transient failure invisibly. The model's first user message contains the happy-path descriptor:

[The user sent an image~ Here's what I can see:
A photograph of …]
[If you need a closer look, use vision_analyze with
image_url: /Users/.../cache/images/img_abc123.png ~]

No follow-up vision_analyze tool call. The retry shows up only in the gateway log:

INFO  gateway.run: vision_analyze retry 1/1 for /Users/.../cache/images/img_abc123.png after transient failure; sleeping 0.60s

Changed files

gateway/run.py (modified, +123/-5)
tests/gateway/test_vision_auto_retry.py (added, +442/-0)
tests/gateway/test_vision_memory_leak.py (modified, +16/-1)
website/docs/reference/environment-variables.md (modified, +1/-0)

Code Example

Debug report uploaded:
  Report       https://dpaste.com/D8RLYHUFQ
  agent.log    https://dpaste.com/868PSL89C
  gateway.log  https://dpaste.com/G7D5HVT4Y

---



---

result_json = await vision_analyze_tool(image_url=path, user_prompt=analysis_prompt)
result = json.loads(result_json)
if result.get("success"):
    # happy path
else:
    # kawaii fallback triggers here

RAW_BUFFERClick to expand / collapse

Bug Description

When an image is attached to a Discord message, Hermes' auto-vision enrichment in gateway/run.py::_enrich_message_with_vision returns success: false, causing the kawaii fallback string "couldn't quite see it this time (>_<)" to be injected into the message. The agent then has to call vision_analyze manually on the same image. This duplicates work and adds ~30s + one tool call per image. The pattern is Discord-specific — CLI and curator sessions don't show it.

Steps to Reproduce

Run Hermes with a Discord adapter configured (hermes-discord in platform_toolsets.discord).
From Discord, send any image attachment to your Hermes bot (no caption needed, or with a short caption).
Observe the model's first user message in the session.

Expected Behavior

_enrich_message_with_vision should auto-analyze the image successfully and inject something like:

[The user sent an image~ Here's what I can see: <description>] [If you need a closer look, use vision_analyze with image_url: ...]

Agent proceeds to handle the user's intent with the description already in context. No second tool call needed.

Actual Behavior

vision_analyze_tool returns success: false on the Discord-cached path. The fallback branch fires and the agent receives:

[The user sent an image but I couldn't quite see it this time (>_<) You can try looking at it yourself with vision_analyze using image_url: <path>]

The agent then calls vision_analyze manually with the same path — and it succeeds the second time. The file is readable; the timing of the first auto-call is the issue.

Pattern hits every Discord session with an image in my logs. Three recent sessions (215, 203, 140 messages each) all opened with this exact fallback string.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp)

Messaging Platform (if gateway-related)

Discord

Debug Report

Debug report uploaded:
  Report       https://dpaste.com/D8RLYHUFQ
  agent.log    https://dpaste.com/868PSL89C
  gateway.log  https://dpaste.com/G7D5HVT4Y

Operating System

macOS Sequoia (Darwin 24.6.0 x86_64)

Python Version

3.11.15

Hermes Version

0.13.0 (2026.5.7) [64145a19]

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Affected code is ~/.hermes/hermes-agent/gateway/run.py around lines 13020-13040, in _enrich_message_with_vision:

result_json = await vision_analyze_tool(image_url=path, user_prompt=analysis_prompt)
result = json.loads(result_json)
if result.get("success"):
    # happy path
else:
    # kawaii fallback triggers here

Likely root cause is in gateway/platforms/discord.py — the attachment file is being passed to _enrich_message_with_vision before Discord's download completes, or the cached path/format doesn't match what vision_analyze_tool expects. Manual retry of the same path always succeeds, which suggests a timing race rather than a permission or format issue.

Proposed Fix (optional)

Three options ranked from most-targeted to most-invasive:

Inline retry in _enrich_message_with_vision — single backoff retry when success: false, since the second manual call always succeeds. Masks the issue without touching the Discord layer.
Wait-for-file pattern in the Discord adapter — os.path.exists() poll up to ~5s before calling _enrich_message_with_vision, to confirm cache write is complete.
Lifecycle hook from Discord adapter — call _enrich_message_with_vision only after the attachment download future resolves.

Are you willing to submit a PR for this?

I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#chain error #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: [Discord] Auto-vision returns success=false on attachments, forcing duplicate vision_analyze call per image [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

PR fix notes

PR #28999: fix(gateway): retry auto-vision on transient failure (#28972)

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Before — every Discord session with an image (per reporter)

After

Changed files

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

RELATED_DISCOVERY

TRENDING