hermes - 💡(How to fix) Fix [Bug]: auxiliary_client (Vision auto-detect / title generation) hangs indefinitely with no timeout when provider stalls; CLI input blocks [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15194Fetched 2026-04-25 06:23:56
View on GitHub
Comments
2
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×2closed ×1cross-referenced ×1

When the main provider experiences a transient ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrent agent.auxiliary_client call (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery is kill -TERM on the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence.

This appears distinct from #13617 (approval prompt — closed), #8340 (terminal tool setsid+disown hang — closed), and #13033 (Linux paste freeze — open).

Error Message

When the main provider experiences a transient ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrent agent.auxiliary_client call (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery is kill -TERM on the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence. No entry in errors.log at the hang timestamp. No broken-pipe or exception trace tied to the hang (there was an unrelated BrokenPipeError earlier in mcp-stderr.log, but it preceded the hang by hours and shouldn't be the cause).

Root Cause

Suspected root cause

Fix Action

Fix / Workaround

Workaround (for other users hitting this)

Code Example

2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary auto-detect: using main provider anthropic (claude-opus-4-6)
2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary title_generation: using auto (claude-opus-4-6) at https://api.anthropic.com
2026-XX-XX 19:39:41,513 INFO agent.auxiliary_client: Vision auto-detect: using main provider anthropic (claude-opus-4-6)
<nothing for 25+ minutes>
RAW_BUFFERClick to expand / collapse

Summary

When the main provider experiences a transient ReadTimeout, the main turn reconnects cleanly (⚠️ Connection to provider dropped (ReadTimeout). Reconnecting… 🔄 Reconnected — resuming…), but a concurrent agent.auxiliary_client call (e.g. Vision auto-detect or title generation) never completes and never times out. The CLI timer keeps counting, the input area remains locked, and the only recovery is kill -TERM on the agent process. No error is written to logs — the last line is the auxiliary client's "using main provider" INFO log, followed by complete silence.

This appears distinct from #13617 (approval prompt — closed), #8340 (terminal tool setsid+disown hang — closed), and #13033 (Linux paste freeze — open).

Environment

  • OS: macOS (Apple Silicon)
  • Hermes: installed via pipx / user-local (Python 3.11)
  • Provider: Anthropic (claude-opus-4-6)
  • MCP servers loaded: github, atlassian, one internal stdio MCP server, plus code-execution
  • Session in progress for ~38 min, ~328K/1M context used at time of hang

Reproduction (observed in the wild, not yet minimized)

  1. Start a long-running session with multiple MCP tool calls interspersed with model turns.
  2. At some point the main provider emits a ReadTimeout; the UI shows Reconnecting… (attempt 2/3) then Reconnected — resuming….
  3. After the reconnect, the next auxiliary call fires — in my case Vision auto-detect (also reproduces with title_generation per auxiliary_client logs).
  4. agent.log shows the "using main provider" INFO line for the auxiliary call, then nothing for 25+ minutes.
  5. The CLI timer continues to count, the prompt is rendered but accepts no keyboard input (Ctrl+C, Ctrl+D, Enter all ignored).
  6. ps shows the agent process alive at ~0% CPU in S state — blocked on an await.

Last log evidence

~/.hermes/logs/agent.log (final lines before 25+ min of silence):

2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary auto-detect: using main provider anthropic (claude-opus-4-6)
2026-XX-XX 19:32:31,460 INFO agent.auxiliary_client: Auxiliary title_generation: using auto (claude-opus-4-6) at https://api.anthropic.com
2026-XX-XX 19:39:41,513 INFO agent.auxiliary_client: Vision auto-detect: using main provider anthropic (claude-opus-4-6)
<nothing for 25+ minutes>

No entry in errors.log at the hang timestamp. No broken-pipe or exception trace tied to the hang (there was an unrelated BrokenPipeError earlier in mcp-stderr.log, but it preceded the hang by hours and shouldn't be the cause).

Suspected root cause

agent/auxiliary_client.py (or equivalent) appears to issue the HTTP request to the auxiliary provider without a timeout= kwarg on the httpx / anthropic SDK call, and without wrapping the call in asyncio.wait_for(...) or a circuit breaker. When the underlying connection stalls (possibly because the socket pool is in a half-dead state post-reconnect of the main client), the coroutine awaits forever. There is no retry-with-timeout path on this branch, unlike the main turn's reconnect logic.

Related candidate: the auxiliary client may be reusing the HTTPX client / pool that the main provider just reconnected, inheriting a stale connection that then silently hangs on the first auxiliary request.

Expected behavior

  1. All auxiliary calls (Vision auto-detect, title_generation, etc.) should have an explicit request timeout (e.g. 30s) applied via the SDK timeout= parameter or asyncio.wait_for.
  2. On timeout, the auxiliary call should:
    • log WARNING with the auxiliary kind (vision / title / etc.) and the elapsed time,
    • fail open (e.g. skip vision hint, use default title) rather than blocking the main turn,
    • not propagate into the CLI input lock.
  3. Main turn should never be gated on auxiliary completion. If the current architecture makes them coupled, add a deadline to the wrapping task group.

Workaround (for other users hitting this)

None at runtime — once hung, the CLI cannot be recovered without kill -TERM <pid> on the agent process. Before restarting, back up ~/.hermes/sessions/session_<id>.json and, if using --resume, archive / remove the stuck session file first to avoid a resume loop (see #7536).

Additional notes

  • Found a related-but-separate issue worth fixing: tools.checkpoint_manager fails repeatedly on git add -A with index.lock: File exists after a prior kill, and never self-heals. Resulted in ~6 hours of silently lost checkpoints on my end. Happy to open a separate issue for that.
  • Would be useful to emit a WARNING when an auxiliary call exceeds some threshold even if not cancelled, so users can diagnose stalls without reading source.

What I can provide on request

  • Redacted agent.log window around the hang
  • Process state snapshot (ps, py-spy dump if needed next time I reproduce)
  • Session JSON structure (without conversation content)

extent analysis

TL;DR

Apply a request timeout to auxiliary client calls using the timeout= parameter or asyncio.wait_for to prevent indefinite hangs.

Guidance

  • Review the agent/auxiliary_client.py code to ensure that all auxiliary calls (e.g., Vision auto-detect, title generation) have an explicit request timeout applied.
  • Consider adding a retry-with-timeout path for auxiliary calls to handle potential connection stalls.
  • Implement a deadline for the wrapping task group to prevent the main turn from being gated on auxiliary completion.
  • Log a WARNING message when an auxiliary call exceeds a certain threshold to aid in diagnosing stalls.

Example

import asyncio
import httpx

# Apply a request timeout using the timeout= parameter
async def auxiliary_call():
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.get("https://api.anthropic.com")

# Alternatively, use asyncio.wait_for to apply a timeout
async def auxiliary_call():
    try:
        await asyncio.wait_for(httpx.get("https://api.anthropic.com"), timeout=30.0)
    except asyncio.TimeoutError:
        # Handle timeout error
        pass

Notes

The provided code snippets are examples and may require modifications to fit the specific use case. The timeout value should be adjusted according to the expected response time for the auxiliary calls.

Recommendation

Apply a workaround by adding a request timeout to auxiliary client calls to prevent indefinite hangs. This will allow the CLI to recover from stalls and prevent the input lock.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. All auxiliary calls (Vision auto-detect, title_generation, etc.) should have an explicit request timeout (e.g. 30s) applied via the SDK timeout= parameter or asyncio.wait_for.
  2. On timeout, the auxiliary call should:
    • log WARNING with the auxiliary kind (vision / title / etc.) and the elapsed time,
    • fail open (e.g. skip vision hint, use default title) rather than blocking the main turn,
    • not propagate into the CLI input lock.
  3. Main turn should never be gated on auxiliary completion. If the current architecture makes them coupled, add a deadline to the wrapping task group.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: auxiliary_client (Vision auto-detect / title generation) hangs indefinitely with no timeout when provider stalls; CLI input blocks [2 comments, 1 participants]