hermes - 💡(How to fix) Fix fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) [1 pull requests]

hermes2026-05-09 04:37:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried. 3. Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

Root Cause

In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

Stale-detected timeouts (FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.

Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.

Fix Action

Fixed

Fixed by PR: feat(agent): eager fallback on stream-stall timeouts (#22277) (https://github.com/NousResearch/hermes-agent/pull/22278)

Code Example

model:
     default: claude-opus-4-7
     provider: anthropic
   fallback_providers:
     - provider: openai-codex
       model: gpt-5.5

---

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

---

fallback:
  eager_on_timeout: true   # new, default true

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

Configure primary as Anthropic (or any cloud provider) and a fallback chain pointing at a different provider:

model:
  default: claude-opus-4-7
  provider: anthropic
fallback_providers:
  - provider: openai-codex
    model: gpt-5.5

Send a turn while the primary provider is degraded (slow stream with heartbeats, gateway issues, etc.).
Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

Root cause

In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

Proposed fix

Extend the eager-fallback trigger to include FailoverReason.timeout (and arguably FailoverReason.connection) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:

fallback:
  eager_on_timeout: true   # new, default true

When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.

Use case

Paid Anthropic primary + OAuth-backed Codex/GPT-5.5 fallback. When Anthropic stream is degraded but not erroring cleanly (heartbeats keep socket alive), the user has no automatic recovery today.
Symmetric to #21444 — that fixes silent-hang on Codex primary; this fixes silent-hang on Anthropic primary (or any provider whose stream-stalls are eventually caught by the stale detector).

Are you willing to submit a PR for this?

Yes — drafting now.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Summary

Reproduction

Root cause

Proposed fix

Use case

Are you willing to submit a PR for this?

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Summary

Reproduction

Root cause

Proposed fix

Use case

Are you willing to submit a PR for this?

Still need to ship something?

RELATED_DISCOVERY

TRENDING