hermes - 💡(How to fix) Fix fallback chain not activated on stream-stall timeouts (15+ min silent hang on degraded primary) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.

Error Message

fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried. 3. Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

Root Cause

In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

Stale-detected timeouts (FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.

Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.

Fix Action

Fixed

Code Example

model:
     default: claude-opus-4-7
     provider: anthropic
   fallback_providers:
     - provider: openai-codex
       model: gpt-5.5

---

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

---

fallback:
  eager_on_timeout: true   # new, default true
RAW_BUFFERClick to expand / collapse

Summary

fallback_providers does not activate when the primary provider is hung on a stream that the stale-stream/stale-call detector eventually kills. The kill produces a classified FailoverReason.timeout error with retryable=True, and the retry loop simply re-hits the same broken primary — burning the full retry budget (3–5 minutes per stale kill × N retries = 15+ minute total hang) before bailing. The configured fallback chain is never tried.

Reproduction

  1. Configure primary as Anthropic (or any cloud provider) and a fallback chain pointing at a different provider:
    model:
      default: claude-opus-4-7
      provider: anthropic
    fallback_providers:
      - provider: openai-codex
        model: gpt-5.5
  2. Send a turn while the primary provider is degraded (slow stream with heartbeats, gateway issues, etc.).
  3. Observe: the stream stalls, HERMES_STREAM_STALE_TIMEOUT fires (default 180s, scaled up to 240–300s for large context), connection is killed, error classifies as FailoverReason.timeout. The retry loop attempts the same provider again, hangs again, kills again — repeated until retries exhausted. Fallback is never activated. Total hang: 15+ minutes.

Root cause

In run_agent.py:13035-13059, eager fallback is gated on rate-limit/billing only:

is_rate_limited = classified.reason in (
    FailoverReason.rate_limit,
    FailoverReason.billing,
)
if is_rate_limited and self._fallback_index < len(self._fallback_chain):
    ...
    if self._try_activate_fallback(reason=classified.reason):

Stale-detected timeouts (FailoverReason.timeout) and connection drops never activate fallback eagerly — they retry-with-backoff against the same broken provider. The stale detector itself works correctly; it's the retry loop that ignores the configured fallback chain in this failure mode.

Issue #21444 documents the inverse direction (Codex/gpt-5.5 primary stalling → fallback activates after one stale kill on the non-streaming path). That works because non-streaming fully exhausts retries faster, and because that issue's reporter saw fallback fire after one ~300s wait — but for streaming primaries with longer per-attempt budget and exponential-backoff between attempts, multiple full stale-kill cycles compound into 15+ min observed hangs.

Proposed fix

Extend the eager-fallback trigger to include FailoverReason.timeout (and arguably FailoverReason.connection) after the first stale-detected timeout. Gate behind a new opt-out config key for back-compat:

fallback:
  eager_on_timeout: true   # new, default true

When true, the same eager-fallback block activates on the first classified timeout if a fallback chain is configured AND the credential pool can't recover. This converts a 15+ min silent hang into a ~5 min single-stale-kill + immediate fallback.

Use case

  • Paid Anthropic primary + OAuth-backed Codex/GPT-5.5 fallback. When Anthropic stream is degraded but not erroring cleanly (heartbeats keep socket alive), the user has no automatic recovery today.
  • Symmetric to #21444 — that fixes silent-hang on Codex primary; this fixes silent-hang on Anthropic primary (or any provider whose stream-stalls are eventually caught by the stale detector).

Are you willing to submit a PR for this?

Yes — drafting now.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING