hermes - ✅(Solved) Fix [Bug] Fallback announced but never sent: `trying fallback...` logged when `/model`-set invalid id triggers HTTP 400, but `fallback_model` is never invoked and session aborts [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17446Fetched 2026-04-30 06:47:36
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×2cross-referenced ×1

When the active session has a model override (set via /model <invalid-id>) that triggers an HTTP 400 not a valid model ID from the provider, Hermes logs ⚠️ Non-retryable error (HTTP 400) — trying fallback... but the configured fallback_model is never actually invoked — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.

Error Message

⚠️ API call failed (attempt 1/3): BadRequestError [HTTP 400] 🔌 Provider: openrouter Model: deepseek/deepseek-v4 🌐 Endpoint: https://openrouter.ai/api/v1 📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID ⚠️ Non-retryable error (HTTP 400) — trying fallback... 🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json ❌ Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID ❌ Non-retryable client error (HTTP 400). Aborting. 🔌 Provider: openrouter Model: deepseek/deepseek-v4 🌐 Endpoint: https://openrouter.ai/api/v1 ERROR root: Non-retryable client error: …'deepseek/deepseek-v4 is not a valid model ID'…

⚠️ API call failed (attempt 1/3): BadRequestError [HTTP 400] 🔌 Provider: openrouter Model: deepseek/deepseek-v4 …same sequence…

Root Cause

Probable root cause

Fix Action

Fix / Workaround

  1. Start Hermes with this config.yaml:
    model:
      provider: openrouter
      model: deepseek/deepseek-v4-flash:floor
      default: deepseek/deepseek-v4-pro
    fallback_providers: []
    fallback_model:
      provider: openrouter
      model: google/gemini-2.0-flash-001
  2. Connect a Telegram bot, start a session.
  3. In the chat, switch to an invalid model (intentional typo of a real model published 2026‑04‑24, e.g. dropping -pro): /model deepseek/deepseek-v4 The switch is accepted (no upfront catalog rejection — see related #7922).
  4. Send any message. OpenRouter returns:
    HTTP 400: deepseek/deepseek-v4 is not a valid model ID
  5. Send another message. Same error. Bot is dead until the session JSON is patched by hand.

Workaround used

Hand-patched <HERMES_HOME>/sessions/session_<id>.json setting model back to a valid id (deepseek/deepseek-v4-pro), and updated model.default in config.yaml to the same. Bot resumed on the next message without restart.

PR fix notes

PR #15971: fix(image-gen): preserve xAI API error status

Description (problem / solution / changelog)

Why change

xAI image generation errors can report HTTP status 0 even when xAI returns a real 4xx/5xx response. requests.Response objects are falsy for error status codes, so the HTTPError handler treated real error responses as missing.

Files changed

  • plugins/image_gen/xai/init.py: check HTTPError.response with is not None before reading status/message
  • tests/plugins/image_gen/test_xai_provider.py: add regression coverage with a real requests.Response status 401

Verification run

  • python -m pytest tests/plugins/image_gen/test_xai_provider.py -q -o 'addopts='
  • inspected diff for secrets/regressions

Risk level

Low. Error-handling only; success path unchanged.

Changed files

  • plugins/image_gen/xai/__init__.py (modified, +4/-3)
  • tests/plugins/image_gen/test_xai_provider.py (modified, +21/-0)

Code Example

model:
     provider: openrouter
     model: deepseek/deepseek-v4-flash:floor
     default: deepseek/deepseek-v4-pro
   fallback_providers: []
   fallback_model:
     provider: openrouter
     model: google/gemini-2.0-flash-001

---

HTTP 400: deepseek/deepseek-v4 is not a valid model ID

---

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
   📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID
⚠️ Non-retryable error (HTTP 400) — trying fallback...
🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json
Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID
Non-retryable client error (HTTP 400). Aborting.
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
ERROR root: Non-retryable client error:'deepseek/deepseek-v4 is not a valid model ID'
⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   …same sequence…

---

// request_dump_20260429_123206_c28fa020_20260429_123304_899256.json
{
  "reason": "non_retryable_client_error",
  "request": {
    "url": "https://openrouter.ai/api/v1/chat/completions",
    "body": {
      "model": "deepseek/deepseek-v4",
      "messages": [...],
      "tools": [...],
      "extra_body": {"reasoning": {"enabled": true, "effort": "medium"}}
    }
  },
  "error": {"status_code": 400, "body": {"message": "deepseek/deepseek-v4 is not a valid model ID", "code": 400}}
}
RAW_BUFFERClick to expand / collapse

Summary

When the active session has a model override (set via /model <invalid-id>) that triggers an HTTP 400 not a valid model ID from the provider, Hermes logs ⚠️ Non-retryable error (HTTP 400) — trying fallback... but the configured fallback_model is never actually invoked — the next API request body still contains the broken primary model and the session aborts. The chat (Telegram in my case) becomes permanently unresponsive on every subsequent turn until the session JSON is hand-edited.

Environment

  • Hermes Agent: v0.11.0 (image nousresearch/hermes-agent:latest, sha256 148f233e89d1)
  • Container created: 2026‑04‑28
  • OS: Linux (Docker, Ubuntu base)
  • Provider: openrouter
  • Primary (config): deepseek/deepseek-v4-flash:floor
  • Session model override (broken): deepseek/deepseek-v4 (invalid OpenRouter ID)
  • Configured fallback_model: google/gemini-2.0-flash-001 (valid, confirmed via OpenRouter /api/v1/models)
  • Integration: Telegram bot (polling)
  • fallback_providers: [] (only the legacy single-dict fallback_model is set)

Repro

  1. Start Hermes with this config.yaml:
    model:
      provider: openrouter
      model: deepseek/deepseek-v4-flash:floor
      default: deepseek/deepseek-v4-pro
    fallback_providers: []
    fallback_model:
      provider: openrouter
      model: google/gemini-2.0-flash-001
  2. Connect a Telegram bot, start a session.
  3. In the chat, switch to an invalid model (intentional typo of a real model published 2026‑04‑24, e.g. dropping -pro): /model deepseek/deepseek-v4 The switch is accepted (no upfront catalog rejection — see related #7922).
  4. Send any message. OpenRouter returns:
    HTTP 400: deepseek/deepseek-v4 is not a valid model ID
  5. Send another message. Same error. Bot is dead until the session JSON is patched by hand.

Observed log output (gateway, two consecutive user messages)

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
   📝 Error: HTTP 400: deepseek/deepseek-v4 is not a valid model ID
⚠️ Non-retryable error (HTTP 400) — trying fallback...
🧾 Request debug dump written to: /opt/data/sessions/request_dump_…_899256.json
❌ Non-retryable error (HTTP 400): HTTP 400: deepseek/deepseek-v4 is not a valid model ID
❌ Non-retryable client error (HTTP 400). Aborting.
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   🌐 Endpoint: https://openrouter.ai/api/v1
ERROR root: Non-retryable client error: …'deepseek/deepseek-v4 is not a valid model ID'…

⚠️  API call failed (attempt 1/3): BadRequestError [HTTP 400]
   🔌 Provider: openrouter  Model: deepseek/deepseek-v4
   …same sequence…

Notice: the line 🔄 Primary model failed — switching to fallback: <fb_model> via <fb_provider> (emitted by _try_activate_fallback at run_agent.py:7178 after a successful client swap) is absent. The trying fallback... message at run_agent.py:11819 is emitted, but the subsequent _try_activate_fallback() call returns False, so execution falls straight through to the abort path.

Evidence: request dump confirms body never carries fallback model

Two dumps from the same session, written for the two consecutive aborted turns:

// request_dump_20260429_123206_c28fa020_20260429_123304_899256.json
{
  "reason": "non_retryable_client_error",
  "request": {
    "url": "https://openrouter.ai/api/v1/chat/completions",
    "body": {
      "model": "deepseek/deepseek-v4",
      "messages": [...],
      "tools": [...],
      "extra_body": {"reasoning": {"enabled": true, "effort": "medium"}}
    }
  },
  "error": {"status_code": 400, "body": {"message": "deepseek/deepseek-v4 is not a valid model ID", "code": 400}}
}

The second dump (next turn, 1 minute later) is byte-identical on body.model. There is no second dump showing a request to google/gemini-2.0-flash-001 — the fallback is announced but never sent on the wire.

Probable root cause

Reading run_agent.py v0.11.0:

  • _try_activate_fallback() (line 6997) advances self._fallback_index before the activation can fail (line 7022: self._fallback_index += 1). With a single-entry chain (legacy fallback_model: form, no fallback_providers:), the index reaches len(_fallback_chain) after the very first attempt.
  • _restore_primary_runtime() (line ~7196) resets _fallback_index = 0 only when self._fallback_activated is True — i.e. after a successful fallback activation. If activation failed earlier in the session (e.g. earlier the primary had a transient No models provided 400 and the fallback briefly succeeded but _fallback_activated got cleared on the next primary restoration without resetting index — or if any activation path returned False via the recursive return self._try_activate_fallback() exhaustion guard), _fallback_index is permanently stuck past the end of the chain.
  • From that point on, every _try_activate_fallback() call returns False at the bounds check (line 7018: if self._fallback_index >= len(self._fallback_chain): return False), even though a perfectly valid fallback_model is configured.

The user-facing symptom is then: trying fallback... logged → no client swap → same broken model in body → same 400 → abort. The session is pinned forever.

I haven't fully proven this is the exact cause (would need to instrument _fallback_index at runtime), but it's consistent with all observations: configured fallback exists, primary is broken in a way the fallback would not share, dumps show only primary model on the wire, and the switching to fallback log line is missing.

Why this is distinct from existing issues

  • #7922 (provider-prefixed slug sent as raw model field) — that one ships the wrong slug to a valid endpoint. Here the model value is a plausible-but-non-existent OpenRouter ID; the body.model is verbatim what /model stored.
  • #16677 (DeepSeek V4 Pro crash loop on rate limits / vision aux) — that's gateway-process crashing under 429/aux mis-resolution. Here the gateway stays up; only the conversation aborts cleanly with no fallback attempt.
  • #6380 / #7385 — cosmetic status-bar staleness after fallback. Here the issue is functional: fallback is announced but does not happen.
  • #15072.- normalization in model names. Not the case here; the user-typed /model deepseek/deepseek-v4 is stored verbatim.

I scanned all open fallback-titled issues plus closed fallback_model / session override issues and none describe the "fallback announced but never sent on the wire" symptom on a session with a /model-set invalid override.

Suggested fixes

  1. Reset _fallback_index on every new turn unconditionally, not only when _fallback_activated is True. Move the self._fallback_index = 0 line out of the if not self._fallback_activated: return False early-exit branch in _restore_primary_runtime(), or do it at the top of run_conversation().
  2. Don't increment _fallback_index until after activation succeeds. Currently it's incremented before any failure can occur, so a single transient failure can permanently exhaust a length‑1 chain. Increment at the bottom of the success path (just before return True), and let the recursive retry inside the function advance through the chain explicitly.
  3. Reject /model slugs that don't exist in the resolved provider catalog. The current acceptance with a warning (already noted in #7922) makes this class of typo a footgun. At minimum, when the override later produces 400 not a valid model ID, the gateway could automatically clear the override and revert to model.model from config.
  4. Don't emit trying fallback... until the activation actually succeeds. Move the status emit into _try_activate_fallback() after self._fallback_activated = True (line 7095). Right now it's emitted at the call site (line 11819) before knowing whether the swap will happen, which is the misleading UX bit.

Workaround used

Hand-patched <HERMES_HOME>/sessions/session_<id>.json setting model back to a valid id (deepseek/deepseek-v4-pro), and updated model.default in config.yaml to the same. Bot resumed on the next message without restart.

Happy to provide full session JSONs or instrument _fallback_index if helpful.

extent analysis

TL;DR

The most likely fix is to reset _fallback_index on every new turn unconditionally and don't increment it until after activation succeeds.

Guidance

  • Review the _restore_primary_runtime() function to ensure _fallback_index is reset to 0 when a fallback activation fails.
  • Modify the _try_activate_fallback() function to increment _fallback_index only after a successful activation.
  • Consider rejecting /model slugs that don't exist in the resolved provider catalog to prevent similar issues.
  • Update the logging to only emit trying fallback... after a successful fallback activation.

Example

No code snippet is provided as the issue is related to the logic of the _try_activate_fallback() and _restore_primary_runtime() functions, which requires a thorough review of the codebase.

Notes

The provided analysis suggests that the issue is related to the _fallback_index not being reset correctly, causing the fallback model to not be invoked. However, without access to the full codebase, it's difficult to provide a definitive solution.

Recommendation

Apply the suggested fixes to the _try_activate_fallback() and _restore_primary_runtime() functions to ensure correct fallback behavior.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING