hermes - ✅(Solved) Fix Multi-level fallback chain aborts on 401 from first fallback instead of cascading to second fallback [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17485Fetched 2026-04-30 06:47:13
View on GitHub
Comments
2
Participants
3
Timeline
8
Reactions
0
Timeline (top)
labeled ×4commented ×2cross-referenced ×2

Fix Action

Fixed

PR fix notes

PR #17518: fix(agent): cascade nested fallback_model chains

Description (problem / solution / changelog)

Summary

Fixes #17485.

Legacy configs can express multi-level provider failover by nesting fallback_model blocks. Hermes only kept the first dict in _fallback_chain, so after fallback 1 failed with a non-retryable/auth error, the retry loop had no second provider to activate and aborted despite the nested fallback config.

This normalizes fallback config into an ordered chain before agent startup:

  • keeps the canonical list-form fallback_providers / list fallback_model behavior
  • preserves single-dict fallback_model behavior
  • flattens nested legacy fallback_model entries so fallback 1 can cascade to fallback 2

Verification

  • scripts/run_tests.sh tests/run_agent/test_provider_fallback.py (22 passed)
  • git diff --check

Scope

This only changes fallback-chain normalization and focused fallback tests. It does not alter provider-specific auth refresh, retry classification, or the canonical fallback command/config storage format.

Changed files

  • run_agent.py (modified, +34/-9)
  • tests/run_agent/test_provider_fallback.py (modified, +62/-0)

PR #17535: fix(run_agent): normalize nested fallback chains to enable cascading provider failover

Description (problem / solution / changelog)

What does this PR do?

This PR fixes provider fallback handling in run_agent.py by normalizing legacy nested fallback_model blocks into a single ordered runtime chain. Previously, nested fallback definitions could be skipped during failover, so fallback resolution did not consistently cascade past the first fallback provider. This change ensures fallback resolution is deterministic and traverses the full chain, including the nested case reported in #17485.

Related Issue

Fixes #17485

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • run_agent.py: Added fallback-chain normalization logic to flatten legacy nested fallback_model payloads into the runtime fallback chain order, enabling proper cascade behavior.
  • tests/run_agent/test_provider_fallback.py: Added regression tests for:
    • nested fallback chain normalization
    • cascading fallback when an earlier fallback target raises a 401-like error
  • tests/run_agent/test_fallback_model.py: Kept/verified coverage aligned with the updated normalization behavior and edge cases.

How to Test

  1. Run fallback provider tests:
    • scripts/run_tests.sh tests/run_agent/test_provider_fallback.py -q
    • scripts/run_tests.sh tests/run_agent/test_fallback_model.py -q
  2. Confirm both suites pass:
    • test_provider_fallback.py: 21 passed
    • test_fallback_model.py: 28 passed
  3. Add a provider configuration with a nested fallback chain and simulate a first fallback returning 401-like error; verify the agent continues cascading through the normalized fallback order and does not stop prematurely.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: <!-- e.g. Ubuntu 24.04, macOS 15.2, Windows 11 -->

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • scripts/run_tests.sh tests/run_agent/test_provider_fallback.py -q → 21 passed
  • scripts/run_tests.sh tests/run_agent/test_fallback_model.py -q → 28 passed

Changed files

  • run_agent.py (modified, +35/-11)
  • tests/run_agent/test_provider_fallback.py (modified, +35/-0)
RAW_BUFFERClick to expand / collapse

Primary (Gemini) → 400 invalid key → correctly cascades to fallback 1 (Groq) ✅ Fallback 1 (Groq) → 401 invalid key → logs "trying fallback..." then aborts instead of cascading to fallback 2 (Mistral) ❌ v0.11.0 (2026.4.23), Linux Mint. Config uses nested fallback_model: blocks.

extent analysis

TL;DR

The issue is likely due to a misconfiguration in the fallback model cascade, preventing it from correctly falling back to the next model (Mistral) after an invalid key error in Groq.

Guidance

  • Review the fallback_model block configuration to ensure that the cascade is properly defined, allowing the system to fall back to Mistral after a 401 error in Groq.
  • Verify that the logging indicates the correct sequence of events and that the "trying fallback..." message is followed by an attempt to use Mistral.
  • Check the version history of v0.11.0 to see if there are any known issues related to fallback model cascading.
  • Inspect the system's behavior when encountering a 401 error in Groq to determine why it aborts instead of continuing to the next fallback.

Example

No code snippet is provided due to lack of specific configuration details.

Notes

The issue seems to be specific to the configuration of the fallback models and their interaction with error handling. Without more details on the configuration or the exact error messages, it's challenging to provide a precise fix.

Recommendation

Apply workaround: Adjust the fallback_model configuration to ensure proper cascading, as the current behavior suggests a configuration issue rather than a version problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Multi-level fallback chain aborts on 401 from first fallback instead of cascading to second fallback [2 pull requests, 2 comments, 3 participants]