hermes - 💡(How to fix) Fix fallback_providers not activated when 429 follows prior timeout recovery

Q: Expected behavior

When the primary model is rate-limited (429) and `fallback_providers` is configured with a different model under the same provider, the agent should switch to the fallback model after the retry budget is exhausted.

Error Message

When the primary model (glm-5.1) hit HTTP 429 rate limits, the configured fallback to glm-4.7 was never activated for the main gateway session. All retries stayed on glm-5.1 until the retry budget was exhausted, then the error was returned to the user. 3. "API call failed after 3 retries" → error returned to user 16:06:36 ERROR API call failed after 3 retries. HTTP 429 | provider=zai model=glm-5.1

Root Cause

I traced through agent/conversation_loop.py and believe the issue is:

After 3 timeouts, _try_recover_primary_transport() succeeds and resets retry_count = 0 (line ~2897)
The subsequent 429s trigger _recover_with_credential_pool() (line ~2065), which returns (False, True) — setting has_retried_429 = True but not recovering
The early fallback check at line ~2456 (is_rate_limited and _fallback_index < len(_fallback_chain)) calls _pool_may_recover_from_rate_limit() which correctly returns False (single-credential pool)
However, _try_activate_fallback() at line ~2468 appears to not fire — possibly because the combination of primary_recovery_attempted = True and the prior credential pool interaction leaves the retry state in a condition where the fallback path is skipped

The cron session succeeded because it hit 429 directly (no prior timeout), so the simpler path through the fallback logic worked correctly.

Code Example

provider: zai
model: glm-5.1
base_url: https://open.bigmodel.cn/api/coding/paas/v4

fallback_providers:
  - provider: zai
    model: glm-4.7
    base_url: https://open.bigmodel.cn/api/coding/paas/v4

---

2026-05-26 16:00:48 INFO [cron_006bda9176a7...] agent.chat_completion_helpers: Fallback activated: glm-5.1 → glm-4.7 (zai)

---

16:06:27 WARNING API call failed (attempt 1/3) error_type=RateLimitError provider=zai model=glm-5.1
16:06:31 WARNING API call failed (attempt 2/3) error_type=RateLimitError provider=zai model=glm-5.1
16:06:36 ERROR   API call failed after 3 retries. HTTP 429 | provider=zai model=glm-5.1

---

16:06:21 WARNING API call failed (attempt 3/3) error_type=APITimeoutError model=glm-5.1
16:06:27 WARNING API call failed (attempt 1/3) error_type=RateLimitError model=glm-5.1

Bug: `fallback_providers` not activated when primary model hits 429 after prior timeout recovery

Environment

Hermes Agent: latest main (May 26, 2026)
Provider: zai (智谱 GLM Coding Plan endpoint)
Primary model: glm-5.1
Fallback model: glm-4.7 (same provider, same base_url, different model)
Platform: macOS, Telegram gateway

Config

provider: zai
model: glm-5.1
base_url: https://open.bigmodel.cn/api/coding/paas/v4

fallback_providers:
  - provider: zai
    model: glm-4.7
    base_url: https://open.bigmodel.cn/api/coding/paas/v4

What happened

However, a cron job session running at the same time did successfully fall back:

2026-05-26 16:00:48 INFO [cron_006bda9176a7...] agent.chat_completion_helpers: Fallback activated: glm-5.1 → glm-4.7 (zai)

Timeline (main session `20260526_041128_140c21`)

3x timeout → _try_recover_primary_transport() rebuilds the OpenAI client, resets retry_count = 0
3x HTTP 429 → all on glm-5.1, no fallback attempted
"API call failed after 3 retries" → error returned to user
User sends follow-up → another 3x HTTP 429 on glm-5.1, same result

Evidence

agent.log — all requests show model=glm-5.1, never glm-4.7:

16:06:27 WARNING API call failed (attempt 1/3) error_type=RateLimitError provider=zai model=glm-5.1
16:06:31 WARNING API call failed (attempt 2/3) error_type=RateLimitError provider=zai model=glm-5.1
16:06:36 ERROR   API call failed after 3 retries. HTTP 429 | provider=zai model=glm-5.1

No "Fallback activated" entry exists for this session.

errors.log — shows the timeout → 429 transition:

16:06:21 WARNING API call failed (attempt 3/3) error_type=APITimeoutError model=glm-5.1
16:06:27 WARNING API call failed (attempt 1/3) error_type=RateLimitError model=glm-5.1

Root cause analysis

I traced through agent/conversation_loop.py and believe the issue is:

After 3 timeouts, _try_recover_primary_transport() succeeds and resets retry_count = 0 (line ~2897)
The subsequent 429s trigger _recover_with_credential_pool() (line ~2065), which returns (False, True) — setting has_retried_429 = True but not recovering
The early fallback check at line ~2456 (is_rate_limited and _fallback_index < len(_fallback_chain)) calls _pool_may_recover_from_rate_limit() which correctly returns False (single-credential pool)
However, _try_activate_fallback() at line ~2468 appears to not fire — possibly because the combination of primary_recovery_attempted = True and the prior credential pool interaction leaves the retry state in a condition where the fallback path is skipped

The cron session succeeded because it hit 429 directly (no prior timeout), so the simpler path through the fallback logic worked correctly.

Expected behavior

When the primary model is rate-limited (429) and fallback_providers is configured with a different model under the same provider, the agent should switch to the fallback model after the retry budget is exhausted.

Workaround

None found. The fallback mechanism works in simple scenarios but breaks when a timeout precedes the 429.

Related code

agent/conversation_loop.py — retry loop, fallback checks at lines ~2456 and ~2900
agent/chat_completion_helpers.py — try_activate_fallback() at line ~740
run_agent.py — _pool_may_recover_from_rate_limit() at line ~239

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix fallback_providers not activated when 429 follows prior timeout recovery

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug: `fallback_providers` not activated when primary model hits 429 after prior timeout recovery

Environment

Config

What happened

Timeline (main session `20260526_041128_140c21`)

Evidence

Root cause analysis

Expected behavior

Workaround

Related code

FAQ

Expected behavior

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix fallback_providers not activated when 429 follows prior timeout recovery

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Bug: fallback_providers not activated when primary model hits 429 after prior timeout recovery

Environment

Config

What happened

Timeline (main session 20260526_041128_140c21)

Evidence

Root cause analysis

Expected behavior

Workaround

Related code

FAQ

Expected behavior

Still need to ship something?

TRENDING

Bug: `fallback_providers` not activated when primary model hits 429 after prior timeout recovery

Timeline (main session `20260526_041128_140c21`)