hermes - ✅(Solved) Fix [Bug] _restore_primary_runtime bypasses credential pool, reuses stale (revoked) api_key [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#25205Fetched 2026-05-14 03:48:05
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Timeline (top)
cross-referenced ×3labeled ×3commented ×1

_restore_primary_runtime restores the api_key from a construction-time snapshot (_primary_runtime). When the credential pool has rotated across turns — due to token revocation (401), exhaustion (429/402), or rate-limit cooldown — the snapshot key is stale. The next turn immediately hits the same error, re-exhausts remaining entries, and falls through to cross-provider fallback instead of using the pool's current best entry.

Error Message

_restore_primary_runtime restores the api_key from a construction-time snapshot (_primary_runtime). When the credential pool has rotated across turns — due to token revocation (401), exhaustion (429/402), or rate-limit cooldown — the snapshot key is stale. The next turn immediately hits the same error, re-exhausts remaining entries, and falls through to cross-provider fallback instead of using the pool's current best entry.

Root Cause

run_agent.py:8908 (_restore_primary_runtime):

self.api_key = rt["api_key"]  # ← stale snapshot key

The credential pool (self._credential_pool) is never consulted during restore. The pool tracks which entries are exhausted/available, but _restore_primary_runtime ignores it entirely.

Fix Action

Fix

After restoring from the snapshot, call pool.select() to get the current best entry and update api_key/base_url/client accordingly. Falls back gracefully to the snapshot key when the pool is absent or empty.

PR: #24174

PR fix notes

PR #25206: fix(pool): re-select from credential pool on primary runtime restore

Description (problem / solution / changelog)

Summary

Fixes #25205.

_restore_primary_runtime restores api_key from a construction-time snapshot. When the credential pool has rotated (exhaustion, revocation, rate-limit cooldown), the snapshot key is stale. The next turn immediately hits the same error, re-exhausts entries, and falls through to cross-provider fallback.

After restoring the snapshot, re-select from the credential pool if one exists. Falls back gracefully to the snapshot key when the pool is absent or empty.

Test plan

  • 7 new test cases in test_restore_primary_pool_reselect.py
    • Rotation picks new entry (not stale snapshot key)
    • Pool absent → snapshot key used (existing behavior)
    • Pool empty (all exhausted) → snapshot key used
    • Entry with empty key → snapshot key used
    • Base URL updated from pool entry
    • Client rebuild triggered after reselect
  • 53 existing credential pool tests pass
  • Manual: run multi-turn session with codex pool, verify restore log shows "Restore re-selected pool entry" and fallback does NOT activate on next turn

Risks

  • Low: pool.select() is idempotent and already used at session init. The restore path just calls it again.
  • If select() fails or returns None, we fall back to the snapshot key (same as pre-fix behavior).

Changed files

  • run_agent.py (modified, +60/-0)
  • tests/agent/test_restore_primary_pool_reselect.py (added, +222/-0)

PR #25277: fix: load credential pool in switch_model for new provider (#25273)

Description (problem / solution / changelog)

Problem

When switching providers mid-session via /model (e.g., from the default zai provider to openai-codex), switch_model() updates model, provider, api_key, base_url, api_mode, and rebuilds the client — but never loads the credential pool for the new provider.

The agent's _credential_pool stays as whatever was loaded at construction time (typically None for providers like zai that use a single API key from env). This silently disables pool rotation on the new provider:

  1. All 429/401/402 errors burn the same credential with no rotation
  2. After exhausting retries, the agent falls through to cross-provider fallback
  3. Zero pool rotation log entries appear (pool is None, recovery bails immediately)

This is why a user with 3 openai-codex OAuth credentials (where one has usage available) still hits rate limits — the pool never rotates because it was never loaded.

Closes #25273

Fix

run_agent.pyswitch_model() (line ~2659)

  • Call load_pool(new_provider) after swapping runtime fields, before client rebuild
  • Set self._credential_pool to the loaded pool (or None if loading fails or pool has no credentials)
  • Wrapped in try/except to avoid breaking model switch if pool loading fails

run_agent.py_primary_runtime snapshot in switch_model() (line ~2770)

  • Include credential_pool in the snapshot dict so it survives _restore_primary_runtime() across turns

run_agent.py_restore_primary_runtime() (line ~8895)

  • Restore credential_pool from snapshot when present and not None
  • Older snapshots that predate this field are handled gracefully (pool unchanged)

Test plan

8 new tests in tests/agent/test_switch_model_credential_pool.py:

  • test_switch_loads_pool_for_new_provider — verifies load_pool is called
  • test_switch_sets_pool_none_when_no_credentials — pool=None when no creds
  • test_switch_handles_load_pool_exception — pool=None on load error
  • test_switch_includes_pool_in_primary_runtime — snapshot has pool
  • test_switch_same_provider_keeps_pool — re-load on same provider
  • test_restore_primary_preserves_pool_from_snapshot — restore path works
  • test_restore_primary_keeps_pool_when_snapshot_has_none — None guard
  • test_restore_primary_keeps_pool_when_snapshot_predates_field — backward compat

All 53 existing credential pool tests still pass.

Risks

  • Low: load_pool is called once during switch_model. If it fails (e.g., corrupted auth.json), the pool is set to None and the agent operates without pool rotation — same behavior as before the fix.
  • The _primary_runtime snapshot now holds a reference to the pool object. Since the pool is a singleton per provider (loaded from auth.json), this is a shared reference, not a copy. State changes to the pool (e.g., exhaustion) are reflected in both the agent and the snapshot.

Related

  • #24173 — Codex soft-failure pool rotation (response_invalid block)
  • #25205 — _restore_primary_runtime uses stale snapshot key
  • #24159 — Codex Responses API soft failures bypass pool rotation

Changed files

  • run_agent.py (modified, +24/-0)
  • tests/agent/test_switch_model_credential_pool.py (added, +302/-0)

Code Example

14:05:55 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:05:55 Fallback activated: gpt-5.5 → glm-5.1 (zai)
14:24:27 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:24:27 Fallback activated: gpt-5.5 → glm-5.1 (zai)

---

self.api_key = rt["api_key"]  # ← stale snapshot key
RAW_BUFFERClick to expand / collapse

Description

_restore_primary_runtime restores the api_key from a construction-time snapshot (_primary_runtime). When the credential pool has rotated across turns — due to token revocation (401), exhaustion (429/402), or rate-limit cooldown — the snapshot key is stale. The next turn immediately hits the same error, re-exhausts remaining entries, and falls through to cross-provider fallback instead of using the pool's current best entry.

Reproduction

  1. Configure a credential pool with 3+ openai-codex entries
  2. Start a session using openai-codex as primary
  3. On turn N, one entry gets 401 token_revoked
  4. Pool rotation selects a new entry; turn N completes
  5. On turn N+1, _restore_primary_runtime restores the REVOKED key from the snapshot
  6. Immediate 401 again; pool rotates; if only one un-exhausted entry remains, it also fails
  7. All entries exhausted → cross-provider fallback activates

Evidence

Agent log shows the pattern:

14:05:55 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:05:55 Fallback activated: gpt-5.5 → glm-5.1 (zai)
14:24:27 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:24:27 Fallback activated: gpt-5.5 → glm-5.1 (zai)

Restore and fallback happen within the same second — the stale key fails immediately.

Root cause

run_agent.py:8908 (_restore_primary_runtime):

self.api_key = rt["api_key"]  # ← stale snapshot key

The credential pool (self._credential_pool) is never consulted during restore. The pool tracks which entries are exhausted/available, but _restore_primary_runtime ignores it entirely.

Fix

After restoring from the snapshot, call pool.select() to get the current best entry and update api_key/base_url/client accordingly. Falls back gracefully to the snapshot key when the pool is absent or empty.

PR: #24174

Impact

  • Affects all providers with credential pools (openai-codex, anthropic, nous, etc.)
  • Single-credential pools are unaffected (no rotation to miss)
  • Multi-turn sessions are most impacted (each restore re-stales the key)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug] _restore_primary_runtime bypasses credential pool, reuses stale (revoked) api_key [2 pull requests, 1 comments, 2 participants]