hermes - ✅(Solved) Fix [Bug] _restore_primary_runtime bypasses credential pool, reuses stale (revoked) api_key [2 pull requests, 1 comments, 2 participants]

jmmaloney4 · 2026-05-13T19:31:56Z

[hermes] restore primary runtime restores the api key from a construction-time snapshot primary runtime . When the credential pool has rotated across turns — d… `_restore_primary_runtime` restores the `api_key` from a construction-time snapshot (`_primary_runtime`). When the credential pool has rotated across turns — due to token revocation (401), exhaustion (429/402), or rate-limit cooldown — the snapshot key is stale. The next turn immediately hits the same error, re-exhausts remaining entries, and falls through to cross-provider fallback instead of using the pool's current best entry. # PR #25206: fix(pool): re-select from credential pool on primary runtime restore - Repository: NousResearch/hermes-agent - Author: jmmaloney4 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/25206 ## Description (problem / solution / changelog) ## Summary Fixes #25205. `_restore_primary_runtime` restores `api_key` from a construction-time snapshot. When the credential pool has rotated (exhaustion, revocation, rate-limit cooldown), the snapshot key is stale. The next turn immediately hits the same error, re-exhausts entries, and falls through to cross-provider fallback. After restoring the snapshot, re-select from the credential pool if one exists. Falls back gracefully to the snapshot key when the pool is absent or empty. ## Test plan - [x] 7 new test cases in `test_restore_primary_pool_reselect.py` - Rotation picks new entry (not stale snapshot key) - Pool absent → snapshot key used (existing behavior) - Pool empty (all exhausted) → snapshot key used - Entry with empty key → snapshot key used - Base URL updated from pool entry - Client rebuild triggered after reselect - [x] 53 existing credential pool tests pass - [ ] Manual: run multi-turn session with codex pool, verify restore log shows "Restore re-selected pool entry" and fallback does NOT activate on next turn ## Risks - Low: pool.select() is idempotent and already used at session init. The restore path just calls it again. - If select() fails or returns None, we fall back to the snapshot key (same as pre-fix behavior). ## Changed files - `run_agent.py` (modified, +60/-0) - `tests/agent/test_restore_primary_pool_reselect.py` (added, +222/-0) --- # PR #25277: fix: load credential pool in switch_model for new provider (#25273) - Repository: NousResearch/hermes-agent - Author: jmmaloney4 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/25277 ## Description (problem / solution / changelog) ## Problem When switching providers mid-session via `/model` (e.g., from the default `zai` provider to `openai-codex`), `switch_model()` updates model, provider, api_key, base_url, api_mode, and rebuilds the client — but **never loads the credential pool for the new provider**. The agent's `_credential_pool` stays as whatever was loaded at construction time (typically `None` for providers like `zai` that use a single API key from env). This silently disables pool rotation on the new provider: 1. All 429/401/402 errors burn the same credential with no rotation 2. After exhausting retries, the agent falls through to cross-provider fallback 3. Zero pool rotation log entries appear (pool is `None`, recovery bails immediately) This is why a user with 3 openai-codex OAuth credentials (where one has usage available) still hits rate limits — the pool never rotates because it was never loaded. Closes #25273 ## Fix **`run_agent.py` — `switch_model()` (line ~2659)** - Call `load_pool(new_provider)` after swapping runtime fields, before client rebuild - Set `self._credential_pool` to the loaded pool (or `None` if loading fails or pool has no credentials) - Wrapped in try/except to avoid breaking model switch if pool loading fails **`run_agent.py` — `_primary_runtime` snapshot in `switch_model()` (line ~2770)** - Include `credential_pool` in the snapshot dict so it survives `_restore_primary_runtime()` across turns **`run_agent.py` — `_restore_primary_runtime()` (line ~8895)** - Restore `credential_pool` from snapshot when present and not `None` - Older snapshots that predate this field are handled gracefully (pool unchanged) ## Test plan 8 new tests in `tests/agent/test_switch_model_credential_pool.py`: - `test_switch_loads_pool_for_new_provider` — verifies load_pool is called - `test_switch_sets_pool_none_when_no_credentials` — pool=None when no creds - `test_switch_handles_load_pool_exception` — pool=None on load error - `test_switch_includes_pool_in_primary_runtime` — snapshot has pool - `test_switch_same_provider_keeps_pool` — re-load on same provider - `test_restore_primary_preserves_pool_from_snapshot` — restore path works - `test_restore_primary_keeps_pool_when_snapshot_has_none` — None guard - `test_restore_primary_keeps_pool_when_snapshot_predates_field` — backward compat All 53 existing credential pool tests still pass. ## Risks - Low: `load_pool` is called once during `switch_model`. If it fails (e.g., corrupt

hermes2026-05-13 19:31:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#25205•Fetched 2026-05-14 03:48:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jmmaloney4

Participants

alt-glitch

jmmaloney4

Timeline (top)

cross-referenced ×3labeled ×3commented ×1

_restore_primary_runtime restores the api_key from a construction-time snapshot (_primary_runtime). When the credential pool has rotated across turns — due to token revocation (401), exhaustion (429/402), or rate-limit cooldown — the snapshot key is stale. The next turn immediately hits the same error, re-exhausts remaining entries, and falls through to cross-provider fallback instead of using the pool's current best entry.

Error Message

Root Cause

run_agent.py:8908 (_restore_primary_runtime):

self.api_key = rt["api_key"]  # ← stale snapshot key

The credential pool (self._credential_pool) is never consulted during restore. The pool tracks which entries are exhausted/available, but _restore_primary_runtime ignores it entirely.

Fix Action

Fix

After restoring from the snapshot, call pool.select() to get the current best entry and update api_key/base_url/client accordingly. Falls back gracefully to the snapshot key when the pool is absent or empty.

PR: #24174

PR fix notes

PR #25206: fix(pool): re-select from credential pool on primary runtime restore

Repository: NousResearch/hermes-agent
Author: jmmaloney4
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/25206

Description (problem / solution / changelog)

Summary

Fixes #25205.

_restore_primary_runtime restores api_key from a construction-time snapshot. When the credential pool has rotated (exhaustion, revocation, rate-limit cooldown), the snapshot key is stale. The next turn immediately hits the same error, re-exhausts entries, and falls through to cross-provider fallback.

After restoring the snapshot, re-select from the credential pool if one exists. Falls back gracefully to the snapshot key when the pool is absent or empty.

Test plan

7 new test cases in test_restore_primary_pool_reselect.py
- Rotation picks new entry (not stale snapshot key)
- Pool absent → snapshot key used (existing behavior)
- Pool empty (all exhausted) → snapshot key used
- Entry with empty key → snapshot key used
- Base URL updated from pool entry
- Client rebuild triggered after reselect
53 existing credential pool tests pass
Manual: run multi-turn session with codex pool, verify restore log shows "Restore re-selected pool entry" and fallback does NOT activate on next turn

Risks

Low: pool.select() is idempotent and already used at session init. The restore path just calls it again.
If select() fails or returns None, we fall back to the snapshot key (same as pre-fix behavior).

Changed files

run_agent.py (modified, +60/-0)
tests/agent/test_restore_primary_pool_reselect.py (added, +222/-0)

PR #25277: fix: load credential pool in switch_model for new provider (#25273)

Repository: NousResearch/hermes-agent
Author: jmmaloney4
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/25277

Description (problem / solution / changelog)

Problem

When switching providers mid-session via /model (e.g., from the default zai provider to openai-codex), switch_model() updates model, provider, api_key, base_url, api_mode, and rebuilds the client — but never loads the credential pool for the new provider.

The agent's _credential_pool stays as whatever was loaded at construction time (typically None for providers like zai that use a single API key from env). This silently disables pool rotation on the new provider:

All 429/401/402 errors burn the same credential with no rotation
After exhausting retries, the agent falls through to cross-provider fallback
Zero pool rotation log entries appear (pool is None, recovery bails immediately)

This is why a user with 3 openai-codex OAuth credentials (where one has usage available) still hits rate limits — the pool never rotates because it was never loaded.

Closes #25273

Fix

run_agent.py — switch_model() (line ~2659)

Call load_pool(new_provider) after swapping runtime fields, before client rebuild
Set self._credential_pool to the loaded pool (or None if loading fails or pool has no credentials)
Wrapped in try/except to avoid breaking model switch if pool loading fails

run_agent.py — _primary_runtime snapshot in switch_model() (line ~2770)

Include credential_pool in the snapshot dict so it survives _restore_primary_runtime() across turns

run_agent.py — _restore_primary_runtime() (line ~8895)

Restore credential_pool from snapshot when present and not None
Older snapshots that predate this field are handled gracefully (pool unchanged)

Test plan

8 new tests in tests/agent/test_switch_model_credential_pool.py:

test_switch_loads_pool_for_new_provider — verifies load_pool is called
test_switch_sets_pool_none_when_no_credentials — pool=None when no creds
test_switch_handles_load_pool_exception — pool=None on load error
test_switch_includes_pool_in_primary_runtime — snapshot has pool
test_switch_same_provider_keeps_pool — re-load on same provider
test_restore_primary_preserves_pool_from_snapshot — restore path works
test_restore_primary_keeps_pool_when_snapshot_has_none — None guard
test_restore_primary_keeps_pool_when_snapshot_predates_field — backward compat

All 53 existing credential pool tests still pass.

Risks

Low: load_pool is called once during switch_model. If it fails (e.g., corrupted auth.json), the pool is set to None and the agent operates without pool rotation — same behavior as before the fix.
The _primary_runtime snapshot now holds a reference to the pool object. Since the pool is a singleton per provider (loaded from auth.json), this is a shared reference, not a copy. State changes to the pool (e.g., exhaustion) are reflected in both the agent and the snapshot.

#24173 — Codex soft-failure pool rotation (response_invalid block)
#25205 — _restore_primary_runtime uses stale snapshot key
#24159 — Codex Responses API soft failures bypass pool rotation

Changed files

run_agent.py (modified, +24/-0)
tests/agent/test_switch_model_credential_pool.py (added, +302/-0)

Code Example

14:05:55 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:05:55 Fallback activated: gpt-5.5 → glm-5.1 (zai)
14:24:27 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:24:27 Fallback activated: gpt-5.5 → glm-5.1 (zai)

---

self.api_key = rt["api_key"]  # ← stale snapshot key

RAW_BUFFERClick to expand / collapse

Description

Reproduction

Configure a credential pool with 3+ openai-codex entries
Start a session using openai-codex as primary
On turn N, one entry gets 401 token_revoked
Pool rotation selects a new entry; turn N completes
On turn N+1, _restore_primary_runtime restores the REVOKED key from the snapshot
Immediate 401 again; pool rotates; if only one un-exhausted entry remains, it also fails
All entries exhausted → cross-provider fallback activates

Evidence

Agent log shows the pattern:

14:05:55 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:05:55 Fallback activated: gpt-5.5 → glm-5.1 (zai)
14:24:27 Primary runtime restored for new turn: gpt-5.5 (openai-codex)
14:24:27 Fallback activated: gpt-5.5 → glm-5.1 (zai)

Restore and fallback happen within the same second — the stale key fails immediately.

Root cause

run_agent.py:8908 (_restore_primary_runtime):

self.api_key = rt["api_key"]  # ← stale snapshot key

The credential pool (self._credential_pool) is never consulted during restore. The pool tracks which entries are exhausted/available, but _restore_primary_runtime ignores it entirely.

Fix

PR: #24174

Impact

Affects all providers with credential pools (openai-codex, anthropic, nous, etc.)
Single-credential pools are unaffected (no rotation to miss)
Multi-turn sessions are most impacted (each restore re-stales the key)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug] _restore_primary_runtime bypasses credential pool, reuses stale (revoked) api_key [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #25206: fix(pool): re-select from credential pool on primary runtime restore

Description (problem / solution / changelog)

Summary

Test plan

Risks

Changed files

PR #25277: fix: load credential pool in switch_model for new provider (#25273)

Description (problem / solution / changelog)

Problem

Fix

Test plan

Risks

Related

Changed files

Code Example

Description

Reproduction

Evidence

Root cause

Fix

Impact

Still need to ship something?

RELATED_DISCOVERY

TRENDING