hermes - 💡(How to fix) Fix [Bug] Codex Responses API soft failures (status=failed) bypass credential pool rotation

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When the OpenAI Codex Responses API returns quota exhaustion as HTTP 200 with response.status = "failed", the runtime correctly detects the soft failure (via the response_invalid block added in #15104) but routes recovery exclusively through _try_activate_fallback() (cross-provider). The response_invalid block never calls _recover_with_credential_pool() or mark_exhausted_and_rotate(), so same-provider credential pool rotation is dead for this error path.

Error Message

When the OpenAI Codex Responses API returns quota exhaustion as HTTP 200 with response.status = "failed", the runtime correctly detects the soft failure (via the response_invalid block added in #15104) but routes recovery exclusively through _try_activate_fallback() (cross-provider). The response_invalid block never calls _recover_with_credential_pool() or mark_exhausted_and_rotate(), so same-provider credential pool rotation is dead for this error path. Two error-handling paths exist in run_agent.py:

Path 1: HTTP Exception Handler (pool rotation WORKS)

  • response.error contains quota/rate-limit message The OpenAI SDK does not throw an exception for this. The error is only detectable by inspecting response.status and response.error. This means Path 1 never fires for Codex quota exhaustion — only Path 2. In the response_invalid block for codex_responses, when response.status in {"failed", "cancelled"} and the error message contains quota/billing/rate-limit signals, call self._recover_with_credential_pool(reason) with an appropriate FailoverReason before falling through to _try_activate_fallback().

Classify the error for pool recovery

The error classification logic from agent/error_classifier.py could be reused to detect quota signals from response.error.message and response.error.code. If the error is not quota-related (e.g. content policy), pool rotation should not fire.

  • #23138 — eager fallback not firing for HTTP 402 (exception handler path, different error class)

Root Cause

Two error-handling paths exist in run_agent.py:

Fix Action

Workaround

  1. Start a fresh session (/new or restart hermes) — the pool will select a different entry on cold start
  2. If all entries are exhausted, manually reset: hermes auth reset openai-codex
  3. If using fill_first, manually swap priority 0: hermes auth remove openai-codex 1 && hermes auth add openai-codex --type oauth --label <name>

Code Example

$ grep -c "_recover_with_credential_pool\|mark_exhausted" <lines 12335-12600>
0

---

# After setting response_invalid = True and error_details:
if _codex_resp_status in {"failed", "cancelled"}:
    # Classify the error for pool recovery
    error_reason = self._classify_codex_soft_failure(_codex_error_msg)
    if error_reason:
        recovered, _ = self._recover_with_credential_pool(error_reason)
        if recovered:
            retry_count = 0
            compression_attempts = 0
            primary_recovery_attempted = False
            continue  # retry with new credentials
RAW_BUFFERClick to expand / collapse

Summary

When the OpenAI Codex Responses API returns quota exhaustion as HTTP 200 with response.status = "failed", the runtime correctly detects the soft failure (via the response_invalid block added in #15104) but routes recovery exclusively through _try_activate_fallback() (cross-provider). The response_invalid block never calls _recover_with_credential_pool() or mark_exhausted_and_rotate(), so same-provider credential pool rotation is dead for this error path.

Environment

  • Hermes Agent v0.13.0 (2026.5.7)
  • Provider: openai-codex (3 OAuth accounts in credential pool, round_robin strategy)
  • Python 3.13, OpenAI SDK 2.31.0

Reproduction

  1. Configure multiple openai-codex OAuth accounts in the credential pool (hermes auth add openai-codex --type oauth --label <name>)
  2. Set credential_pool_strategies.openai-codex: round_robin (or any strategy)
  3. Drain the quota on the first pool entry (priority 0)
  4. Run any hermes session using openai-codex
  5. Observe: the runtime retries the same exhausted account 3 times with backoff, then falls through to cross-provider fallback (if configured) or gives up. It never rotates to the next pool entry.

Root cause

Two error-handling paths exist in run_agent.py:

Path 1: HTTP Exception Handler (pool rotation WORKS)

Lines ~13170+ in the except block around the API call loop.

When OpenAI SDK throws APIStatusError (HTTP 429, 402, etc.), the handler calls:

  1. classify_api_error()FailoverReason
  2. _recover_with_credential_pool(reason)mark_exhausted_and_rotate() → selects next pool entry
  3. Retries with new credentials

Path 2: Invalid Response Handler (pool rotation DOES NOT FIRE)

Lines ~12335-12600 in the response_invalid block.

When the Responses API returns HTTP 200 with response.status = "failed" (quota exhaustion), the handler:

  1. Detects _codex_resp_status in {"failed", "cancelled"} (correctly, added in #15104)
  2. Sets response_invalid = True
  3. Calls _try_activate_fallback() — cross-provider only
  4. If no fallback: retries with exponential backoff on the same credentials
  5. After max_retries (default 3): gives up

Zero calls to _recover_with_credential_pool() or mark_exhausted_and_rotate() in the entire response_invalid block. The pool entry is never marked exhausted, so subsequent sessions may also select the same dead entry.

This grep confirms:

$ grep -c "_recover_with_credential_pool\|mark_exhausted" <lines 12335-12600>
0

Why the Responses API does this

The OpenAI Responses API returns quota exhaustion as a "soft failure":

  • HTTP 200 (success from SDK perspective)
  • response.status = "failed" (not "completed")
  • response.error contains quota/rate-limit message
  • response.output is empty

The OpenAI SDK does not throw an exception for this. The error is only detectable by inspecting response.status and response.error. This means Path 1 never fires for Codex quota exhaustion — only Path 2.

Proposed fix

In the response_invalid block for codex_responses, when response.status in {"failed", "cancelled"} and the error message contains quota/billing/rate-limit signals, call self._recover_with_credential_pool(reason) with an appropriate FailoverReason before falling through to _try_activate_fallback().

Pseudocode for the insertion point around line 12362:

# After setting response_invalid = True and error_details:
if _codex_resp_status in {"failed", "cancelled"}:
    # Classify the error for pool recovery
    error_reason = self._classify_codex_soft_failure(_codex_error_msg)
    if error_reason:
        recovered, _ = self._recover_with_credential_pool(error_reason)
        if recovered:
            retry_count = 0
            compression_attempts = 0
            primary_recovery_attempted = False
            continue  # retry with new credentials

The error classification logic from agent/error_classifier.py could be reused to detect quota signals from response.error.message and response.error.code. If the error is not quota-related (e.g. content policy), pool rotation should not fire.

Workaround

  1. Start a fresh session (/new or restart hermes) — the pool will select a different entry on cold start
  2. If all entries are exhausted, manually reset: hermes auth reset openai-codex
  3. If using fill_first, manually swap priority 0: hermes auth remove openai-codex 1 && hermes auth add openai-codex --type oauth --label <name>

Related

  • #15104 — added the response.status == "failed" detection but only routes to cross-provider fallback
  • #22212 — incomplete in-retry profile rotation (python platform layer, different code path)
  • #23138 — eager fallback not firing for HTTP 402 (exception handler path, different error class)
  • #4361 — fixed credential pool preservation through smart routing and deferred eager fallback on 429

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug] Codex Responses API soft failures (status=failed) bypass credential pool rotation