codex - 💡(How to fix) Fix Retry transient capacity errors with backoff and retained task state

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  1. Codex receives a transient capacity / server-overloaded / model-at-capacity error. The error tells the user to try a different model before Codex has exhausted automatic retry. The overload/capacity condition appears to be represented as a shared error condition and exposed through shared protocol error info. That looks like the right joint point for cross-platform behavior.
  • emit a retrying/reconnecting event rather than a terminal error retryable error + retry budget available -> retry with backoff retryable error + retry budget exhausted -> paused / waiting for manual retry non-retryable error -> terminal error
  • A task receiving a retryable capacity error does not immediately enter a terminal failed state.
  • Changing only the error message text.
  • capacity / high-demand error path
  • maps backend overload/capacity responses into shared error state
  • codex-rs/protocol/src/error.rs
  • defines shared error classification and retryability

Root Cause

The expectation of an automation agent is:

  • start the task
  • leave it alone
  • come back later
  • review completed work or a clear non-retryable failure

The current experience is:

  • start the task
  • worry it may silently stop
  • hover over it
  • come back to partial work blocked by a transient backend condition
  • manually act as the retry loop

That turns automation into babysitting.

This creates reliability, trust, and anxiety problems. The product expectation is that automation means automation.

Expectation:

  • leave long-running tasks overnight
  • build trust
  • relax
  • review results later

Reality:

  • unreliability
  • trust issues
  • hovering
  • anxiety
  • manual recovery from transient backend state

Fix Action

Fix / Workaround

This issue should not be considered resolved by temporary backend mitigation alone.

Code Example

running -> retrying with backoff -> paused / waiting for manual retry

---

running -> failed

---

retryable error + retry budget available -> retry with backoff
retryable error + retry budget exhausted -> paused / waiting for manual retry
non-retryable error -> terminal error
RAW_BUFFERClick to expand / collapse

What variant of Codex are you using?

Codex Desktop App, Codex VS Code extension. V 26.506.31421

What feature would you like to see?

Transient capacity errors are retryable.

When Codex receives a backend/model-capacity condition such as:

Selected model is at capacity. Please try a different model.

that should be treated as a temporary admission/capacity failure, not as a terminal task failure.

Codex should keep the task alive, retry with bounded exponential backoff and jitter, and resume from the last safe model/tool boundary.

This is normal distributed-systems behavior for transient overload. Treating a retryable backend condition as a terminal task failure turns automation into babysitting.

Core reliability expectation

Retryable tasks should not fail terminally.

If a task hits a retryable condition, such as transient model capacity, network interruption, server overload, or temporary backend unavailability, Codex should preserve the task state and keep the task resumable.

The state machine should be:

running -> retrying with backoff -> paused / waiting for manual retry

not:

running -> failed

Automatic retry can have limits. That is fine. But exhausting the automatic retry budget should move the task into a retained-state paused state, not a failed state.

At that point the UI should offer:

  • Retry now
  • Continue retrying automatically
  • Switch model and resume
  • Cancel task

The important contract is that retryable errors should not destroy task continuity or force the user to reconstruct context manually.

Current behavior

Codex performs substantial work and then stops on a transient backend capacity condition.

Examples from screenshots:

  • Ran 4 commands, changed 3 files, +175 -53, then stopped with:

    Selected model is at capacity. Please try a different model.

  • Ran 8 commands, changed 9 files, +901 -106, then stopped with:

    Selected model is at capacity. Please try a different model.

This leaves the user with partial work and requires manual intervention.

The user has to notice the failure, reopen or review the task, determine whether the work is safe to continue, and manually act as the retry loop.

Expected behavior

I should be able to start a long-running Codex task, leave it running overnight, and trust that transient backend capacity issues will be handled automatically.

Expected flow:

  1. Codex receives a transient capacity / server-overloaded / model-at-capacity error.

  2. The task remains active.

  3. The UI shows something like:

    Selected model is temporarily at capacity. Retrying in 30s...

  4. Codex retries with exponential backoff and jitter.

  5. Codex resumes from the last safe model/tool boundary.

  6. If automatic retries are exhausted, the task moves to paused / waiting for manual retry, not failed.

  7. The user can optionally:

    • retry now
    • continue retrying automatically
    • cancel
    • manually switch models and resume
  8. Manual intervention should happen only after retry exhaustion or a true non-retryable failure.

Failure modes observed

Failure mode 1: hard stop during long-running work

Codex makes progress, runs commands, modifies files, then stops because the selected model is temporarily at capacity.

The work is not treated as paused/retrying. It is treated as failed.

Failure mode 2: UI pushes manual model switching too early

The error tells the user to try a different model before Codex has exhausted automatic retry.

For automation, the first recovery path should be automatic retry. Manual model switching should be an override or fallback, not the default recovery path.

Failure mode 3: retry exhaustion conflates pause with failure

Even if automatic retry has a bounded budget, exhausting that budget should not mean the task failed.

It should mean Codex has paused automatic retry and retained the task state for manual retry, continued retry, model switch, or cancellation.

Why this matters

The expectation of an automation agent is:

  • start the task
  • leave it alone
  • come back later
  • review completed work or a clear non-retryable failure

The current experience is:

  • start the task
  • worry it may silently stop
  • hover over it
  • come back to partial work blocked by a transient backend condition
  • manually act as the retry loop

That turns automation into babysitting.

This creates reliability, trust, and anxiety problems. The product expectation is that automation means automation.

Expectation:

  • leave long-running tasks overnight
  • build trust
  • relax
  • review results later

Reality:

  • unreliability
  • trust issues
  • hovering
  • anxiety
  • manual recovery from transient backend state

Suggested implementation

This should likely be handled in shared core/protocol code, not separately in each UI.

The overload/capacity condition appears to be represented as a shared error condition and exposed through shared protocol error info. That looks like the right joint point for cross-platform behavior.

Suggested behavior:

  • classify server/model capacity overload as retryable
  • reuse existing retry/backoff machinery
  • emit a retrying/reconnecting event rather than a terminal error
  • only surface the current “try a different model” wording after retry exhaustion
  • avoid rerunning completed commands blindly
  • resume from persisted conversation/tool state
  • if automatic retries are exhausted, emit a paused/retryable state instead of a failed state

Conceptually:

retryable error + retry budget available -> retry with backoff
retryable error + retry budget exhausted -> paused / waiting for manual retry
non-retryable error -> terminal error

Definition of done / acceptance criteria

This issue should not be considered resolved by temporary backend mitigation alone.

A resolution should include durable product behavior that can be observed by users and, where applicable, verified by tests.

Required behavior

  • Transient model-capacity / server-overloaded errors are classified as retryable.
  • A task receiving a retryable capacity error does not immediately enter a terminal failed state.
  • While automatic retries remain, the task enters a visible retrying state.
  • The retrying state shows at least:
    • that Codex is retrying
    • the retry attempt count or retry phase
    • the next retry delay or approximate retry time
  • Retry uses bounded exponential backoff with jitter, or an equivalent documented backoff policy.
  • If the automatic retry budget is exhausted, the task enters a retained-state paused / waiting for manual retry state, not failed.
  • In the paused state, the user can resume without recreating the task, re-prompting from scratch, or manually reconstructing context.
  • The paused state offers clear actions:
    • Retry now
    • Continue automatic retry
    • Cancel
    • Switch model and resume, if model switching is supported for that task type
  • Completed file changes remain visible and intact.
  • Completed command/tool results remain visible and intact.
  • The task retains enough execution state to resume safely from the last safe model/tool boundary.
  • Codex does not blindly rerun shell commands or tool calls that already completed successfully.
  • Non-retryable failures still fail clearly and do not loop forever.
  • Retryable failure handling is observable in logs/debug output.

Required test or verification coverage

A complete fix should include test coverage or an equivalent maintainer-verifiable reproduction showing:

  • A capacity/server-overloaded response during a task causes retry/backoff, not terminal failure.
  • A capacity/server-overloaded response after partial work preserves changed files and command/tool history.
  • Exhausting the automatic retry budget produces a paused/resumable state, not a failed state.
  • Manual retry from the paused state resumes the task without losing state.
  • Non-retryable errors still produce a terminal failure.

Not sufficient to close this issue

The following should not be considered sufficient resolution by themselves:

  • Closing because a specific outage or capacity incident was mitigated.
  • Changing only the error message text.
  • Advising the user to switch models manually.
  • Adding only a banner or warning.
  • Retrying only initial request creation while still allowing mid-task capacity failures to terminate the task.
  • Marking the issue resolved without a linked PR, commit, release note, or explicit explanation of where the retry/pause behavior is implemented.
  • Treating retry budget exhaustion as task failure instead of retained-state pause.

Additional information

Related issues

Primary related issue:

  • Related to #22277

#22277 was closed as mitigated, but that does not appear to establish durable product behavior. Backend capacity recovery and client/task resilience are separate issues. This request is specifically about the latter: retryable capacity failures must not terminate or discard long-running task state.

Strong supporting issues:

  • Related to #19583
  • Related to #19579
  • Related to #17014
  • Related to #19446

Adjacent UI/state issues:

  • Related to #11635
  • Related to #11904

Related PRs / implementation precedent

These do not appear to fully resolve this request, but they are relevant implementation context:

  • Related to #10118
    • model-capacity guidance / client-side capacity messaging
  • Related to #1947
    • capacity / high-demand error path
  • Related to #1956
    • retry/backoff tuning
  • Related to #506
    • precedent for retrying mid-stream failures with existing exponential backoff
  • Related to #12001
    • model reroute notification, adjacent to capacity/routing behavior

Relevant files

Possible shared implementation points:

  • codex-rs/codex-api/src/api_bridge.rs
    • maps backend overload/capacity responses into shared error state
  • codex-rs/protocol/src/error.rs
    • defines shared error classification and retryability
  • codex-rs/core/src/session/turn.rs
    • main turn loop where retryable errors should become retrying/paused, not terminal failure
  • codex-rs/core/src/compact.rs
    • existing retry/backoff precedent using reconnect-style behavior
  • codex-rs/protocol/src/error_tests.rs
    • likely place for retryability regression tests
  • codex-rs/codex-api/src/api_bridge_tests.rs
    • existing mapping tests for server-overloaded responses
  • codex-rs/app-server-protocol/src/protocol/v2/shared.rs
    • shared protocol surface for app/extension clients

Triage metadata

Suggested labels:

  • enhancement
  • bug
  • connectivity
  • app
  • extension
  • CLI, if maintainers want to track the same cross-client behavior

I would avoid treating this primarily as rate-limits. This is about retryable model/server capacity and retained task state, not user quota exhaustion.

Additional information

This is not a request for unlimited capacity, priority access, or automatic quality downgrade.

It is a request for Codex to handle retryable infrastructure conditions like an automation system:

  • retry automatically when safe
  • preserve state
  • pause instead of fail when automatic retry is exhausted
  • let the user resume without reconstructing context manually

A retryable transient backend condition should not become a terminal user-facing task failure.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

I should be able to start a long-running Codex task, leave it running overnight, and trust that transient backend capacity issues will be handled automatically.

Expected flow:

  1. Codex receives a transient capacity / server-overloaded / model-at-capacity error.

  2. The task remains active.

  3. The UI shows something like:

    Selected model is temporarily at capacity. Retrying in 30s...

  4. Codex retries with exponential backoff and jitter.

  5. Codex resumes from the last safe model/tool boundary.

  6. If automatic retries are exhausted, the task moves to paused / waiting for manual retry, not failed.

  7. The user can optionally:

    • retry now
    • continue retrying automatically
    • cancel
    • manually switch models and resume
  8. Manual intervention should happen only after retry exhaustion or a true non-retryable failure.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

codex - 💡(How to fix) Fix Retry transient capacity errors with backoff and retained task state