I should be able to start a long-running Codex task, leave it running overnight, and trust that transient backend capacity issues will be handled automatically. Expected flow: 1. Codex receives a transient capacity / server-overloaded / model-at-capacity error. 2. The task remains active. 3. The UI shows something like: > Selected model is temporarily at capacity. Retrying in 30s... 4. Codex retries with exponential backoff and jitter. 5. Codex resumes from the last safe model/tool boundary. 6. If automatic retries are exhausted, the task moves to `paused / waiting for manual retry`, not `failed`. 7. The user can optionally: - retry now - continue retrying automatically - cancel - manually switch models and resume 8. Manual intervention should happen only after retry exhaustion or a true non-retryable failure.

codex - 💡(How to fix) Fix Retry transient capacity errors with backoff and retained task state

codex2026-05-12 23:08:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Codex receives a transient capacity / server-overloaded / model-at-capacity error. The error tells the user to try a different model before Codex has exhausted automatic retry. The overload/capacity condition appears to be represented as a shared error condition and exposed through shared protocol error info. That looks like the right joint point for cross-platform behavior.

emit a retrying/reconnecting event rather than a terminal error retryable error + retry budget available -> retry with backoff retryable error + retry budget exhausted -> paused / waiting for manual retry non-retryable error -> terminal error
A task receiving a retryable capacity error does not immediately enter a terminal failed state.
Changing only the error message text.
capacity / high-demand error path
maps backend overload/capacity responses into shared error state
codex-rs/protocol/src/error.rs
defines shared error classification and retryability

Root Cause

The expectation of an automation agent is:

start the task
leave it alone
come back later
review completed work or a clear non-retryable failure

The current experience is:

start the task
worry it may silently stop
hover over it
come back to partial work blocked by a transient backend condition
manually act as the retry loop

That turns automation into babysitting.

This creates reliability, trust, and anxiety problems. The product expectation is that automation means automation.

Expectation:

leave long-running tasks overnight
build trust
relax
review results later

Reality:

unreliability
trust issues
hovering
anxiety
manual recovery from transient backend state

Fix Action

Fix / Workaround

This issue should not be considered resolved by temporary backend mitigation alone.

Code Example

running -> retrying with backoff -> paused / waiting for manual retry

---

running -> failed

---

retryable error + retry budget available -> retry with backoff
retryable error + retry budget exhausted -> paused / waiting for manual retry
non-retryable error -> terminal error

RAW_BUFFERClick to expand / collapse

What variant of Codex are you using?

Codex Desktop App, Codex VS Code extension. V 26.506.31421

What feature would you like to see?

Transient capacity errors are retryable.

When Codex receives a backend/model-capacity condition such as:

Selected model is at capacity. Please try a different model.

that should be treated as a temporary admission/capacity failure, not as a terminal task failure.

Codex should keep the task alive, retry with bounded exponential backoff and jitter, and resume from the last safe model/tool boundary.

This is normal distributed-systems behavior for transient overload. Treating a retryable backend condition as a terminal task failure turns automation into babysitting.

Core reliability expectation

Retryable tasks should not fail terminally.

If a task hits a retryable condition, such as transient model capacity, network interruption, server overload, or temporary backend unavailability, Codex should preserve the task state and keep the task resumable.

The state machine should be:

running -> retrying with backoff -> paused / waiting for manual retry

not:

running -> failed

Automatic retry can have limits. That is fine. But exhausting the automatic retry budget should move the task into a retained-state paused state, not a failed state.

At that point the UI should offer:

Retry now
Continue retrying automatically
Switch model and resume
Cancel task

The important contract is that retryable errors should not destroy task continuity or force the user to reconstruct context manually.

Current behavior

Codex performs substantial work and then stops on a transient backend capacity condition.

Examples from screenshots:

Ran 4 commands, changed 3 files, +175 -53, then stopped with:

Selected model is at capacity. Please try a different model.
Ran 8 commands, changed 9 files, +901 -106, then stopped with:

Selected model is at capacity. Please try a different model.

This leaves the user with partial work and requires manual intervention.

The user has to notice the failure, reopen or review the task, determine whether the work is safe to continue, and manually act as the retry loop.

Expected behavior

I should be able to start a long-running Codex task, leave it running overnight, and trust that transient backend capacity issues will be handled automatically.

Expected flow:

Codex receives a transient capacity / server-overloaded / model-at-capacity error.
The task remains active.
The UI shows something like:

Selected model is temporarily at capacity. Retrying in 30s...
Codex retries with exponential backoff and jitter.
Codex resumes from the last safe model/tool boundary.
If automatic retries are exhausted, the task moves to paused / waiting for manual retry, not failed.
The user can optionally:
- retry now
- continue retrying automatically
- cancel
- manually switch models and resume
Manual intervention should happen only after retry exhaustion or a true non-retryable failure.

Failure modes observed

Failure mode 1: hard stop during long-running work

Codex makes progress, runs commands, modifies files, then stops because the selected model is temporarily at capacity.

The work is not treated as paused/retrying. It is treated as failed.

Failure mode 2: UI pushes manual model switching too early

The error tells the user to try a different model before Codex has exhausted automatic retry.

For automation, the first recovery path should be automatic retry. Manual model switching should be an override or fallback, not the default recovery path.

Failure mode 3: retry exhaustion conflates pause with failure

Even if automatic retry has a bounded budget, exhausting that budget should not mean the task failed.

It should mean Codex has paused automatic retry and retained the task state for manual retry, continued retry, model switch, or cancellation.

Why this matters

The expectation of an automation agent is:

start the task
leave it alone
come back later
review completed work or a clear non-retryable failure

The current experience is:

start the task
worry it may silently stop
hover over it
come back to partial work blocked by a transient backend condition
manually act as the retry loop

That turns automation into babysitting.

This creates reliability, trust, and anxiety problems. The product expectation is that automation means automation.

Expectation:

leave long-running tasks overnight
build trust
relax
review results later

Reality:

unreliability
trust issues
hovering
anxiety
manual recovery from transient backend state

Suggested implementation

This should likely be handled in shared core/protocol code, not separately in each UI.

The overload/capacity condition appears to be represented as a shared error condition and exposed through shared protocol error info. That looks like the right joint point for cross-platform behavior.

Suggested behavior:

classify server/model capacity overload as retryable
reuse existing retry/backoff machinery
emit a retrying/reconnecting event rather than a terminal error
only surface the current “try a different model” wording after retry exhaustion
avoid rerunning completed commands blindly
resume from persisted conversation/tool state
if automatic retries are exhausted, emit a paused/retryable state instead of a failed state

Conceptually:

retryable error + retry budget available -> retry with backoff
retryable error + retry budget exhausted -> paused / waiting for manual retry
non-retryable error -> terminal error

Definition of done / acceptance criteria

This issue should not be considered resolved by temporary backend mitigation alone.

A resolution should include durable product behavior that can be observed by users and, where applicable, verified by tests.

Required behavior

Transient model-capacity / server-overloaded errors are classified as retryable.
A task receiving a retryable capacity error does not immediately enter a terminal failed state.
While automatic retries remain, the task enters a visible retrying state.
The retrying state shows at least:
- that Codex is retrying
- the retry attempt count or retry phase
- the next retry delay or approximate retry time
Retry uses bounded exponential backoff with jitter, or an equivalent documented backoff policy.
If the automatic retry budget is exhausted, the task enters a retained-state paused / waiting for manual retry state, not failed.
In the paused state, the user can resume without recreating the task, re-prompting from scratch, or manually reconstructing context.
The paused state offers clear actions:
- Retry now
- Continue automatic retry
- Cancel
- Switch model and resume, if model switching is supported for that task type
Completed file changes remain visible and intact.
Completed command/tool results remain visible and intact.
The task retains enough execution state to resume safely from the last safe model/tool boundary.
Codex does not blindly rerun shell commands or tool calls that already completed successfully.
Non-retryable failures still fail clearly and do not loop forever.
Retryable failure handling is observable in logs/debug output.

Required test or verification coverage

A complete fix should include test coverage or an equivalent maintainer-verifiable reproduction showing:

A capacity/server-overloaded response during a task causes retry/backoff, not terminal failure.
A capacity/server-overloaded response after partial work preserves changed files and command/tool history.
Exhausting the automatic retry budget produces a paused/resumable state, not a failed state.
Manual retry from the paused state resumes the task without losing state.
Non-retryable errors still produce a terminal failure.

Not sufficient to close this issue

The following should not be considered sufficient resolution by themselves:

Closing because a specific outage or capacity incident was mitigated.
Changing only the error message text.
Advising the user to switch models manually.
Adding only a banner or warning.
Retrying only initial request creation while still allowing mid-task capacity failures to terminate the task.
Marking the issue resolved without a linked PR, commit, release note, or explicit explanation of where the retry/pause behavior is implemented.
Treating retry budget exhaustion as task failure instead of retained-state pause.

Additional information

Related issues

Primary related issue:

Related to #22277

#22277 was closed as mitigated, but that does not appear to establish durable product behavior. Backend capacity recovery and client/task resilience are separate issues. This request is specifically about the latter: retryable capacity failures must not terminate or discard long-running task state.

Strong supporting issues:

Related to #19583
Related to #19579
Related to #17014
Related to #19446

Adjacent UI/state issues:

Related to #11635
Related to #11904

Related PRs / implementation precedent

These do not appear to fully resolve this request, but they are relevant implementation context:

Related to #10118
- model-capacity guidance / client-side capacity messaging
Related to #1947
- capacity / high-demand error path
Related to #1956
- retry/backoff tuning
Related to #506
- precedent for retrying mid-stream failures with existing exponential backoff
Related to #12001
- model reroute notification, adjacent to capacity/routing behavior

Relevant files

Possible shared implementation points:

codex-rs/codex-api/src/api_bridge.rs
- maps backend overload/capacity responses into shared error state
codex-rs/protocol/src/error.rs
- defines shared error classification and retryability
codex-rs/core/src/session/turn.rs
- main turn loop where retryable errors should become retrying/paused, not terminal failure
codex-rs/core/src/compact.rs
- existing retry/backoff precedent using reconnect-style behavior
codex-rs/protocol/src/error_tests.rs
- likely place for retryability regression tests
codex-rs/codex-api/src/api_bridge_tests.rs
- existing mapping tests for server-overloaded responses
codex-rs/app-server-protocol/src/protocol/v2/shared.rs
- shared protocol surface for app/extension clients

Triage metadata

Suggested labels:

enhancement
bug
connectivity
app
extension
CLI, if maintainers want to track the same cross-client behavior

I would avoid treating this primarily as rate-limits. This is about retryable model/server capacity and retained task state, not user quota exhaustion.

Additional information

This is not a request for unlimited capacity, priority access, or automatic quality downgrade.

It is a request for Codex to handle retryable infrastructure conditions like an automation system:

retry automatically when safe
preserve state
pause instead of fail when automatic retry is exhausted
let the user resume without reconstructing context manually

A retryable transient backend condition should not become a terminal user-facing task failure.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

I should be able to start a long-running Codex task, leave it running overnight, and trust that transient backend capacity issues will be handled automatically.

Expected flow:

Codex receives a transient capacity / server-overloaded / model-at-capacity error.
The task remains active.
The UI shows something like:

Selected model is temporarily at capacity. Retrying in 30s...
Codex retries with exponential backoff and jitter.
Codex resumes from the last safe model/tool boundary.
If automatic retries are exhausted, the task moves to paused / waiting for manual retry, not failed.
The user can optionally:
- retry now
- continue retrying automatically
- cancel
- manually switch models and resume
Manual intervention should happen only after retry exhaustion or a true non-retryable failure.

#api #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

codex - 💡(How to fix) Fix Retry transient capacity errors with backoff and retained task state

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

What variant of Codex are you using?

What feature would you like to see?

Core reliability expectation

Current behavior

Expected behavior

Failure modes observed

Failure mode 1: hard stop during long-running work

Failure mode 2: UI pushes manual model switching too early

Failure mode 3: retry exhaustion conflates pause with failure

Why this matters

Suggested implementation

Definition of done / acceptance criteria

Required behavior

Required test or verification coverage

Not sufficient to close this issue

Additional information

Related issues

Related PRs / implementation precedent

Relevant files

Triage metadata

Additional information

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING