OpenClaw should distinguish at least these cases: 1. Hard-dead app-server turn: no activity/progress, interrupt and retire as today. 2. Still-progressing large Codex turn: either extend/yield/checkpoint intentionally, or fail with a distinct `progressing_timeout`/`context_pressure_timeout` classification instead of generic model fallback. 3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

openclaw - 💡(How to fix) Fix Codex app-server can timeout/fallback during near-window progressing turns [2 pull requests]

StepCodex · 2026-05-12T18:37:04Z

[openclaw] TLDR A beta.5 Codex app-server turn with a very large assembled context hit OpenClaw's fixed attempt timeout, retired the Codex app-server client, a… ## Fixed - Fixed by PR: fix: make Codex app-server timeout progress-aware (https://github.com/openclaw/openclaw/pull/81152) - Fixed by PR: fix: normalize Codex app-server runtime attribution (https://github.com/openclaw/openclaw/pull/81180) ## TLDR A beta.5 Codex app-server turn with a very large assembled context hit OpenClaw's fixed attempt timeout, retired the Codex app-server client, and fell back from `openai-codex/gpt-5.5` to `openai-codex/gpt-5.4`. The failure was not an OpenAI quota/rate-limit error and not a QA mock artifact. The source-level issue is that the Codex app-server runner has an absolute wall-clock timeout that does not distinguish a dead turn from a still-progressing, near-window turn. ## Impact Product impact: P1 for beta.5 Codex-default confidence on long first-hour sessions. A real user session with heavy context can be interrupted mid-turn and silently pushed into model fallback even when the underlying app-server path may still be making progress. QA impact: P1 for runtime parity/confidence, because first-hour and soak lanes need to classify this separately from provider errors, quota failures, and generic model fallback. ## What happened In a local `v2026.5.10-beta.5` gateway session, Lossless-Claw assembled a near-window context and the Codex app-server attempt timed out: ```text [lcm] assemble: ... tokenBudget=258000 estimatedTokens=257965 ... codex app-server client retired after timed-out turn embedded_run_failover_decision ... provider=openai-codex model=gpt-5.5 ... failoverReason=timeout ... rawErrorPreview="codex app-server attempt timed out" model_fallback_decision ... candidateModel="gpt-5.5" ... reason="timeout" ... nextCandidateModel="gpt-5.4" model_fallback_decision ... candidateModel="gpt-5.4" ... candidate_succeeded ``` Related pressure in the same session: ```text [lcm] auto-rotate: phase=runtime action=rotate ... durationMs=84054 ... [worker-llm] timeout after 30000ms ... ``` The LCM/context pressure is probably the amplifier, but the upstream OpenClaw runtime behavior still needs a clearer progress-aware path. ## Source-level evidence The Codex app-server runner currently arms a fixed attempt timeout and aborts the run when it fires: - `extensions/codex/src/app-server/run-attempt.ts` creates a `setTimeout(... params.timeoutMs)` and calls `projector?.markTimedOut()` plus `runAbortController.abort("timeout")`. - The abort path interrupts the turn and retires the app-server client when the timeout fired. - Existing tests cover retiring the client after timeout, but they do not cover a large/slow turn that is still emitting app-server notifications or token/progress events near the boundary. The context projection path can render very large context into the Codex prompt: - `extensions/codex/src/app-server/context-engine-projection.ts` allows rendered context up to `MAX_RENDERED_CONTEXT_CHARS = 1_000_000`. - Existing context-engine tests cover truncation/reserve behavior, but not timeout/progress interaction for near-window live turns. ## Expected behavior OpenClaw should distinguish at least these cases: 1. Hard-dead app-server turn: no activity/progress, interrupt and retire as today. 2. Still-progressing large Codex turn: either extend/yield/checkpoint intentionally, or fail with a distinct `progressing_timeout`/`context_pressure_timeout` classification instead of generic model fallback. 3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path. ## Suggested fix direction Add progress-aware timeout handling for Codex app-server attempts: - Track recent app-server notifications/output/token/progress activity and expose it to the timeout decision. - Add a bounded extension or yield/checkpoint path for turns that are active but slow due to large context. - Preserve the existing hard timeout for truly idle/dead turns. - Add diagnostics that separate `idle_timeout`, `progressing_timeout`, and `context_pressure_timeout`. - Add regression coverage with a fake app-server client that emits progress beyond the old timeout boundary. ## Reproduction status This has one local native/live reproduction from beta.5 logs plus source-level proof of the fixed-timeout path. I am not claiming a provider bug or quota issue. I am also not claiming Lossless-Claw itself is upstream OpenClaw's responsibility; the upstream issue is how the Codex app-server runner handles near-window, still-progressing turns. ## Links Related broad tracker: #66251

Error Message

A beta.5 Codex app-server turn with a very large assembled context hit OpenClaw's fixed attempt timeout, retired the Codex app-server client, and fell back from openai-codex/gpt-5.5 to openai-codex/gpt-5.4. The failure was not an OpenAI quota/rate-limit error and not a QA mock artifact. The source-level issue is that the Codex app-server runner has an absolute wall-clock timeout that does not distinguish a dead turn from a still-progressing, near-window turn. 3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

Code Example

[lcm] assemble: ... tokenBudget=258000 estimatedTokens=257965 ...
codex app-server client retired after timed-out turn
embedded_run_failover_decision ... provider=openai-codex model=gpt-5.5 ... failoverReason=timeout ... rawErrorPreview="codex app-server attempt timed out"
model_fallback_decision ... candidateModel="gpt-5.5" ... reason="timeout" ... nextCandidateModel="gpt-5.4"
model_fallback_decision ... candidateModel="gpt-5.4" ... candidate_succeeded

---

[lcm] auto-rotate: phase=runtime action=rotate ... durationMs=84054 ...
[worker-llm] timeout after 30000ms ...

TLDR

Impact

Product impact: P1 for beta.5 Codex-default confidence on long first-hour sessions. A real user session with heavy context can be interrupted mid-turn and silently pushed into model fallback even when the underlying app-server path may still be making progress.

QA impact: P1 for runtime parity/confidence, because first-hour and soak lanes need to classify this separately from provider errors, quota failures, and generic model fallback.

What happened

In a local v2026.5.10-beta.5 gateway session, Lossless-Claw assembled a near-window context and the Codex app-server attempt timed out:

[lcm] assemble: ... tokenBudget=258000 estimatedTokens=257965 ...
codex app-server client retired after timed-out turn
embedded_run_failover_decision ... provider=openai-codex model=gpt-5.5 ... failoverReason=timeout ... rawErrorPreview="codex app-server attempt timed out"
model_fallback_decision ... candidateModel="gpt-5.5" ... reason="timeout" ... nextCandidateModel="gpt-5.4"
model_fallback_decision ... candidateModel="gpt-5.4" ... candidate_succeeded

Related pressure in the same session:

[lcm] auto-rotate: phase=runtime action=rotate ... durationMs=84054 ...
[worker-llm] timeout after 30000ms ...

The LCM/context pressure is probably the amplifier, but the upstream OpenClaw runtime behavior still needs a clearer progress-aware path.

Source-level evidence

The Codex app-server runner currently arms a fixed attempt timeout and aborts the run when it fires:

extensions/codex/src/app-server/run-attempt.ts creates a setTimeout(... params.timeoutMs) and calls projector?.markTimedOut() plus runAbortController.abort("timeout").
The abort path interrupts the turn and retires the app-server client when the timeout fired.
Existing tests cover retiring the client after timeout, but they do not cover a large/slow turn that is still emitting app-server notifications or token/progress events near the boundary.

The context projection path can render very large context into the Codex prompt:

extensions/codex/src/app-server/context-engine-projection.ts allows rendered context up to MAX_RENDERED_CONTEXT_CHARS = 1_000_000.
Existing context-engine tests cover truncation/reserve behavior, but not timeout/progress interaction for near-window live turns.

Expected behavior

OpenClaw should distinguish at least these cases:

Hard-dead app-server turn: no activity/progress, interrupt and retire as today.
Still-progressing large Codex turn: either extend/yield/checkpoint intentionally, or fail with a distinct progressing_timeout/context_pressure_timeout classification instead of generic model fallback.
Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

Suggested fix direction

Add progress-aware timeout handling for Codex app-server attempts:

Track recent app-server notifications/output/token/progress activity and expose it to the timeout decision.
Add a bounded extension or yield/checkpoint path for turns that are active but slow due to large context.
Preserve the existing hard timeout for truly idle/dead turns.
Add diagnostics that separate idle_timeout, progressing_timeout, and context_pressure_timeout.
Add regression coverage with a fake app-server client that emits progress beyond the old timeout boundary.

Reproduction status

This has one local native/live reproduction from beta.5 logs plus source-level proof of the fixed-timeout path. I am not claiming a provider bug or quota issue. I am also not claiming Lossless-Claw itself is upstream OpenClaw's responsibility; the upstream issue is how the Codex app-server runner handles near-window, still-progressing turns.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Codex app-server can timeout/fallback during near-window progressing turns [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

TLDR

Impact

What happened

Source-level evidence

Expected behavior

Suggested fix direction

Reproduction status

Links

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Codex app-server can timeout/fallback during near-window progressing turns [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

TLDR

Impact

What happened

Source-level evidence

Expected behavior

Suggested fix direction

Reproduction status

Links

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING