openclaw - 💡(How to fix) Fix Codex app-server can timeout/fallback during near-window progressing turns [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

A beta.5 Codex app-server turn with a very large assembled context hit OpenClaw's fixed attempt timeout, retired the Codex app-server client, and fell back from openai-codex/gpt-5.5 to openai-codex/gpt-5.4. The failure was not an OpenAI quota/rate-limit error and not a QA mock artifact. The source-level issue is that the Codex app-server runner has an absolute wall-clock timeout that does not distinguish a dead turn from a still-progressing, near-window turn. 3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

Root Cause

QA impact: P1 for runtime parity/confidence, because first-hour and soak lanes need to classify this separately from provider errors, quota failures, and generic model fallback.

Fix Action

Fixed

Code Example

[lcm] assemble: ... tokenBudget=258000 estimatedTokens=257965 ...
codex app-server client retired after timed-out turn
embedded_run_failover_decision ... provider=openai-codex model=gpt-5.5 ... failoverReason=timeout ... rawErrorPreview="codex app-server attempt timed out"
model_fallback_decision ... candidateModel="gpt-5.5" ... reason="timeout" ... nextCandidateModel="gpt-5.4"
model_fallback_decision ... candidateModel="gpt-5.4" ... candidate_succeeded

---

[lcm] auto-rotate: phase=runtime action=rotate ... durationMs=84054 ...
[worker-llm] timeout after 30000ms ...
RAW_BUFFERClick to expand / collapse

TLDR

A beta.5 Codex app-server turn with a very large assembled context hit OpenClaw's fixed attempt timeout, retired the Codex app-server client, and fell back from openai-codex/gpt-5.5 to openai-codex/gpt-5.4. The failure was not an OpenAI quota/rate-limit error and not a QA mock artifact. The source-level issue is that the Codex app-server runner has an absolute wall-clock timeout that does not distinguish a dead turn from a still-progressing, near-window turn.

Impact

Product impact: P1 for beta.5 Codex-default confidence on long first-hour sessions. A real user session with heavy context can be interrupted mid-turn and silently pushed into model fallback even when the underlying app-server path may still be making progress.

QA impact: P1 for runtime parity/confidence, because first-hour and soak lanes need to classify this separately from provider errors, quota failures, and generic model fallback.

What happened

In a local v2026.5.10-beta.5 gateway session, Lossless-Claw assembled a near-window context and the Codex app-server attempt timed out:

[lcm] assemble: ... tokenBudget=258000 estimatedTokens=257965 ...
codex app-server client retired after timed-out turn
embedded_run_failover_decision ... provider=openai-codex model=gpt-5.5 ... failoverReason=timeout ... rawErrorPreview="codex app-server attempt timed out"
model_fallback_decision ... candidateModel="gpt-5.5" ... reason="timeout" ... nextCandidateModel="gpt-5.4"
model_fallback_decision ... candidateModel="gpt-5.4" ... candidate_succeeded

Related pressure in the same session:

[lcm] auto-rotate: phase=runtime action=rotate ... durationMs=84054 ...
[worker-llm] timeout after 30000ms ...

The LCM/context pressure is probably the amplifier, but the upstream OpenClaw runtime behavior still needs a clearer progress-aware path.

Source-level evidence

The Codex app-server runner currently arms a fixed attempt timeout and aborts the run when it fires:

  • extensions/codex/src/app-server/run-attempt.ts creates a setTimeout(... params.timeoutMs) and calls projector?.markTimedOut() plus runAbortController.abort("timeout").
  • The abort path interrupts the turn and retires the app-server client when the timeout fired.
  • Existing tests cover retiring the client after timeout, but they do not cover a large/slow turn that is still emitting app-server notifications or token/progress events near the boundary.

The context projection path can render very large context into the Codex prompt:

  • extensions/codex/src/app-server/context-engine-projection.ts allows rendered context up to MAX_RENDERED_CONTEXT_CHARS = 1_000_000.
  • Existing context-engine tests cover truncation/reserve behavior, but not timeout/progress interaction for near-window live turns.

Expected behavior

OpenClaw should distinguish at least these cases:

  1. Hard-dead app-server turn: no activity/progress, interrupt and retire as today.
  2. Still-progressing large Codex turn: either extend/yield/checkpoint intentionally, or fail with a distinct progressing_timeout/context_pressure_timeout classification instead of generic model fallback.
  3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

Suggested fix direction

Add progress-aware timeout handling for Codex app-server attempts:

  • Track recent app-server notifications/output/token/progress activity and expose it to the timeout decision.
  • Add a bounded extension or yield/checkpoint path for turns that are active but slow due to large context.
  • Preserve the existing hard timeout for truly idle/dead turns.
  • Add diagnostics that separate idle_timeout, progressing_timeout, and context_pressure_timeout.
  • Add regression coverage with a fake app-server client that emits progress beyond the old timeout boundary.

Reproduction status

This has one local native/live reproduction from beta.5 logs plus source-level proof of the fixed-timeout path. I am not claiming a provider bug or quota issue. I am also not claiming Lossless-Claw itself is upstream OpenClaw's responsibility; the upstream issue is how the Codex app-server runner handles near-window, still-progressing turns.

Links

Related broad tracker: #66251

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

OpenClaw should distinguish at least these cases:

  1. Hard-dead app-server turn: no activity/progress, interrupt and retire as today.
  2. Still-progressing large Codex turn: either extend/yield/checkpoint intentionally, or fail with a distinct progressing_timeout/context_pressure_timeout classification instead of generic model fallback.
  3. Context-pressure preflight: if assembled context is near the model/runtime budget, reduce projection or warn before the turn enters a fixed wall-clock timeout path.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING