openclaw - 💡(How to fix) Fix Codex app-server emits notification:turn/started then goes silent; embedded run wedges for the full stuck-session recovery window

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Codex app-server emits notification:turn/started for a turn and then goes completely silent — no deltas, no turn/completed, no turn/error. The session sits in embedded_run state indefinitely until OpenClaw's stuck session recovery fires (default 360s) and force-aborts it. Operator-visible symptom: agent received the message but never replies.

Error Message

Codex app-server emits notification:turn/started for a turn and then goes completely silent — no deltas, no turn/completed, no turn/error. The session sits in embedded_run state indefinitely until OpenClaw's stuck session recovery fires (default 360s) and force-aborts it. Operator-visible symptom: agent received the message but never replies. [WARN] stalled session: ... reason=active_work_without_progress classification=stalled_agent_run [WARN] stuck session recovery: action=abort_embedded_run aborted=true drained=true released=0 [WARN] stuck session recovery outcome: status=aborted ... Either: (a) Codex app-server should emit turn/error (or close the stream) within bounded time if it cannot make progress; or (b) the gateway side should have a per-turn watchdog that surfaces event:codex_turn_timeout to the user and lets them retry, rather than burying the failure under a generic stuck session recovery.

  • If no notification:turn/delta or notification:turn/completed arrives within channels.codex.turnProgressThresholdMs (default e.g. 60s), emit a synthetic turn/error to the embedded run consumer with code=CODEX_TURN_STALLED.
  • Surface to the user as a retryable error rather than a silent ~6-minute wedge.

Root Cause

When the Codex app-server hangs after turn/started, the user sees a completely silent failure — gateway and OpenClaw look healthy, telegram outbound works, but their request just disappears for 6 minutes until recovery. The recovery does kill the wedged run, but does not retry the user's request, so the message is effectively lost.

Fix Action

Workaround

launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway — clears all wedged sessions, spawns fresh Codex socket connections. Today this restored service within ~10s.

Code Example

[WARN] stalled session: ... reason=active_work_without_progress classification=stalled_agent_run
       lastProgress=codex_app_server:notification:turn/started ...
[WARN] stuck session recovery: action=abort_embedded_run aborted=true drained=true released=0
[WARN] stuck session recovery outcome: status=aborted ...
RAW_BUFFERClick to expand / collapse

Summary

Codex app-server emits notification:turn/started for a turn and then goes completely silent — no deltas, no turn/completed, no turn/error. The session sits in embedded_run state indefinitely until OpenClaw's stuck session recovery fires (default 360s) and force-aborts it. Operator-visible symptom: agent received the message but never replies.

Environment

  • OpenClaw: 2026.5.20 (/opt/homebrew/lib/node_modules/openclaw)
  • Node: 25.9.0
  • OS: macOS 15 (Darwin 25.4)
  • Codex backend: node /opt/homebrew/bin/codex app-server --enable goals --listen unix://app-server.sock
  • Provider auth: openai-codex via OAuth (ChatGPT account)
  • Multiple agents using Codex backend: system-architect, librarian, codex

Reproduction (observed 2026-05-22 13:46–13:52 local)

Two librarian sessions started at 13:46:09 after a fresh gateway kickstart and immediately wedged on codex_app_server:notification:turn/started. Both stalled for ~360s with no progress events until stuck-session recovery aborted them:

  • sessionId=d729f8f8-fe2a-40e4-8778-8be979511d1f sessionKey=agent:librarian:telegram:direct:6689123501
  • sessionId=681d1bb8-4291-4977-b0ca-69e059beaf66 sessionKey=agent:librarian:main
  • (Same day, separate cohort): sessionId=8a6d0f15-3d6b-4e7d-91e8-3b60ee451207 sessionKey=agent:system-architect:telegram:group:-1003731083010:topic:479 — different stall mode (activeTool=bash exec hang) but same Codex backend.

Recovery log lines:

[WARN] stalled session: ... reason=active_work_without_progress classification=stalled_agent_run
       lastProgress=codex_app_server:notification:turn/started ...
[WARN] stuck session recovery: action=abort_embedded_run aborted=true drained=true released=0
[WARN] stuck session recovery outcome: status=aborted ...

Correlated [diagnostics/liveness] liveness warning: reasons=event_loop_delay … eventLoopDelayMaxMs=5469 immediately after the first stall — gateway main thread blocked ~5s, likely while waiting on a Codex socket response that never came.

Why this matters

When the Codex app-server hangs after turn/started, the user sees a completely silent failure — gateway and OpenClaw look healthy, telegram outbound works, but their request just disappears for 6 minutes until recovery. The recovery does kill the wedged run, but does not retry the user's request, so the message is effectively lost.

Suspected cause

Open hypotheses:

  1. Codex app-server tool dispatch hang when the requested model is registered as an agents.defaults.models.* alias but the underlying provider entry is missing (related: my other bug today involved a hot reload adding openai/gpt-5.4-mini, google/gemini-3.x, xai/grok-4.20-0309-* aliases without registering them under models.providers). Codex may have accepted the turn and then deadlocked when resolving the model.
  2. Long-lived app-server processes — I have 6 stale node /opt/homebrew/bin/codex app-server processes from 5-8 days ago all reparented to launchd with app-server.sock bindings. The gateway's choice of which socket to talk to may be hitting a dead one. (Filing this separately if it stays after socket re-resolution improvements.)
  3. No heartbeat/timeout on the gateway side for turn/startedturn/completed round-trip. Recovery is min_age=300s, which is much longer than any reasonable user-visible patience.

Expected behavior

Either: (a) Codex app-server should emit turn/error (or close the stream) within bounded time if it cannot make progress; or (b) the gateway side should have a per-turn watchdog that surfaces event:codex_turn_timeout to the user and lets them retry, rather than burying the failure under a generic stuck session recovery.

Proposed fix shape

In dist/cli/codex-runtime.ts (or equivalent), add a turn-started-without-progress watchdog:

  • Start a per-turn timer when notification:turn/started is received.
  • If no notification:turn/delta or notification:turn/completed arrives within channels.codex.turnProgressThresholdMs (default e.g. 60s), emit a synthetic turn/error to the embedded run consumer with code=CODEX_TURN_STALLED.
  • Surface to the user as a retryable error rather than a silent ~6-minute wedge.

Also worth investigating whether long-lived stale app-server sockets need broker-side health probes before being selected.

Workaround

launchctl kickstart -k gui/$(id -u)/ai.openclaw.gateway — clears all wedged sessions, spawns fresh Codex socket connections. Today this restored service within ~10s.

Related

  • This was hit alongside openclaw/openclaw#85224 (gateway respawn supervisor false-positive) and the already-known #83950 (telegram polling stall, fix in #84861).
  • Possibly related to #85113 (Codex plugin sync wedge) and #85050 (Codex MCP module missing) — not duplicates because the symptom layer is different (socket-level vs plugin-load).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either: (a) Codex app-server should emit turn/error (or close the stream) within bounded time if it cannot make progress; or (b) the gateway side should have a per-turn watchdog that surfaces event:codex_turn_timeout to the user and lets them retry, rather than burying the failure under a generic stuck session recovery.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Codex app-server emits notification:turn/started then goes silent; embedded run wedges for the full stuck-session recovery window