openclaw - 💡(How to fix) Fix Codex harness can leave TUI/session stuck after app-server client closes [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#76127Fetched 2026-05-03 04:42:04
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
2
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×1unsubscribed ×1

When using the native Codex harness globally (agentRuntime.id: "codex"), OpenClaw can get into a state where:

  • TUI turns remain active/stuck for several minutes after prompt.submitted.
  • Diagnostics repeatedly skip recovery because an active_embedded_run is present.
  • The Codex app-server client closes or restarts during nearby work.
  • A model result may eventually be recorded as successful, but the persisted session state says timeout or the trajectory says session.ended: interrupted.
  • After this, the TUI itself can become unresponsive to keyboard input while the gateway continues reporting lane backlog.

This makes the operator unable to tell whether the turn is running, timed out, succeeded, or whether the TUI is still usable.

Error Message

[agent/embedded] embedded run failover decision: runId=commitments-... reason=timeout from=openai/gpt-5.5 rawError=codex app-server attempt timed out [agents/harness] Codex agent harness failed; not falling back to embedded PI backend [diagnostic] lane task error: lane=main durationMs=21442 error="Error: codex app-server client is closed" [diagnostic] lane task error: lane=session:agent:Agent2:matrix:channel:<redacted> durationMs=39855 error="Error: codex app-server client is closed"

Root Cause

  • TUI turns remain active/stuck for several minutes after prompt.submitted.
  • Diagnostics repeatedly skip recovery because an active_embedded_run is present.
  • The Codex app-server client closes or restarts during nearby work.
  • A model result may eventually be recorded as successful, but the persisted session state says timeout or the trajectory says session.ended: interrupted.
  • After this, the TUI itself can become unresponsive to keyboard input while the gateway continues reporting lane backlog.

Code Example

node ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-.../node_modules/.bin/codex app-server --listen stdio://
~/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-.../node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex/codex app-server --listen stdio://

---

{
  "agentRuntime": {
    "id": "codex",
    "fallback": "none"
  }
}

---

{
  "appServer": {
    "mode": "yolo",
    "requestTimeoutMs": 20000,
    "serviceTier": "fast"
  }
}

---

provider=openai
modelId=gpt-5.4
modelApi=openai-responses

---

Hi, are you there? Please confirm.
What time is it in London?

---

{
  "type": "session.started",
  "sessionKey": "agent:Agent1:tui-<redacted>",
  "provider": "openai",
  "modelId": "gpt-5.4",
  "modelApi": "openai-responses"
}

---

18:00:52 prompt submitted
18:01:03 model.completed -> "Yes."
session.ended -> success

---

{
  "type": "prompt.submitted",
  "sessionKey": "agent:Agent1:tui-<redacted>",
  "runId": "50298e97-fd10-42dc-adda-df1224c623b6",
  "provider": "openai",
  "modelId": "gpt-5.4",
  "modelApi": "openai-responses"
}

---

[diagnostic] stuck session: sessionId=unknown sessionKey=agent:Agent1:tui-<redacted> state=processing age=123s queueDepth=1 reason=processing_with_queued_work recovery=checking
[diagnostic] stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=954ac58b-... sessionKey=agent:Agent1:tui-<redacted> age=123s queueDepth=1 activeSessionId=954ac58b-...

---

{
  "type": "model.completed",
  "ts": "2026-05-02T15:07:38.934Z",
  "runId": "50298e97-fd10-42dc-adda-df1224c623b6",
  "data": {
    "timedOut": false,
    "aborted": false,
    "promptError": null,
    "usage": {
      "input": 341,
      "output": 28,
      "cacheRead": 32128,
      "total": 32497
    },
    "assistantTexts": [
      "It’s `16:02` in London right now (`BST`, Saturday, 2026-05-02)."
    ]
  }
}

---

{
  "type": "session.ended",
  "data": {
    "status": "interrupted",
    "timedOut": false,
    "promptError": null
  }
}

---

{
  "status": "timeout",
  "runtimeMs": 360972,
  "modelProvider": "openai",
  "model": "gpt-5.4",
  "inputTokens": 341,
  "outputTokens": 28,
  "totalTokens": 32469
}

---

[agent/embedded] agent cleanup timed out: runId=db4758c6-... sessionId=954ac58b-... step=codex-trajectory-flush timeoutMs=10000
[agent/embedded] codex app-server thread resume failed; starting a new thread

---

[agent/embedded] embedded run failover decision: runId=commitments-... reason=timeout from=openai/gpt-5.5 rawError=codex app-server attempt timed out
[agents/harness] Codex agent harness failed; not falling back to embedded PI backend
[diagnostic] lane task error: lane=main durationMs=21442 error="Error: codex app-server client is closed"
[diagnostic] lane task error: lane=session:agent:Agent2:matrix:channel:<redacted> durationMs=39855 error="Error: codex app-server client is closed"

---

[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=34s eventLoopDelayP99Ms=4664.1 eventLoopDelayMaxMs=4768.9 eventLoopUtilization=0.94 cpuCoreRatio=0.931 active=7 waiting=0 queued=7
[reload] config change requires channel reload (matrix) - deferring until 16 operation(s), 6 reply(ies), 7 embedded run(s), 13 task run(s) complete
[reload] channel reload timeout after 10943ms with 16 operation(s), 9 reply(ies), 7 embedded run(s), 13 task run(s) still active; reloading channels anyway

---

openclaw gateway: ~99.9% CPU
openclaw parent: idle
openclaw-tui: idle, attached to ttys002
openclaw-tui websocket to localhost:18789: ESTABLISHED

---

node::SpinEventLoopInternal
uv_run
uv__io_poll
kevent

---

[diagnostic] lane wait exceeded: lane=main waitedMs=73954 queueAhead=2
[diagnostic] lane wait exceeded: lane=main waitedMs=69518 queueAhead=2
[diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=701.5 eventLoopDelayMaxMs=4680.8 eventLoopUtilization=0.683 cpuCoreRatio=0.702 active=0 waiting=0 queued=9
[diagnostic] lane wait exceeded: lane=main waitedMs=71016 queueAhead=2

---

{
  "key": "agent:Agent1:tui-<redacted>",
  "sessionId": "2d348fde-9440-40bc-9ca8-87539c0c84d1",
  "status": null,
  "runtimeMs": null,
  "modelProvider": "openai",
  "model": "gpt-5.4",
  "inputTokens": 0,
  "outputTokens": 0,
  "totalTokens": 0
}

---

"serviceTier": null
RAW_BUFFERClick to expand / collapse

Summary

When using the native Codex harness globally (agentRuntime.id: "codex"), OpenClaw can get into a state where:

  • TUI turns remain active/stuck for several minutes after prompt.submitted.
  • Diagnostics repeatedly skip recovery because an active_embedded_run is present.
  • The Codex app-server client closes or restarts during nearby work.
  • A model result may eventually be recorded as successful, but the persisted session state says timeout or the trajectory says session.ended: interrupted.
  • After this, the TUI itself can become unresponsive to keyboard input while the gateway continues reporting lane backlog.

This makes the operator unable to tell whether the turn is running, timed out, succeeded, or whether the TUI is still usable.

Environment

  • OpenClaw: OpenClaw 2026.4.29 (a448042)
  • macOS: 26.3.1 (25D2128)
  • Node from sampled TUI process: 25.6.0
  • Global Codex CLI: codex-cli 0.128.0
  • OpenClaw-managed Codex app-server process:
node ~/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-.../node_modules/.bin/codex app-server --listen stdio://
~/.openclaw/plugin-runtime-deps/openclaw-2026.4.29-.../node_modules/@openai/codex-darwin-arm64/vendor/aarch64-apple-darwin/codex/codex app-server --listen stdio://

Startup logs show bundled runtime deps include @openai/[email protected], so it looks like OpenClaw is using its bundled Codex app-server rather than the newer global CLI.

Global agent runtime config:

{
  "agentRuntime": {
    "id": "codex",
    "fallback": "none"
  }
}

Codex plugin config:

{
  "appServer": {
    "mode": "yolo",
    "requestTimeoutMs": 20000,
    "serviceTier": "fast"
  }
}

The trajectory for the problematic turn used:

provider=openai
modelId=gpt-5.4
modelApi=openai-responses

Reproduction Pattern

  1. Configure all agents to use the native Codex harness with no PI fallback.
  2. Start/restart the gateway.
  3. Send simple prompts through a TUI session.
  4. Have normal Matrix/channel traffic and other agent lanes active at the same time.
  5. Observe Codex app-server timeout/close events and lane backlog.

This happened with trivial prompts such as:

Hi, are you there? Please confirm.
What time is it in London?

Observed Turn State Mismatch

A TUI session started normally:

{
  "type": "session.started",
  "sessionKey": "agent:Agent1:tui-<redacted>",
  "provider": "openai",
  "modelId": "gpt-5.4",
  "modelApi": "openai-responses"
}

The first prompt completed quickly:

18:00:52 prompt submitted
18:01:03 model.completed -> "Yes."
session.ended -> success

The next trivial prompt was submitted at 2026-05-02T15:01:48.491Z:

{
  "type": "prompt.submitted",
  "sessionKey": "agent:Agent1:tui-<redacted>",
  "runId": "50298e97-fd10-42dc-adda-df1224c623b6",
  "provider": "openai",
  "modelId": "gpt-5.4",
  "modelApi": "openai-responses"
}

No model.completed event was recorded for about 5 minutes 50 seconds. During that interval diagnostics repeatedly reported the session as stuck:

[diagnostic] stuck session: sessionId=unknown sessionKey=agent:Agent1:tui-<redacted> state=processing age=123s queueDepth=1 reason=processing_with_queued_work recovery=checking
[diagnostic] stuck session recovery skipped: reason=active_embedded_run action=observe_only sessionId=954ac58b-... sessionKey=agent:Agent1:tui-<redacted> age=123s queueDepth=1 activeSessionId=954ac58b-...

This repeated at roughly 153s, 184s, 214s, 244s, 276s, 306s, 338s, and 369s.

Eventually the trajectory recorded a successful model output:

{
  "type": "model.completed",
  "ts": "2026-05-02T15:07:38.934Z",
  "runId": "50298e97-fd10-42dc-adda-df1224c623b6",
  "data": {
    "timedOut": false,
    "aborted": false,
    "promptError": null,
    "usage": {
      "input": 341,
      "output": 28,
      "cacheRead": 32128,
      "total": 32497
    },
    "assistantTexts": [
      "It’s `16:02` in London right now (`BST`, Saturday, 2026-05-02)."
    ]
  }
}

But the same trajectory ended as interrupted:

{
  "type": "session.ended",
  "data": {
    "status": "interrupted",
    "timedOut": false,
    "promptError": null
  }
}

The persisted session entry also disagreed with the successful trajectory output:

{
  "status": "timeout",
  "runtimeMs": 360972,
  "modelProvider": "openai",
  "model": "gpt-5.4",
  "inputTokens": 341,
  "outputTokens": 28,
  "totalTokens": 32469
}

Surrounding App-Server and Gateway Logs

Right before/around the problematic TUI run:

[agent/embedded] agent cleanup timed out: runId=db4758c6-... sessionId=954ac58b-... step=codex-trajectory-flush timeoutMs=10000
[agent/embedded] codex app-server thread resume failed; starting a new thread

Same period:

[agent/embedded] embedded run failover decision: runId=commitments-... reason=timeout from=openai/gpt-5.5 rawError=codex app-server attempt timed out
[agents/harness] Codex agent harness failed; not falling back to embedded PI backend
[diagnostic] lane task error: lane=main durationMs=21442 error="Error: codex app-server client is closed"
[diagnostic] lane task error: lane=session:agent:Agent2:matrix:channel:<redacted> durationMs=39855 error="Error: codex app-server client is closed"

Gateway load/backlog at the same time:

[diagnostic] liveness warning: reasons=event_loop_delay,cpu interval=34s eventLoopDelayP99Ms=4664.1 eventLoopDelayMaxMs=4768.9 eventLoopUtilization=0.94 cpuCoreRatio=0.931 active=7 waiting=0 queued=7
[reload] config change requires channel reload (matrix) - deferring until 16 operation(s), 6 reply(ies), 7 embedded run(s), 13 task run(s) complete
[reload] channel reload timeout after 10943ms with 16 operation(s), 9 reply(ies), 7 embedded run(s), 13 task run(s) still active; reloading channels anyway

Live TUI Freeze Afterward

After resetting/reopening a TUI session, the TUI became unresponsive to keyboard input. Process state at capture time:

openclaw gateway: ~99.9% CPU
openclaw parent: idle
openclaw-tui: idle, attached to ttys002
openclaw-tui websocket to localhost:18789: ESTABLISHED

Sample of the TUI process showed the Node main thread blocked in kevent:

node::SpinEventLoopInternal
uv_run
uv__io_poll
kevent

The TUI itself was not spinning at that exact sample, but it was not accepting input. Gateway diagnostics continued to report lane backlog:

[diagnostic] lane wait exceeded: lane=main waitedMs=73954 queueAhead=2
[diagnostic] lane wait exceeded: lane=main waitedMs=69518 queueAhead=2
[diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=701.5 eventLoopDelayMaxMs=4680.8 eventLoopUtilization=0.683 cpuCoreRatio=0.702 active=0 waiting=0 queued=9
[diagnostic] lane wait exceeded: lane=main waitedMs=71016 queueAhead=2

The current TUI session entry existed but had no useful status/runtime/tokens yet:

{
  "key": "agent:Agent1:tui-<redacted>",
  "sessionId": "2d348fde-9440-40bc-9ca8-87539c0c84d1",
  "status": null,
  "runtimeMs": null,
  "modelProvider": "openai",
  "model": "gpt-5.4",
  "inputTokens": 0,
  "outputTokens": 0,
  "totalTokens": 0
}

Expected Behavior

One of these should happen:

  1. A trivial Codex harness turn completes promptly.
  2. If the app-server/request times out, the turn fails deterministically and clears the active embedded run.
  3. Stuck-session recovery can safely abort or clear an active Codex harness run after timeout.
  4. The TUI remains responsive, even if the gateway is backed up or the Codex app-server client has closed.

Trajectory state and persisted session state should agree.

Actual Behavior

The Codex harness turn stayed active for minutes and diagnostics refused recovery due to active_embedded_run. It eventually produced a normal model output with promptError: null, but the session ended as interrupted and was persisted as timeout.

Afterward, the TUI became unresponsive to keyboard input while the gateway showed ongoing lane backlog and event-loop delay.

Additional Notes

The Codex session binding did not persist a per-session service tier override:

"serviceTier": null

However the global plugin config had appServer.serviceTier: "fast", and the installed code appears to pass appServer.serviceTier into thread/start, thread/resume, and turn/start.

It would be useful for Codex harness artifacts/logs to record the effective serviceTier per turn. Right now the trajectory and binding do not clearly prove whether fast was applied at the request level.

Possible Areas to Inspect

  • Codex app-server lifecycle after thread resume failed; starting a new thread.
  • Whether codex-trajectory-flush cleanup timeout can poison the next TUI turn.
  • Race between model completion, timeout, interruption, and persisted session status.
  • Why stuck-session recovery is observe_only indefinitely for active Codex harness runs.
  • TUI responsiveness when gateway lanes are backed up or the app-server client is closed/restarting.
  • Whether OpenClaw should prefer a newer global Codex CLI/app-server or expose config for the Codex app-server command/version.
  • Effective serviceTier logging for Codex harness requests.

extent analysis

TL;DR

The issue can be mitigated by adjusting the requestTimeoutMs in the Codex plugin config to a higher value to prevent premature timeouts and allow the Codex harness to complete its turns.

Guidance

  1. Increase requestTimeoutMs: Try setting requestTimeoutMs to a higher value (e.g., 60000) in the Codex plugin config to give the Codex harness more time to complete its turns.
  2. Verify Codex app-server logs: Check the Codex app-server logs for any errors or warnings that may indicate issues with the app-server lifecycle or thread management.
  3. Inspect stuck-session recovery: Investigate why stuck-session recovery is observe_only indefinitely for active Codex harness runs and consider adjusting the recovery mechanism to handle such cases.
  4. Monitor TUI responsiveness: Keep an eye on TUI responsiveness when the gateway lanes are backed up or the app-server client is closed/restarting to ensure it remains usable.
  5. Consider upgrading Codex CLI/app-server: Look into using a newer global Codex CLI/app-server or exposing config for the Codex app-server command/version to potentially resolve version-related issues.

Example

No specific code snippet is provided as the issue seems to be related to configuration and app-server behavior rather than code.

Notes

The issue appears to be related to the interaction between the Codex harness, app-server, and TUI, and adjusting the requestTimeoutMs may help mitigate the problem. However, further investigation into the Codex app-server lifecycle, stuck-session recovery, and TUI responsiveness is necessary to fully resolve the issue.

Recommendation

Apply workaround: Increase requestTimeoutMs to a higher value to prevent premature timeouts and allow the Codex harness to complete its turns. This change can help mitigate the issue, but further investigation is needed to fully resolve the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING